# Homework 1

In this homework set, we will practice using pandas to clean and prep our data. 

### Deliverables: 
1. Pandas notebook with outputs
2. The `acs_nyc.csv` file you used
3. (Optional) The chatgpt output as a csv. 

In [1]:
# We are going to start importing the libraries we need
# all in one cell. 

# It is a good practice to keep all the imports in one cell so that
# we can easily see what libraries we are using in the notebook.
import pandas as pd

## There is no need to import libraries more than once!

  from pandas.core.computation.check import NUMEXPR_INSTALLED


# Household Income and Home Value in NYC
In this exercise, you are going to investigate the relationship between household income and home values in New York City. 

You are given the dataset `acs_nyc.csv`, which was extracted from the IPUMS NHGIS portal. 


# Data Loading and Initial Exploration

In [41]:
acs_nyc = pd.read_csv('acs_nyc.csv')


Display the first 12 rows of the data. (1 pt)

In [4]:
acs_nyc.head(12)

Unnamed: 0,FIPS,hh_income,house_value
0,36005000100,,
1,36005000200,70867.0,457300.0
2,36005000400,98090.0,456100.0
3,36005001600,40033.0,587600.0
4,36005001901,55924.0,
5,36005001902,60804.0,425600.0
6,36005001903,,
7,36005001904,,
8,36005002001,20870.0,
9,36005002002,,441800.0


How many rows and how many columns are in the data? (1pt)

In [11]:
rows, columns = acs_nyc.shape

print(f"there are {rows} rows and {columns} columns")

there are 2327 rows and 3 columns


What are the datatypes for each column? (1pt)

FIPS has type int64, and hh_income and house_value have type float64.

In [13]:
acs_nyc.dtypes

FIPS             int64
hh_income      float64
house_value    float64
dtype: object

Using one function, display the 25th, 50th, 75th percentiles, mean, min, max, count, and standard deviations for the `hh_income` and `house_value` columns. (1 pt)

In [42]:
acs_nyc[["hh_income", "house_value"]].describe()

Unnamed: 0,hh_income,house_value
count,2196.0,1966.0
mean,78730.628871,759977.6
std,38669.599532,364248.7
min,11988.0,9999.0
25%,53277.25,530675.0
50%,73199.5,664600.0
75%,96989.0,908725.0
max,250001.0,2000001.0


# Data Manipulation
The FIPS code is a unique identifier for each administrative unit in the US. This is how to read a FIPS code. The number of digits in your `FIPS` columns corresponds to whether each row is a state, county, tract, or block. 

![This is how to read a FIPS code](https://customer.precisely.com/servlet/rtaImage?eid=ka0Vu0000000qGs&feoid=00N6g00000TynF6&refid=0EM6g0000010wuS)



1. What administrative unit is our dataset, based on how many FIPS digits in the `FIPS` column? (1 pt)

You can write your answer in this cell. 

##ANSWER<br>
There are 11 digits for each entry in the FIPS column, which means that the administrative unit is **tracts**.

2. Using the FIPS code, create the following columns: (5 pts)
- State
- County: 
    - Bronx County: 36005
    - Brooklyn County: 36047
    - Manhattan County: 36061
    - Queens County: 36081
    - Staten Island: 36085
- Tract_ID

In [74]:
#since I know the data is from NYC, I will not try to parse the first digits and match to all the statecodes
#this reduces the amount of work I have to do

#but I first check this is true
assert acs_nyc["FIPS"].apply(lambda x: str(x)[:2] != "36").sum()==0

#now I set all the states to NEW YORK
acs_nyc["State"] = "NEW YORK"

#and create a small custom function that matches code to county
def county_from_FIPS(FIPS):
  county_code = str(FIPS)[:5]
  if county_code=="36005":
    return "Bronx County"
  elif county_code == "36047":
    return "Brooklyn County"
  elif county_code == "36061":
    return "Manhattan County"
  elif county_code == "36081":
    return "Queens County"
  elif county_code == "36085":
    return "Staten Island"
  else:
    return "" 

#apply the function
acs_nyc["County"] = acs_nyc["FIPS"].apply(county_from_FIPS)


#also checked that all conversions were successful (there's no county outside the expected ones)
assert (acs_nyc["County"]=="").sum()==0

#now we add the tract number, this has to be a string to preserve leading zeros
#we index the last six digits

acs_nyc["Tract_ID"] = acs_nyc["FIPS"].apply(lambda x: str(x)[-6:])

In [75]:
acs_nyc

Unnamed: 0,FIPS,hh_income,house_value,State,County,Tract_ID
1,36005000200,70867.000000,457300.00000,NEW YORK,Bronx County,000200
2,36005000400,98090.000000,456100.00000,NEW YORK,Bronx County,000400
3,36005001600,40033.000000,587600.00000,NEW YORK,Bronx County,001600
4,36005001901,55924.000000,759977.64649,NEW YORK,Bronx County,001901
5,36005001902,60804.000000,425600.00000,NEW YORK,Bronx County,001902
...,...,...,...,...,...,...
2321,36085030301,95913.000000,457600.00000,NEW YORK,Staten Island,030301
2322,36085030302,85842.000000,420500.00000,NEW YORK,Staten Island,030302
2323,36085031901,78730.628871,288300.00000,NEW YORK,Staten Island,031901
2324,36085031902,76066.000000,381600.00000,NEW YORK,Staten Island,031902


3. Check for null values in the `hh_income` and `house_value` columns. Remove any rows with both hh_income and house_value missing. With the rest of the missing values, replace missing values in these columns with the mean value of their respective columns. (5 pts)

In [76]:
## first I apply a mask that takes out rows where both columns are null
acs_nyc=acs_nyc[~(acs_nyc['hh_income'].isna() & acs_nyc['house_value'].isna())]

##replace missing income with mean
mean_income = acs_nyc['hh_income'].mean()
acs_nyc.loc[acs_nyc['hh_income'].isna(), 'hh_income'] =mean_income

##replace missing house value
mean_house_value = acs_nyc['house_value'].mean()
acs_nyc.loc[acs_nyc['house_value'].isna(), 'house_value'] = mean_house_value

Verify that there are no more null values in these columns. (1 pt)

In [77]:
##double check there are no null values
assert len(acs_nyc[acs_nyc['hh_income'].isna() | acs_nyc['house_value'].isna()])==0

# Data analysis

Group the data by `County` and find the median household income and house value. Which county has the lowest median house income and house value? Show this through sorting the dataframe's rows by lowest median household income to highest and only displaying the first row. (5pt)

**Bronx County does, see below for analysis**

In [100]:
# first we get medians for income and house value in each county
county_medians = acs_nyc.groupby("County")[['hh_income', 'house_value']].median().reset_index()

county_medians

Unnamed: 0,County,hh_income,house_value
0,Bronx County,44444.0,550000.0
1,Brooklyn County,69134.5,788650.0
2,Manhattan County,103362.0,863700.0
3,Queens County,78730.628871,640450.0
4,Staten Island,90730.0,592750.0


In [103]:
#we sort by income, and then by house value
print(county_medians.sort_values('hh_income').iloc[0])
print(county_medians.sort_values('house_value').iloc[0])

#these are the same. So Bronx county has both the lowest median household income and the lowest median house value
# we display the row of the dataframe:
county_medians.sort_values('hh_income').head(1)

County         Bronx County
hh_income           44444.0
house_value        550000.0
Name: 0, dtype: object
County         Bronx County
hh_income           44444.0
house_value        550000.0
Name: 0, dtype: object


Unnamed: 0,County,hh_income,house_value
0,Bronx County,44444.0,550000.0


Which county has largest standard deviation in house income and house value? Show this through sorting the dataframe's rows by highest household income variance to lowest and only displaying the first row.(5pt)

In [107]:
# first we get standard deviations for income and house value in each county
county_stds = acs_nyc.groupby("County")[['hh_income', 'house_value']].std().reset_index()

#and then we sort it, first by income
county_stds = county_stds.sort_values('hh_income', ascending=False)

Unnamed: 0,County,hh_income,house_value
2,Manhattan County,58547.942749,485595.141937
1,Brooklyn County,35727.585633,338227.775542
4,Staten Island,27706.657905,123544.229959
3,Queens County,24564.964022,199768.268043
0,Bronx County,24224.235001,194361.027699


In [110]:
#highest standard deviation in household income and highest standard deviation in house value are both in Manhattan
#so we diplay the first row:
county_stds.head(1)

Unnamed: 0,County,hh_income,house_value
0,Bronx County,24224.235001,194361.027699


# Bonus (3 pts)
Take our original `acs_nyc.csv` and ask Chatgpt to clean it. What was the prompt that you gave it? Attach the dataset that it returned as a csv and read it into the cell below.

INSERT YOUR PROMPT HERE.

In [None]:
## INSERT YOUR CODE HERE