# Homework 1

In this homework set, we will practice using pandas to clean and prep our data. 

### Deliverables: 
1. Pandas notebook with outputs
2. The `acs_nyc.csv` file you used
3. (Optional) The chatgpt output as a csv. 

In [179]:
# We are going to start importing the libraries we need
# all in one cell. 

# It is a good practice to keep all the imports in one cell so that
# we can easily see what libraries we are using in the notebook.
import pandas as pd

## There is no need to import libraries more than once!

# Household Income and Home Value in NYC
In this exercise, you are going to investigate the relationship between household income and home values in New York City. 

You are given the dataset `acs_nyc.csv`, which was extracted from the IPUMS NHGIS portal. 


# Data Loading and Initial Exploration

In [180]:
acs_nyc = pd.read_csv('acs_nyc.csv')


Display the first 12 rows of the data. (1 pt)

In [181]:
acs_nyc.head(12)

Unnamed: 0,FIPS,hh_income,house_value
0,36005000100,,
1,36005000200,70867.0,457300.0
2,36005000400,98090.0,456100.0
3,36005001600,40033.0,587600.0
4,36005001901,55924.0,
5,36005001902,60804.0,425600.0
6,36005001903,,
7,36005001904,,
8,36005002001,20870.0,
9,36005002002,,441800.0


How many rows and how many columns are in the data? (1pt)

In [182]:
rows, columns = acs_nyc.shape

print(f"there are {rows} rows and {columns} columns")

there are 2327 rows and 3 columns


What are the datatypes for each column? (1pt)

FIPS has type int64, and hh_income and house_value have type float64.

In [183]:
acs_nyc.dtypes

FIPS             int64
hh_income      float64
house_value    float64
dtype: object

Using one function, display the 25th, 50th, 75th percentiles, mean, min, max, count, and standard deviations for the `hh_income` and `house_value` columns. (1 pt)

In [184]:
acs_nyc[["hh_income", "house_value"]].describe()

Unnamed: 0,hh_income,house_value
count,2196.0,1966.0
mean,78730.628871,759977.6
std,38669.599532,364248.7
min,11988.0,9999.0
25%,53277.25,530675.0
50%,73199.5,664600.0
75%,96989.0,908725.0
max,250001.0,2000001.0


# Data Manipulation
The FIPS code is a unique identifier for each administrative unit in the US. This is how to read a FIPS code. The number of digits in your `FIPS` columns corresponds to whether each row is a state, county, tract, or block. 

![This is how to read a FIPS code](https://customer.precisely.com/servlet/rtaImage?eid=ka0Vu0000000qGs&feoid=00N6g00000TynF6&refid=0EM6g0000010wuS)



1. What administrative unit is our dataset, based on how many FIPS digits in the `FIPS` column? (1 pt)

**ANSWER**<br>
There are 11 digits for each entry in the FIPS column, which means that the administrative unit is **tracts**.

2. Using the FIPS code, create the following columns: (5 pts)
- State
- County: 
    - Bronx County: 36005
    - Brooklyn County: 36047
    - Manhattan County: 36061
    - Queens County: 36081
    - Staten Island: 36085
- Tract_ID

In [185]:
#since I know the data is from NYC, I will not try to parse the first digits and match to all the statecodes
#this reduces the amount of work I have to do

#but I first check this is true (NEW YORK state code is 36)
assert acs_nyc["FIPS"].apply(lambda x: str(x)[:2] != "36").sum()==0

#now I set all the states to NEW YORK
acs_nyc["State"] = "NEW YORK"

#and create a small custom function that matches code to county
def county_from_FIPS(FIPS):
  county_code = str(FIPS)[:5]
  if county_code=="36005":
    return "Bronx County"
  elif county_code == "36047":
    return "Brooklyn County"
  elif county_code == "36061":
    return "Manhattan County"
  elif county_code == "36081":
    return "Queens County"
  elif county_code == "36085":
    return "Staten Island"
  else:
    return "" 

#apply the function
acs_nyc["County"] = acs_nyc["FIPS"].apply(county_from_FIPS)


#also checked that all conversions were successful (there's no county outside the expected ones)
assert (acs_nyc["County"]=="").sum()==0

#now we add the tract number, this has to be a string to preserve leading zeros
#we index the last six digits

acs_nyc["Tract_ID"] = acs_nyc["FIPS"].apply(lambda x: str(x)[-6:])

In [186]:
acs_nyc

Unnamed: 0,FIPS,hh_income,house_value,State,County,Tract_ID
0,36005000100,,,NEW YORK,Bronx County,000100
1,36005000200,70867.0,457300.0,NEW YORK,Bronx County,000200
2,36005000400,98090.0,456100.0,NEW YORK,Bronx County,000400
3,36005001600,40033.0,587600.0,NEW YORK,Bronx County,001600
4,36005001901,55924.0,,NEW YORK,Bronx County,001901
...,...,...,...,...,...,...
2322,36085030302,85842.0,420500.0,NEW YORK,Staten Island,030302
2323,36085031901,,288300.0,NEW YORK,Staten Island,031901
2324,36085031902,76066.0,381600.0,NEW YORK,Staten Island,031902
2325,36085032300,86471.0,237100.0,NEW YORK,Staten Island,032300


3. Check for null values in the `hh_income` and `house_value` columns. Remove any rows with both hh_income and house_value missing. With the rest of the missing values, replace missing values in these columns with the mean value of their respective columns. (5 pts)

In [187]:
## first I apply a mask that takes out rows where both columns are null
acs_nyc=acs_nyc[~(acs_nyc['hh_income'].isna() & acs_nyc['house_value'].isna())]

##replace missing income with mean
mean_income = acs_nyc['hh_income'].mean()
acs_nyc.loc[acs_nyc['hh_income'].isna(), 'hh_income'] = mean_income

##replace missing house value
mean_house_value = acs_nyc['house_value'].mean()
acs_nyc.loc[acs_nyc['house_value'].isna(), 'house_value'] = mean_house_value

Verify that there are no more null values in these columns. (1 pt)

In [188]:
##double check there are no null values
assert len(acs_nyc[acs_nyc['hh_income'].isna() | acs_nyc['house_value'].isna()])==0

# Data analysis

Group the data by `County` and find the median household income and house value. Which county has the lowest median house income and house value? Show this through sorting the dataframe's rows by lowest median household income to highest and only displaying the first row. (5pt)

**Bronx County does, see below for analysis**

In [189]:
# first we group the data by county and get medians for income and house value in each
# (FIPS and other texts ignored because they cannot be used for median, and are different in each tract)
county_medians = acs_nyc.groupby("County")[['hh_income', 'house_value']].median().reset_index()

#we sort by income and display the first row (lowest median household income) of resulting df
county_medians.sort_values('hh_income').head(1)

Unnamed: 0,County,hh_income,house_value
0,Bronx County,44444.0,550000.0


Which county has largest standard deviation in house income and house value? Show this through sorting the dataframe's rows by highest household income variance to lowest and only displaying the first row.(5pt)

In [190]:
# first we get standard deviations for income and house value in each county
county_stds = acs_nyc.groupby("County")[['hh_income', 'house_value']].std().reset_index()

#and then we sort it by income
county_stds = county_stds.sort_values('hh_income', ascending=False)

In [191]:
#highest standard deviation in household income and highest standard deviation in house value are both in Manhattan
#so we diplay the first row:
county_stds.head(1)

Unnamed: 0,County,hh_income,house_value
2,Manhattan County,58547.942749,485595.141937


# Bonus (3 pts)
Take our original `acs_nyc.csv` and ask Chatgpt to clean it. What was the prompt that you gave it? Attach the dataset that it returned as a csv and read it into the cell below.

My prompt:

This is a dataset of household income (hh_income) and home values (house_value) in New York City. There are missing values in hh_income, as well as house_value. Can you remove the rows where both hh_income and house_value are missing, and then replace the rows where only one of the values is missing with the mean value of the columns. 
Also, FIPS is a unique identifier for administrative units, and these are census tracts for New York (hence the first 2 digits are 36, the next 3 digits are of the county, and the last 6 are the TRACT_ID). Can you add three columns, one for the state (all NEW YORK), one for the county (you should use the first 5 digits to find the county using this mapping:
    - Bronx County: 36005
    - Brooklyn County: 36047
    - Manhattan County: 36061
    - Queens County: 36081
    - Staten Island: 36085), and then parse the last 6 digits as strings and put them in the county column?

Since I don't have chatgpt4, I instead asked chatgpt 3.5 for the code to do the manipulation I want (Rifqi said this is ok).<br>
Below is the code it gave me, and I ran it locally and then saved the .csv

In [192]:
# Assuming your dataset is loaded into a DataFrame named df
df = pd.read_csv('acs_nyc.csv')

# Remove rows where both hh_income and house_value are missing
df = df.dropna(subset=['hh_income', 'house_value'], how='all')

# Replace missing values with the mean of the respective columns
df['hh_income'].fillna(df['hh_income'].mean(), inplace=True)
df['house_value'].fillna(df['house_value'].mean(), inplace=True)

# Add columns for State and County
df['State'] = 'NEW YORK'

# Mapping for County based on the first 5 digits of FIPS
county_mapping = {
    36005: 'Bronx County',
    36047: 'Brooklyn County',
    36061: 'Manhattan County',
    36081: 'Queens County',
    36085: 'Staten Island'
}

# Parse the last 6 digits of FIPS and add to the County column
df['County'] = df['FIPS'].astype(str).str[:5].astype(int).map(county_mapping)

# Parse the last 6 digits and put them in the Tract_ID column
df['Tract_ID'] = df['FIPS'].astype(str).str[5:]

# If you want to save the modified DataFrame to a new CSV file
# df.to_csv("acs_nyc_chatgpt.csv") #I commented this out so it doesn't write to your computer

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['hh_income'].fillna(df['hh_income'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['house_value'].fillna(df['house_value'].mean(), inplace=True)


In [193]:
## INSERT YOUR CODE HERE
df = pd.DataFrame(pd.read_csv("acs_nyc_chatgpt.csv",index_col=0, dtype={"Tract_ID": object}))

df

Unnamed: 0,FIPS,hh_income,house_value,State,County,Tract_ID
1,36005000200,70867.000000,457300.00000,NEW YORK,Bronx County,000200
2,36005000400,98090.000000,456100.00000,NEW YORK,Bronx County,000400
3,36005001600,40033.000000,587600.00000,NEW YORK,Bronx County,001600
4,36005001901,55924.000000,759977.64649,NEW YORK,Bronx County,001901
5,36005001902,60804.000000,425600.00000,NEW YORK,Bronx County,001902
...,...,...,...,...,...,...
2321,36085030301,95913.000000,457600.00000,NEW YORK,Staten Island,030301
2322,36085030302,85842.000000,420500.00000,NEW YORK,Staten Island,030302
2323,36085031901,78730.628871,288300.00000,NEW YORK,Staten Island,031901
2324,36085031902,76066.000000,381600.00000,NEW YORK,Staten Island,031902


In [194]:
#everything in df is the same as acs_nyc, so chatGPT did this perfectly!
#very cool!
assert (df!=acs_nyc).sum().sum()==0