## Battle of the Cities: Data Cleaning

In [None]:
import pandas as pd
import numpy as np

rent_data = pd.read_csv('Neighborhood_Zri_AllHomesPlusMultifamily.csv')
rent_data.head(2)

## Dropping Data
This data set contains rent values for all of the US. I only want data for NYC and SF so I'll filter the data set. Also, for New York, I only want data for Manhattan, so I will need to filter for that as well.

In [None]:
rent_data = rent_data[(rent_data["City"] == "New York") | (rent_data["City"] == "San Francisco")]

In [None]:
rent_data["CountyName"].value_counts()

In [None]:
rent_data = rent_data[(rent_data["CountyName"] == "New York County") | (rent_data["CountyName"] == "San Francisco County")]
rent_data["CountyName"].value_counts()

Now that I have the relevant data for the cities I'm interested in, I'll drop unecessary columns.

In [None]:
rent_data.drop(["RegionID", "State", "Metro", "CountyName", "SizeRank"], axis=1, inplace=True)

#rename columns
rent_data.rename(columns={"RegionName":"Neighborhood"}, inplace=True)

rent_data.columns

In [None]:
rent_data.set_index('Neighborhood', inplace=True)

The data for 2019 is incomplete at this time. Also, I'm only concerned with recent rent data, so I'll drop the data that spans from 2010 through 2017.

In [None]:
rent_data.drop(rent_data.columns.to_series()["2010-09":"2017-12"], axis=1, inplace=True)

rent_data.columns

## Handling Missing Values

In [None]:
rent_data.isnull().sum()

It looks like there are some missing values in each column. However, transposing the data will show if there are neighborhoods with a high volume of missing values.

In [None]:
transposed = rent_data.transpose()

transposed.isnull().sum().sort_values(ascending=False)

Considering there are 21 records for each neighborhood, I'll drop neighborhoods that are missing more than half of their data since I want a relatively accurate representation of rents in each area.

In [None]:
rent_data.dropna(thresh=11, axis=0, inplace=True)
rent_data.isnull().sum()

In [None]:
rent_data.to_csv('rent_data_clean.csv')