## Battle of the Cities: Data Cleaning

In [87]:
import pandas as pd
import numpy as np

rent_data = pd.read_csv('Neighborhood_Zri_AllHomesPlusMultifamily.csv')
rent_data.head(2)

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,2010-09,2010-10,2010-11,...,2018-12,2019-01,2019-02,2019-03,2019-04,2019-05,2019-06,2019-07,2019-08,2019-09
0,274772,Northeast Dallas,Dallas,TX,Dallas-Fort Worth-Arlington,Dallas County,1,1074.0,1080.0,1093.0,...,1367.0,1368.0,1364.0,1360.0,1362.0,1367.0,1371.0,1373.0,1375.0,1378.0
1,112345,Maryvale,Phoenix,AZ,Phoenix-Mesa-Scottsdale,Maricopa County,2,896.0,915.0,913.0,...,1167.0,1169.0,1167.0,1164.0,1166.0,1170.0,1178.0,1189.0,1200.0,1207.0


## Dropping Data
This data set contains rent values for all of the US. I only want data for NYC and SF so I'll filter the data set. Also, for New York, I only want data for Manhattan, so I will need to filter for that as well.

In [88]:
rent_data = rent_data[(rent_data["City"] == "New York") | (rent_data["City"] == "San Francisco")]

In [89]:
rent_data["CountyName"].value_counts()

San Francisco County    75
Queens County           66
Richmond County         58
Kings County            52
Bronx County            44
New York County         33
Name: CountyName, dtype: int64

In [90]:
rent_data = rent_data[(rent_data["CountyName"] == "New York County") | (rent_data["CountyName"] == "San Francisco County")]
rent_data["CountyName"].value_counts()

San Francisco County    75
New York County         33
Name: CountyName, dtype: int64

Now that I have the relevant data for the cities I'm interested in, I'll drop unecessary columns.

In [91]:
rent_data.drop(["RegionID", "State", "Metro", "CountyName", "SizeRank"], axis=1, inplace=True)

#rename columns
rent_data.rename(columns={"RegionName":"Neighborhood"}, inplace=True)

rent_data.columns

Index(['Neighborhood', 'City', '2010-09', '2010-10', '2010-11', '2010-12',
       '2011-01', '2011-02', '2011-03', '2011-04',
       ...
       '2018-12', '2019-01', '2019-02', '2019-03', '2019-04', '2019-05',
       '2019-06', '2019-07', '2019-08', '2019-09'],
      dtype='object', length=111)

In [92]:
rent_data.set_index('Neighborhood', inplace=True)

The data for 2019 is incomplete at this time. Also, I'm only concerned with recent rent data, so I'll drop the data that spans from 2010 through 2017.

In [93]:
rent_data.drop(rent_data.columns.to_series()["2010-09":"2017-12"], axis=1, inplace=True)

rent_data.columns

Index(['City', '2018-01', '2018-02', '2018-03', '2018-04', '2018-05',
       '2018-06', '2018-07', '2018-08', '2018-09', '2018-10', '2018-11',
       '2018-12', '2019-01', '2019-02', '2019-03', '2019-04', '2019-05',
       '2019-06', '2019-07', '2019-08', '2019-09'],
      dtype='object')

## Handling Missing Values

In [118]:
rent_data.isnull().sum()

City       0
2018-01    8
2018-02    7
2018-03    7
2018-04    7
2018-05    8
2018-06    7
2018-07    6
2018-08    7
2018-09    6
2018-10    5
2018-11    5
2018-12    5
2019-01    7
2019-02    9
2019-03    7
2019-04    7
2019-05    7
2019-06    7
2019-07    7
2019-08    7
2019-09    7
dtype: int64

It looks like there are some missing values in each column. However, transposing the data will show if there are neighborhoods with a high volume of missing values.

In [111]:
transposed = rent_data.transpose()

transposed.isnull().sum().sort_values(ascending=False)

Neighborhood
Balboa Terrace          21
Mount Davidson Manor    21
Forest Hill             21
Westwood Highlands      18
North Waterfront        15
                        ..
NoHo                     0
Garment District         0
Haight                   0
Duboce Triangle          0
Upper West Side          0
Length: 108, dtype: int64

Considering there are 21 records for each neighborhood, I'll drop neighborhoods that are missing more than half of their data since I want a relatively accurate representation of rents in each area.

In [119]:
rent_data.dropna(thresh=11, axis=0, inplace=True)
rent_data.isnull().sum()

City       0
2018-01    1
2018-02    0
2018-03    0
2018-04    0
2018-05    1
2018-06    1
2018-07    0
2018-08    1
2018-09    0
2018-10    0
2018-11    0
2018-12    0
2019-01    1
2019-02    3
2019-03    2
2019-04    2
2019-05    2
2019-06    2
2019-07    2
2019-08    2
2019-09    2
dtype: int64

In [130]:
rent_data.to_csv('rent_data_clean.csv')