# Battle of the Cities: New York City vs. San Francisco
## Data Cleaning Workbook
This workbook contains the data cleaning part of this project.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#importing rent data
rent_data = pd.read_csv("Neighborhood_MedianRentalPrice_1Bedroom.csv")
rent_data.head(3)

Unnamed: 0,RegionName,City,State,Metro,CountyName,SizeRank,2010-02,2010-03,2010-04,2010-05,...,2018-12,2019-01,2019-02,2019-03,2019-04,2019-05,2019-06,2019-07,2019-08,2019-09
0,Northeast Dallas,Dallas,TX,Dallas-Fort Worth-Arlington,Dallas County,1,,,,,...,1050.0,1075.0,1053.0,1050.0,1157.0,1175.0,1240.0,1100.0,1094.0,1150.0
1,Maryvale,Phoenix,AZ,Phoenix-Mesa-Scottsdale,Maricopa County,2,,,,,...,705.0,724.0,755.0,767.5,795.0,795.0,867.0,860.0,825.0,840.0
2,Paradise,Las Vegas,NV,Las Vegas-Henderson-Paradise,Clark County,3,,,,,...,950.0,983.0,927.0,933.0,921.0,896.0,930.0,950.0,1017.5,1000.0


I'm only interested in data from New York and San Francisco, so I'll create two data frames for this data.

In [3]:
nyc_data = rent_data[rent_data["City"] == "New York"]
sf_data = rent_data[rent_data["City"] == "San Francisco"]

## Cleaning Manhattan Data

New York City has five boroughs. In the interest of keeping the city analysis similar for both cities, I'll only be looking at Manhattan neighborhoods, which should fall under New York County.

In [5]:
nyc_data = nyc_data[nyc_data["CountyName"] == "New York County"]
nyc_data.isnull().sum()

RegionName    0
City          0
State         0
Metro         0
CountyName    0
             ..
2019-05       0
2019-06       0
2019-07       0
2019-08       0
2019-09       0
Length: 122, dtype: int64

In [6]:
cols = ["Metro", "CountyName", "SizeRank", "2010-02", "2010-03", "2010-04", "2010-05", "2010-06",
       "2010-07", "2010-08", "2010-09", "2010-10", "2010-11", "2010-12"]
nyc_data = nyc_data.drop(cols, axis=1)

#fill remaining data with median of row
m = nyc_data.mean(axis=1)
for i, col in enumerate(nyc_data):
    nyc_data.iloc[:, i].fillna(m, inplace=True)
#check            
nyc_data.isnull().sum()

The Manhattan data starts on Feb. 2012, this is tentative until I'm finished cleaning the San Francisco data. I want both data sets to cover the same length of time.

## Cleaning San Francisco Data

In [8]:
sf_data.isnull().sum()

RegionName    0
City          0
State         0
Metro         0
CountyName    0
             ..
2019-05       0
2019-06       0
2019-07       0
2019-08       0
2019-09       0
Length: 122, dtype: int64

In [9]:
cols = ["Metro", "CountyName", "SizeRank", "2010-02", "2010-03", "2010-04", "2010-05", "2010-06",
       "2010-07", "2010-08", "2010-09", "2010-10", "2010-11", "2010-12"]
sf_data = sf_data.drop(cols, axis=1)

#fill remaining data with median of row
m = sf_data.mean(axis=1)
for i, col in enumerate(sf_data):
    sf_data.iloc[:, i].fillna(m, inplace=True)
#check            
sf_data.isnull().sum()

RegionName    0
City          0
State         0
2011-01       0
2011-02       0
             ..
2019-05       0
2019-06       0
2019-07       0
2019-08       0
2019-09       0
Length: 108, dtype: int64

### Export data

In [10]:
nyc_data.to_csv('nyc_rent.csv', index=False)
sf_data.to_csv('sf_rent.csv', index=False)