# Final Project Submission
---

## Info

* Student name: **Barto Molina**
* Student pace: **part time**
* Scheduled project review date/time: **10/15/2019 5:00 PM (EST)**
* Instructor name: **Victor Geislinger**
* Blog post URL: [...](https://medium.com/@bartomolina/...)

## The Project

The goal of the project is to find the best 5 zipcodes across the US to invest in. For the sake of simplicity, I'm going to limit the research to the NY zipcodes. Throughout the project, by performing some EDA and modelling, I will try to infer which areas will be the ones that will perform better in terms of % increase in the coming years.

## Imports

We'll import the required libraries that will be used throughout the rest of the project:

In [76]:
import pandas as pd
import numpy as np
import googlemaps
import folium
import json
import random

In [3]:
# we use dotenv to use Google API key
# use .env.example as a reference
import os
from dotenv import load_dotenv
load_dotenv()

GOOGLE_KEY = os.getenv("GOOGLE_KEY")
gmaps = googlemaps.Client(key=GOOGLE_KEY)

## Step 1: Load the Data/Filtering for Chosen Zipcodes

In [5]:
df = pd.read_csv('zillow_data.csv')
df.head()

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,...,2018-12,2019-01,2019-02,2019-03,2019-04,2019-05,2019-06,2019-07,2019-08,2019-09
0,61639,10025,New York,NY,New York-Newark-Jersey City,New York County,1,167200.0,167400.0,167400.0,...,1051500,1029200,1014300,1008700,991500,973400,965800,967100,967800,965400
1,84654,60657,Chicago,IL,Chicago-Naperville-Elgin,Cook County,2,148200.0,148900.0,149400.0,...,314500,314700,314600,312900,311000,309700,308900,308400,307000,305700
2,61637,10023,New York,NY,New York-Newark-Jersey City,New York County,3,350900.0,351700.0,352300.0,...,1375700,1370100,1369000,1362500,1358300,1369700,1394900,1403200,1405400,1412700
3,84616,60614,Chicago,IL,Chicago-Naperville-Elgin,Cook County,4,160800.0,162700.0,164200.0,...,367100,365900,364800,362800,360300,357600,354200,348900,343800,341500
4,61616,10002,New York,NY,New York-Newark-Jersey City,New York County,5,,,,...,966100,954500,949600,943700,928400,913300,901400,892700,891200,889900


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4770 entries, 0 to 4769
Columns: 289 entries, RegionID to 2019-09
dtypes: float64(237), int64(48), object(4)
memory usage: 10.5+ MB


## Step 2: Data Preprocessing

In [7]:
def get_datetimes(df):
    return pd.to_datetime(df.columns.values[1:], format='%Y-%m')

In [101]:
# we're going to analyze only the NY postcodes
df_NY = df[df['City'] == 'New York'].copy()
df_NY.reset_index(drop=True, inplace=True)

In [102]:
df_NY.head()

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,...,2018-12,2019-01,2019-02,2019-03,2019-04,2019-05,2019-06,2019-07,2019-08,2019-09
0,61639,10025,New York,NY,New York-Newark-Jersey City,New York County,1,167200.0,167400.0,167400.0,...,1051500,1029200,1014300,1008700,991500,973400,965800,967100,967800,965400
1,61637,10023,New York,NY,New York-Newark-Jersey City,New York County,3,350900.0,351700.0,352300.0,...,1375700,1370100,1369000,1362500,1358300,1369700,1394900,1403200,1405400,1412700
2,61616,10002,New York,NY,New York-Newark-Jersey City,New York County,5,,,,...,966100,954500,949600,943700,928400,913300,901400,892700,891200,889900
3,61807,10467,New York,NY,New York-Newark-Jersey City,Bronx County,7,,,,...,213300,214000,215400,217600,218200,219100,223800,227600,230100,231700
4,61630,10016,New York,NY,New York-Newark-Jersey City,New York County,9,242300.0,241200.0,240000.0,...,1007700,997500,994400,993300,985100,975200,970200,964400,956700,951700


In [104]:
df_NY['CountyName'].value_counts()

New York County    40
Queens County      31
Kings County       31
Richmond County     8
Bronx County        5
Name: CountyName, dtype: int64

We're going to select the zipcodes that are closer than half an hour by public transportation to the UN HQ.

In [123]:
df_NY['Distance'] = np.nan
destination = 'United Nations Secretariat Building, East 42nd Street, New York, NY'

for i, zipcode in df_NY.head().iterrows():
    directions = gmaps.directions(str(zipcode['RegionName']), 'United Nations Secretariat Building, East 42nd Street, New York, NY', mode="transit")
    distance = directions[0]['legs'][0]['duration']['value']
    df_NY.at[i, 'Distance'] = distance / 60
    # print(distance / 60)

In [124]:
df_NY.head()

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,...,2019-01,2019-02,2019-03,2019-04,2019-05,2019-06,2019-07,2019-08,2019-09,Distance
0,61639,10025,New York,NY,New York-Newark-Jersey City,New York County,1,167200.0,167400.0,167400.0,...,1029200,1014300,1008700,991500,973400,965800,967100,967800,965400,34.883333
1,61637,10023,New York,NY,New York-Newark-Jersey City,New York County,3,350900.0,351700.0,352300.0,...,1370100,1369000,1362500,1358300,1369700,1394900,1403200,1405400,1412700,25.316667
2,61616,10002,New York,NY,New York-Newark-Jersey City,New York County,5,,,,...,954500,949600,943700,928400,913300,901400,892700,891200,889900,27.916667
3,61807,10467,New York,NY,New York-Newark-Jersey City,Bronx County,7,,,,...,214000,215400,217600,218200,219100,223800,227600,230100,231700,58.433333
4,61630,10016,New York,NY,New York-Newark-Jersey City,New York County,9,242300.0,241200.0,240000.0,...,997500,994400,993300,985100,975200,970200,964400,956700,951700,16.866667


In [125]:
path = 'ny_new_york_zip_codes_geo.min.json'
with open(path) as json_file:  
    NY_zip = json.load(json_file)

NY_zip_selected = list()

for zipcode in NY_zip['features']:
    if int(zipcode['properties']['ZCTA5CE10']) in df_NY.head()['RegionName'].values:
        zipcode['properties']['ZCTA5CE10'] = int(zipcode['properties']['ZCTA5CE10'])
        NY_zip_selected.append(zipcode)

NY_zip['features'] = NY_zip_selected

In [126]:
m = folium.Map(location=[40.75, -73.97], zoom_start=9)

choropleth = folium.Choropleth(highlight=True,
    geo_data=NY_zip,
    name='choropleth',
    data=df_NY.head(),
    columns=['RegionName', 'Distance'],
    key_on='feature.properties.ZCTA5CE10',
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='house prices (median)'
).add_to(m)

folium.LayerControl().add_to(m)

# add tooltip to see the zipcode
choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(['ZCTA5CE10'], labels=False)
)

m

In [157]:
# load the New York state zipcodes
# path = 'ny_new_york_zip_codes_geo.min.json'
# with open(path) as json_file:  
#     NY_zip = json.load(json_file)

m = folium.Map(location=[40.75, -73.97], zoom_start=9)

choropleth = folium.Choropleth(highlight=True,
    geo_data=NY_zip,
    name='choropleth',
    data=df_NY,
    columns=['RegionName', 'SizeRank'],
    key_on='feature.properties.ZCTA5CE10',
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='house prices (median)'
).add_to(m)

folium.LayerControl().add_to(m)

# add tooltip to see the zipcode
choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(['ZCTA5CE10'], labels=False)
)

m

## Step 4: Reshape from Wide to Long Format

In [None]:
def melt_data(df):
    melted = pd.melt(df, id_vars=['RegionName', 'City', 'State', 'Metro', 'CountyName'], var_name='time')
    melted['time'] = pd.to_datetime(melted['time'], infer_datetime_format=True)
    melted = melted.dropna(subset=['value'])
    return melted.groupby('time').aggregate({'value':'mean'})

In [None]:
df_NY.drop(['RegionID', 'SizeRank'], axis=1, inplace=True)

In [None]:
melt_data(df_NY)

## Step 3: EDA and Visualization

In [None]:
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 22}

matplotlib.rc('font', **font)

# NOTE: if you visualizations are too cluttered to read, try calling 'plt.gcf().autofmt_xdate()'!

## Step 5: ARIMA Modeling

## Step 6: Interpreting Results