# Final Project Submission
---

## Info

* Student name: **Barto Molina**
* Student pace: **part time**
* Scheduled project review date/time: **10/15/2019 5:00 PM (EST)**
* Instructor name: **Victor Geislinger**
* Blog post URL: [...](https://medium.com/@bartomolina/...)

## The Project

The goal of the project is to find the best 5 zipcodes across the US to invest in. For the sake of simplicity, I'm going to limit the research to the NY zipcodes. Throughout the project, by performing some EDA and modelling, I will try to infer which areas will be the ones that will perform better in terms of % increase in the coming years.

## Imports

We'll import the required libraries that will be used throughout the rest of the project:

In [9]:
import pandas as pd
import googlemaps

In [10]:
# we use dotenv to use Google API key
# use .env.example as a reference
import os
from dotenv import load_dotenv
load_dotenv()

GOOGLE_KEY = os.getenv("GOOGLE_KEY")
gmaps = googlemaps.Client(key=GOOGLE_KEY)

## Step 1: Load the Data/Filtering for Chosen Zipcodes

In [11]:
df = pd.read_csv('zillow_data.csv')
df.head()

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,...,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04
0,84654,60657,Chicago,IL,Chicago,Cook,1,334200.0,335400.0,336500.0,...,1005500,1007500,1007800,1009600,1013300,1018700,1024400,1030700,1033800,1030600
1,90668,75070,McKinney,TX,Dallas-Fort Worth,Collin,2,235700.0,236900.0,236700.0,...,308000,310000,312500,314100,315000,316600,318100,319600,321100,321800
2,91982,77494,Katy,TX,Houston,Harris,3,210400.0,212200.0,212200.0,...,321000,320600,320200,320400,320800,321200,321200,323000,326900,329900
3,84616,60614,Chicago,IL,Chicago,Cook,4,498100.0,500900.0,503100.0,...,1289800,1287700,1287400,1291500,1296600,1299000,1302700,1306400,1308500,1307000
4,93144,79936,El Paso,TX,El Paso,El Paso,5,77300.0,77300.0,77300.0,...,119100,119400,120000,120300,120300,120300,120300,120500,121000,121500


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14723 entries, 0 to 14722
Columns: 272 entries, RegionID to 2018-04
dtypes: float64(219), int64(49), object(4)
memory usage: 30.6+ MB


## Step 2: Data Preprocessing

In [None]:
def get_datetimes(df):
    return pd.to_datetime(df.columns.values[1:], format='%Y-%m')

In [13]:
# we're going to analyze only the NY postcodes
df_NY = df[df['City'] == 'New York'].copy()

In [14]:
df_NY.head()

Unnamed: 0,RegionID,RegionName,City,State,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,...,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04
6,61807,10467,New York,NY,New York,Bronx,7,152900.0,152700.0,152600.0,...,394400,400000,407300,411600,413200,414300,413900,411400,413200,417900
10,62037,11226,New York,NY,New York,Kings,11,162000.0,162300.0,162600.0,...,860200,851000,853900,870000,885100,887800,890500,901700,930700,963200
12,62087,11375,New York,NY,New York,Queens,13,252400.0,251800.0,251400.0,...,1022600,1033700,1048600,1066400,1081200,1088800,1092700,1089500,1084000,1084600
13,62045,11235,New York,NY,New York,Kings,14,190500.0,191000.0,191500.0,...,767300,777300,788800,793900,796000,799700,806600,810600,813400,816200
20,61625,10011,New York,NY,New York,New York,21,,,,...,12137600,12112600,12036600,12050100,12016300,11946500,11978100,11849300,11563000,11478300


In [15]:
df_NY['CountyName'].value_counts()

Queens      55
Kings       28
Bronx       13
Richmond    12
New York     6
Name: CountyName, dtype: int64

We're going to select the zipcodes that are closer than half an hour by public transportation to the UN HQ.

In [25]:
destination = 'United Nations Secretariat Building, East 42nd Street, New York, NY'

for i, zipcode in df_NY.head().iterrows():
    print(zipcode['RegionName'])
    #directions = gmaps.directions(zipcode['RegionName'], 'United Nations Secretariat Building, East 42nd Street, New York, NY', mode="transit")
    #distance = directions[0]['legs'][0]['duration']['value']

10467
11226
11375
11235
10011


## Step 4: Reshape from Wide to Long Format

In [50]:
def melt_data(df):
    melted = pd.melt(df, id_vars=['RegionName', 'City', 'State', 'Metro', 'CountyName'], var_name='time')
    melted['time'] = pd.to_datetime(melted['time'], infer_datetime_format=True)
    melted = melted.dropna(subset=['value'])
    return melted.groupby('time').aggregate({'value':'mean'})

In [51]:
df_NY.drop(['RegionID', 'SizeRank'], axis=1, inplace=True)

In [52]:
melt_data(df_NY)

Unnamed: 0_level_0,value
time,Unnamed: 1_level_1
1996-04-01,2.148192e+05
1996-05-01,2.152337e+05
1996-06-01,2.156981e+05
1996-07-01,2.161865e+05
1996-08-01,2.167202e+05
1996-09-01,2.173125e+05
1996-10-01,2.179885e+05
1996-11-01,2.188394e+05
1996-12-01,2.199029e+05
1997-01-01,2.208837e+05


## Step 3: EDA and Visualization

In [None]:
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 22}

matplotlib.rc('font', **font)

# NOTE: if you visualizations are too cluttered to read, try calling 'plt.gcf().autofmt_xdate()'!

## Step 5: ARIMA Modeling

## Step 6: Interpreting Results