By:  David R. Torres<br>
Flatiron School<br>
Github repo: https://github.com/davidrtorres/dsc-mod-4-project-v2-1-onl01-dtsc-pt-041320/tree/master

# **Using an ARIMA Model for Time Series Forecasting**

### **Introduction**
Business Problem: I am a consultant for Premium Real Estate, LLC.  The firm asked me to provide analysis and recommendations for investing in real estate in the top 5 zipcodes in Brooklyn, NY that will provide the highest return on investment.  The investment firm is looking for short-term investments with the highest returns over a 3 year period.  The investment firm isn't looking for long term investments.<br>
<br>
I will make recommendations based on the real estate prices in Brooklyn. The top 5 zipcodes or 'best' zipcodes will be those with the highest ROI over the 3 year period.<br>
<br>
For the task, I analyzed real estate sales data from Zillow which covers the time period 4-1-1996 to 4-1-2018.<br>
I used an Auto Arima model to conduct a gridsearch and find the p,d,qs and Seasonal P,D,Qs with the lowest related AIC scores for each zipcode.  I used a SARIMA model to make predictions regarding the test data.  I used the metric RMSE to evaluate how my models were performing in making predictons.  I then made models to perform dynamic forecasts for 3 years.<br>

In [1]:
print('Notebook 12-18-20')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from statsmodels.tsa.stattools import adfuller

import warnings
warnings.filterwarnings('ignore')
import itertools
import statsmodels.api as sm

Notebook 12-18-20


In [None]:
zillow = pd.read_csv('https://raw.githubusercontent.com/learn-co-students/dsc-mod-4-project-v2-1-onl01-dtsc-ft-070620/master/time-series/zillow_data.csv')

In [None]:
zillow.rename(columns={'RegionName': 'Zipcode'}, inplace=True)
zillow.head()

## Preprocessing Data
### Melted Data Function
The dataset is from Zillow.com so I had to first reshape the data frame from wide to long format using the function .melt() and then transform it into a time series data frame.  Then I created a subset of the dataframe for only properties located under column 'State' of 'NY' and column 'CountyName' of 'Kings'.   

In [None]:
def melt_data(df):
    """
    df - is the dataframe
    This is a time series so I need a column for dates to become the index.
    .melt() function sets up dataframe so date columns can be merged as a single column.  
    """
    melted = pd.melt(df, id_vars=['RegionID','Zipcode', 'City', 'State', 'Metro', 'CountyName', 
                                  'SizeRank'], var_name='Month', value_name='MeanValue')
    melted['Month'] = pd.to_datetime(melted['Month'], format='%Y-%m')
    #melted = melted.set_index('Month')
    melted = melted.dropna(subset=['MeanValue'])
    return melted

In [None]:
all_zipcodes = melt_data(zillow)

In [None]:
all_zipcodes.columns

In [None]:
all_zipcodes.head(10)

In [None]:
kings_zips = all_zipcodes[(all_zipcodes['CountyName']=='Kings') & (all_zipcodes['State']== 'NY')]
kings_zips

In [None]:
#for loop gets the monthly mean sales price for each Brooklyn zipcode and puts it in dictionary. 

test_dict = {}

for zipcode in kings_zips['Zipcode'].unique(): 
    all_zips = kings_zips[kings_zips['Zipcode'] == zipcode]
    all_zips = all_zips.set_index('Month')['MeanValue']
    all_zips = all_zips.asfreq('MS')
    all_zips.name = zipcode
    test_dict[zipcode] = all_zips
    

In [None]:
test_dict

In [None]:
zip_df = pd.concat(test_dict, axis=1)

In [None]:
len(zip_df)

### **Dataframe of Brooklyn Zipcodes** 

After pre-processing the data I limited the scope of my search to real property in Brookyln, NY.  These
properties were identified in the dataframe under 'CountyName' as 'King'.

In [None]:
type(zip_df[11226])

In [None]:
zip_df.head()

In [None]:
zip_df.tail()

In [None]:
len(zip_df.columns)

In [None]:
zip_df.index

In [None]:
zip_df.isna().sum()

#### **NaN Values**

In [None]:
zip_df.bfill(inplace=True)

In [None]:
zip_df.isna().sum()

In [None]:
zip_df[11238].value_counts(dropna=False)

In [None]:
zip_df.keys()

In [None]:
zip_df[11238].head()

### **Plot of Price Trends of Brooklyn Zipcodes**

This is a plot of the time series data for each zip code demonstrating the price trends.  There are 28 zipcodes.  An overall upward trend can be observed from the years 1996 to 2018.  Regarding the housing bubble, we can see in the plot that housing prices peaked in early 2006, started to decline in 2006 and 2007, and then reached lows in 2012.<br>
The top 5 zip codes have consistently been top 5 performers for around 14 years.  The bottom 3 zipcodes displayed little change in price over time compared to top priced zip codes.  

In [None]:
zip_df.plot(figsize=(12,8))
plt.title("Housing Price Trends ")
#plt.set(title=f'Housing Prices by Year - {zip_df.index.freq}')
plt.xlabel('Year')
plt.ylabel('Brooklyn Real Estate Price Trends for Zipcodes')
plt.legend(bbox_to_anchor=(1.05,1),loc='upper left')

#### Plot of top 6 zipcodes with highest sale prices over extended periods. 

In [None]:
zip_df_1 = zip_df[[11217, 11238,11215,11216,11222,11205]]

In [None]:
zip_df_1.plot(figsize=(12,8))
plt.title("Top 6 Zipcodes With Highest Real Estate Prices")
#plt.set(title=f'Housing Prices by Year - {zip_df.index.freq}')
plt.xlabel('Year')
plt.ylabel('Home Prices')
plt.legend(bbox_to_anchor=(1.05,1),loc='upper left')

Zipcodes: 11217, 11238, 11205 had the same values for monthly real easte so I created a plot eliminating these dates.  <br>

In [None]:
zip_no_nan = zip_df_1['2003-12-01':] 
zip_no_nan

In [None]:
zip_no_nan.plot(figsize=(12,8))
plt.title("Housing Price Trends ")
#plt.set(title=f'Housing Prices by Year - {zip_df.index.freq}')
plt.xlabel('Year')
plt.ylabel('Home Prices')
plt.legend(bbox_to_anchor=(1.05,1),loc='upper left')

In [None]:
zip_df.tail()

### **Train/Test Split**
The zip_df dataset will be split into train and test sets to be used as inputs for the models.  Train data set is from 1996–04–01 to 2014–01–01. The length of our train data set is 214 rows, or 214 time periods.<br>
Test data set is from 2014–01–01 to 2018–04–01. The length of the test data is 52 rows, or 52 time periods. 
That is the value will use for our .predict() method. 

In [None]:
year = '2014-01-01'
train_brk = zip_df.loc[:year]
#test_brk = zip_df.loc[year:]
test_brk = zip_df.loc['2014-01-02':]

In [None]:
print(len(zip_df))
print(train_brk.shape)
print(test_brk.shape)

In [None]:
train_brk.tail()

In [None]:
test_brk.head()

In [None]:
test_brk.tail()

In [None]:
test_brk[11238][[0,-1]]

## **Auto_Arima Model**
Why use an Auto_ARIMA model?  We use the auto-ARIMA process because it identifies the optimal parameters for an ARIMA model.  In order to use an ARIMA model it is essential that p,d,q values are inputted into the model.  Generally for an ARIMA model statisical techniques are used to generate these values by performing the difference to eliminate the non-stationarity and obtaining values from ACF and PACF plots.  I would add that the ACF and PACF plots are hard to interpret.<br>

What do p,d and q represent?  The p is the parameter associated with the auto-regressive aspect of the model, which incorporates past values.  For example, forecasting that if it rained a lot over the past few days it's likely that it will rain tomorrow as well.<br>  
The d is the parameter associated with the integrated part of the model, which effects the amount of differencing to apply to a time series, i.e., forecasting that the amount of rain tomorrow will be similar to the amount of rain today, if the daily amounts of rain have been similar over the past few days.<br>

The q is the parameter associated with the moving average part of the model.<br>

In the auto ARIMA, the P,D, and Q describe the same associations as p,d, and q, but correspond with the seasonal components of the model.<br>
The auto ARIMA works similar to a grid search to find the optimal values for p, d, and q. The auto ARIMA iterates through all possible combinations of the p,d,q values for each zip code to find out which combination produces the model with the lowest AIC score (best fit).  The final combination of parameters for each zipcode would be determined according the lowest AIC.

In [None]:
import six
import sys
sys.modules['sklearn.externals.six'] = six

In [None]:
!pip install pmdarima

In [None]:
import six
import joblib
import sys
sys.modules['sklearn.externals.six'] = six
sys.modules['sklearn.externals.joblib'] = joblib
import pmdarima as pm
from pmdarima import auto_arima

In [None]:
def arima_model(df):
    """
    df- dataframe
    function is a gridsearch to get optimal p,d,qs and lowest AIC for the model.
    q-is moving average
    """
    autoarima_model = auto_arima(df, start_p = 0, start_q = 0, #start_q = 0
                              test='adf',             # use adftest to find optimal 'd'
                              max_p = 3, max_q = 3,   # maximum p and q
                              m = 12,                  #frequency of series 
                              d = None,               # let model determine 'd', was 1
                              seasonal = True, 
                              start_P=0, D=1, trace = False, #start  #trace= True
                              error_action ='ignore',   # we don't want to know if an order does not work 
                              suppress_warnings = True,  # we don't want convergence warnings 
                              stepwise = True)           # set to stepwise  
       
    return autoarima_model


In [None]:
stepwise_fit = arima_model(train_brk[11226])

In [None]:
stepwise_fit.summary()

### **Dataframe of p,d,qs, Seasonal p,d,qs and lowest AIC**
The for loop iterates through the Brooklyn zipcodes dataframe (zip_df) and uses the arima_model function to get the best fit parameters (p,d,qs, Seasonal p,d,qs) and lowest AICs for each Brooklyn zipcode.  The list, arima_list, is then converted into a Pandas dataframe.  

In [None]:
arima_list = [['zipcode', 'pdq','seasonal_pdq','aic']] 
for col in zip_df.columns:
  zip_data = arima_model(zip_df[col])
  arima_list.append([col,zip_data.order, zip_data.seasonal_order, zip_data.aic()])
#result   
output_df = pd.DataFrame(arima_list[1:],columns=arima_list[0]) 
output_df  

### **Dataframe of PDQs, Seasonal PDQs and AICs**

In [None]:
output_df

## **SARIMA Model**

### **Fitting a SARIMA Time Series Model**
Using a grid search approach, I used the Auto_Arima model to identify the set of optimal parameters to produce the best fitting model of the time series data.  The optimal parameter values are then inputted into the SARIMAX model.  I used a SARIMAX model because it takes into account trends and seasonality.  Acordingly, we can model our data without differencing it and addressing the issue of whether data is stationary or not.<br>    

Coef column shows the importance of each feature and how each one impacts the time series patterns. 
The P>|z| provides the significance of each feature weight.<br>

If a weight has a p-value lower or close to 0.05 it is reasonable to retain it in the model.<br>

Model diagnostics - the purpose is to ensure that residuals remain uncorrelated, normally distributed having zero mean.  N(0,1)) is the standard notation for a normal distribution with mean 0 and standard deviation of 1).  This is a good indication that the residuals are normally distributed.<br>

In [None]:
def fit_ARIMA(df, order=None, seasonal_order=None):
    """
    forecasting statsmodel SARIMAX model
    """
    ARIMA_MODEL = sm.tsa.statespace.SARIMAX(df, 
                                        order=order, 
                                        seasonal_order=seasonal_order, 
                                        enforce_stationarity=False, 
                                        enforce_invertibility=False)

    # Fit the model and print results
    output = ARIMA_MODEL.fit()

    #display / no tables 1
    display(output.summary())
    
    print('\n')
    print('MODEL DIAGNOSTICS')
    
    output.plot_diagnostics(figsize=(15, 18));
    plt.show()
    
    return output

## **Validating the Model**
We're going to see how good our ARIMA model is at forecasting the sale price of homes located in zipcode 11128.

### **One-Step-Ahead Forecasting**

One Step Ahead forecasting means that forecasts at each point are generated using the full history data up to that point to make the prediction.  This allows us to evaluate how good our model is at predicting just one value ahead. 

In order to validate the model, I started by comparing predicted values to real values of the time series.  This will help us understand the accuracy of our forecasts.  I picked zip code 11218 as an example.<br>

The methods .get_prediction() and .conf_int()  allow us to obtain the values and related confidence intervals for the time series forecasts.  The method .get_prediction generates in sample predictions and requires a date so that predictions will be made based on the data up to 1/1/2014. The model is going to make a prediction from the known values.  For this part I wil be  working with the whole dataset and not the train or test sets.<br>

I set the dynamic parameter to false so that the model produces one-step ahead visuals.  The method .conf_int gives us the upper and lower limits on the values of our predictions of the 'pred' object.  This will generate a dataframe of the upper and lower uncertainty range of our prediction. 
    

### Model Diagnostics
In the fit_arima function I included the method '.plot_diagnostics()' to run on the ARIMA output  for a plot of the diagnostics to ensure that none of the assumptions are met. 

The method '.plot_diagnostics()' enables us to confirm whether the residuals remain uncorrelated and normally distributed having zero mean. In the absence of these assumptions, we can not move forward and need further tweaking of the model.

The qq-plot on the bottom left shows that for the most part the ordered distribution of residuals (blue dots) follows the linear trend of the samples taken from a standard normal distribution with N(0, 1).  This is a strong indication that the residuals are normally distributed.

In [None]:
zipcode_osa = 11218
zip_params= output_df[output_df['zipcode']==zipcode_osa]
zip_params.pdq.values[0]
zip_params.seasonal_pdq.values[0]

output_sarima = fit_ARIMA(zip_df[zipcode_osa],order=zip_params.pdq.values[0], seasonal_order= zip_params.seasonal_pdq.values[0] )

pred = output_sarima.get_prediction(start=pd.to_datetime('2014-01-01'), dynamic=False)
pred_conf = pred.conf_int()

In [None]:
zip_params = output_df[output_df['zipcode']==11218]
zip_params

In [None]:
zip_params.columns

In [None]:
zip_params['pdq']

In [None]:
zip_params.pdq.values[0]

#### Plot of One-Step-Ahead Forecasting

Below is a plot of the real and forecasted values of the time series to assess how well the model did.  The plot icludes the confidence intervals which in this case overlap the predicted values.  The mean prediction is marked with the orange line.  The uncertainy range is shaded in green.  The uncertainty is due the the random terms that can't be predicted. 

The central value of the forecast is stored in the .predicted_mean attribute of the pred object.

the method .fill_between() produces the shade area between our lower and upper limits.

Visually it looks like the model did pretty good at making the predictions for zipcode 11218 because the forecasts align with the true test set values.

In [None]:
plt.figure(figsize=(12,5))
# Plot observed values
ax = train_brk[11218].plot(label='observed')
test_brk[11218].plot(label='Test')
# Plot predicted values
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=0.9)

# Plot the range for confidence intervals
ax.fill_between(pred_conf.index,
                pred_conf.iloc[:, 0],
                pred_conf.iloc[:, 1], color='g', alpha=0.5)

# Set axes labels
ax.set_xlabel('Date')
ax.set_ylabel('Sale Price')
plt.title(f'One-Step-Ahead Forecasting for Zipcode {zipcode_osa}')
plt.legend();

#### Check the Model's Accuracy

I will check the models' accuracy in making the prediction by using the metric RMSE (Mean Squared Error).  RMSE is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are.  RMSE is a measure of how spread out these residuals are. It tells us how concentrated the data is around the line of best fit.

Model was able to forecast the average daily real estate sales in the test set within 10,607.48 of the real sales.  The sales in this zip code 11218 range from around 1,003,700.00 to 2,202,400.00.  

In [None]:
# Get the real and predicted values
forecasted_11238 = pred.predicted_mean
truth_1128 =test_brk[11218]['1996':]

# Compute the root mean square error
mse = ((forecasted_11238 - truth_1128) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
#np.sqrt(np.mean((predictions-targets)**2))
rmse = np.sqrt(np.mean((forecasted_11238 - truth_1128) ** 2))
print('The Root Mean Squared Error of our forecasts is {}'.format(round(rmse, 2)))

In [None]:
train_brk[zipcode].describe().round(3)

### Dynamic Forecasting

We can make predictioned further than just One-Step-Ahead. We predict One-Step-Ahead and use this predicted value to forecast the next value after that.  We don't don't know the shock term after that so the uncertainy level can grow quickly.

The dynamic is set to True

In [None]:
zipcode_df = 11218
zip_params = output_df[output_df['zipcode']== zipcode_df]
zip_params.pdq.values[0]
zip_params.seasonal_pdq.values[0]

output_sarima = fit_ARIMA(zip_df[zipcode_df ],order=zip_params.pdq.values[0] ,seasonal_order= zip_params.seasonal_pdq.values[0] )
# Get dynamic predictions with confidence intervals as above 
pred_dynamic = output_sarima.get_prediction(start=pd.to_datetime('2014-01-01'), dynamic=True,full_results=True)
pred_dynamic_conf = pred_dynamic.conf_int()

Plotting the observed and forecasted values of the time series, we see that the overall forecasts are accurate even when using Dynamic Forecasting.  Model was able to forecast the average daily real estate sales in the test set within 3724.22 of the real sales.   

In [None]:

def prediction_vis(pred_dynamic,pred_dynamic_conf, y):
  # Plot the dynamic forecast with confidence intervals.
  plt.figure(figsize=(12,5))
  # Plot observed values
  ax = y.plot(label='Observed')

  # Plot predicted values
  pred_dynamic.predicted_mean.plot(ax=ax, label='Dynamic Forecast', alpha=0.9)

  # Plot the range for confidence intervals
  ax.fill_between(pred_dynamic_conf.index,
                  pred_dynamic_conf.iloc[:, 0],
                  pred_dynamic_conf.iloc[:, 1], color='g', alpha=0.5)

  # Set axes labels
  ax.set_xlabel('Date')
  ax.set_ylabel('Sale Price')
  plt.legend()

  return ax

In [None]:
prediction_visual = prediction_vis(pred_dynamic,pred_dynamic_conf,train_brk[zipcode_df])
prediction_visual

In [None]:
# Get the real and predicted values
forecast_11238 = pred_dynamic.predicted_mean
truth_11238 = train_brk[current_zip]#['1996':]

# Compute the mean square error
mse = ((forecast_11238 - truth_11238) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)))
#np.sqrt(np.mean((predictions-targets)**2))
rmse = np.sqrt(np.mean((forecast_11238 - truth_11238) ** 2))
print('The Root Mean Squared Error of our forecasts is {}'.format(round(rmse, 2)))

In [None]:
# Get forecast --- steps ahead in future

# prediction = output_sarima.get_forecast(steps=36, dynamic=True)
# prediction.predicted_mean

# # Get confidence intervals of forecasts
# predict_conf = prediction.conf_int()

In [None]:
steps = 36
# Get forecast --- steps ahead in future
prediction_object = output_sarima.get_forecast(steps=steps, dynamic=True)
prediction_object.predicted_mean

predict_conf = prediction_object.conf_int()

In [None]:
prediction_visual_2 = prediction_vis(prediction_object,predict_conf,zip_df[zipcode_df])
prediction_visual_2

### **Return on Investment DataFrame**
The method .get_forecast() computes the forecasted values for a specified number of steps ahead.<br>

The method .conf_int() gets the confidence intervals of forecasts.<br>


In [None]:
# Get forecast --- steps ahead in future
prediction = output_sarima.get_forecast(steps=36, dynamic=True)
prediction.predicted_mean

# Get confidence intervals of forecasts
predict_conf = prediction.conf_int()


In [None]:
steps = 36
# Get forecast --- steps ahead in future
prediction_object = output_sarima.get_forecast(steps=steps, dynamic=True)

In [None]:
def my_function(prediction_object, zip):
  """
  function gets ROI for 1 zipcode 
  """
  df_Summary = pd.concat([pd.DataFrame({f'Predicted_Mean {zip}':prediction_object.predicted_mean}), prediction_object.conf_int()],axis = 1)
  df_Summary
  # my_sample = df_Summary.iloc[[0, -1]].round(3)
  my_sample = df_Summary.iloc[[0, -1]].round(3) #1st and last

  return my_sample  #df_Summary


In [None]:
my_output = my_function(prediction_object, zip='11218')
my_output

In [None]:

def my_roi(cost, current):
  """  
  function to calculate ROI 
  ROI= 
  Cost of Investment
  Current Value of Investment−Cost of Investment
  """
  return (current - cost) / cost


In [None]:
cost = my_output.iloc[0,0]
current = my_output.iloc[-1,0]

In [None]:

current_lower = my_output.iloc[-1,1]
current_upper = my_output.iloc[-1,2]

In [None]:
#upper lower end
roi_dic = {}

cost = my_output.iloc[0,0]
current = my_output.iloc[-1,0]
current_lower = my_output.iloc[-1,1]
current_upper = my_output.iloc[-1,2]


my_roi(cost, current)
roi_dic['roi'] = my_roi(cost, current)
roi_dic['roi_lower'] = my_roi(cost, current_lower)
roi_dic['roi_upper'] = my_roi(cost, current_upper)

roi_dic

In [None]:
my_output

In [None]:
zip_rois={}
steps = 36

#def zipcode_roi(output_df,):
for zipcode in output_df['zipcode'].unique():
  pdq = output_df.loc[ output_df['zipcode']==zipcode, 'pdq'].iloc[0] 
  seasonal = output_df.loc[ output_df['zipcode']==zipcode, 'seasonal_pdq'].iloc[0] 
  df_ts = zip_df[zipcode]


  output_sarima = fit_ARIMA(df_ts, order=pdq, seasonal_order=seasonal)
  prediction_object = output_sarima.get_forecast(steps=steps, dynamic=True)
  my_output = my_function(prediction_object, zip=zipcode)
  
  roi_dic = {}

  cost = my_output.iloc[0,0]
  current = my_output.iloc[-1,0]
  current_lower = my_output.iloc[-1,1]
  current_upper = my_output.iloc[-1,2]

  my_roi(cost, current)
  roi_dic['roi'] = my_roi(cost, current)
  roi_dic['roi_lower'] = my_roi(cost, current_lower)
  roi_dic['roi_upper'] = my_roi(cost, current_upper)

  zip_rois[zipcode] = pd.Series(roi_dic)
ROI = pd.DataFrame(zip_rois)

In [None]:
roi_df = ROI.T 
roi_df.reset_index(inplace=True)
roi_df.rename(columns={'index':'zipcode'}, inplace=True)
roi_df.style.background_gradient()

#### **ROI Chart**

In [None]:
roi_chart_1 = roi_df.sort_values(by=['roi'],ascending=False)
#roi_chart_1 = roi_chart_1.round(3)
roi_chart_1.style.background_gradient()

## **Dynamic Forecasting**
Forecasting begins 4-1-2018

In [None]:
def forecast_function(output_df, current_zip=None,steps=None):
  
  # roi_t[roi_t[roi_t.name]== current_zip]
  # print('\n')
  zip_params = output_df[output_df['zipcode']==current_zip]
  zip_params.pdq.values[0]
  zip_params.seasonal_pdq.values[0]

  #steps = 36
  output_sar = fit_ARIMA(zip_df[current_zip], order=zip_params.pdq.values[0], seasonal_order=zip_params.seasonal_pdq.values[0])
  prediction = output_sar.get_forecast(steps=steps, dynamic=True)
  prediction.predicted_mean

  # Get confidence intervals of forecasts
  predict_conf = prediction.conf_int()

  return prediction, predict_conf, current_zip


In [None]:
def forecast_visual(prediction,predict_conf, y, figsize=None):
  """
  prediction-statsmodel object
  predict_conf- pd Dataframe
  """
  print(roi_df[roi_df['zipcode']== current_zip])
  print('\n')
  # Plot future predictions with confidence intervals
  fig,ax = plt.subplots(figsize=figsize)
  ax = y.plot(label='Observed') #(10, 8))
  prediction.predicted_mean.plot(ax=ax, label='Future Forecast')
  ax.fill_between(predict_conf.index,
                  predict_conf.iloc[:, 0],
                  predict_conf.iloc[:, 1], color='k', alpha=0.25)
  
    #I added this and can delete  
    #ax.axvline(prediction.predicted_mean.index[12])

  label_font = {'weight':'bold','size':18}
  ax.set_xlabel('Date',fontdict=label_font)
  ax.set_ylabel('Home Prices',fontdict=label_font)
  ax.set_title(f'Price Forecast for Zipcode: {y.name} /{steps} Months ',fontdict=label_font)

  ax.legend(loc="upper left")

  return ax





### **Zipcode: 11223**
Zipcode 11223 ranks in 1st place for ROI.  The ROI will be 60% on average.<br>
If an investment is made for a return on the lower end the return will be 5.27.<br> 
If a ninvestment is made for a return on the upper end the return will be 86.31.  This will take you into year 2021 and after.<br>
Either way it's a good return on the investment.

In [None]:
#prediction, predict_conf, current_zip =  forecast_function(output_df, current_zip=11210, steps=36)
#df_test['Btime'].iloc[0]
prediction, predict_conf, current_zip =  forecast_function(output_df, current_zip=roi_chart_1['zipcode'].iloc[0], steps=36)

In [None]:
test_brk[roi_chart_1['zipcode'].iloc[0]].plot()

In [None]:
test_brk[roi_chart_1['zipcode'].iloc[0]].describe()

In [None]:
forecast_visual_zip = forecast_visual(prediction,predict_conf,test_brk[roi_chart_1['zipcode'].iloc[0]], figsize=(12,8))
forecast_visual_zip

### **Zipcode: 11210**

In [None]:
prediction, predict_conf, current_zip =  forecast_function(output_df, current_zip=roi_chart_1['zipcode'].iloc[1],steps=36)

In [None]:
test_brk[11233].describe()

In [None]:
test_brk[roi_chart_1['zipcode'].iloc[1]].plot()

In [None]:
forecast_visual_zip = forecast_visual(prediction,predict_conf,test_brk[roi_chart_1['zipcode'].iloc[1]], figsize=(12,8))
forecast_visual_zip

### **Zipcode: 11230**

In [None]:
prediction, predict_conf, current_zip =  forecast_function(output_df, current_zip=roi_chart_1['zipcode'].iloc[2],steps=36)

In [None]:
test_brk[roi_chart_1['zipcode'].iloc[2]].describe().round(3)

In [None]:
test_brk[roi_chart_1['zipcode'].iloc[2]].plot()

In [None]:
forecast_visual_zip = forecast_visual(prediction,predict_conf,test_brk[roi_chart_1['zipcode'].iloc[2]], figsize=(12,8))
forecast_visual_zip

### **Zipcode: 11224**

In [None]:
prediction, predict_conf, current_zip =  forecast_function(output_df, current_zip=roi_chart_1['zipcode'].iloc[3],steps=36)

In [None]:
test_brk[11230].describe().round(3)

In [None]:
test_brk[roi_chart_1['zipcode'].iloc[3]].plot()

In [None]:
forecast_visual_zip = forecast_visual(prediction,predict_conf,test_brk[roi_chart_1['zipcode'].iloc[3]], figsize=(12,8))
forecast_visual_zip

### **Zipcode: 11233**

In [None]:
prediction, predict_conf, current_zip =  forecast_function(output_df, current_zip=roi_chart_1['zipcode'].iloc[4],steps=36)

In [None]:
test_brk[roi_chart_1['zipcode'].iloc[4]].describe().round(3)

In [None]:
test_brk[roi_chart_1['zipcode'].iloc[4]].plot()

In [None]:

forecast_visual_zip = forecast_visual(prediction,predict_conf,test_brk[roi_chart_1['zipcode'].iloc[4]], figsize=(12,8))
forecast_visual_zip

## ***Stationarity***

### **Zipcode: 11226**

In [None]:
zip_df[11226].plot()

In [None]:
def test_stationarity_1(timeseries, window):
    
    #Defining rolling statistics
    rolmean = timeseries.rolling(window=window).mean()
    rolstd = timeseries.rolling(window=window).std()

    #Plot rolling statistics:
    fig = plt.figure(figsize=(12, 8))
    orig = plt.plot(timeseries.iloc[window:], color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='upper left')
    plt.title('Rolling Mean & Standard Deviation')
    plt.legend(bbox_to_anchor=(1.05,1),loc='upper left')
    plt.show()
    

In [None]:
#Not mine

def dickey_fuller_test_ind_zip(zip_code):
    dftest = adfuller(zip_code)

    # Extract and display test results in a user friendly manner
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dftest)

    print ('Results of Dickey-Fuller Test:')

    return dfoutput

In [None]:
new_dic = {}
for col in zip_df.columns:
  zip_test = dickey_fuller_test_ind_zip(zip_df[col])
  new_dic[col] = zip_test

In [None]:
new_dic[11226]

In [None]:

def dickey_fuller_test_zipcodes(df):
    for col in df.columns:
        dftest = adfuller(df[col])
        dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
        for key,value in dftest[4].items():
            dfoutput['Critical Value (%s)'%key] = value
        print ('Results of Dickey-Fuller Test:')
        
        print(dfoutput) 
        #print(dftest)
        print ('\n')         

In [None]:
dickey_fuller_test_zipcodes(zip_df)

In [None]:
X_1 = zip_df.copy()

In [None]:
def stationary_test(df):
    rolling_mean = df.rolling(window=12).mean()
    rolling_std = df.rolling(window=12).std()

    plt.plot(df,color='blue',label='orignal')
    plt.plot(rolling_mean, color='red',label='Rolling Mean')
    plt.plot(rolling_std, color='green',label='Rolling STD')
    plt.legend(loc='best')
    plt.title('Rolling Mean and Rolling Standard Deviation')
    #plt.show()
    result = adfuller(df)
    print('ADF statistic: {}'.format(result[0]))
    print('p-value: {}'.format(result[1]))
    print('Critical Values:')
    for key, value in result[4].items():
        print('\t{} : {}'.format(key,value))
        
    names = ['Test Statistic','p-value','#Lags Used','# of Observations Used']
    res  = dict(zip(names,result[:4]))
    res['Stationary Results'] = res['p-value']<.05
    
    return pd.DataFrame(res,index=['AD Fuller Results'])    

In [None]:
stationary_test(zip_df[11226])

###  Zipcode:  11238

In [None]:
#brooklyn_zips[11226]

In [None]:
#stationary_test(zip_df[11238])

### Zipcode:  11215

In [None]:
stationary_test(zip_df[11226])

### Removing Trend
#### Log-Transformation (np.log)

In [None]:
## Log Transform
ts3 = np.log(zip_df[11226])
#ts3.plot()
stationary_test(ts3)

#### Differencing

In [None]:
"""
#subtracts the ts 1 step forward from itself. Good way of eliminting trend

#below ts centered around 0
#we achieved stationarity
#eliminating day-to-day patterns
"""
ts0 = zip_df[11226].diff().dropna()
#ts0.plot()

stationary_test(ts0)

#### Subtract Rolling Mean 

In [None]:
## Subtract Rolling mean
ts2 = (zip_df[11226] - zip_df[11226].rolling(3).mean()).dropna()
#ts2.plot()
stationary_test(ts2)

#### Subtract Exponentially-Weighted Mean 

In [None]:
## Subtract Exponentially Weight Mean Rolling mean
ts4 = (zip_df[11226] - zip_df[11226].ewm(halflife=7).mean()).dropna()
#ts4.plot()
stationary_test(ts4)

#### Seasonal Decomposition 

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(zip_df[11226])#,model='mul')
decomp.plot();

In [None]:
## Get ADFuller Results for seasonal component
stationary_test(decomp.seasonal)

In [None]:
## Get ADFuller Results for trend component
stationary_test(decomp.trend.dropna())

In [None]:
## Get ADFuller Results for resid component
stationary_test(decomp.resid.dropna())

In [None]:
decomp.resid.dropna()

## **RNN**

In [None]:
df_rnn = zip_df[[11238]]

In [None]:
df_rnn.head()

In [None]:
df_rnn.plot()

In [None]:
len(df_rnn)

In [None]:
265-12

In [None]:
"""
x_train= x_train.reshape(-1, 1)
y_train= y_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
"""
train = df_rnn.iloc[:253]
test = df_rnn.iloc[253:]
#test = test.reshape(1, -1)
#train= train.reshape(-1, 1)

In [None]:
test

In [None]:
len(test)

In [None]:
train.shape

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()
scaler.fit(train)
scaled_train = scaler.transform(train)
scaled_test = scaler.transform(test)

In [None]:
scaled_train

In [None]:
from keras.preprocessing.sequence import TimeseriesGenerator

In [None]:
n_input = 2
n_features = 1 #smaller batch sizes lead to better training

generator = TimeseriesGenerator(scaled_train, scaled_train, length=n_input, batch_size=1)

In [None]:
scaled_train[:5]

In [None]:
len(scaled_train)

In [None]:
"""

253 - n_input(2)

"""
len(generator)

In [None]:
#create model and fit it to the generator object
from keras.models import Sequential
from keras.layers import Dense  #for final output later
from keras.layers import LSTM #long short term memory

In [None]:
n_input = 12 #look at full year of data or 12 months before predicting 13th month
n_features = 1 #smaller batch sizes lead to better training
               #how many columns you have. WE have 1 column which is time stamp for y

train_generator = TimeseriesGenerator(scaled_train, scaled_train, length=n_input, batch_size=1)

In [None]:
model = Sequential()

model.add(LSTM(150, activation='relu', input_shape=(n_input, n_features)))
#need to aggregate all the neurons to sngle prediciton
model.add(Dense(1)) #added single dense neuron which will directly output our prediction
model.compile(optimizer='adam', loss='mse')

In [None]:
"""
may want to play around w/number of neurons on LSTM layer
"""
model.summary()

In [None]:
"""
fit tou our training generator
more epochs you use hte longer it's going to take to train
1 epoch is a single entire run through of training data

We get significant reduciton over 1st couple of epochs then around 15 start seeing convergence

"""
model.fit_generator(train_generator, epochs=25)

In [None]:
model.history.history.keys()

In [None]:
plt.plot(range(len(model.history.history['loss'])),model.history.history['loss']);

In [None]:
"""
evalute on the test data
create an evlauation batch
our network trains 1 step ahead

our network is 12 network steps 
    then predict step 13
    
need last 12 points of training data inorder to predict pt. 1 of test data 

these are last 12 points of training set
"""
first_eval_batch = scaled_train[-12:]
first_eval_batch

In [None]:
"""
it now has 3 brackets at the top
"""
first_eval_batch = first_eval_batch.reshape((1,n_input,n_features))
first_eval_batch

In [None]:
"""
call model on first_eval_batch
gives array prediciton
means given these 12 points of training data it predicts taht below should be 1st point of test data set
"""
model.predict(first_eval_batch)

In [None]:
scaled_test

In [None]:
"""
not just predict 1st point in test set but the entire test set
how to forecast into the future
Forecast using RNN model
"""
#hold predicitons
test_predictions = []
#last n_input points from training set
first_eval_batch = scaled_train[-n_input:] 
#reshape to format of RNN wants, (same format as Timeseriesgenerator 
current_batch = first_eval_batch.reshape((1,n_input,n_features))

#hoe far into the futrue will I forecast: length of test set
for i in range(len(test)):
    #1time step ahead of historical 12 points
    current_pred = model.predict(current_batch)[0] #0 is for formatting 
    test_predictions.append(current_pred)
    
    #update current batch to include prediciton
    current_batch = np.append(current_batch[:,1:,:],[[current_pred]],axis=1)

In [None]:
test_predictions

In [None]:
true_predictions = scaler.inverse_transform(test_predictions)
true_predictions

In [None]:
test['Predictions'] = true_predictions

In [None]:
test

### **RNN Plot / Sales v Predicted Values**

In [None]:
"""
sales v predicted values
"""
test.plot(figsize=(12,5));

### **Recommendations**
Below are the Brooklyn zipcodes with the predicted Top 5 ROIs which I would recommend investing in:<br>
11223  (63%)<br>
11210  (59%)<br>
11230  (46%)<br>
11224  (45%)<br>
11233  (42%)<br>