Use this command to convert the notebook into a slideshow that can be presented:


jupyter nbconvert --to slides  --ServePostProcessor.port=8910 --ServePostProcessor.ip='0.0.0.0' --post serve SMMT.ipynb

In [1]:
import plotly
plotly.tools.set_credentials_file(username='churtado', api_key='iaMRV6ydU9Ove5Yfy0R7')

In [2]:
%load_ext sparkmagic.magics

http://10.13.12.209:8998/

In [3]:
%manage_spark

MagicsControllerWidget(children=(Tab(children=(ManageSessionWidget(children=(HTML(value='<br/>'), HTML(value='…

Added endpoint http://10.13.12.209:8998/
Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
10,,spark,idle,,,✔


SparkSession available as 'spark'.


In [4]:
%%spark
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

val df = sqlContext.read.format("com.crealytics.spark.excel").option("sheetName", "pc").option("useHeader", "true").option("inferSchema", "false").load("/user/hive/data/smmt/smmt.xls")
df.createTempView("pc")

sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@1d01ce68
df: org.apache.spark.sql.DataFrame = [date: string, passenger_cars: string]


In [5]:
%%spark -c sql
SHOW TABLES

Unnamed: 0,database,tableName,isTemporary
0,default,cleaned_taxes,False
1,default,passenger_cars,False
2,,pc,True


In [6]:
%%spark -c sql
SELECT * from pc limit 3

Unnamed: 0,date,passenger_cars
0,2018-06-30,234945
1,2018-05-31,192649
2,2018-04-30,167911


In [7]:
%%spark -c sql
DROP TABLE IF EXISTS passenger_cars

In [8]:
%%spark -c sql 

CREATE TABLE passenger_cars AS
SELECT 
 date, 
 passenger_cars
FROM pc

In [9]:
%%spark -c sql -q -o df_passenger_cars
SELECT * from pc

This is an attempt to forecast SMMT passenger car data. 
  
The data consists of total PC sold, and date.

When decomposing a time series, there are 2 components to look out for:  
  
Systematic: Components of the time series that have consistency or recurrence and can be described and modeled.  
Non-Systematic: Components of the time series that cannot be directly modeled.  

Concretely, the components are:  
  
    Level: The average value in the series.  
    Trend: The increasing or decreasing value in the series.  
    Seasonality: The repeating short-term cycle in the series.  
    Noise: The random variation in the series.  

Let's try to decompose it. First set up our dataframe

In [10]:
%matplotlib inline

import pandas as pd
from statsmodels.tsa.seasonal import seasonal_decompose

df_passenger_cars_decomp = df_passenger_cars
df_passenger_cars_decomp['date'] = pd.to_datetime(df_passenger_cars_decomp['date'])
df_passenger_cars_decomp = df_passenger_cars_decomp.set_index('date')

In [11]:
df_passenger_cars_decomp.head(5)

Unnamed: 0_level_0,passenger_cars
date,Unnamed: 1_level_1
2018-06-30,234945
2018-05-31,192649
2018-04-30,167911
2018-03-31,474069
2018-02-28,80805


We're looking at this breakdown assuming an additive model first:

In [12]:
from statsmodels.tsa.seasonal import seasonal_decompose

result_additive = seasonal_decompose(df_passenger_cars_decomp, model='additive')
# print(result_additive.trend)
# print(result_additive.seasonal)
# print(result_additive.resid)
# print(result_additive.observed)

We'll take the original values from our statsmodel decomp and plot them

In [38]:
import plotly.plotly as py
import plotly.graph_objs as go

trace1_observed = go.Scatter(
    x = result_additive.observed.index,
    y = result_additive.observed.values,
    xaxis='x',
    yaxis='y',
    name='trend'
)
data_observed = [trace1_observed]
layout_observed = go.Layout(
    autosize=False,
    width=800,
    height=500,
    xaxis=dict(
        domain=[0, 1]
    ),
    yaxis=dict(
        domain=[0, 1]
    )
)


In [39]:
fig_data = go.Figure(data=data_observed, layout=layout_observed)
py.iplot(fig_data, filename='observed')

In [40]:
import plotly.plotly as py
import plotly.graph_objs as go

trace1_additive = go.Scatter(
    x = result_additive.trend.index,
    y = result_additive.trend.values,
    xaxis='x3',
    yaxis='y3',
    name='trend'
)
trace2_additive = go.Scatter(
    x = result_additive.seasonal.index,
    y = result_additive.seasonal.values,
    xaxis='x2',
    yaxis='y2',
    name='seasonal'
)
trace3_additive = go.Scatter(
    x = result_additive.resid.index,
    y = result_additive.resid.values,
    xaxis='x',
    yaxis='y',
    name='residual'
)
trace4_additive = go.Scatter(
    x = result_additive.observed.index,
    y = result_additive.observed.values,
    xaxis='x4',
    yaxis='y4',
    name='observed'
)
data_additive = [trace4_additive, trace1_additive, trace2_additive, trace3_additive]
layout_additive = go.Layout(
    autosize=False,
    width=800,
    height=500,
    xaxis=dict(
        domain=[0, 1]
    ),
    yaxis=dict(
        domain=[0, 0.23]
    ),
    xaxis2=dict(
        domain=[0, 1],
        anchor='x2'
    ),
    yaxis2=dict(
        domain=[0.26, 0.48],
        anchor='y2'
    ),
    xaxis3=dict(
        domain=[0, 1],
        anchor='y3'
    ),
    yaxis3=dict(
        domain=[0.52, 0.73]
    ),
    xaxis4=dict(
        domain=[0, 1],
        anchor='y4'
    ),
    yaxis4=dict(
        domain=[0.76, 1],
        anchor='x4'
    )
)


We'll decompose the dataset using a naive decomposition first

In [41]:
fig_additive = go.Figure(data=data_additive, layout=layout_additive)
py.iplot(fig_additive, filename='additive-subplots')

Now we'll decompose using a naive multiplicative model

In [42]:
from statsmodels.tsa.seasonal import seasonal_decompose

result_mult = seasonal_decompose(df_passenger_cars_decomp, model='multiplicative')
# print(result_additive.trend)
# print(result_additive.seasonal)
# print(result_additive.resid)
# print(result_additive.observed)

In [43]:
import plotly.plotly as py
import plotly.graph_objs as go

trace1_mult = go.Scatter(
    x = result_mult.trend.index,
    y = result_mult.trend.values,
    xaxis='x3',
    yaxis='y3',
    name='trend'
)
trace2_mult = go.Scatter(
    x = result_mult.seasonal.index,
    y = result_mult.seasonal.values,
    xaxis='x2',
    yaxis='y2',
    name='seasonal'
)
trace3_mult = go.Scatter(
    x = result_mult.resid.index,
    y = result_mult.resid.values,
    xaxis='x',
    yaxis='y',
    name='residual'
)
trace4_mult = go.Scatter(
    x = result_mult.observed.index,
    y = result_mult.observed.values,
    xaxis='x4',
    yaxis='y4',
    name='observed'
)
data_mult = [trace4_mult, trace1_mult, trace2_mult, trace3_mult]
layout_mult = go.Layout(
    autosize=False,
    width=800,
    height=500,
    xaxis=dict(
        domain=[0, 1]
    ),
    yaxis=dict(
        domain=[0, 0.23]
    ),
    xaxis2=dict(
        domain=[0, 1],
        anchor='x2'
    ),
    yaxis2=dict(
        domain=[0.26, 0.48],
        anchor='y2'
    ),
    xaxis3=dict(
        domain=[0, 1],
        anchor='y3'
    ),
    yaxis3=dict(
        domain=[0.52, 0.73]
    ),
    xaxis4=dict(
        domain=[0, 1],
        anchor='y4'
    ),
    yaxis4=dict(
        domain=[0.76, 1],
        anchor='x4'
    )
)

In [44]:
fig_mult = go.Figure(data=data_mult, layout=layout_mult)
py.iplot(fig_mult, filename='multiplicative-subplots')

  
We're going to try and do a linear regression on this
  