# More in-depth example with exercise


**Modules you need:**

For this notebook, make sure you have the following modules installed: plotly, pandas, numpy, statsmodels, scipy

In [1]:
import numpy
import pandas
from plotly import __version__ as plotly_version
from plotly.offline import init_notebook_mode, iplot
import statsmodels
import scipy

print("Plotly version: " + plotly_version)

init_notebook_mode(connected=True)         # initiate notebook for offline plot

Plotly version: 3.4.1


**Generating some synthetic data to use:**

Don't worry about the following code; it's just there to generate some pretty-looking random timeseries data. We end up with a dataframe with two numerical colums 'A' and 'B', and a 'Date' column. 

In [2]:
import pandas as pd
import numpy as np
from statsmodels.tsa.arima_process import arma_generate_sample

num_points = 1500

np.random.seed(13)
xs = arma_generate_sample([1, -1, 0, 0, 0], [1, -1, -1, -0.8, -0.3], nsample=num_points, sigma=0.5, burnin=100)

np.random.seed(10)
os = arma_generate_sample([1, -1, 0, 0, 0], [1, -1, -1, -0.8, -0.3], nsample=num_points, sigma=0.5, burnin=100) * 0.5

df = pd.date_range(start=pd.to_datetime('2014-01-01'), periods=1500, name='Date').to_frame(index=False)
df['A'] = xs + 20
df['B'] = xs + os + 20
df

Unnamed: 0,Date,A,B
0,2014-01-01,12.333677,7.401696
1,2014-01-02,13.135447,8.013742
2,2014-01-03,13.885421,9.309362
3,2014-01-04,14.853525,11.225588
4,2014-01-05,15.735236,12.557178
5,2014-01-06,16.571813,13.540015
6,2014-01-07,16.311891,13.317491
7,2014-01-08,17.209576,14.594901
8,2014-01-09,16.758510,14.273342
9,2014-01-10,18.016900,15.035362



**Plotting the two time series:**

We can make a nice plot of series A and series B. Note how the date axis is handled gracefully, *even when you zoom in*.

In [3]:
figure = {
    'data': [
        {
            'y': df['A'],
            'x': df['Date'],
            'name': "Series A"
        },
        {
            'y': df['B'],
            'x': df['Date'],
            'name': "Series B"
        }]
}

iplot(figure)

**Adding lines for mean and standard deviation:**

The following plot of Series A has lines for the mean, and one standard deviation either side of it.

In [4]:
figure = {
    'data': [
        {
            'y': df['A'],
            'x': df['Date'],
            'name': "Series A",
            'line': {'color': "rgb(0, 128, 0)"}
        },
        {
            'y': [df['A'].mean(), df['A'].mean()],
            'x': [df['Date'].min(), df['Date'].max()],
            'mode': 'lines',
            'name': "Series A mean",
            'line': {'color': "rgb(162, 162, 162)"}
        },
        {
            'y': [df['A'].mean() + df['A'].std(), df['A'].mean() + df['A'].std()],
            'x': [df['Date'].min(), df['Date'].max()],
            'mode': 'lines',
            'name': "Series A mean + 1 sigma",
            'line': {'color': "rgb(162, 162, 162)", 'dash': 'dash'}
        },
        {
            'y': [df['A'].mean() - df['A'].std(), df['A'].mean() - df['A'].std()],
            'x': [df['Date'].min(), df['Date'].max()],
            'mode': 'lines',
            'name': "Series A mean - 1 sigma",
            'line': {'color': "rgb(162, 162, 162)", 'dash': 'dash'}
        }
    ],
    'layout': {'showlegend': False}
}

iplot(figure)

**Smoothing the data:**
    
This shows a couple of ways of smoothing the data, using a moving average and using a *Savitzky-Golay filter* (which here gives a nicer result, I believe).

You can change the window size and see how that changes the smoothed curve; try using 50 for example. For smaller window sizes the smoothed curve will follow the data more closely, but be less picturesque.

Notice that *just like in Analytics*, you can hide a series by clicking it in the legend.

In [5]:
#MA_smoothed_A = df['A']
#SG_smoothed_A = df['A']

from scipy.signal import savgol_filter

SG_smoothed_A = savgol_filter(df['A'].values, window_length=151, polyorder=3)
MA_smoothed_A = df['A'].rolling(window=151, min_periods = 1, center=True).mean()

figure = {
    'data': [
        {
            'y': df['A'],
            'x': df['Date'],
            'name': "Series A",
            'line': {'color': "rgb(0, 128, 0)"}
        },
        {
            'y': SG_smoothed_A,
            'x': df['Date'],
            'name': "SG smoothing",
            'line': {'color': "rgb(255, 128, 128)"}
        },
        {
            'y': MA_smoothed_A,
            'x': df['Date'],
            'name': "Mov. avg.",
            'line': {'color': "rgb(128, 128, 255)"}            
        }]
}

iplot(figure)


internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.


Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.



## Making a scatter plot:

We can also make a scatter plot of series A vs series B.

In [6]:
figure = {
    'data': [
        {
            'x': df['A'],
            'y': df['B'],
            'mode': 'markers',
            'marker': {'size': 5}
        }]
}

iplot(figure)

## The Exercise

Your mission, should you choose to accept it, is to use the dataframe `df` and the explanation given above to produce something as close as possible to the following graph:
![try to match this!](exercise_target_plot.png)
Try to get the following things to match:

-  now the dashed lines in both dimensions show *two* standard deviations above and below the mean
-  axis titles and the chart title.
-  axis ranges.
-  a point with a large black marker showing the mean of the values (hint: try changing the size (10-20 till you find a good match).


## bonus: if you are Dead Super Keen (or want to practice your pandas)

Produce a plot like the above one, but instead of plotting A vs B, plot A vs *lagged* A, i.e. plot each day's value of A against the value of A from the day before.

# hints
if you are interested in tackling the correlation plot challenge, stop reading. but if you are getting stuck, look here for help. also use `#python-workshop`. 

below are the steps to build the plots.

the first hint is that you want to manually set the plot range. let us set the plot ranges as variables:

In [7]:
min_x = 0
max_x = 60

min_y = 0
max_y = 80

and the titles:

In [8]:
plot_title = 'Scatter plot of A vs B'
x_title = 'Series A'
y_title = 'Series B'

secondly, you may want to set the plot layout definition separately: 

In [9]:
layout = {
    'title': plot_title,
    'xaxis': {
        'title': x_title,
        'range': [min_x, max_x]
        },
    'yaxis': {
        'title': y_title,
        'range': [min_y, max_y]
        },
    'showlegend': False
}

now let us look at the data we want to plot. remember that each data set graphed is in its own `dict`, and the `data` element of the `figure` object is a list of these dicts. 

first we have scatter of blue points of series a and series b. so we want series a on the x axis and series b on the y axis. we want points (which plotly calls `markers`). we have to guess the marker size from the graph, so use trial and error there. 

In [10]:
series_scatter_points = {
    'x': df['A'],
    'y': df['B'],
    'mode': 'markers',
    'marker': {'size': 5}
}

the next element we need is a black point at the mean coordinates of each series. note that even though this data element only has a single point, it still needs to be represented as a list, just a list with a single item. the x-coordinate is the mean of series a and the y coordinate is the mean of the series b. 

In [11]:
A_mean = df['A'].mean()
B_mean = df['B'].mean()
pt_color_black = "rgb(0,0,0)"
balance_point = {
    'x' : [A_mean],
    'y' : [B_mean],
    'mode' : 'markers',
    'marker' : {'size': 12, 'color' : pt_color_black}
}

finally, we need two sets of lines, one vertical set and one horizontal set. all the lines are to be gray and dashed. we already saw how that is achieved above: 
`'line': {'color': "rgb(162, 162, 162)", 'dash': 'dash'}`
- the vertical set: two lines, one at x = mean of series A minus 2 times its standard deviation the other one at x = mean of series a plus 2 times its standard deviation. 
- the horizontal set: two lines, one at y = mean of series B minus 2 times its standard deviation the other one at y = mean of series a plus 2 times its standard deviation. 

In [12]:
gray = "rgb(162, 162, 162)"
guide_line_style = {'color': gray, 'dash': 'dash'}

A_std = df['A'].std()
B_std = df['B'].std()
lower_range_x = A_mean - 2*A_std
upper_range_x = A_mean + 2*A_std
lower_range_y = B_mean - 2*B_std
upper_range_y = B_mean + 2*B_std

lower_vertical = {
    'x': [lower_range_x, lower_range_x],
    'y': [min_y, max_y],
    'mode': 'lines',
    'line': guide_line_style
}

upper_vertical = {
    'x' : [upper_range_x, upper_range_x],
    'y' : [min_y, max_y],
    'mode' : 'lines',
    'line' : guide_line_style
}

lower_horizontal = {
    'x' : [min_x, max_x],
    'y': [lower_range_y, lower_range_y],
    'mode' : 'lines',
    'line' : guide_line_style    
}

upper_horizontal = {
    'x' : [min_x, max_x],
    'y': [upper_range_y, upper_range_y],
    'mode' : 'lines',
    'line' : guide_line_style    
}

now, all we have to do is to put these plot elements together into a `figure` object (which is a simple `dict` with two keys: `data` and `layout`) and pass this to the `iplot` function. 

In [13]:
figure = {
    'data':
        [series_scatter_points,
            balance_point,
            lower_vertical,
            upper_vertical,
            lower_horizontal,
            upper_horizontal
        ],
    'layout': layout
}

iplot(figure)

## bonus challenge:
here we first calculate the lagged series and then we only need change the reference to it instead of series B.

In [14]:
# Pandas has a 'shift' function for this, but we'll do it "on foot".
import datetime as dt
lagged_df = df.copy()
lagged_df['Date'] = lagged_df['Date'] + dt.timedelta(days=1)

# You don't really need to 'validate' here but it's a useful thing to know in general.
merged_df = pd.merge(df, lagged_df, on='Date', validate = 'one_to_one', suffixes=('', '_lagged'))

In [15]:
#lagged_df
merged_df.head()

Unnamed: 0,Date,A,B,A_lagged,B_lagged
0,2014-01-02,13.135447,8.013742,12.333677,7.401696
1,2014-01-03,13.885421,9.309362,13.135447,8.013742
2,2014-01-04,14.853525,11.225588,13.885421,9.309362
3,2014-01-05,15.735236,12.557178,14.853525,11.225588
4,2014-01-06,16.571813,13.540015,15.735236,12.557178


In [16]:
# note that the ranges are unchanged
# new titles
plot_title = 'Scatter plot of A vs lagged A'
x_title = 'Series A'
y_title = 'Series A lagged'
# reset the layout
layout = {
    'title': plot_title,
    'xaxis': {
        'title': x_title,
        'range': [min_x, max_x]
        },
    'yaxis': {
        'title': y_title,
        'range': [min_y, max_y]
        },
    'showlegend': False
}

In [17]:
# the mean value of series A changes a little bit so i need to recompute it.
# (because the lagged is not defined for the 1st value and we only have 1499 values now).
A_mean = merged_df['A'].mean()
A_std = merged_df['A'].std()
A_lagged_mean = merged_df['A_lagged'].mean()
A_lagged_std = merged_df['A_lagged'].std()
lower_range_x = A_mean - 2*A_std
upper_range_x = A_mean + 2*A_std
lower_range_y = A_lagged_mean - 2*A_lagged_std
upper_range_y = A_lagged_mean + 2*A_lagged_std

In [18]:
# we need to slightly alter our data list components to reference the new lagged variables:
# scatter points:
series_scatter_points = {
    'x': merged_df['A'],
    'y': merged_df['A_lagged'],
    'mode': 'markers',
    'marker': {'size': 5}
}

balance_point = {
    'x' : [A_mean],
    'y' : [A_lagged_mean],
    'mode' : 'markers',
    'marker' : {'size': 12, 'color' : pt_color_black}
}

lower_vertical = {
    'x': [lower_range_x, lower_range_x],
    'y': [min_y, max_y],
    'mode': 'lines',
    'line': guide_line_style
}

upper_vertical = {
    'x' : [upper_range_x, upper_range_x],
    'y' : [min_y, max_y],
    'mode' : 'lines',
    'line' : guide_line_style
}

lower_horizontal = {
    'x' : [min_x, max_x],
    'y': [lower_range_y, lower_range_y],
    'mode' : 'lines',
    'line' : guide_line_style    
}

upper_horizontal = {
    'x' : [min_x, max_x],
    'y': [upper_range_y, upper_range_y],
    'mode' : 'lines',
    'line' : guide_line_style    
}

In [19]:
# now we just have to put it all together:
figure = {
    'data':
        [series_scatter_points,
            balance_point,
            lower_vertical,
            upper_vertical,
            lower_horizontal,
            upper_horizontal
        ],
    'layout': layout
}

iplot(figure)