### Pandas Lab: Time Shifts & Multi Level Indexing

This lab is designed to introduce you to working with time in a more granular way, and understanding how to build features when your data has hierarchies or panels.  

Ie, when you have repeated observations for the same objects.  This is an important concept because lots of statistical methods don't explicitly account for values which might naturally be correlated with one another over time.  

But lots of data **is** highly correlated over time!  

By the time you're done with this lab, you'll have built 9 columns that capture a variety of information about how an observed value is changing with respect to itself.

**Question 1:** Set the multi-level index so the first level is the Stock symbol itself, and the second level is the date.  Make sure the date column is sorted in ascending order.  You might have to use the `sort_index(level=0)` method to get the values straight.

In [1]:
import pandas as pd
df = pd.read_csv("/Users/aoifeduna/AoifeRepo/aoiferepo/Lectures/Unit2/data/stocks_panel.csv", parse_dates=['Date'])
# Don't forget to parse it as dates

In [2]:
df.head()

Unnamed: 0,Date,Stock,Price
0,2014-11-05,AAPL,108.860001
1,2014-11-05,AMZN,296.519989
2,2014-11-05,FB,74.830002
3,2014-11-05,MSFT,47.860001
4,2014-11-05,GOOGL,555.950012


In [3]:
df = df.set_index(['Stock', 'Date']).sort_index(level=0)

In [4]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price
Stock,Date,Unnamed: 2_level_1
AAPL,2014-11-05,108.860001
AAPL,2014-11-06,108.699997
AAPL,2014-11-07,109.010002
AAPL,2014-11-10,108.830002
AAPL,2014-11-11,109.699997


**Question 2:** To capture some other aspects of dates, create columns in your dataset that capture this aspect of each timestamp:

  - What quarter it's in
  - Whether or not it's the last day of the month/quarter
  - What day it is (ie, do price changes vary by day?)
  
**Hint:** You don't use the `dt` attribute to get date parts from index values.  Multi indices are also a little tricky.  

To get what you want, try this: `df.index.get_level_values(level=1).your_datepart_here`

In [5]:
df.info()
# Checking to see if it's date time

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 6285 entries, (AAPL, 2014-11-05 00:00:00) to (MSFT, 2019-11-01 00:00:00)
Data columns (total 1 columns):
Price    6285 non-null float64
dtypes: float64(1)
memory usage: 77.5+ KB


In [6]:
df['Quarter'] = df.index.get_level_values(level=1).quarter
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price,Quarter
Stock,Date,Unnamed: 2_level_1,Unnamed: 3_level_1
AAPL,2014-11-05,108.860001,4
AAPL,2014-11-06,108.699997,4
AAPL,2014-11-07,109.010002,4
AAPL,2014-11-10,108.830002,4
AAPL,2014-11-11,109.699997,4


In [7]:
df['Day'] = df.index.get_level_values(level=1).dayofweek
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price,Quarter,Day
Stock,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAPL,2014-11-05,108.860001,4,2
AAPL,2014-11-06,108.699997,4,3
AAPL,2014-11-07,109.010002,4,4
AAPL,2014-11-10,108.830002,4,0
AAPL,2014-11-11,109.699997,4,1


In [8]:
df['EndofMonth'] = df.index.get_level_values(level=1).is_month_end
df['EndofQuarter'] = df.index.get_level_values(level=1).is_quarter_end
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price,Quarter,Day,EndofMonth,EndofQuarter
Stock,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AAPL,2014-11-05,108.860001,4,2,False,False
AAPL,2014-11-06,108.699997,4,3,False,False
AAPL,2014-11-07,109.010002,4,4,False,False
AAPL,2014-11-10,108.830002,4,0,False,False
AAPL,2014-11-11,109.699997,4,1,False,False


**Question 3:** Time Series Embedding

Lots of times if you're trying to predict the value of something tomorrow, the most import piece of information is what the value of something is today, and yesterday, and so on.

Try and create columns that capture previously observed values for each stock.  

Make two columns that capture the value of the following:

 - What the previous recorded price for each stock was
 - The stock price from two observations ago
 
**Remember:** This has to be done on a particular level of the index to make sure it's getting applied appropriately!

In [9]:
df['PreviousValue'] = df.groupby(level=0)['Price'].shift()
df['Value2DaysAgo'] = df.groupby(level=0)['Price'].shift(2)
df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Price,Quarter,Day,EndofMonth,EndofQuarter,PreviousValue,Value2DaysAgo
Stock,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AAPL,2014-11-05,108.860001,4,2,False,False,,
AAPL,2014-11-06,108.699997,4,3,False,False,108.860001,
AAPL,2014-11-07,109.010002,4,4,False,False,108.699997,108.860001
AAPL,2014-11-10,108.830002,4,0,False,False,109.010002,108.699997
AAPL,2014-11-11,109.699997,4,1,False,False,108.830002,109.010002
AAPL,2014-11-12,111.25,4,2,False,False,109.699997,108.830002
AAPL,2014-11-13,112.82,4,3,False,False,111.25,109.699997
AAPL,2014-11-14,114.18,4,4,False,False,112.82,111.25
AAPL,2014-11-17,113.989998,4,0,False,False,114.18,112.82
AAPL,2014-11-18,115.470001,4,1,False,False,113.989998,114.18


**Question 4:** How did each stock price change compared to the S&P 500? 

Lots of times it's useful to see how something else moves with some other item that you're trying to track.  

In the data folder is a file called `s&p.csv`, and it contains the price history of the S&P 500 index for each day since its inception. See if you can upload it, and merge the `adj close` column into your dataset, so there's a column that displays the observed value of the index for every single price observation we have in our dataset.

**Hints:**
 - Merging on multi-level indices is tricky and prone to failure.  To make this a little bit easier, just use `reset_index()` to pop out the date column in the multi-index, and merge on it as if it were a regular column.
 - Make sure both date columns are actually encoded as dates, rather than strings, or else the merge won't work.
 - You'll want to go back to the multi-level index when you're done with this step.

In [10]:
sp = pd.read_csv("/Users/aoifeduna/AoifeRepo/aoiferepo/Lectures/Unit2/data/s&p.csv", parse_dates=['Date'])
df = df.reset_index().merge(sp[['Date','Adj Close']], on='Date')

In [11]:
df = df.set_index(['Stock', 'Date']).sort_index(level=0)

In [12]:
df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Price,Quarter,Day,EndofMonth,EndofQuarter,PreviousValue,Value2DaysAgo,Adj Close
Stock,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
AAPL,2014-11-05,108.860001,4,2,False,False,,,2023.569946
AAPL,2014-11-06,108.699997,4,3,False,False,108.860001,,2031.209961
AAPL,2014-11-07,109.010002,4,4,False,False,108.699997,108.860001,2031.920044
AAPL,2014-11-10,108.830002,4,0,False,False,109.010002,108.699997,2038.26001
AAPL,2014-11-11,109.699997,4,1,False,False,108.830002,109.010002,2039.680054
AAPL,2014-11-12,111.25,4,2,False,False,109.699997,108.830002,2038.25
AAPL,2014-11-13,112.82,4,3,False,False,111.25,109.699997,2039.329956
AAPL,2014-11-14,114.18,4,4,False,False,112.82,111.25,2039.819946
AAPL,2014-11-17,113.989998,4,0,False,False,114.18,112.82,2041.319946
AAPL,2014-11-18,115.470001,4,1,False,False,113.989998,114.18,2051.800049


**Question 5:** Window Statistics

Lots of times, if we want to capture some idea of momentum, or how some value changes with what's usually observed.

Ie, if we had 48 purchases in a store today, how does that number compare to what's happened in the last 14 days?  Are things trending up or trending down?  

This also allows us to get a clearer picture of general trends in values, even if there are irregular daily spikes.

To handle these sorts of issues, pandas has an entire section to calculate window statistics called `rolling`, it works like this:

In [13]:
# I'll create a sample dataframe with 30 days worth of values
import numpy as np
index = pd.date_range(start='01/01/2020', end='02/05/2020')
sample_df = pd.DataFrame(np.random.randn(36), index=index, columns=['Value'])
# and here's what it looks like
sample_df.head()

Unnamed: 0,Value
2020-01-01,1.135131
2020-01-02,0.202561
2020-01-03,-1.589153
2020-01-04,0.393599
2020-01-05,2.585817


In [14]:
# and now we'll see rolling 10 day averages
sample_df.rolling(10).mean()

Unnamed: 0,Value
2020-01-01,
2020-01-02,
2020-01-03,
2020-01-04,
2020-01-05,
2020-01-06,
2020-01-07,
2020-01-08,
2020-01-09,
2020-01-10,0.257259


You can specify the number of observations to calculate, and choose your aggregator -- `mean()`, `min()`, `sum()`, etc, although `mean()` is the most common.

**Your Turn:** Calculate the rolling 5 & 10 day moving averages for each stock inside the dataset.

**Note:** Do *not* try and merge them back into your dataset yet, just make sure you have the values showing up.

In [15]:
rolling5 = df.groupby(level=0)['Price'].rolling(5).mean().values

In [16]:
rolling10 = df.groupby(level=0)['Price'].rolling(10).mean().values

If you take a look at the index, you should notice that it has *three* levels to it, and not just two like before.  

Combining datasets with differing numbers of levels is cumbersome, and there's a decent amount of churn in what methods work from one version of Pandas to another.  

For now, try and get these values back into your original dataset by taking the following steps:

 - calculate the 5 & 10 rolling averages for each stock price on the multilevel index, and save these as variables, and then use the *values* attribute for each one to drop the index and just get the column values (ask me about this if you have questions)
 - use reset_index() to unstack the index on your original dataframe
 - create new columns for the 5 & 10 day moving averages in the original dataset, using the values from the first step.
 
So as a quick example, it would sort of work like this:

`five_day = df.groupby(level=0)['Price'].your_stuff_here.values`

And then use this as the basis to make your new column from your original dataframe with the reset index.

In [17]:
df = df.reset_index()

In [18]:
df['5DayRolling'] = rolling5
df['10DayRolling'] = rolling10

In [19]:
df.head(15)

Unnamed: 0,Stock,Date,Price,Quarter,Day,EndofMonth,EndofQuarter,PreviousValue,Value2DaysAgo,Adj Close,5DayRolling,10DayRolling
0,AAPL,2014-11-05,108.860001,4,2,False,False,,,2023.569946,,
1,AAPL,2014-11-06,108.699997,4,3,False,False,108.860001,,2031.209961,,
2,AAPL,2014-11-07,109.010002,4,4,False,False,108.699997,108.860001,2031.920044,,
3,AAPL,2014-11-10,108.830002,4,0,False,False,109.010002,108.699997,2038.26001,,
4,AAPL,2014-11-11,109.699997,4,1,False,False,108.830002,109.010002,2039.680054,109.02,
5,AAPL,2014-11-12,111.25,4,2,False,False,109.699997,108.830002,2038.25,109.498,
6,AAPL,2014-11-13,112.82,4,3,False,False,111.25,109.699997,2039.329956,110.322,
7,AAPL,2014-11-14,114.18,4,4,False,False,112.82,111.25,2039.819946,111.356,
8,AAPL,2014-11-17,113.989998,4,0,False,False,114.18,112.82,2041.319946,112.387999,
9,AAPL,2014-11-18,115.470001,4,1,False,False,113.989998,114.18,2051.800049,113.542,111.281


In [22]:
df2 = df.melt()
df2
# Not great example of df.melt

Unnamed: 0,variable,value
0,Stock,AAPL
1,Stock,AAPL
2,Stock,AAPL
3,Stock,AAPL
4,Stock,AAPL
...,...,...
75415,10DayRolling,139.598
75416,10DayRolling,139.724
75417,10DayRolling,140.144
75418,10DayRolling,140.512
