# Topic 37: Intro to Time Series

- 05/27/21
- onl01-dtsc-ft-022221

## Learning Objectives:

- Learn how to load in timeseries data into pandas
- Learn how to plot timeseries in pandas
- Learn how to resample at different time frequencies
- Learn about types of time series trends and how to remove them.
- Learn about seasonal decomposition

- Prepare a time series dataset to use for modeling next class

## Questions?

# Intro to Time Series

## References

- [Pandas Timeseries Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)
- ['Timeseries Offset Aliases'](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases)
- [Anchored Offsets](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#anchored-offsets)


- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html

# Working with Time Series

In [None]:
## Import the essentials
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import numpy as np
import pandas as pd
import os,sys


## Setting figures to timeseries-friendly
mpl.rcParams['figure.figsize'] = (12,4)
# sns.set_context('talk')

# import warnings
# warnings.filterwarnings('ignore')

## Time Series Tools from statsmodels
import statsmodels.tsa.api as tsa
import statsmodels
print(f'Statsmodels version = {statsmodels.__version__}')


## Creating a Time Series from a DataFrame

### Data is from Baltimore's Open-Data Portal

- New Crime Stats Just downloaded:
- [Baltimore Open Data](https://www.baltimorepolice.org/crime-stats/open-data)
    - We Are Using [Part 1 Crime](https://data.baltimorecity.gov/search?q=crime%20data):
    - https://data.baltimorecity.gov/datasets/3eeb0a2cbae94b3e8549a8193717a9e1_0/explore
    
- **Note: to save space, I converted the .csv to a .csv.gz using this code**
```python 
## Read orig csv from Downloads and save .gz vers to local folder
df = pd.read_csv('/Users/jamesirving/Downloads/Part1_Crime_data.csv')
df.to_csv('baltimore_crime_05-26-21.csv.gz',compression='gzip',index=False)
```

In [None]:
## Read in the data
file = 'baltimore_crime_05-26-21.csv.gz'
df = pd.read_csv(file)
df

In [None]:
## Keeping Necessary/Desired Columns
keep_cols = ['CrimeDateTime','Description','Total_Incidents',
             'District','Neighborhood'
#               'Weapon', 'Latitude','Longitude',
            ]
df = df[keep_cols]
df

## Preparing Data for Time Series Visualization

- Index must be a `datetimeindex`

In [None]:
## Check Index 


In [None]:
## Convert CrimeDateTime to datetime and set as index


#### What do we notice about the index?

- frequency?
- range?

In [None]:
## Inspect the value_counts for the different types of crimes

# display with an inline-barplot inside your df


In [None]:
## Grab All Shootings  - using groupby


In [None]:
## Checking total-incidents value_counts


In [None]:
## Lets get just Shootings in a new series

# Make a crime var with the name of the crime we want

## save ts from Total Incidents and rename series to crime 


In [None]:
## Get list of crimes to iterate through


In [None]:
## make a dict of all crime types' DataFrames 

## For each crime type
    
    ## Get the group df
    
    ## Save the group_df into the CRIMES dict
    
## Display the keys


# Visualizing Time Series

In [None]:
## Pull out shooting from CRIMES dict


In [None]:
## Plot the ts


#### Q: What went wrong? What are we looking at?

- 

## Time Series Frequencies & Resampling

In [None]:
## Resample to daily data ("D")


In [None]:
## Check the index, whats different?


In [None]:
## PLot the time series


>#### Q: It worked! But whats the issue?

## Slicing With Time Series

- Make sure your index is sorted first'
- Use `.loc` with dates as strings for slicing

In [None]:
## Slice out dates prior to rise in daily counts


>#### Much better! sort of... but whats the issue now?

### Time series Frequencies


- We want the daily counts for our crimes.
    - In order to do so we have to resample the ts using the correct frequency alias.
- For time series modeling, we will need our time series as a specific frequency without missing data.

#### Pandas Frequency Aliases

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases


|Alias	| Description|
| --- | --- |
|B |	business day frequency|
|C |	custom business day frequency|
|D |	calendar day frequency|
|W |	weekly frequency|
|M |	month end frequency|
|SM |	semi-month end frequency (15th and end of month)|
|BM |	business month end frequency|
|CBM |	custom business month end frequency|
|MS |	month start frequency|
|SMS |	semi-month start frequency (1st and 15th)|
|BMS |	business month start frequency|
|CBMS |	custom business month start frequency|
|Q |	quarter end frequency|
|BQ |	business quarter end frequency|
|QS |	quarter start frequency|
|BQS |	business quarter start frequency|
|A, Y |	year end frequency|
|BA, BY |	business year end frequency|
|AS, YS |	year start frequency|
|BAS, BYS |	business year start frequency|
|BH | business hour frequency|
|H | hourly frequency|
|T |  min	minutely frequency|
|S | secondly frequency|
|L |  ms	milliseconds|
|U |  us	microseconds|
|N | nanoseconds|

#### Compare Resampled ts

In [None]:
## Plot the same ts as different frequencies
## Specify freq codes daily, every 3 days, weekly, Monthly, Quarterly, yearly

## select ts from CRIMES

## For each freq code
    
    ## make a new figure, resample and plot



In [None]:
## Repeat the above loop,but plot it all on one figure

## Specify freq codes

## select ts from CRIMES

## For each freq code, resample and plot


## Visualize all CRIMES as "D" Freq

>- **Loop through all CRIMES, slice out 2015-Present, and resample as "D"**
    - We can always downsample without issue, but upsampling is a problematic

### Using Dictionaries for TIme Series preprocessing

In [None]:
## Save all crimes from 2015 on with freq=D in new TS dict

## For each crime
    
    ## Resample and slice and save ts
    


In [None]:
## Check shooting


### Now that we have the same frequency for each crime series, make them into a dataframe

In [None]:
## Concatenate all ts together into one ts_df


### Visualize all ts with the differnet requency codes

In [None]:
## Plot the same ts as different frequencies


### Dealing with Null Values 

In [None]:
## Check For Null Values


In [None]:
# save a T/F index for if a row has any nulls

# check out the null rows


#### Q: what do we notice about our null values? Where are they?

> - We have several options for filling in null values for time series, based on what would be best for our data.

In [None]:
## FFill null values with the next non-null value


In [None]:
## FFill null values with the previous non-null value


In [None]:
## We have crime counts, so it makes sense to fill with 0


In [None]:
## Save df to csv for time series modeling next class
# ts_df.to_csv('baltimore_crime_counts_2021.csv')

# Time Series Trends

## Types of Trends

<img src="https://raw.githubusercontent.com/learn-co-students/dsc-removing-trends-online-ds-ft-100719/master/images/new_trendseasonal.png" width=80%>

### Stationarity

<div style="text-align:center;font-size:2em">Mean</div>
    
<img src="https://raw.githubusercontent.com/jirvingphd/dsc-types-of-trends-online-ds-ft-100719/master/images/new_mean_nonstationary.png" width=70%>
<br><br>
<div style="text-align:center;font-size:3em">Variance</div>
<img src="https://raw.githubusercontent.com/jirvingphd/dsc-types-of-trends-online-ds-ft-100719/master/images/new_cov_nonstationary.png" width=70%>
</div>

## Detecting Trends

In [None]:
## Grab ROBBERY - STREET and resample as weekly data


### Augmented Dickey Fuller Test for Stationarity

In [None]:
## Lab Function
# from statsmodels.tsa.stattools import adfuller

def adfuller_test_df(ts,index=['AD Fuller Results']):
    """Returns the AD Fuller Test Results and p-values for the null hypothesis
    that there the data is non-stationary (that there is a unit root in the data)"""
    
    df_res = tsa.stattools.adfuller(ts)
    
    names = ['Test Statistic','p-value','#Lags Used','# of Observations Used']
    res  = dict(zip(names,df_res[:4]))
    
    res['p<.05'] = res['p-value']<.05
    res['Stationary?'] = res['p<.05']
    
    if isinstance(index,str):
        index = [index]
    res_df = pd.DataFrame(res,index=index)
    res_df = res_df[['Test Statistic','#Lags Used',
                     '# of Observations Used','p-value','p<.05',
                    'Stationary?']]
    return res_df



def stationarity_check(TS,window=8,plot=True,index=['AD Fuller Results']):
    """Adapted from https://github.com/learn-co-curriculum/dsc-removing-trends-lab/tree/solution"""
    
    # Calculate rolling statistics
    roll_mean = TS.rolling(window=window, center=False).mean()
    roll_std = TS.rolling(window=window, center=False).std()
    
    # Perform the Dickey Fuller Test
    dftest = adfuller_test_df(TS,index=index)
    
    if plot:
        # Plot rolling statistics:
        fig = plt.figure(figsize=(12,6))
        plt.plot(TS, color='blue',label=f'Original (freq={TS.index.freq})')
        plt.plot(roll_mean, color='red', label=f'Rolling Mean (window={window})')
        plt.plot(roll_std, color='black', label = f'Rolling Std (window={window})')
        plt.legend(loc='best')
        plt.title('Rolling Mean & Standard Deviation')
        display(dftest)
        plt.show(block=False)
        
    return dftest
    

In [None]:
## Test stationariy check function 


## Removing Trends 

- For time series modeling, we will want to get our time series stationary.*

#### Trend Removal Methods
- Differencing (`.diff()`)
- Log-Transformation (`np.log`)
- Subtract Rolling Mean (`ts-ts.rolling().mean()`)
- Subtract Exponentially-Weighted Mean (`ts-ts.ewm().mean()`)
- Seasonal Decomposition (`from statsmodels.tsa.seasonal import seasonal_decompose`
)

_`*`=caveat to be discussed tomorrow_

In [None]:
## Plot Original Time Series and Check for Stationarity


In [None]:
## Apply differnncing, plot and get adfuller test


In [None]:
## Log Transform, plot and get adfuller test


In [None]:
## Subtract Rolling mean


In [None]:
## Subtract Exponentially Weight Mean Rolling mean


#### Q: What do we notice? What methods achieved stationarity?

### Seasonal Decomposition

In [None]:
## Use seasonal decompose on the ts and plot
plt.rcParams['figure.figsize']=(12,6)


#### Checking which components are stationary

In [None]:
## Save seasonal/trend/resid in a dictionary.


In [None]:
## Make a list of adfuller results to append

## Save results of orig ts

## Loop through decomp dict, 

    # Fill any missing values, get adfuller result

    
    ## Append res to decomp_stationary

## make into a df


In [None]:
## Plot decomp again for convenient comparison


### Summary

- Tomorrow we will continue working with the dataset we processed today for time series modeling. 

# APPENDIX

### Date Str Formatting




Formatting follows the Python datetime <strong><a href='http://strftime.org/'>strftime</a></strong> codes.<br>
The following examples are based on <tt>datetime.datetime(2001, 2, 3, 16, 5, 6)</tt>:
<br><br>

<table style="display: inline-block">  
<tr><th>CODE</th><th>MEANING</th><th>EXAMPLE</th><tr>
<tr><td>%Y</td><td>Year with century as a decimal number.</td><td>2001</td></tr>
<tr><td>%y</td><td>Year without century as a zero-padded decimal number.</td><td>01</td></tr>
<tr><td>%m</td><td>Month as a zero-padded decimal number.</td><td>02</td></tr>
<tr><td>%B</td><td>Month as locale’s full name.</td><td>February</td></tr>
<tr><td>%b</td><td>Month as locale’s abbreviated name.</td><td>Feb</td></tr>
<tr><td>%d</td><td>Day of the month as a zero-padded decimal number.</td><td>03</td></tr>  
<tr><td>%A</td><td>Weekday as locale’s full name.</td><td>Saturday</td></tr>
<tr><td>%a</td><td>Weekday as locale’s abbreviated name.</td><td>Sat</td></tr>
<tr><td>%H</td><td>Hour (24-hour clock) as a zero-padded decimal number.</td><td>16</td></tr>
<tr><td>%I</td><td>Hour (12-hour clock) as a zero-padded decimal number.</td><td>04</td></tr>
<tr><td>%p</td><td>Locale’s equivalent of either AM or PM.</td><td>PM</td></tr>
<tr><td>%M</td><td>Minute as a zero-padded decimal number.</td><td>05</td></tr>
<tr><td>%S</td><td>Second as a zero-padded decimal number.</td><td>06</td></tr>
</table>
<table style="display: inline-block">
<tr><th>CODE</th><th>MEANING</th><th>EXAMPLE</th><tr>
<tr><td>%#m</td><td>Month as a decimal number. (Windows)</td><td>2</td></tr>
<tr><td>%-m</td><td>Month as a decimal number. (Mac/Linux)</td><td>2</td></tr>
<tr><td>%#x</td><td>Long date</td><td>Saturday, February 03, 2001</td></tr>
<tr><td>%#c</td><td>Long date and time</td><td>Saturday, February 03, 2001 16:05:06</td></tr>
</table>  
    

### Attempting Using Grouper like in lab, but not working

In [None]:
# ts_q = ts.to_frame()

# ts_q#.plot()

In [None]:
 # Use pandas grouper to group values using annual frequency
# year_groups = ts_q.groupby(pd.Grouper(freq ='A'))

In [None]:
# # Create a new DataFrame and store yearly values in columns 
# ts_annual = pd.DataFrame()

# for yr, group in year_groups:
# #     display(group.values)
# #     ts_annual[yr.year] = group.values.ravel()
    
# # Plot the yearly groups as subplots
# # ts_annual.plot(figsize = (13,8), subplots=True, legend=True);