# Time Series Analysis

* Time series data is one of the most common data types and understanding how to work with it is a critical data science skill if you want to make predictions and report on trends. 
* Any column of a DataFrame can have datetime object, but the most important is the DateTimeIndex, because this converts the entire DataFrame into a Time Series.
* Many Series/DF methods rely on time information in the index to provide time-series functionality.

***

## **NOTE TO SELF: ideas to fill missing values Cov_weather project:** 

* Use shift to fill in missing weekend values of COVID19 cases?
* use forward fill, back fill, or fill_value?
* or: interpolate missing weekend values?

#### Timestamps 
   * `pd.Timestamp()` [doc here](https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html)
   * `time_stamp = pd.Timestamp(datetime(2017, 1, 1))` -OR- `pd.Timestamp('2017-01-01') == time_stamp`
   * use date string or datetime object (above)
   * timestamps created with date only will automatically set time to midnight ("00:00:00")
   * Timestamp object has many attributes to store time-specific information
      * `.year`, `.weekday_name`, etc
   * Timestamps can also have frequency information
   * __Common timestamp attributes:__ 
       * .second, .minute, .hour
       * .day, .month, .quarter, .year
       * .weekday
       * .dayofweek
       * .weekofyear
       * .dayofyear
   * to create a sequence of timestamps: 
        * `pd.date_range()` [doc here](https://pandas.pydata.org/docs/reference/api/pandas.date_range.html)
        * `index = pd.date_range(start= '2017-1-1', periods=12, freq='M')` OR
        * `index = pd.date_range(start= '2017-1-1', end= '2017-12-1', freq='M')`

#### Periods and frequencies
   * `pd.Period()`([doc here](https://pandas.pydata.org/docs/reference/api/pandas.Period.html)) & `freq` 
   * Period object is pandas specific
   * the period object always has a frequency, with month (`M`) as default.
   * `period = pd.Period('2017-01-01', 'D')`
   * `period = pd.Period('2017-01')` #default period is month
   * convert period to new frequencies:
   * `period.asfreq('D')` [doc here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.asfreq.html)
   * You can convert a period to a timestamp object and a timestamp object to a period object:
   * `period.to_timestamp().to_period('M')` [doc here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_timestamp.html)
   * perform basic date arithmetic with periods:
       * if: `period = pd.Period('2017-01', 'M')`
       * then: `period + 2` = `Period('2017-03', 'M')` #plus two units (frequency) M
   * Timestamps can also have frequency information
   * Sequences of dates & times:    
   * to create a TimeSeries, you need a sequence of dates
   * to create a sequence of Timestamps, use the pandas function `date_range()`
       * `pd.date_range(start, end, periods, freq)`
       * `index = pd.date_range(start = '2017-01-01, periods =12, freq = 'M')` # default freq = 'D'
   * Convert Timestamp index to a Period index with: `index.to_period()`
    
#### Creating a time series
   * `pd.DateTimeIndex()` [doc here](https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html)
   * Create 12 rows x 2 columns of random data to match 12 datetime indices created above:
       * `data = np.random.random((size = 12,2))
       * pd.DataFrame(data=data, index=index)`
   * Common frequency aliases:
       * Hour : 'H'
       * Day : 'D' ("calendar day"; midnight to midnight)
       * Week : 'W'
       * Month : 'M'
       * Quarter : 'Q'
       * Year : 'A'
       * May also be specified to beginning vs end of period, or business-specific defintion 
   * Convert to **business day** frequency: `.asfreq('B')`

#### Indexing & resampling time series
   * **Upsampling:** Increasing the time frequency (requires generating new data)
   * **Downsampling:** Decreasing the time frequency (requires aggregating data)
   * parse date string to datetime64: 
       * `pd.to_datetime()` [doc here](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html)
       * `google.date = pd.to_datetime(google.date)` #convert whole column(s)
   * date to index: `.set_index(inplace=True)`
   * to select subsets of time series data, use strings that represent a complete date or relevant parts of a date
       * `google['2015'].info()` #returns data from 2015
       * `google['2015-3': '2016-2'].info()` #data from 02-03 of 2016
       * *Note that the date range is inclusive of the end date.*
       * Also: use `.loc` with complete date and column label: `google.loc['2016-6-1', 'price']`
   * set frequency information with `.asfreq('D')`
   * Convert to business day frequency: `.asfreq('B')`

#### Lags, changes, and returns for stock price series
   * typical time series manipulations include:
       * Shift or lag values backwards or forwards in time
       * difference in value for a give time period
       * compute percent change over any number of periods (aka rate of growth)
       * instead of manually `pd.to_datetime()`, let `pd.read_csv()` do the parsing work:
           * `google.pd.read_csv('google.csv', parse_dates= ['date'], index_col = 'date')`
           * this converts original date column's column name to index name/header (also 'date')
       * `.shift()`:
           * defaults to periods = 1
           * 1 period = 1 period in the future (first value with be missing); -1 = 1 period in the past (last value will be missing)
           * `google['shifted'] = google.price.shift()` # defaults to period = 1
           * shifting data is useful for comparing data at different points in time 
       * calculate **rate of change** from period to period (aka **financial return**):
           * all of the following methods have a have a `period` keyword that defaults to value 1.
           * `google['change'] = google.price.div(google.shifted)`
           * `.div()` allows you to divide a series by a value OR series of values, for instance by another column in the same dataframe
           * calculate one-period percent change: 
               * `google['return'] = google.change.sub(1).mul(100)`
               * create column `return` by subtracting 1 from change and dividing by 100 to calculate percent difference
           * `.diff()` : calculates the difference in value for two adjacent periods [doc here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html)
           * `.pct_change()`: `google['pct_change'] = google.price.pct_change().mul(100)` [doc here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pct_change.html)
           * to change value of `period` and calculate pct_change between values several periods apart:
               * `google['return_3d'] = google.price.pct_change(periods=3).mul(100)`

#### Compare time series growth rates
   * We often want to compare time series' trends, but because they start at different levels, this isn't always easy
   * Simple solution: normalize price series to start at 100:
       * Divide all prices by first in series, multiply by 100
       * Causes all prices to reflect relative difference in prices from starting price
       * Multiple by 100 and you have the difference to starting point, in percentage points
   * Normalizing a single series:
       * `normalized = google.price.div(first_price).mul(100)`
   * Normalizing multiple series:
       * `normalized = prices.div(prices.iloc[0])`
           * `.div`: automatic alignment of Series index and DataFrame columns [doc here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.div.html)
   * add a __benchmark__:

```
index = pd.read_csv('benchmark.csv', parse_dates= ['date'], index_col = 'date')
prices = pd.concat([prices, index], axis = 1).dropna()
```
   * plotting performance difference: `diff = normalized[tickers].sub(normalized['SP500'], axis=0)
   * `.sub(..., axis=0)`: Subtract a series from each DataFrame column by aligning indices
   * Normalize and plot data in one step: 
       * `data.div(data.iloc[0]).mul(100).plot()`
       * `plt.show()`

#### Changing the time series frequency resampling
   * frequency is an attribute
   * DateTimeIndex: set & change freq using `.asfreq()`
   * changing frequency affects the values in the dataframe
       * **Upsampling:** converting to higher frequency: fill or interpolate missing data
       * **Downsampling:** convering to a lower frequency: aggregate existing data
       * pandas API: 
           * `.asfreq()` [doc here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.asfreq.html)
           * `.reindex()` [doc here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reindex.html)
           * `.resample()` + transformation [doc here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html)
       * when you choose quarterly frequency, pandas defaults to December for the end of the 4th quarter
           * this is modifiable
       * Umpsampling fill methods:
           * Forward fill: `monthly['ffill'] = quarterly.asfreq('M', method='ffill')` [doc here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html) 
           * Back fill: `monthly['bfill'] = quarterly.asfreq('M', method = 'bfill')` [doc here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.bfill.html)
           * provide fill value: `monthly['value'] = quarterly.asfreq('M', fill_value = 0)` [doc here](https://numpy.org/doc/stable/reference/generated/numpy.ma.masked_array.fill_value.html)
           * add missing index: `.reindex()`: conform DataFrame to new index, same filling logic as `.asfreq()`
               * `quarterly.reindex(dates)` : default fill value is NaN
       * Creating new date series info
```
dates = pd.date_range(start='2016', periods=4, freq='Q')
data= range(1,5)
quarterly = pd.Series(data=data, index=dates)
```        

#### Upsampling and interpolation with .resample()
   * `.resample()` ([doc here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html)) follows a logic similar to `.groupby()`:
       * `resample()`groups data within resampling period and applies one or several methods to each group
           * new date is determined by offset - start, end, etc
           * Upsampling: fill from existing or interpolate values
           * Downsampling: apply aggregation to existing data
       * **Resampling Period & Frequency Offsets**
           * Several alternatives to calendar month end
               * Calendar month end: 'M' | Sample date: 2017-04-30
               * Calendar month start: 'MS' | Sample date: 2017-04-01
               * Business month end: 'BM' | Sample date: 2017-04-28
               * Business month start: 'BMS' | Sample date: 2107-04-03
       * .resample() will only produce df after secondary call:
               * `unrate.resample('MS')` #outputs `DatetimeIndexResampler`
               * `unrate.asfreq('MS').equals(unrate.resample('MS').asfreq())` #outputs data
       * `gdp_1 = gdp.resample('MS').ffill().add_suffix('_ffill')`
       * Interpolate: finds points on straight line between existing data.
           * `.interpolate()` [doc here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html)
           * `gdp_2 = gdp.resample('MS').interpolate().add_suffix('_inter')`
       * `pd.concat()` defaults to axis =0; resetting to axis=1 allows you to concatenate horizontally [doc here](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)
       * Downsampling and aggregating:
           * **Downsampling:** reduce the frequency of your data | How to represent existing data with new aggregated values?
               * mean, median, last value
           * first assign calendar day frequency using resample:
           * `ozone = ozone.resample('D').asfreq()`
           * convert daily to monthly
           * `ozone.resample('M').mean()`
           * `resample().mean()`: monthly average, assigned to end of calendar month
           * similar to groupby, you can also apply multiple aggregations at once:
           * `.agg()` [doc here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html)
           * `ozone.resample('M').agg(['mean', 'std']).head()`
           * `resample().agg()` : list of aggregation functions like groupby
           * `.squeeze()` [doc here](https://numpy.org/doc/stable/reference/generated/numpy.squeeze.html)
           

```
# Create the range of dates here
seven_days = pd.date_range(start= '2017-01-01', periods = 7)

# Iterate over the dates and print the number and name of the weekday
for day in seven_days:
    print(day.dayofweek, day.weekday_name)
```
***
```
data = pd.read_csv('nyc.csv')

# Inspect data
print(data.info())

# Convert the date column to datetime64
data.date=pd.to_datetime(data.date)

# Set date column as index
data.set_index('date', inplace=True)

# Inspect data 
print(data.info())

# Plot data
data.plot(subplots=True)
plt.show()
```

```
# Create dataframe prices here
prices = pd.DataFrame()

# Select data for each year and concatenate with prices here 
for year in ['2013', '2014', '2015']:
    price_per_year = yahoo.loc[year, ['price']].reset_index(drop=True)
    price_per_year.rename(columns={'price': year}, inplace=True)
    prices = pd.concat([prices, price_per_year], axis=1)

# Plot prices
prices.plot()
plt.show()
```

* From stock prices to climate data, time series data are found in a wide variety of domains, and being able to effectively work with such data is an increasingly important skill for data scientists.
* Time series analysis deals with data that is ordered in time
* Some useful pandas methods:
    * `df['col'].pct_change()` : percent change
    * `df['col'].diff()` : difference
    * `df['ABC'].corr(df['XYZ'])`: correlation (for pd Series)
* __Google Trends__ allows users to see how often a term is searched for.

* A first step when analyzing a time series is to visualize the data with a plot. 
* Stock and bond markets in the U.S. are closed on different days. For example, although the bond market is closed on Columbus Day (around Oct 12) and Veterans Day (around Nov 11), the stock market is open on those days. One way to see the dates that the stock market is open and the bond market is closed is to convert both indexes of dates into sets and take the difference in sets.

### Two Time Series
* Often, two time series vary together
* __correlation coefficient:__ a measure of how much two series vary together; a correlation of 1 means two series have a perfect linear correlation with no deviations. A correlation of 0 means no correlation whatsoever.
#### Common Mistake: Correlation of two trending series 
* Just because two time series seem to be trending together in the same direction(s) does not mean that they are correlated.
* Example: Dow Jones and UFO sightings
* If you're looking at the correlation of two stocks, you should look at the comparison of their returns, not their levels
* Compute percent changes in each, then compute correlation
* Scatter plots are also useful for visualizing the correlation between the two variables.

```
#Compute percent change using pct_change()
returns = stocks_and_bonds.pct_change()
#Compute correlation using corr()
correlation = returns['SP500'].corr(returns['US10Y'])
print("Correlation of stocks and interest rates: ", correlation)
#Make scatter plot
plt.scatter(returns['SP500'], returns['US10Y'])
plt.show()
```

* Two trending series may show a strong correlation even if they are completely unrelated. This is referred to as "spurious correlation". That's why when you look at the correlation of say, two stocks, you should look at the correlation of their returns and not their levels.
```
#Compute correlation of levels
correlation1 = levels['DJI'].corr(levels['UFO'])
print("Correlation of levels: ", correlation1)
#Compute correlation of percent changes
changes = levels.pct_change()
correlation2 = changes['DJI'].corr(changes['UFO'])
print("Correlation of changes: ", correlation2)
```
### Simple Linear Regression

* Linear Regression aka OLS
* Python packages to perform regressions:
    * In statsmodels:
        * `sm.OLS(y, x).fit()`
    * In numpy:
        * `np.polyfit(x, y, deg=1)`
    * In pandas:
        * `pd.ols(y, x)`
    * In scipy:
        * `stats.lingress(x, y)`
* __Beware:__ The order of x and y is not consistent across all packages
* R-squared measures how closely the data fit the regression line.
    * the R-squared in a simple regression is related to the correlation between the two variables. 
    * the magnitude of the correlation is the square root of the R-squared and the sign of the correlation is the sign of the regression coefficient.

```
#Import the statsmodels module
import statsmodels.api as sm
#Compute correlation of x and y
correlation = x.corr(y)
print("The correlation between x and y is %4.2f" %(correlation))
#Convert the Series x to a DataFrame and name the column x
dfx = pd.DataFrame(x, columns=['x'])
#Add a constant to the DataFrame dfx
dfx1 = sm.add_constant(dfx)
#Regress y on dfx1
result = sm.OLS(y, dfx1).fit()
#Print out the results and look at the relationship between R-squared and the correlation above
print(result.summary())
```

## Autocorrelation
* Autocorrelation is the correlation of a single Time Series with a lagged copy of itself. 
* Aka 'Serial Correlation'
* Often, when we refer to Autocorrelation, we mean __'Lag-one autocorrelation'__
#### What does it mean when a Series has a positive or a negative autocorrelation?
* __Mean Reversion:__ Negative autocorrelation
* __Momentum__ or __Trend-Following:__ Positive autocorrelation
* Example: Traders on Wall Street use autocorrelation to make money
* Individual stocks have historically negative autocorrelation

* `df.resample(rule= 'M', how= 'last')`
    * 'M' = monthly 
    * how = how to do the resampling; you can use the first date, the last date, or even the average
* __Mean reversion__ in stock prices: prices tend to bounce back, or revert, towards previous levels after large moves, which are observed over time horizons of about a week. 
* A more mathematical way to describe mean reversion is to say that stock returns are negatively autocorrelated.
```
#Convert the daily data to weekly data
MSFT = MSFT.resample(rule='W', how='last')
#Compute the percentage change of prices
returns = MSFT.pct_change()
#Compute and print the autocorrelation of returns
autocorrelation = returns['Adj Close'].autocorr()
print("The autocorrelation of weekly returns is %4.2f" %(autocorrelation))
```

* When you look at daily changes in interest rates, the autocorrelation is close to zero. However, if you resample the data and look at annual changes, the autocorrelation is negative. This implies that while short term changes in interest rates may be uncorrelated, long term changes in interest rates are negatively autocorrelated. 
```
#Compute the daily change in interest rates 
daily_diff = daily_rates.diff()
# Compute and print the autocorrelation of daily changes
autocorrelation_daily = daily_diff['US10Y'].autocorr()
print("The autocorrelation of daily interest rate changes is %4.2f" %(autocorrelation_daily))
#Convert the daily data to annual data
yearly_rates = daily_rates.resample(rule='A').last()
#Repeat above for annual data
yearly_diff = yearly_rates.diff()
autocorrelation_yearly = yearly_diff['US10Y'].autocorr()
print("The autocorrelation of annual interest rate changes is %4.2f" %(autocorrelation_yearly))
```

## Autocorrelation Function
* __ACF:__ AutoCorrelation Function; shows not only the lag 1 autocorrelation, but the entire autocorrelation function for different lags
* Any significant non-zero correlations implies that the series can be forecast from the past 
* ACF useful for model selection
* __Plot ACF in Python:__
    * `from statsmodels.graphics.tsaplots import plot_acf`
    * input x is series or array 
    * `plot_acf(x, lags=20, alpha=0.05)`
    * the argument `lags` indicates how many lags of the autocorellation function will be plotted.
    * the `alpha` argument sets the width of the confidence interval.
* __Confidence Interval of ACF:__
    * In `plot_acf` the argument `alpha` determines the width of the confidence interval.
    * `alpha` = chance the if true autocorrelation is zero, it will fall outside blue band
    * Example: `alpha` = 0.05 = 5% chance
    * Confidence bands are wider if:
        * alpha is lower
        * fewer observations
    * Under some simplifying observations, 95% confidence bands are $\pm$2/ $\sqrt{N}$
    * If you don't want to see confidence intervals in your plot, set `alpha=1`
* __Numerical Values of ACF:__
    * Instead of plotting ACF, you can also print numerical values of ACF
    * `from statsmodels.tsa.stattools import acf`
    * `print(acf(x))`

```
#Import the acf module and the plot_acf module from statsmodels
from statsmodels.tsa.stattools import acf
from statsmodels.graphics.tsaplots import plot_acf
#Compute the acf array of HRB
acf_array = acf(HRB)
print(acf_array)
#Plot the acf function
plot_acf(HRB, alpha=1)
plt.show()
```

* the standard deviation of the sample autocorrelation is 1/$\sqrt{N}$ where _N_ is the number of observations.
* if _N_ = 100, for example, the standard deviation of the ACF is 0.1
* since 95% of a normal curve is between +1.96 and -1.96 standard deviations from the mean, the 95% confidence interval is $\pm$1.96/$\sqrt{N}$.
* This approximation only holds when the true autocorrelations are all zero.

```
#Import the plot_acf module from statsmodels and sqrt from math
from statsmodels.graphics.tsaplots import plot_acf
from math import sqrt
#Compute and print the autocorrelation of MSFT weekly returns
autocorrelation = returns['Adj Close'].autocorr()
print("The autocorrelation of weekly MSFT returns is %4.2f" %(autocorrelation))
#Find the number of observations by taking the length of the returns DataFrame
nobs = len(returns)
#Compute the approximate confidence interval
conf = 1.96/sqrt(nobs)
print("The approximate confidence interval is +/- %4.2f" %(conf))
#Plot the autocorrelation function with 95% confidence intervals and 20 lags using plot_acf
plot_acf(returns, alpha=0.05, lags=20)
plt.show()
```

## White Noise
* White noise is a series with:
    * constant mean
    * constant variance
    * zero autocorrelations at all lags
* Special Case: if data has normal distribution, then: _Gaussian White Noise._
* __Simulating White Noise:__
    * `import numpy as np`
    * `noise = np.random.normal(loc=0, scale=1, size=500)`
    * all autocorrelations of white noise are zero:
    * ` plot_acf(noise, lags=50)`
* White noise cannot be forecasted. 
* Stock returns are often modeled as white noise
* A white noise time series is simply a sequence of uncorrelated random variables that are identically distributed. 

```
#Import the plot_acf module from statsmodels
from statsmodels.graphics.tsaplots import plot_acf
#Simulate white noise returns
returns = np.random.normal(loc=0.02, scale=0.05, size=1000)
#Print out the mean and standard deviation of returns
mean = np.mean(returns)
std = np.std(returns)
print("The mean is %5.3f and the standard deviation is %5.3f" %(mean,std))
#Plot returns series
plt.plot(returns)
plt.show()
#Plot autocorrelation function of white noise returns
plot_acf(returns, lags=20)
plt.show()
```

### Random Walk
* Whereas stock returns are often modeled as white noise, stock prices closely follow a random walk. In other words, today's price is yesterday's price plus some random noise.
* Today's price = yesterday's price + noise
* $P_t$ = $P_t-1$ + $\epsilon_t$
* Change in price of a random walk is just white noise:
    * $P_t$ - $P_t-1$ = $\epsilon_t$
* If stock prices follow a random walk, then stock returns are white noise. 
* However, when adding noise, you could theoretically get negative prices
* Can't forecast a random walk; the best forecast for tomorrow's price is today's price. 
* In __Random Walk with drift:__ prices drift by $\mu$ every period 
    * Many time series, like stock prices, are random walks but tend to drift up over time.
    * Change in price for a random walk with drift is still white noise, but with a mean of $\mu$
    * If we view stock prices as random walk with drift, then the returns are still white noise, but with an average return of $\mu$ instead of zero.
* __Statistical test for random walk:__
    * Regress current prices on lag prices 
        * if the slope coefficient $\beta$ is not significantly different from one, then we can not reject the null hypothesis that the series is a random walk.
        * If $\beta$ is significantly less than 1, then we _can_ reject the null hyposthesis that the series is a random walk
        * An identical way to perform statistical test for random walk is to regress the difference in prices o the lag price. And instead of testing is the slope coefficient is one, now we test whether it is zero.: __Dickey Fuller Test__

```
#Generate 500 random steps with mean=0 and standard deviation=1
steps = np.random.normal(loc=0, scale=1, size=500)
#Set first element to 0 so that the first price will be the starting stock price
steps[0]=0
#Simulate stock prices, P with a starting price of 100
P = 100 + np.cumsum(steps)
#Plot the simulated stock prices
plt.plot(P)
plt.title("Simulated Random Walk")
plt.show()
```
```
#Generate 500 random steps
steps = np.random.normal(loc=0.001, scale=0.01, size=500) + 1
#Set first element to 1
steps[0]=1
#Simulate the stock price, P, by taking the cumulative product
P = 100 * np.cumprod(steps)
#Plot the simulated stock prices
plt.plot(P)
plt.title("Simulated Random Walk with Drift")
plt.show()
```
### ADF Test in Python:
* `from statsmodels.tsa.stattools import adfuller`
* Run augmented Dicket Test:
* `adfuller(x)`
* With the ADF test, the "null hypothesis" (the hypothesis that we either reject or fail to reject) is that the series follows a random walk. Therefore, a low p-value (say less than 5%) means we can reject the null hypothesis that the series is a random walk.
* (results[0] is the test statistic, and results[1] is the p-value).
``` 
#Import the adfuller module from statsmodels
from statsmodels.tsa.stattools import adfuller
#Run the ADF test on the price series and print out the results
results = adfuller(AMZN['Adj Close'])
print(results)
#Just print out the p-value
print('The p-value of the test on prices is: ' + str(results[1]))
```
```
#Import the adfuller module from statsmodels
from statsmodels.tsa.stattools import adfuller
#Create a DataFrame of AMZN returns
AMZN_ret = AMZN.pct_change()
#Eliminate the NaN in the first row of returns
AMZN_ret = AMZN_ret.dropna()
#Run the ADF test on the return series and print out the p-value
results = adfuller(AMZN_ret['Adj Close'])
print('The p-value of the test on returns is: ' + str(results[1]))
```
* Most stock prices follow a random walk (perhaps with a drift)

## Stationarity
* __Strong Stationarity:__ entire distribution of data is time-invariant.
* __Weak Stationarity:__ A less restrictive type of stationarity; mean, variance, and auto correlation are time-invariant; easier to test
* If a process is not stationary, then it becomes difficult to model.
* Modeling involves estimating a set of parameters, and if a process is not stationary and the parameters are different at each point in time, then there are too many parameters to estimate. You may end up having more parameters than actual data. 
* Stationarity is necessary for a parsimonious model (with a smaller set of parameters to estimate). 
* A random walk is a common type of non-stationary series; the variance grows with time
* Seasonal series are also non-stationary
* Many non-stationary series can be made stationary through a simple transformation or set of transformations. 
    * First difference
    * Seasonal difference

### Seasonality
* Many time series exhibit strong seasonal behavior. 
* The procedure for removing the seasonal component of a time series is called seasonal adjustment.
* For example, most economic data published by the government is seasonally adjusted.
* "Fourth difference" = periodicity of series = every 4 periods
```
#Import the plot_acf module from statsmodels
from statsmodels.graphics.tsaplots import plot_acf
#Seasonally adjust quarterly earnings
HRBsa = HRB.diff(4)
#Print the first 10 rows of the seasonally adjusted series
print(HRBsa.head(10))
#Drop the NaN data in the first four rows
HRBsa = HRBsa.dropna()
#Plot the autocorrelation function of the seasonally adjusted series
plot_acf(HRBsa)
plt.show()
```

## Introducing an AR model
* AR model = AutoRegressive Model
* Since there is only one lagged value on the right hand side, this is called a:
    * AR model of order 1 or;
    * AR(1) model 
* (AR models of other orders also exist)
* If the AR parameter $\phi$ is 1, then the process is a random walk.
* If the AR parameter $\phi$ is 0, then the process is white noise. 
* In order for the process to be stable and stationary: -1 < $\phi$ < 1
* __Positive $\phi$:__ Mean reversion
* __Negative $\phi$:__ Momentum
* Autocorrelation decays exponentially at a rate of $\phi$
* Therefore, if $\phi$ = 0.9:
    * lag(1) Autocorrelation = 0.9
    * lag(2) Autocorrelation = $0.9^2$
    * lag(3) Autocorrelation = $0.9^3$
    * lag(4) Autocorrelation = $0.9^4$
    * etc...
* Whe $\phi$ is negative, the autocorrelation function still decays exponentially, but the signs $\pm$ of the autocorrelation reverse at each lag. 
### Simulating an AR Process:
* `statsmodels` provides models for simulating AR processes
* There are a few conventions when using the arima_process module that require some explanation. 
* First, these routines were made very generally to handle both AR and MA models. 
* Second, when inputting the coefficients, you must include the zero-lag coefficient of 1, and the sign of the other coefficients is opposite what we have been using (to be consistent with the time series literature in signal processing). 
* For example, for an AR(1) process with , the array representing the AR parameters would be `ar = np.array([1, -0.9])`
```
#import the module for simulating data
from statsmodels.tsa.arima_process import ArmaProcess
#Plot 1: AR parameter = +0.9
plt.subplot(2,1,1)
ar1 = np.array([1, -0.9])
ma1 = np.array([1])
AR_object1 = ArmaProcess(ar1, ma1)
simulated_data_1 = AR_object1.generate_sample(nsample=1000)
plt.plot(simulated_data_1)
#Plot 2: AR parameter = -0.9
plt.subplot(2,1,2)
ar2 = np.array([1, 0.9])
ma2 = np.array([1])
AR_object2 = ArmaProcess(ar2, ma2)
simulated_data_2 = AR_object2.generate_sample(nsample=1000)
plt.plot(simulated_data_2)
plt.show()
```
* The autocorrelation function decays exponentially for an AR time series at a rate of the AR parameter. 
* A smaller AR parameter will have a steeper decay, and for a negative AR parameter, say -0.9, the decay will flip signs

```
#Import the plot_acf module from statsmodels
from statsmodels.graphics.tsaplots import plot_acf
#Plot 1: AR parameter = +0.9
plot_acf(simulated_data_1, alpha=1, lags=20)
plt.show()
#Plot 2: AR parameter = -0.9
plot_acf(simulated_data_2, alpha=1, lags=20)
plt.show()
#Plot 3: AR parameter = +0.3
plot_acf(simulated_data_3, alpha=1, lags=20)
plt.show()
```

## Estimating and Forecasting an AR Model:
* To estimate parameters 
```
#Import the ARMA module from statsmodels
from statsmodels.tsa.arima_model import ARMA
#Fit an AR(1) model to the first simulated data
mod = ARMA(simulated_data_1, order=(1,0))
res = mod.fit()
#Print out summary information on the fit
print(res.summary())
#Print out the estimate for the constant and for phi
print("When the true phi=0.9, the estimate of phi (and the constant) are:")
print(res.params)
```

```
#Import the ARMA module from statsmodels
from statsmodels.tsa.arima_model import ARMA
#Forecast the first AR(1) model
mod = ARMA(simulated_data_1, order=(1,0))
res = mod.fit()
res.plot_predict(start=990, end=1010)
plt.show()
```
```
#Import the ARMA module from statsmodels
from statsmodels.tsa.arima_model import ARMA
#Forecast interest rates using an AR(1) model
mod = ARMA(interest_rate_data, order=(1,0))
res = mod.fit()
#Plot the original series and the forecasted series
res.plot_predict(start=0,end="2022")
plt.legend(fontsize=8)
plt.show()
```
```
#Import the plot_acf module from statsmodels
from statsmodels.graphics.tsaplots import plot_acf
#Plot the interest rate series and the simulated random walk series side-by-side
fig, axes = plt.subplots(2,1)
#Plot the autocorrelation of the interest rate series in the top plot
fig = plot_acf(interest_rate_data, alpha=1, lags=12, ax=axes[0])
#Plot the autocorrelation of the simulated random walk series in the bottom plot
fig = plot_acf(simulated_data, alpha=1, lags=12, ax=axes[1])
#Label axes
axes[0].set_title("Interest Rate Data")
axes[1].set_title("Simulated Random Walk Data")
plt.show()
```
## Choosing the right model:
* You will ordinarily not be the told the order of the model you're trying to estimate
* There are two techniques that can help determine the order of the AR model:
    * __Partial Autocorrelation Function (PACF):__ measures the incremental benefit of adding another lag
    * __Information Criteria:__ The more parameters in a model, the better the model will fit the data, but this can lead to overfitting of the data. The information criteria adjusts the goodness of fit for number of paramenters by implementing a penalty of number of parameters used.
        * Two popular adjusted goodness-of-fit measures:
            * __AIC (Akaike Information Criterion):__ `result.aic`
            * __BIC (Bayesian Information Criterion):__ `result.bic`
            * Try a few different models and choose the one with the lowest information criterion
            
```
#Import the module for estimating an ARMA model
from statsmodels.tsa.arima_model import ARMA
#Fit the data to an AR(p) for p = 0,...,6 , and save the BIC
BIC = np.zeros(7)
for p in range(7):
    mod = ARMA(simulated_data_2, order=(p,0))
    res = mod.fit()
#Save BIC for AR(p)    
    BIC[p] = res.bic  
#Plot the BIC as a function of p
plt.plot(range(1,7), BIC[1:7], marker='o')
plt.xlabel('Order of AR Model')
plt.ylabel('Bayesian Information Criterion')
plt.show()
```

* MA model = moving average
* lag one autocorrelation: only last period's data affects current period's data (nothing older than that)
* Note: one-period auto-correlation is not $\theta$, but: $\theta$ / (1 + $\theta^2$)
* 'the bid-ask bounce'

```
from statsmodels.tsa.arima_process import ArmaProcess
ar = np.array([1])
ma = np.array([1, 0.5])
AR_object = ArmaProcess(ar, ma)
simulated_data = AR_object.generate_sample(nsample=1000)
plt.plot(simulated_data)
```

```
# import the module for simulating data
from statsmodels.tsa.arima_process import ArmaProcess

# Plot 1: MA parameter = -0.9
plt.subplot(2,1,1)
ar1 = np.array([1])
ma1 = np.array([1, -0.9])
MA_object1 = ArmaProcess(ar1, ma1)
simulated_data_1 = MA_object1.generate_sample(nsample=1000)
plt.plot(simulated_data_1)

# Plot 2: MA parameter = +0.9
plt.subplot(2,1,2)
ar2 = np.array([1])
ma2 = np.array([1, 0.9])
MA_object2 = ArmaProcess(ar2, ma2)
simulated_data_2 = MA_object2.generate_sample(nsample=1000)
plt.plot(simulated_data_2)

plt.show()
```

```
# Import the plot_acf module from statsmodels
from statsmodels.graphics.tsaplots import plot_acf

# Plot 1: MA parameter = -0.9
plot_acf(simulated_data_1, lags=20)
plt.show()
# Plot 2: MA parameter = 0.9
plot_acf(simulated_data_2, lags=20)
plt.show()
# Plot 3: MA parameter = -0.3
plot_acf(simulated_data_3, lags=20)
plt.show()
```

## Estimation and forecasting an MA model
* Estimating an MA model (same as estimating an AR model, except MA order 1 is order = (0,1), not AR order 1, which is order = (1,0))

```
from statsmodels.tsa.arima_model import ARMA
mod = ARMA(simulated_data, order= (0,1))
result = mod.fit()
res.plot_predict(start='2016-07-01', end = '2017-06-01')
plt.show()
```

```
# Import the ARMA module from statsmodels
from statsmodels.tsa.arima_model import ARMA

# Fit an MA(1) model to the first simulated data
mod = ARMA(simulated_data_1, order=(0,1))
res = mod.fit()

# Print out summary information on the fit
print(res.summary())

# Print out the estimate for the constant and for theta
print("When the true theta=-0.9, the estimate of theta (and the constant) are:")
print(res.params)

# Import the ARMA module from statsmodels
from statsmodels.tsa.arima_model import ARMA

# Forecast the first MA(1) model
mod = ARMA(simulated_data_1, order=(0,1))
res = mod.fit()
res.plot_predict(start=990, end=1010)
plt.show()
```
## ARMA models
* Combination of an AR model and an MA model 
* Can be converted to pure AR or pure MA models

```
# import datetime module
import datetime

# Change the first date to zero
intraday.iloc[0,0] = 0

# Change the column headers to 'DATE' and 'CLOSE'
intraday.columns = ['DATE', 'CLOSE']

# Examine the data types for each column
print(intraday.dtypes)

# Convert DATE column to numeric
intraday['DATE'] = pd.to_numeric(intraday['DATE'])

# Make the `DATE` column the new index
intraday = intraday.set_index('DATE')

# Notice that some rows are missing
print("If there were no missing rows, there would be 391 rows of minute data")
print("The actual length of the DataFrame is:", len(intraday))
# Everything
set_everything = set(range(391))

# The intraday index as a set
set_intraday = set(intraday.index)

# Calculate the difference
set_missing = set_everything - set_intraday

# Print the difference
print("Missing rows: ", set_missing)
# Fill in the missing rows
intraday = intraday.reindex(range(391), method='ffill')

# From previous step
intraday = intraday.reindex(range(391), method='ffill')

# Change the index to the intraday times
intraday.index = pd.date_range(start='2017-09-01 9:30', end='2017-09-01 16:00', freq='1min')

# Plot the intraday time series
intraday.plot(grid=True)
plt.show()
```
* an AR(1) model is equivalent to an MA() model with the appropriate parameters

```
# import the modules for simulating data and plotting the ACF
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.graphics.tsaplots import plot_acf

# Build a list MA parameters
ma = [.8**i for i in range(30)]

# Simulate the MA(30) model
ar = np.array([1])
AR_object = ArmaProcess(ar, ma)
simulated_data = AR_object.generate_sample(nsample=5000)

# Plot the ACF
plot_acf(simulated_data, lags=30)
plt.show()
```
## Cointegration models
* What is cointegration?
* Two series can be random walks, but the linear combination of the two series may not be a random walk (!?)
* If that is true: each series on its own is not forecastable, but combined they are forecastable. In these cases we say the two series are **cointegrated.**
* Analogy: Dog on a leash. The owner may follow a random walk and the dog may follow a random walk, but the difference between their positions may very well be mean reverting. 
* What type of series are cointegrated?
    * Oil and gas
    * With certain commodities, there may be economic forces that link the two prices (like in the instance of gas and oil)
    * Platinum and Palladium
    * Corn and wheat
    * Corn and sugar
    * ... etc...
    * Bitcoin and Ethereum?
* For stocks, a natural starting point for identifying cointgrated pairs are stocks in the same industry. However, competitors are not necessarily economic substitutes. (Coke vs Pepsi?)
### Two steps to test for cointegration
* First, regress the level of one series onto the level of the other series and get slope *c* 
* Run augmented Dickey-Fuller test for random walk on the linear combination of the two series

```
from statsmodels.tsa.stattools import coint
coint(P,Q)
```
***
```
# Plot the prices separately
plt.subplot(2,1,1)
plt.plot(7.25*HO, label='Heating Oil')
plt.plot(NG, label='Natural Gas')
plt.legend(loc='best', fontsize='small')

# Plot the spread
plt.subplot(2,1,2)
plt.plot(7.25*HO-NG, label='Spread')
plt.legend(loc='best', fontsize='small')
plt.axhline(y=0, linestyle='--', color='k')
plt.show()

# Import the adfuller module from statsmodels
from statsmodels.tsa.stattools import adfuller

# Compute the ADF for HO and NG
result_HO = adfuller(HO['Close'])
print("The p-value for the ADF test on HO is ", result_HO[1])
result_NG = adfuller(NG['Close'])
print("The p-value for the ADF test on NG is ", result_NG[1])

# Compute the ADF of the spread
result_spread = adfuller(7.25 * HO['Close'] - NG['Close'])
print("The p-value for the ADF test on the spread is ", result_spread[1])

# Import the statsmodels module for regression and the adfuller function
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller

# Regress BTC on ETH
ETH = sm.add_constant(ETH)
result = sm.OLS(BTC,ETH).fit()

# Compute ADF
b = result.params[1]
adf_stats = adfuller(BTC['Price'] - b*ETH['Price'])
print("The p-value for the ADF test is ", adf_stats[1])
```

# Analyzing Temperature Data
* Temperature data:
    * New York City from 1870-2016
    * Downloaded from NOAA
    
$\times$ 1) Convert index to datetime object \
$\times$ 2) Plot the data \
$\times$ 3) Run Augmented Dickey Fuller Test to see whether the data is a random walk \
4) Take first differences of the data to transform it into a stationary series \
5) Compute ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Funcion) \
6) Using that as a guide, fit a few AR, MA, and ARMA models to the data \
7) Use information criterion to choose the best model \
8) Forecast temperature over next 30 years \

* __`.corr()`__
* __`.autocorr()`__
* __`.plot(grid=True)`__
* __White Noise:__ a series with
    * constant mean
    * constant variance
    * zero autocorrelations at all lags
    
* **Random Walk: First Differences**

**Is Temperature a Random Walk (with Drift)?**
* An ARMA model is a simplistic approach to forecasting climate changes, but it illustrates many points.

1)

```
# Import the adfuller function from the statsmodels module
from statsmodels.tsa.stattools import adfuller

# Convert the index to a datetime object
temp_NY.index = pd.to_datetime(temp_NY.index, format='%Y')

# Plot average temperatures
temp_NY.plot()
plt.show()

# Compute and print ADF p-value
result = adfuller(temp_NY['TAVG'])
print("The p-value for the ADF test is ", result[1])
```
*** 
2)
```
# Import the modules for plotting the sample ACF and PACF
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Take first difference of the temperature Series
chg_temp = temp_NY.diff()
chg_temp = chg_temp.dropna()

# Plot the ACF and PACF on the same page
fig, axes = plt.subplots(2,1)

# Plot the ACF
plot_acf(chg_temp, lags=20, ax=axes[0])

# Plot the PACF
plot_pacf(chg_temp, lags=20, ax=axes[1])
plt.show()
```
***
3)
```
# Import the module for estimating an ARMA model
from statsmodels.tsa.arima_model import ARMA

# Fit the data to an AR(1) model and print AIC:
mod_ar1 = ARMA(chg_temp, order=(1,0))
res_ar1 = mod_ar1.fit()
print("The AIC for an AR(1) is: ", res_ar1.aic)

# Fit the data to an AR(2) model and print AIC:
mod_ar2 = ARMA(chg_temp, order=(2,0))
res_ar2 = mod_ar2.fit()
print("The AIC for an AR(2) is: ", res_ar2.aic)

# Fit the data to an ARMA(1,1) model and print AIC:
mod_arma11 = ARMA(chg_temp, order=(1,1))
res_arma11 = mod_arma11.fit()
print("The AIC for an ARMA(1,1) is: ", res_arma11.aic)
```
***
4)

```
# Import the ARIMA module from statsmodels
from statsmodels.tsa.arima_model import ARIMA

# Forecast temperatures using an ARIMA(1,1,1) model
mod = ARIMA(temp_NY, order=(1,1,1))
res = mod.fit()

# Plot the original series and the forecasted series
res.plot_predict(start='1872-01-01', end='2046-01-01')
plt.show()
```
***