# INFO 3402 – Class 37: Time series analysis exercise

[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Credit also goes to Jake VanderPlas's *[Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html)* and Justin Markham's [DAT4](https://github.com/justmarkham/DAT4/blob/master/notebooks/08_linear_regression.ipynb) notebooks.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb

import numpy as np
import pandas as pd

pd.options.display.max_columns = 200

import statsmodels.formula.api as smf
import statsmodels.api as sm

Explore the temporal features of the Global Temperature data from the [Berkeley Earth Surface Temperature](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data#GlobalTemperatures.csv) data.

Start by reading in the data.

In [2]:
temp_df = pd.read_csv('global_temperature.csv',parse_dates=['dt'])
temp_df.columns = ['date','avg_temp','avg_temp_error','max_temp','max_temp_error','min_temp','min_temp_error','avg_all_temp','avg_all_temp_error']
temp_df.head()

Unnamed: 0,date,avg_temp,avg_temp_error,max_temp,max_temp_error,min_temp,min_temp_error,avg_all_temp,avg_all_temp_error
0,1750-01-01,3.034,3.574,,,,,,
1,1750-02-01,3.083,3.702,,,,,,
2,1750-03-01,5.626,3.076,,,,,,
3,1750-04-01,8.49,2.451,,,,,,
4,1750-05-01,11.573,2.072,,,,,,


Extract the month from "date" in `temp_df` and add as a new column.

Create a "months_since_1750" variable using `Timestamp` and `Timedelta` (as I did in Class 36).

Create a new pandas `Series` called `avg_temp_s` with "avg_temp" as values and "date" as an index. You may also need to fill in missing values in the data. Go back to Class 15 or 16 on how to handle missing data in time series, or use the [relevant pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) to come up with a strategy for handling these missing data in `avg_temp_s`.

Plot the raw data for "avg_temp" against "date". If you use the `.plot` method on the `temp_df` DataFrame, you can also pass an option like `figsize=(x,y)` to make an especially wide chart.

Use `seasonal_decompose` from statsmodels to visualize the trend, seasonal, and residual components of `avg_temp_s`. 

* What does the **trend** pane suggest about the behavior of this time series? How is this similar or different from the trend in the DIA passenger data?
* What does the **seasonal** pane suggest about the behavior of this time series? How is this similar or different from the trend in the DIA passenger data?
* What does the **residual** pane suggest about the behavior of this time series? How is this similar or different from the trend in the DIA passenger data?

Use the `.rolling()` method (again look at Class 36) to make a rolling average of the temperatures, selecting an appropriate window to average over. Visualize the rolling average to explore if there is an underlying trend.

* What does this rolling average chart capture that the components of `seasonal_compose` does not?
* What might be some possible explanations for why there is so much variance in the old data and less variance in more recent data?
* Try an even coarser rolling window. For example, if you chose a 12-month rolling window, see how the chart changes with something like a 120-month rolling window. Can you find the [Year Without a Summer](https://en.wikipedia.org/wiki/Year_Without_a_Summer) in this data?

Make a point plot using seaborn with "month" on the x-axis and "avg_temp" on the y-axis. Which is the warmest month of the year on average? Is this surprising or not? Is it surprising when you consider this data supposed to be the average temperature across the globe?

Fit a simple linear regression of "avg_temp" against "months_since_1750". Interpret the coefficients, significance, and model performance.

Fit a linear regression model of "avg_temp" with "months_since_1750" with "month" as a fixed effect. Interpret the coefficients, significance, and model performance compared to the previous regression model.

## Advanced

Create a new DataFrame with dates at a monthly interval from January 2010 to January 2050. 

Create columns for "month" and "months_since_1750". 

Use the `.predict()` method on both the simple and fixed effects regression models you trained above. 

Visualize the observational and predicted values from 2010 onward.

## Pro-level

Create a periodogram/power spectral density plot to find other cyclical patterns within the data besides an annual (12-month) frequency. 

What are possible explanations for these other frequencies?