## Problem 1: The Mauna Loa CO2 Concentration

In 1958, Charles David Keeling (1928-2005) from the Scripps Institution of Oceanography began recording carbon dioxide (CO2) concentrations in the atmosphere at an observatory located at about 3,400m altitude on the Mauna Loa Volcano on Hawaii Island. The location was chosen because it is not influenced by changing CO levels due to the local vegetation and because prevailing wind patterns on this tropical island tend to bring well-mixed air to the site. While the recordings are made near a volcano (which tends to produce CO2), wind patterns tend to blow the volcanic CO away from the recording site. Air samples are taken several times a day, and concentrations have been observed using the same measuring method for over 60 years. In addition, samples are stored in flasks and periodically reanalyzed for calibration purposes. The observational study is now run by Ralph Keeling, Charles's son. The result is a data set with very few interruptions and very few inhomogeneities. It has been called the “most important data set in modern climate research."

The data set for this problem can be found in `CO2.csv`. It provides the concentration of CO2 recorded at Mauna Loa for each month starting March 1958. More description is provided in the data set file. We will be considering only the CO2 concentration given in column 5. The goal of the problem is to fit the data and understand its variations. You will encounter missing data points; part of the exercise is to deal with them appropriately.

Let $C_i$ be the average CO2 concentration in month $i$ ($i = 1,2,\dots$, counting from March 1958). We will look for a description of the form:

where:

- $F: t \mapsto F(t)$ accounts for the long-term trend.
- $t_i$ is time at the middle of the $i$th month, measured in fractions of years after Jan 15, 1958. Specifically, we take

  $$
  t_ i=\frac{i+0.5}{12}, \qquad i=0,1,\dots ,
  $$

  where corresponds to Jan, 1958, adding is because the first measurement is halfway through the first month.

- $P_i$ is periodic in with a fixed period, accounting for the seasonal pattern.
- $R_i$ is the remaining residual that accounts for all other influences.

The decomposition is meaningful only if the range of $F$ is much larger than the amplitude of the $P_i$ and this amplitude in turn is substantially larger than that of $R_i$.

You are required to split the data into training and test datasets - you can perform an 80:20 split. All model fitting should be done only on the training set and all the remaining data should be used for evaluation (for the purpose of model selection), i.e. prediction errors should be reported with respect to the test set.

At the end of this problem you should be able to

- Handle incomplete data sets using at least using one method.
- Perform time series regression and find the deterministic and periodic trends in data.
- Interpret residuals.

### Pre-Processing data

You may notice that there are some inhomogeneities in data and the CO2 concentration at these points is recorded as -99.99. Before proceeding, we must clean the data. One simple way to do this is to drop all missing values from the table. For the purpose of the problems below, use this simple method of dropping all the missing values.

Other methods include forward filling–fill missing values with previous values, and interpolation.

The pre-processing should be done before splitting data.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.tsa.api as smt
import sklearn as sk


  from pandas import Int64Index as NumericIndex
