# Missing values and interpolation

One common application of interpolation in data analysis is to fill in missing data.

In this exercise, noisy measured data that has some dropped or otherwise missing values has been loaded. The goal is to compare two time series, and then look at summary statistics of the differences. The problem is that one of the data sets is missing data at some of the times. The pre-loaded data `ts1` has value for all times, yet the data set `ts2` does not: it is missing data for the weekends.

Your job is to first interpolate to fill in the data for all days. Then, compute the differences between the two data sets, now that they both have full support for all times. Finally, generate the summary statistics that describe the distribution of differences.

In [9]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

import data

ts1_values = [i for i in range(len(data.ts1d))]
ts2_values = [i for i in range(len(data.ts2d))]

ts1d = {'Date': data.ts1d, 'Value': ts1_values}
ts2d = {'Date': data.ts2d, 'Value': ts2_values}

df1 = pd.DataFrame(ts1d)
df2 = pd.DataFrame(ts2d)

df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])

ts1 = pd.Series(df1['Value'].values, index=df1['Date'])
ts2 = pd.Series(df2['Value'].values, index=df2['Date'])

In [12]:
# Reset the index of ts2 to ts1, and then use linear interpolation to fill in the NaNs: ts2_interp
ts2_interp = ts2.reindex(ts1.index).interpolate(how='linear')

# Compute the absolute difference of ts1 and ts2_interp: differences 
differences = np.abs(ts1 - ts2_interp)

# Generate and print summary statistics of the differences
print(differences.describe())


count    17.000000
mean      2.882353
std       1.585267
min       0.000000
25%       2.000000
50%       2.666667
75%       4.000000
max       6.000000
dtype: float64
