# __Data imputation__

### __Have we got what we think we have got?__

According to the source of our data of CO2 emissions, they were to be dispatched every 5 mins. However, can we really be sure that that's the case over a period of about 10 years? We will find that out over the next couple of cells.

In [39]:
import pandas as pd

In [40]:
file_path = '../../big_data/df_DUID_price_range.pkl'
df = pd.read_pickle(file_path)

df.head()

Unnamed: 0_level_0,Price,CO2E_EMISSIONS_FACTOR
PeriodID,Unnamed: 1_level_1,Unnamed: 2_level_1
2009-07-01 04:00:00,1.35918,0.991217
2009-07-01 04:05:00,-6e-05,0.0
2009-07-01 04:10:00,-6e-05,0.0
2009-07-01 04:15:00,1.44014,0.991217
2009-07-01 04:20:00,1.7548,1.025701


In [41]:
def correct_timedelta(df, time_diff):
    '''
    df.index must be DateTimeIndex
    Returns two lists
    df=table_of_interest
    col="column_of_interest"
    time_diff=time_diff in seconds as int
    '''
    lst_1 = []
    lst_2 = []
    
    for i in range(1,df.shape[0]):
        delta = abs(df.index[i] - df.index[i-1])
        if int(delta.total_seconds()) != int(time_diff):
            lst_1.append((f'At time stamp {df.index[i-1]}, the interval is {int(delta.total_seconds()/60)} min or {round(float(delta.total_seconds()/3600),2)} h.'))
            lst_2.append((df.index[i-1], int(delta.total_seconds()/60)))
            
    return lst_1, lst_2


In [42]:
l1, l2 = correct_timedelta(df, 300)

In [43]:
l1[:20]

['At time stamp 2009-08-17 02:50:00, the interval is 10 min or 0.17 h.',
 'At time stamp 2009-08-17 03:25:00, the interval is 10 min or 0.17 h.',
 'At time stamp 2009-08-19 16:55:00, the interval is 10 min or 0.17 h.',
 'At time stamp 2009-09-02 04:35:00, the interval is 10 min or 0.17 h.',
 'At time stamp 2009-09-03 03:55:00, the interval is 10 min or 0.17 h.',
 'At time stamp 2009-09-03 04:10:00, the interval is 10 min or 0.17 h.',
 'At time stamp 2009-09-04 02:15:00, the interval is 10 min or 0.17 h.',
 'At time stamp 2009-09-04 02:50:00, the interval is 10 min or 0.17 h.',
 'At time stamp 2009-09-04 04:20:00, the interval is 10 min or 0.17 h.',
 'At time stamp 2009-09-13 04:20:00, the interval is 10 min or 0.17 h.',
 'At time stamp 2009-09-13 04:40:00, the interval is 10 min or 0.17 h.',
 'At time stamp 2010-01-23 14:20:00, the interval is 10 min or 0.17 h.',
 'At time stamp 2010-03-30 02:55:00, the interval is 10 min or 0.17 h.',
 'At time stamp 2010-09-18 06:00:00, the interval i

## __Filling in the gaps__

### __DateTimeIndex__

Next, we will fill in the gaps of the DateTimeIndex. Note, that will leave us with NaN values in the remaining columns whereever datetimes are inserted into the index.

In [44]:
#hypothetical number of 5min intervals given the max and min values of the df time range

h = pd.date_range(start=df.index.min(), end=df.index.max(), freq="5min")
len(h)

1060704

In [45]:
assert abs(df.shape[0] - len(h)) < 0.000001

AssertionError: 

Here we clearly see that there is a mismatch between the hypothetical size of our time range if every 5 min interval was present and the size of our actual dataset.

__Filling in the DateTimeIndex:__

In [46]:
df = df.reindex(h)
print(df.shape)
df.head()

(1060704, 2)


Unnamed: 0,Price,CO2E_EMISSIONS_FACTOR
2009-07-01 04:00:00,1.35918,0.991217
2009-07-01 04:05:00,-6e-05,0.0
2009-07-01 04:10:00,-6e-05,0.0
2009-07-01 04:15:00,1.44014,0.991217
2009-07-01 04:20:00,1.7548,1.025701


In [47]:
df = df.sort_index(ascending=True)
df.isna().any()

Price                    True
CO2E_EMISSIONS_FACTOR    True
dtype: bool

Note that we have only filled in the DateTimeIndex so far. Any other columns were affected by the introduction of NaN values where datetimes have been filled in. In the next step we will impute the data into the other columns. Let's discuss how we best do it with a time series.