# Data Cleaning, Preprocessing, and Feature Engineering

In this notebook, we will read in our data. Match instances in our data to images, and add the file path to the df.
After, joining our data, we will need to cross join to get Irradiance data with it's iterval ahead weather, irradiance, and sky images. 

All data was dowloaded from __[here](https://zenodo.org/record/2826939#.YEPKXi1h1pS)__. Thanks so much to the University of California San Diego team (Carreira Pedro, Hugo; Larson, David; Coimbra, Carlos) who worked so hard on collecting this data, and for supporting the work of others in this space.

#### Below we will:
1. [Create dataframes](#Create-DataFrames-for-Models)
    - create filepaths for images
    - get time intervals and cross join for earlier irradiance data
    - merge weather and irradiance data
    
  
2. [explore our data](#)
3. [preprocess/scale our data](#Pre-Processing-and-Scaling)


Import needed libraries:

In [380]:
import pandas as pd
import bz2
from datetime import datetime,timedelta
import tarfile
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from math import pi, sin
import data_processing_functions

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
# import sklearn

In [398]:
#need to look into how these may have been saved incorrectly, so if they're off then 
# they can be matched to the closest file by second, maybe there should be an upper 
# and lower time within the min?
    

### Create DataFrames for Models


Open the image files to get image names:


In [None]:
new_img_file_names = {}
for yr in [2014,2015,2016]:
    f_name = f'data/Folsom_sky_images_{yr}.tar.bz2'
    print(f_nmae)
    tar = tarfile.open(f_name, "r")
    tar_members_names = [filename for filename in tar.getnames()]
    img_file_names[yr] = {}
    get_all_file_names(tar_members_names,img_file_names[yr])


In [None]:
# save_pickle('image_file_names.pkl',new_img_file_names)
# save_pickle('data/df_solar_and_img_data.pkl',df_merge_1)
img_file_names = open_pickle('data/image_file_names.pkl')
df_merge_1 = open_pickle('data/df_solar_and_img_data.pkl')

Read in weather and irradiance data:

In [None]:
fol_irr = pd.read_csv('data/Folsom_irradiance.csv',index_col=0)
fol_sat = pd.read_csv('data/Folsom_satellite.csv')
fol_sky_img = pd.read_csv('data/Folsom_sky_image_features.csv',index_col=0)
fol_weather = pd.read_csv('data/Folsom_weather.csv',index_col=0)

Get datetime for each interval:

In [None]:
# fol_sat.columns
datetime_blank_min_before(fol_irr,5)
datetime_blank_min_before(fol_irr,10)
datetime_blank_min_before(fol_irr,15)
datetime_blank_min_before(fol_irr,20)
datetime_blank_min_before(fol_irr,25)
datetime_blank_min_before(fol_irr,30)

In [None]:
fol_weather['timestamp'] = pd.to_datetime(fol_weather['timestamp'])

Merge the irradance DFs to the folsome weather DFs on the interval before: 

In [None]:
df_5_min_ahead = pd.merge(fol_irr,fol_weather,how="left", left_on="5_min_before", right_on="timestamp")
df_10_min_ahead = pd.merge(fol_irr,fol_weather,how="left", left_on="10_min_before", right_on="timestamp")
df_15_min_ahead = pd.merge(fol_irr,fol_weather,how="left", left_on="15_min_before", right_on="timestamp")
df_20_min_ahead = pd.merge(fol_irr,fol_weather,how="left", left_on="20_min_before", right_on="timestamp")
df_25_min_ahead = pd.merge(fol_irr,fol_weather,how="left", left_on="25_min_before", right_on="timestamp")
df_30_min_ahead = pd.merge(fol_irr,fol_weather,how="left", left_on="30_min_before", right_on="timestamp")

In [None]:
df_5_min_ahead = df_5_min_ahead.dropna()[['timestamp_x','5_min_before', 'ghi', 'dni', 'dhi',
                                  'air_temp', 'relhum', 'press', 'windsp', 'winddir','max_windsp', 
                                  'precipitation']].rename(columns={'timestamp_x':'timestamp'})

df_10_min_ahead = df_10_min_ahead.dropna()[['timestamp_x','10_min_before', 'ghi', 'dni', 'dhi',
                                  'air_temp', 'relhum', 'press', 'windsp', 'winddir','max_windsp', 
                                  'precipitation']].rename(columns={'timestamp_x':'timestamp'})

df_15_min_ahead = df_15_min_ahead.dropna()[['timestamp_x','15_min_before', 'ghi', 'dni', 'dhi',
                                  'air_temp', 'relhum', 'press', 'windsp', 'winddir','max_windsp', 
                                  'precipitation']].rename(columns={'timestamp_x':'timestamp'})

df_20_min_ahead = df_20_min_ahead.dropna()[['timestamp_x','20_min_before', 'ghi', 'dni', 'dhi',
                                  'air_temp', 'relhum', 'press', 'windsp', 'winddir','max_windsp', 
                                  'precipitation']].rename(columns={'timestamp_x':'timestamp'})

df_25_min_ahead = df_25_min_ahead.dropna()[['timestamp_x','25_min_before', 'ghi', 'dni', 'dhi',
                                  'air_temp', 'relhum', 'press', 'windsp', 'winddir','max_windsp', 
                                  'precipitation']].rename(columns={'timestamp_x':'timestamp'})

df_30_min_ahead = df_30_min_ahead.dropna()[['timestamp_x','30_min_before', 'ghi', 'dni', 'dhi',
                                  'air_temp', 'relhum', 'press', 'windsp', 'winddir','max_windsp', 
                                  'precipitation']].rename(columns={'timestamp_x':'timestamp'})

In [None]:
for table in [df_5_min_ahead,df_10_min_ahead,df_15_min_ahead,df_20_min_ahead,df_25_min_ahead,df_30_min_ahead]:
#     print(table.columns[1])
    breakdown_dates(table,table.columns[1])

Using functions get the times below and above ours to match the image files to:

In [None]:
for table in [df_5_min_ahead,df_10_min_ahead,df_15_min_ahead,df_20_min_ahead,df_25_min_ahead,df_30_min_ahead]:
    print(table.columns[1])
    table['higher_file'] = table.apply(lambda row: make_higher_image_path(row),axis=1)
    table['lower_file'] = table.apply(lambda row: make_lower_image_path(row),axis=1)

Get the correct image file that falls between the two columns above:

In [None]:
for table in [df_5_min_ahead,df_10_min_ahead,df_15_min_ahead,df_20_min_ahead,df_25_min_ahead,df_30_min_ahead]:
    print(table.columns[1])
    table['file'] = table.apply(lambda row: get_correct_file(row,img_file_names),axis=1)

Only get instances where we have files for:

In [None]:
df_5_min_ahead_w_img = df_5_min_ahead[(~df_5_min_ahead.file.isnull()) & (df_5_min_ahead.file != 0)]
df_10_min_ahead_w_img = df_10_min_ahead[(~df_10_min_ahead.file.isnull()) & (df_10_min_ahead.file != 0)]
df_15_min_ahead_w_img = df_15_min_ahead[(~df_15_min_ahead.file.isnull()) & (df_15_min_ahead.file != 0)]
df_20_min_ahead_w_img = df_20_min_ahead[(~df_20_min_ahead.file.isnull()) & (df_20_min_ahead.file != 0)]
df_25_min_ahead_w_img = df_25_min_ahead[(~df_25_min_ahead.file.isnull()) & (df_25_min_ahead.file != 0)]
df_30_min_ahead_w_img = df_30_min_ahead[(~df_30_min_ahead.file.isnull()) & (df_30_min_ahead.file != 0)]

In [3]:
# save_pickle("df_5_min_ahead_data.pkl",df_5_min_ahead_w_img)
# save_pickle("df_10_min_ahead_data.pkl",df_10_min_ahead_w_img)
# save_pickle("df_15_min_ahead_data.pkl",df_15_min_ahead_w_img)
# save_pickle("df_20_min_ahead_data.pkl",df_20_min_ahead_w_img)
# save_pickle("df_25_min_ahead_data.pkl",df_25_min_ahead_w_img)
# save_pickle("df_30_min_ahead_data.pkl",df_30_min_ahead_w_img)
df_5_min_ahead_w_img = open_pickle("df_5_min_ahead_data.pkl")
df_10_min_ahead_w_img = open_pickle("df_10_min_ahead_data.pkl")
df_15_min_ahead_w_img = open_pickle("df_15_min_ahead_data.pkl")
df_20_min_ahead_w_img = open_pickle("df_20_min_ahead_data.pkl")
df_25_min_ahead_w_img = open_pickle("df_25_min_ahead_data.pkl")
df_30_min_ahead_w_img = open_pickle("df_30_min_ahead_data.pkl")

In [4]:
df_5_min_ahead_w_img = df_5_min_ahead_w_img[['timestamp', '5_min_before', 'ghi', 'air_temp', 
                                             'relhum','press', 'windsp', 'winddir', 
                                             'max_windsp', 'precipitation', 'file']]
df_10_min_ahead_w_img = df_10_min_ahead_w_img[['timestamp', '10_min_before', 'ghi','air_temp', 
                                               'relhum','press', 'windsp', 'winddir', 
                                               'max_windsp', 'precipitation', 'file']]
df_15_min_ahead_w_img = df_15_min_ahead_w_img[['timestamp', '15_min_before', 'ghi', 
                                               'air_temp', 'relhum','press', 'windsp', 'winddir', 
                                               'max_windsp', 'precipitation', 'file']]
df_20_min_ahead_w_img = df_20_min_ahead_w_img[['timestamp', '20_min_before', 'ghi', 
                                               'air_temp', 'relhum','press', 'windsp', 'winddir', 
                                               'max_windsp', 'precipitation', 'file']]
df_25_min_ahead_w_img = df_25_min_ahead_w_img[['timestamp', '25_min_before', 'ghi', 
                                               'air_temp', 'relhum','press', 'windsp', 'winddir', 
                                               'max_windsp', 'precipitation', 'file']]
df_30_min_ahead_w_img = df_30_min_ahead_w_img[['timestamp', '30_min_before', 'ghi', 
                                               'air_temp', 'relhum','press', 'windsp', 'winddir', 
                                               'max_windsp', 'precipitation', 'file']]

Update datetime format on all DFs before we join to get irradiance from earlier timestamps:

In [6]:
for df in [df_5_min_ahead_w_img,df_10_min_ahead_w_img,df_15_min_ahead_w_img,
           df_20_min_ahead_w_img,df_25_min_ahead_w_img,df_30_min_ahead_w_img]:
    
    df['timestamp'] = pd.to_datetime(df['timestamp'])

Now join tables, to match on their time ahead intervals. This will give us the irradiance at the time interval before.

In [12]:
df_5_min = df_5_min_ahead_w_img.merge(fol_irr,how="left", left_on="5_min_before", right_on="timestamp")
df_10_min = df_10_min_ahead_w_img.merge(fol_irr,how="left", left_on="10_min_before", right_on="timestamp")
df_15_min = df_15_min_ahead_w_img.merge(fol_irr,how="left", left_on="15_min_before", right_on="timestamp")
df_20_min = df_20_min_ahead_w_img.merge(fol_irr,how="left", left_on="20_min_before", right_on="timestamp")
df_25_min = df_25_min_ahead_w_img.merge(fol_irr,how="left", left_on="25_min_before", right_on="timestamp")
df_30_min = df_30_min_ahead_w_img.merge(fol_irr,how="left", left_on="30_min_before", right_on="timestamp")

In [17]:
# df_20_min.head(21)

Using the fuction to get the columns we are supposed to have:

In [18]:
df_5_min = update_df_for_model(df_5_min,"5_min_before")
df_10_min = update_df_for_model(df_10_min,"10_min_before")
df_15_min = update_df_for_model(df_15_min,"15_min_before")
df_20_min = update_df_for_model(df_20_min,"20_min_before")
df_25_min = update_df_for_model(df_25_min,"25_min_before")
df_30_min = update_df_for_model(df_30_min,"30_min_before")

In [370]:
# save_pickle("../data_rp/df_5_min_data.pkl",df_5_min)
# save_pickle("../data_rp/df_10_min_data.pkl",df_10_min)
# save_pickle("../data_rp/df_15_min_data.pkl",df_15_min)
# save_pickle("../data_rp/df_20_min_data.pkl",df_20_min)
# save_pickle("../data_rp/df_25_min_data.pkl",df_25_min)
# save_pickle("../data_rp/df_30_min_data.pkl",df_30_min)
df_5_min = open_pickle("../data_rp/df_5_min_data.pkl")
df_10_min = open_pickle("../data_rp/df_10_min_data.pkl")
df_15_min = open_pickle("../data_rp/df_15_min_data.pkl")
df_20_min = open_pickle("../data_rp/df_20_min_data.pkl")
df_25_min = open_pickle("../data_rp/df_25_min_data.pkl")
df_30_min = open_pickle("../data_rp/df_30_min_data.pkl")

In [371]:
df_5_min.head()

Unnamed: 0,Y,timestamp,5_min_before,air_temp,relhum,press,windsp,winddir,max_windsp,precipitation,file,5_min_before_i
0,2.52,2014-01-02 15:33:00,2014-01-02 15:28:00,2.8,75.06,1010.0,2.0,199.6,2.6,0.0,2014/01/02/20140102_152808.jpg,4.87
1,3.17,2014-01-02 15:34:00,2014-01-02 15:29:00,2.7,75.5,1010.0,1.74,190.4,2.4,0.0,2014/01/02/20140102_152907.jpg,5.59
2,3.9,2014-01-02 15:35:00,2014-01-02 15:30:00,2.7,75.54,1010.0,1.78,193.6,2.3,0.0,2014/01/02/20140102_153008.jpg,1.23
3,4.64,2014-01-02 15:36:00,2014-01-02 15:31:00,2.7,74.98,1010.0,1.72,192.2,2.1,0.0,2014/01/02/20140102_153108.jpg,1.62
4,5.36,2014-01-02 15:37:00,2014-01-02 15:32:00,2.62,74.76,1010.0,1.66,188.2,2.4,0.0,2014/01/02/20140102_153208.jpg,2.04


In [374]:
for df in [df_5_min,df_10_min,df_15_min,df_20_min,df_25_min,df_30_min]:
    breakdown_dates(df,'timestamp')
    df['season'] = df.month.apply(lambda row: replace_m_w_season(row))

In [375]:
df_5_min.columns

Index(['Y', 'timestamp', '5_min_before', 'air_temp', 'relhum', 'press',
       'windsp', 'winddir', 'max_windsp', 'precipitation', 'file',
       '5_min_before_i', 'year', 'month', 'day', 'hour', 'minute', 'season'],
      dtype='object')

In [56]:
df_5_min.columns

Index(['Y', 'timestamp', '5_min_before', 'air_temp', 'relhum', 'press',
       'windsp', 'winddir', 'max_windsp', 'precipitation', 'file',
       '5_min_before_i', 'year', 'month', 'day', 'hour', 'min', 'sec',
       'season'],
      dtype='object')

In [301]:
# t = 5
# for df in [df_5_min,df_10_min,df_15_min,df_20_min,df_25_min,df_30_min]:
#     df.drop(['day','min','sec'], axis=1,inplace=True)
#     df.drop(['month'], axis=1,inplace=True)

#     col = f'{t}_min_before'
#     col_1 = f'{t}_min_before_i'
# #     print(t,col,col_1)
#     df = df[['Y', 'timestamp', col, 'air_temp', 'relhum', 'press',
#        'windsp', 'winddir', 'max_windsp', 'precipitation', 'file',
#        col_1, 'year', 'month', 'hour','season']]
#     t += 5

In [377]:
sampled_dfs = {}
m = 5
for df in [df_5_min,df_10_min,df_15_min,df_20_min,df_25_min,df_30_min]:
    name = f"df_{m}_min"
    col = f"{m}_min_before"
#     df.drop(['day','min','sec'], axis=1,inplace=True)
    d = df.sample(n=10000,random_state=42)
#     breakdown_dates(d,'timestamp')
    d.drop(['timestamp',col], axis=1,inplace=True)
    sampled_dfs[name] = d
    m += 5

In [378]:
sampled_dfs['df_5_min'].columns

Index(['Y', 'air_temp', 'relhum', 'press', 'windsp', 'winddir', 'max_windsp',
       'precipitation', 'file', '5_min_before_i', 'year', 'month', 'day',
       'hour', 'minute', 'season'],
      dtype='object')

In [784]:
sampled_dfs['df_5_min'].head()

Unnamed: 0,Y,air_temp,relhum,press,windsp,winddir,max_windsp,precipitation,file,5_min_before_i,year,month,day,hour,minute,season
212504,16.37,11.1,74.34,1004.0,1.46,55.0,2.1,0.0,2014/11/09/20141109_145759.jpg,12.83,2014,11,9,15,3,4
730167,147.3,17.1,51.86,1004.0,0.3,128.9,0.5,0.0,2016/10/31/20161031_224459.jpg,163.5,2016,10,31,22,50,4
650333,529.2,19.54,51.86,1000.0,1.46,144.2,3.2,0.0,2016/07/17/20160717_155708.jpg,514.1,2016,7,17,16,2,3
540543,557.4,22.6,29.7,1007.0,1.66,175.0,2.8,0.0,2016/03/01/20160301_223608.jpg,568.5,2016,3,1,22,41,2
272258,318.5,14.42,73.84,1008.0,1.5,300.2,2.0,0.0,2015/02/19/20150219_220511.jpg,262.7,2015,2,19,22,10,1


### Data Exploration:

In [71]:
previews = []
described = []
for df in [df_5_min,df_10_min,df_15_min,df_20_min,df_25_min,df_30_min]:
    previews.append(preview_df(df))
    described.append(df.describe())

In [73]:
described[5]

Unnamed: 0,Y,air_temp,relhum,press,windsp,winddir,max_windsp,precipitation,30_min_before_i,year,month,hour,season
count,764846.0,764846.0,764846.0,764846.0,764846.0,764846.0,764846.0,764846.0,764846.0,764846.0,764846.0,764846.0,764846.0
mean,412.049214,21.337232,44.304767,1003.443749,1.567333,216.414405,2.51834,0.002666,412.803596,2015.023526,6.50635,14.907275,2.57509
std,296.527848,8.384509,21.283695,4.969679,0.890709,76.316673,1.312563,0.035558,295.533668,0.810692,3.215236,7.712866,1.052862
min,0.0,-2.9,4.86,983.0,0.0,0.0,0.0,0.0,0.0,2014.0,1.0,0.0,1.0
25%,142.2,14.7,26.94,1000.0,0.94,153.4,1.6,0.0,142.3,2014.0,4.0,14.0,2.0
50%,381.5,20.58,40.48,1003.0,1.42,228.5,2.3,0.0,381.5,2015.0,7.0,17.0,3.0
75%,658.3,27.84,59.64,1007.0,2.0,280.7,3.2,0.0,658.3,2016.0,9.0,20.0,3.0
max,1466.0,42.78,94.0,1021.0,9.3,360.0,13.5,4.77,1466.0,2016.0,12.0,23.0,4.0


In [783]:
df_5_min[df_5_min.columns[1:]].head()

Unnamed: 0,timestamp,5_min_before,air_temp,relhum,press,windsp,winddir,max_windsp,precipitation,file,5_min_before_i,year,month,day,hour,minute,season
0,2014-01-02 15:33:00,2014-01-02 15:28:00,2.8,75.06,1010.0,2.0,199.6,2.6,0.0,2014/01/02/20140102_152808.jpg,4.87,2014,1,2,15,33,1
1,2014-01-02 15:34:00,2014-01-02 15:29:00,2.7,75.5,1010.0,1.74,190.4,2.4,0.0,2014/01/02/20140102_152907.jpg,5.59,2014,1,2,15,34,1
2,2014-01-02 15:35:00,2014-01-02 15:30:00,2.7,75.54,1010.0,1.78,193.6,2.3,0.0,2014/01/02/20140102_153008.jpg,1.23,2014,1,2,15,35,1
3,2014-01-02 15:36:00,2014-01-02 15:31:00,2.7,74.98,1010.0,1.72,192.2,2.1,0.0,2014/01/02/20140102_153108.jpg,1.62,2014,1,2,15,36,1
4,2014-01-02 15:37:00,2014-01-02 15:32:00,2.62,74.76,1010.0,1.66,188.2,2.4,0.0,2014/01/02/20140102_153208.jpg,2.04,2014,1,2,15,37,1


In [106]:
cols = df_5_min.columns[-4:-2].to_list()
cols.append(df_5_min.columns[-1])
print(cols)

['year', 'month', 'season']


In [108]:
df_5_min.columns

Index(['Y', 'timestamp', '5_min_before', 'air_temp', 'relhum', 'press',
       'windsp', 'winddir', 'max_windsp', 'precipitation', 'file',
       '5_min_before_i', 'year', 'month', 'hour', 'season'],
      dtype='object')

### Pre Processing and Scaling

first lets save this data, for other tests in the future with different transformations:

In [381]:
save_pickle('../data_rp/sampled_raw_data.pkl',sampled_dfs)
# print (math.sin(math.pi/2))

Train-test split: following regular proportions (pareto method).

In [394]:
train_test_data = {}
t = 5
for df_name in sampled_dfs.keys():
    df = sampled_dfs[df_name]
    train_test_data[df_name] = {}
    x_cols = df.columns[1:].to_list()
    train_test_data[df_name]['file'] = df['file']
    x_cols.remove('file')
    X_train, X_test, y_train, y_test = train_test_split(df[x_cols], df[df.columns[0]], 
                                                        test_size=0.20, random_state=42)
    train_test_data[df_name]['X_train'] = X_train
    train_test_data[df_name]['X_test'] = X_test
    train_test_data[df_name]['y_train'] = y_train
    train_test_data[df_name]['y_test'] = y_test
    t += 5
    

Below we pass our train-test data through the processing function to onehot encode categorical features, min-max scale our continuous features, and transform all time data using the sine function. All data other than time data, will be between [0,1] with time data between [-1,1]. This transformation/scalining will help our model converge more quickly.

In [399]:
t = 5
for _ in range(0,6):
    df_name = f'df_{t}_min'
    train = train_test_data[df_name]['X_train']
    test = train_test_data[df_name]['X_test']
    x_train,x_test = process_timeahead_attributes(t,train,test,time_to_period)
    train_test_data[df_name]['X_train_p'],train_test_data[df_name]['X_test_p'] = x_train,x_test
    print(df_name)
    t+=5
#     process_timeahead_attributes(con_cols,cat_cols,train,test)

df_5_min
df_10_min
df_15_min
df_20_min
df_25_min
df_30_min


In [406]:
save_pickle('../data_rp/model_data_dict.pkl',train_test_data)

In [None]:
# df_5_min.describe()
# df_10_min.describe()
# df_15_min.describe()
# df_20_min.describe()
# df_25_min.describe()
# df_30_min.describe()

# sns.distplot(df_5_min['5_min_before_i']);
# sns.distplot(df_5_min['Y']);
# sns.distplot(df_5_min['air_temp']);
# g = sns.PairGrid(df_5_min, height=3.5)
# g.map(sns.scatterplot)

# corr = df_5_min.corr()
# plt.figure(figsize=(30,20))
# mask = np.zeros_like(corr)
# mask[np.triu_indices_from(mask)] = True
# with sns.axes_style("white"):
#     ax = sns.heatmap(corr,mask=mask,center=0,cmap="coolwarm",annot=True,linewidths=.5)

# corr = df_30_min.corr()
# plt.figure(figsize=(30,20))
# mask = np.zeros_like(corr)
# mask[np.triu_indices_from(mask)] = True
# with sns.axes_style("white"):
#     ax = sns.heatmap(corr,mask=mask,center=0,cmap="coolwarm",annot=True,linewidths=.5)