<a href="https://colab.research.google.com/github/dd-open-source/ml-projects/blob/main/shell-ai-hackathon-weather-data/Level1/ShellAI_Hackathon_Level_1_2021.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We stored training and testing datasets in google drive so need to mount drive inorder to access the files.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Python modules used for the hackathon.

In [None]:
import os
import datetime

import IPython
import IPython.display
from IPython.display import clear_output
from google.colab import files

import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split


Read train and test datasets and convert to pandas Dataframes.


In [None]:
# Training csv file: https://drive.google.com/file/d/1C7NuAI1i48y6GYoP4C0_7F4exC0TNdSP/view
# Testing csv file: https://drive.google.com/file/d/1MKRijjGs35uVTYAl4YFLcVV9ODb_5X97/view
# Download above files to your drive and update paths accordingly
training_path = '/content/drive/MyDrive/ML_Projects/shell-ai-data/train.csv'
train_df = pd.read_csv(training_path)
testing_path = '/content/drive/MyDrive/ML_Projects/shell-ai-data/test.csv'
test_df = pd.read_csv(testing_path)

**perform_preprocessing** does the following feature Engineering steps: 
* Converted wind speed into wind velocity vector (velocity along x and y axes)
* Converted Azimuth angle in agree to sin and cos values as they better captures the cyclic nature.
* Used Forward-Fill (ffill method in pandas) to fill the missing values in target column.

In [None]:
def perform_preprocessing(df):
  peak_wind_speed_col = 'Peak Wind Speed @ 6ft [m/s]'
  avg_wind_dir_col = 'Avg Wind Direction @ 6ft [deg from N]'
  azimuth_angle_col = 'Azimuth Angle [degrees]'

  wind_speed = df.pop(peak_wind_speed_col)
  bad_wv = wind_speed < 0
  wind_speed[bad_wv] = 0.0
  wind_angle = df.pop(avg_wind_dir_col)*np.pi / 180

  df['wind_x'] = wind_speed*np.cos(wind_angle)
  df['wind_y'] = wind_speed*np.sin(wind_angle)

  azimuth_angle = df.pop(azimuth_angle_col) * np.pi/180
  df['azimuth_sin'] = np.sin(azimuth_angle)
  df['azimuth_cos'] = np.cos(azimuth_angle)

  total_cloud_cover = 'Total Cloud Cover [%]'
  df.loc[df[total_cloud_cover]<0, total_cloud_cover] = np.nan
  df = df.fillna(method='ffill')

  return df

**prepare_timeseries_records** function takes time series dataset coverts to multi-step regression problem.

Example: This function converts following dataframe

Time | Feature1 | Feature2
-----|----------|--------
T1   | 1.1 | 2.1 
T2   | 1.2 | 2.2
T3   | 1.3 | 2.3 

to

Feature1-1 | Feature2-1 | Feature1-2 | Feature2-2 
-----------|------------|------------|-----------
1.1        | 1.2        | 2.1        | 2.2 
1.2        | 1.3        | 2.2        | 2.3 


In [None]:
def prepare_timeseries_records(df, sampling_rate, stride, steps, start_offset=0, end_offset=None):
  timeseries_dfs = [[]]*steps
  for i in range(steps):
    timeseries_dfs[i] = df[i*sampling_rate+start_offset:end_offset:stride].reset_index(drop=True)
    timeseries_dfs[i].columns = ["{}_{}".format(c, i+1) for c in timeseries_dfs[i].columns]
  return pd.concat(timeseries_dfs, axis=1).dropna()

**prepare_dfs** groups dataset by date/scenario and creates train and test datasets using above functions.
* We are grouping dataset by date so that data is not combined across different dates as we need to predict using data within a day in the test set. 
* We are also ignoring data with missing values from at start and end of the day. 


In [None]:
def prepare_dfs(full_df, groupby, columns, sampling_rate=10, stride=5, input_steps=12):
  x_train_dfs = []
  y_train_dfs = []
  for _, day_df in full_df.groupby(groupby):
    day_df = day_df[columns].reset_index(drop=True)
    valid_indices = day_df.index[day_df['Total Cloud Cover [%]'] >= 0]
    min_valid = min(valid_indices)
    max_valid = max(valid_indices)  

    df = day_df.iloc[min_valid:max_valid, :].reset_index(drop=True)
    df = perform_preprocessing(df)
    x_train_dfs.append(prepare_timeseries_records(df, sampling_rate, stride, input_steps, end_offset=-120))
    y_train_dfs.append(prepare_timeseries_records(df[['Total Cloud Cover [%]']], sampling_rate, stride, 120//sampling_rate, start_offset=input_steps*sampling_rate))
  
  return pd.concat(x_train_dfs, axis=0).reset_index(drop=True), pd.concat(y_train_dfs, axis=0).reset_index(drop=True).iloc[:, ::(30//sampling_rate)].reset_index(drop=True)


**prepare_df_for_predictions** makes test dataset to predict and submit them for evaluation in the portal. We need to this function with same parameters used for training set and it should have same set of features.

In [None]:
def prepare_df_for_predictions(full_df, groupby, columns, sampling_rate=10, input_steps=12):
  x_train_dfs = []
  for _, day_df in full_df.groupby(groupby):
    df = day_df[columns].reset_index(drop=True)
    df = perform_preprocessing(df)
    x_train_dfs.append(prepare_timeseries_records(df, sampling_rate, 1, input_steps, start_offset=(360-sampling_rate*(input_steps-1))))
  
  return pd.concat(x_train_dfs, axis=0).reset_index(drop=True)

We are constructing training set using both **train_df** and **test_df** as we are using 2 hours data to predict next 2 hours. 

We are reserving a small portion from time series dataset created using test_df as it is the close represention of final dataset to be used for predictions.

In [None]:
SAMPLING_RATE=5
INPUT_STEPS=24
columns = ['Global CMP22 (vent/cor) [W/m^2]',
       'Direct sNIP [W/m^2]', 'Azimuth Angle [degrees]',
       'Tower Dry Bulb Temp [deg C]', 'Tower Wet Bulb Temp [deg C]',
       'Tower Dew Point Temp [deg C]', 'Tower RH [%]', 'Total Cloud Cover [%]',
       'Peak Wind Speed @ 6ft [m/s]', 'Avg Wind Direction @ 6ft [deg from N]',
       'Station Pressure [mBar]', 'Precipitation (Accumulated) [mm]',
       'Snow Depth [cm]', 'Moisture', 'Albedo (CMP11)']
X,Y = prepare_dfs(train_df, 'DATE (MM/DD)', columns, sampling_rate=SAMPLING_RATE,stride=1, input_steps=INPUT_STEPS)
X_test_train, Y_test_train = prepare_dfs(test_df, 'scenario_set', columns, sampling_rate=SAMPLING_RATE,stride=1, input_steps=INPUT_STEPS)
X_train1, X_test, Y_train1, Y_test = train_test_split(X_test_train, Y_test_train, test_size=0.2) 
X_pred = prepare_df_for_predictions(test_df, 'scenario_set', columns, sampling_rate=SAMPLING_RATE, input_steps=INPUT_STEPS)
X_full = pd.concat([X, X_train1], axis=0).reset_index(drop=True) 
Y_full = pd.concat([Y, Y_train1], axis=0).reset_index(drop=True) 
print(X_full.shape, Y_full.shape, X_test.shape, Y_test.shape, X_pred.shape)

(195678, 384) (195678, 4) (7415, 384) (7415, 4) (300, 384)


**Scaling**

In [None]:
train_mean = X_full.mean()
train_std = X_full.std()

X_train = (X_full - train_mean) / train_std
X_test = (X_test - train_mean) / train_std
Y_train = Y_full
X_pred = (X_pred - train_mean) / train_std
train_indexs = list(range(X_train.shape[0]))
np.random.shuffle(train_indexs)
X_train = X_train.iloc[train_indexs]
Y_train = Y_train.iloc[train_indexs]

**Model Training**

In [None]:
models_location = '/content/drive/MyDrive/ML_Projects/models/ShellAI/'
MAX_EPOCHS = 5

def compile_and_fit(model, x_train, y_train, model_path, epochs=5, patience=3):

  early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                                    patience=patience,
                                                    mode='min')

  model.compile(loss='mean_absolute_error',
                optimizer='adam')
  
  model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=model_path,
    monitor='val_loss',
    mode='min',
    save_weights_only=True,
    save_best_only=True)

  history = model.fit(x=x_train.values, y=y_train.values, epochs=epochs,
                      validation_split=0.1,
                      callbacks=[early_stopping, model_checkpoint_callback])
  return history



We used combination of LSTM (long short term memory) and dense layers to train the model.

In [None]:
def create_model():
    return tf.keras.Sequential([
      tf.keras.layers.Reshape((INPUT_STEPS,-1), input_shape=(16*INPUT_STEPS,)),
      # LSTM expects input shape (batch size, time steps, features)
      tf.keras.layers.LSTM(32, return_sequences=False),
      tf.keras.layers.Dense(512, activation='relu'),
      tf.keras.layers.Dense(512, activation='relu'),
      tf.keras.layers.Dense(512, activation='relu'),
      tf.keras.layers.Dense(512, activation='relu'),
      tf.keras.layers.Dense(512, activation='relu'),
      tf.keras.layers.Dense(512, activation='relu'),
      tf.keras.layers.Dense(512, activation='relu'),
      tf.keras.layers.Dense(512, activation='relu'),
      tf.keras.layers.Dense(512, activation='relu'),
      tf.keras.layers.Dense(512, activation='relu'),
      tf.keras.layers.Dense(512, activation='relu'),
      tf.keras.layers.Dense(4) # Output layer
   ])
    
model = create_model()
print(model.summary())

model_name = "third"
model_path = os.path.join(models_location, model_name, "checkpoint")
history = compile_and_fit(model, X_train, Y_train, model_path, epochs=100, patience=3)
clear_output()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
reshape_1 (Reshape)          (None, 24, 16)            0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                6272      
_________________________________________________________________
dense_32 (Dense)             (None, 512)               16896     
_________________________________________________________________
dense_33 (Dense)             (None, 512)               262656    
_________________________________________________________________
dense_34 (Dense)             (None, 512)               262656    
_________________________________________________________________
dense_35 (Dense)             (None, 512)               262656    
_________________________________________________________________
dense_36 (Dense)             (None, 512)              

**Prediction**
* Predicts target values and generates final submission file.

In [None]:
min_loss_index = np.argmin(history.history['val_loss'])
print("train_loss=", history.history['loss'][min_loss_index],"val_loss=", history.history['val_loss'][min_loss_index])
best_model = create_model()
best_model.load_weights(tf.train.latest_checkpoint(os.path.join(models_location, model_name)))
best_model.compile(loss='mean_absolute_error', optimizer='adam')
print("test_loss=", best_model.evaluate(X_test, Y_test, verbose=0))
predictions = best_model.predict(X_pred.values)
pred_df = pd.DataFrame(predictions, columns=['30_min_horizon', '60_min_horizon', '90_min_horizon', '120_min_horizon'])
pred_df['scenario_set'] = [i+1 for i in range( pred_df.shape[0])]
pred_df.set_index('scenario_set', inplace=True)
pred_location = '/content/drive/MyDrive/ML_Projects/shell-ai-data/submission-{}.csv'.format(datetime.datetime.now().isoformat())
pred_df.to_csv(pred_location)
files.download(pred_location)

In [None]:
# last_best_test_loss=3.4990718364715576
# val_loss=3.408386468887329