## Run these cells only once during a session

Before running these files, there should be 3 importants thing already living within the runtime / directory space

1. `mlod.zip` - this zip file contains the entire code base
2. `requirements.txt` - the list of packages to install before doing anything
3. this notebook file

### Installing the `mlod` package

This cell installs the mlod package to the Colab

In [None]:
from zipfile import ZipFile

with ZipFile('./mlod.zip', 'r') as zpf:
    zpf.extractall(path='./')

### Installing the packages
Run the cell below to install the packages used in the project

In [None]:
!pip install -r ./requirements.txt

# Challenge #1: AirQo Ugandan Air Quality Forecast Challenge

This notebook contains the reformat of the code, so that properly set up for implementation.
Most of the code abstractions are written inside our package `mlod`, which should be included with this notebook.


## Init Steps

This section involves setting up the data from the `zindi` to use for the competition

### Setting up the data

Please upload the to the `./data` path inside the workspace folder. Run the cell below, repeateadly till when there are no errors.
Make sure the uploaded data is the `Train.csv` and `Test.csv` used in the competition

In [None]:
from pathlib import Path

train_file_csv = Path('./data/Train.csv')
test_file_csv = Path('./data/Test.csv')

# check if Train file doesn't exist
assert train_file_csv.exists(), 'Make sure the Test csv file exists the path "%s"' % train_file_csv

# check if Test file doesn't exist
assert test_file_csv.exists(), 'Make sure the Test csv file exists the path "%s"' % test_file_csv

## Actual sequence of processes

### Initiating different processes

Performing steps that are needed for before doing any form of training

In [None]:
import random
import numpy as np

# using our chosen seed number
from mlod import SEED_NUMBER as MLOD_SEED_NUMBER

# Setting the seed
random.seed(MLOD_SEED_NUMBER)
np.random.seed(MLOD_SEED_NUMBER)

### Load and preprocess data

In [None]:
import pandas as pd

## Fetching the data
train_df = pd.read_csv(train_file_csv)
test_df = pd.read_csv(test_file_csv)

TEST_IDS = test_df['ID']


### Preprocessing the data

[Low-level Preprocessing]<br />
By using the `mlod.preprocessors.*` involves preprocessing the data in the following ways
- Modifying the data such that each row, has its atomic values, thus making the data **grow** in size
- Performing **special** feature engineering that some of which include:
    - Peforming Cyclic Representation to selected features
    - Using wind speed (`wind_spd`) and direction (`wind_dir`) to obtain 
        catersian values for speed (`u` and `v`)
    - Add past features within a certain window of an the current row.<br />
        This technique is to help data make model make relation between sequence of data

### Preprocess + Model Training

Since out approach is an ensemble, and the different models are preprocessed differently, the code below, contains the `Model` paired with its `PreProcessor`.

Since for our ensemble we are boosting, we will be using the values from the `1`st process and feed it to the next

#### 1: LightGBM + Version 1 Pre Processing

This first approach includes using our `MlodPreProcessor` and our `LGBModel`

In [None]:
from mlod.preprocessors import MlodPreProcessor

mlod_preprocessor = MlodPreProcessor()
mlod_pp_opts = dict(cols_to_retain=['ID'])
x_train, y_train = mlod_preprocessor.process(train_df, **mlod_pp_opts)

In [None]:
# Training the LGB Model
# ------------------------------
import pandas as pd
from mlod.models import LGBModel
from sklearn.model_selection import GroupKFold

lgb_model = LGBModel('airqo')

In [None]:
x_train_ids = x_train.pop('ID')
fold_group = x_train['day_idx'].astype(str) + '_' + x_train['24hr_idx'].astype(str)

In [None]:
assert 'ID' not in x_train.columns, 'Make sure ID is NOT in the columns'

In [None]:
# perfoming evalution, not training since 'cv' is True
lgb_eval_out = lgb_model.train(x_train, y_train, cv=True, kfold=GroupKFold, group=fold_group, n_splits=3)

# Save the to feed to the next
df_to_feed = pd.DataFrame.from_dict({ 'ID': x_train_ids, 'oof': lgb_eval_out['oof'] })

save_path = './lgb_eval.csv'
df_to_feed.to_csv(save_path)
print('Saving the OOF values to path: {}'.format(save_path))

In [None]:
import lightgbm as lgb

# training the model
lgb_model.train(x_train, y_train, cv=False)

# save the model
lgb_model.model.save_model('./lgb-airqo')

#### 2: CatBoost + Version 2 Pre Processing

This first approach includes using our `AirQoPreProcessor` and our `CatBoostModel`

In [None]:
## Training the CatBoost Model
# ------------------------------
from mlod.preprocessors import AirQoPreProcessor

airqo_preprocessor = AirQoPreProcessor()

airqo_pp_opts = dict(cols_to_retain=['ID'])
x_train, y_train = airqo_preprocessor.process(train_df, **airqo_pp_opts)

In [None]:
# since the output of the LGBModel prediction have way more rows (121x) because 
#  of the way it was preprocessed, we need to deal with this
train_feed = df_to_feed.groupby('ID').mean()

# Add the feed value to the data before training to new model
x_train = x_train.join(train_feed, on='ID')

# drop the ID after joining
del x_train['ID']

In [None]:
from mlod.models import CatBoostModel
from sklearn.model_selection import KFold

cb_model = CatBoostModel('airqo')

In [None]:
# performing cross validation training
cb_eval_out = cb_model.train(x_train, y_train, cv=True, store_cv_models=True, kfold=KFold, n_splits=50)

### Ensemble Prediction

Since we are dealing with an ensemble model, the prediction will most likely also have to be different.
We would need to take the output of `lgb_model` and use it as an input to the `cb_model`.

Below is a function that captures this ensemble prediction.

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm

from mlod.models import Model
from mlod.preprocessors import PreProcessor
from typing import Tuple

import logging
logger = logging.getLogger('mlod')

class EnsemblePredictor:
    def __init__(self, 
                 trained_lgb_model: Model, 
                 cv_trained_cb_model: Model, 
                 lgb_pp_opts: Tuple[PreProcessor, dict], 
                 cb_pp_opts: Tuple[PreProcessor, dict]):
        
        # Checks if the models are trained
        assert trained_lgb_model.model is not None, "the lgb model is not trained"
        assert cv_trained_cb_model.is_cv_trained, "the cb model needs to be trained by cross validation"
        
        self.lgb = trained_lgb_model
        self.cb = cv_trained_cb_model
        
        # load up the preprocessor and config used in LGB model
        lgb_pp, lgp_opts = lgb_pp_opts
        self.lgb_pp = lgb_pp
        self.lgp_opts = lgp_opts
        
        # load up the preprocessor and config used in CatBoost model
        cb_pp, cb_opts = cb_pp_opts
        self.cb_pp = cb_pp
        self.cb_opts = cb_opts
        
    def predict(self, x: pd.DataFrame) -> np.ndarray:
        
        # pre-process like lgb
        x_out_lgb = self.lgb_pp.process(x.copy(), test=True, **self.lgp_opts)
        x_ids = x_out_lgb.pop('ID')
        
        # pre-process like cb
        x_out_cb = self.cb_pp.process(x.copy(), test=True, **self.cb_opts)
        
        logger.info('Making prediction using base model')
        # output for the lgb + merge with x_out_cb
        to_merge = pd.DataFrame.from_dict({ 'ID': x_ids, 'oof': self.lgb.predict(x_out_lgb) })
        
        # mean merge the values
        to_merge = to_merge.groupby('ID').mean()
        x_out_cb = x_out_cb.join(to_merge, on='ID')
        
        # remove ID col + empty unneeded data
        del x_out_cb['ID']
        del to_merge
        
        # store the list of predictions
        ls_preds = []
        
        logger.info('Making prediction using each %d cv models' % len(self.cb.cv_models))
        # get the models used in the cross validations
        for cv_model in tqdm(self.cb.get_cv_models()):
            # make prediction using combined values with the cb model
            pred = cv_model.predict(x_out_cb)
            ls_preds.append(pred)
        
        # compute the mean of the predictions of 
        #  the cross validation models
        return np.mean(ls_preds, 0)

Using this `EnsemblePredictor` and saving predictions

In [None]:
import numpy as np
from mlod.file_utils import PredictionStorage

# Building the ensemble predictor
predictor = EnsemblePredictor(
                    lgb_model, 
                    cb_model, 
                    (mlod_preprocessor, mlod_pp_opts)
                    (airqo_preprocessor, airqo_pp_opts)
                )

y_test = predictor.predict(test_df)

In [None]:
# Store the results for submission
mean_rmse = np.mean([cb_eval_out['rmse'], lgb_eval_out['rmse']])

out_df = pd.DataFrame.from_dict(dict(ID=TEST_IDS.values, target=y_test)).set_index('ID')
out_df.to_csv(f'./airqo_sub{mean_rmse}.csv')

The file to upload should be name `airqo_subXXX.csv`. The values in XXX, is a way for us to keep tabs on training steps with what rmse produces what results.
Its also an indicator of ensemble overfitting