### This notebook describes how to structure the project.

During the competition there were many tasks to do, plan, research, test, tune, accept disappointment of failed tests, and be happy with a little improvement. 

Therefore, the workspace and the files of the project should be structured in a flexible way with less repeated code. In other word, to split code from data/configuration to save time and reduce errors/bugs.

Therefore, let us first we define the entities in the project.
There are four main entities and I think these will be the same in all projects: experiment, model, level, and stack. These will be modeled by classes as described in this notebook.


I would like to give credits to many kernels and websites among them:

 - Good introduction introduction about stacking: https://mlwave.com/kaggle-ensembling-guide/
 - Implementation of stacking and some nice discussion https://www.kaggle.com/getting-started/18153#post103381
 - Stacking solution for a regression problem https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard
 
*This is a draft work, and will be improved regularly.*


In [1]:
# Familiar imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

import os
import glob
from datetime import datetime
from pathlib import Path




# helpers
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder, OneHotEncoder, PowerTransformer, StandardScaler, \
                                  MinMaxScaler, RobustScaler, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, KFold, cross_val_score, StratifiedKFold
from sklearn.pipeline import make_pipeline

# Models
from sklearn.linear_model import ElasticNet, Lasso, LinearRegression
from sklearn.neighbors import KNeighborsRegressor 
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

# base
from sklearn.base import BaseEstimator, RegressorMixin, TransformerMixin, clone

# scoring
from sklearn.metrics import mean_squared_error


In [2]:
# notebook options
pd.set_option("display.max_columns", 100)
path = "../input/30-days-of-ml/"
train_file = "train.csv"
test_file = "test.csv"

In [3]:
# Load the training data
train = pd.read_csv(f'{path}{os.sep}{train_file}', index_col=0)
test = pd.read_csv(f'{path}{os.sep}{test_file}', index_col=0)

# Preview the data
# train.describe().T

In [4]:
# Separate target from features
y = train['target']
features = train.drop(['target'], axis=1)

# Preview features
# features.head().T

In [5]:
# identify columns
numerical_cols = features.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = features.select_dtypes(include=['object', 'bool']).columns

# useful for column transformers 
numerical_ix = features.columns.get_indexer(numerical_cols)
categorical_ix = features.columns.get_indexer(categorical_cols)

In [6]:
# work on a copy
X_train = features.copy()
X_test = test.copy()

### Helper functions

These are two functions to save and load predictions, they can be wrapped within a class for a better modeling or kept as they are since they are independent of the project setting.


In [7]:

## helper fucntions
def store_oof_predictions(model_name,
                          valid_predictions,
                          test_predictions,
                          X_train,
                          X_test,
                          output_folder,
                          n_folds, 
                          random_state):    
    
    """
    Store oof predictions into files.
    It is safe not to use defaults for this helper to make sure 
    we are storing the right experiement. 
    """
    
    if test_predictions is not None:
        test_df = pd.DataFrame({'Id': X_test.index,
                       model_name: test_predictions})
        test_df.to_csv( f'{output_folder}{os.sep}{model_name}_{n_folds}_test.csv',
                       index=False)

    if valid_predictions is not None:
        valid_df = pd.DataFrame({'Id': X_train.index,
                           model_name: valid_predictions})
        valid_df.to_csv(f'{output_folder}{os.sep}{model_name}_{n_folds}_valid.csv', 
                        index=False)
    
    
def read_oof_predictions(pattern, 
                     models_names, 
                     index='id'):    
    """
    Read files according the pattern in their name
    it returns a dataframe where each column is the prediction of a model
    """
    li = [] 
    for f in glob.glob(pattern):
        file_name = f.split(os.sep)[-1] # get file name only
        if file_name.lower().startswith(tuple(s.lower() for s in models_names)): # get only required models
            df = pd.read_csv(f, index_col=0)
            li.append(df)
    
                                   
    return pd.concat(li, axis=1)
    

### Experiment 
Since everything boiled down to stacking (even if we have a single level), the experiment class will handle the organization of the resulted files from the test: test and oofs predictions. Therefore, assuming the project has the following structure with a folder called **experiments** we can save our tests in this folder. This is what this class will do. Another important thing is to create the CV folds, and use them. Currently the class supports one seed only, but that can be changed easily by creating a list of CV indexes instead of a single one. I will add that in the next round.

```
    ML30_project
    │   README.md
    │
    └───notebooks
    │   ...
    │
    └───experiments
    │   │   
    │   │
    │   └───experiment_1
    │   │   │   
    │   │   └───level_1
    │   │   │   xgb_test.csv
    │   │   │   xgb_test.csv
    │   │   │   ...
    │   │   └───level_2
    │   │   │   ...
    │   │   └───level_3
    │   │   │   ...
    │   │   └───level_...
    │   │
    │   │
    │   └───experiment_...
```


>The code that generated the results is important to save too, but that can be done easily by creating a new version of the notebook or copying notebook with the CV_LB results.

>This class is so important when running notebooks in our computers since Kaggle has a nice notebook management system which saves outputs as well.



In [8]:

class Experiment():
    def __init__(self,
                 title='Experiment',
                 n_folds=3,
                 random_state=42,
                 shuffle=True,
                 use_different_random_states=True,
                 main_folder=os.getcwd()):
        
        self.title = title
        self.n_folds = n_folds
        self.random_state = random_state
        self.shuffle = shuffle
        self.use_different_random_states = use_different_random_states
        self.main_folder = main_folder
        
        # create the main folder if it does not exist
        if not os.path.exists(f'{self.main_folder}'):
            os.makedirs(f'{self.main_folder}')
        
    def calc_folds_indexes(self, X, y, n_folds=None, sampler=KFold):
        """
        Create folds from a dataset X and a target y
        sampler: can be KFold,  StratifiedKFold, or any sampling class  
        """
        # if no number of folds are specified use the global number
        if n_folds is None: 
            n_folds = self.n_folds
            
        self.folds = sampler(n_splits=self.n_folds, 
                        random_state=self.random_state,
                        shuffle=self.shuffle)
        self.folds_idxs = list(self.folds.split(X, y))
        
        
        
    def join_folder(self, folder=None):
         """
         Join the current working folder. Therefore, any output will be written to the folder.
         This is important as we move from one level to another during the training of the stack
         if folder does not exist, it will create it first. if it is None, it will create a folder
         with a time stamp.
         """

         # create time stamp and subfolder with the current time stamp
         if folder is not None: # if folder is specified
            self.output_folder = folder 
            # create a folder if does not exit.
            folder_path = f'{self.main_folder}{os.sep}{self.output_folder}'
            if not os.path.exists(folder_path):
                os.makedirs(folder_path)                
         else: # create a folder with the time stamp
            time_stamp = datetime.now().isoformat(' ', 'seconds')
            self.output_folder = self.title + ' ' + time_stamp.replace(':', '-')
            # create and replace if it exits.
            Path(f'{self.main_folder}{os.sep}{self.output_folder}').mkdir(parents=True, exist_ok=True)
    
    def to_dict(self):
        return dict({"title": self.title,
                    "n_folds": self.n_folds,
                    "random_state": self.random_state,
                    "use_different_random_states": self.use_different_random_states,
                    "main_folder": self.main_folder,
                    "output_folder": self.output_folder})
        
    def __str__(self):
        return str(self.to_dict())

### ModelWrapper 
This class role is to avoid coding multiple classes for each model (or model types). We can see that models can actually be categorized into different categories, where some models accept more parameters than the others. For instance xgboost can use an evaluation set to determine the stopping round number, while Lasso does not accept such extra parameters.

Thanks to the flexibility of python and the design of the base models, we can wrap the model and `wrapper` to do what the model should do. In fact, we can easily stretch this class to support sklearn pipelines.


In [9]:
class ModelWrapper():
    def __init__(self, 
                 model,
                 name,
                 uses_eval_set=False,
                 fit_params={}):
        
        self.model = model
        self.name = name
        
        self.uses_eval_set = uses_eval_set
        self.fit_params = fit_params # any extra params for the 'fit' function
                
    def fit(self, X, y, eval_set=None):
        if self.uses_eval_set:
            self.model.fit(X, y, eval_set=eval_set, **(self.fit_params))
        else:
            self.model.fit(X, y, **(self.fit_params)) 
        return self
    

    def predict(self, X):
        return self.model.predict(X)
        
    def clone_me(self, random_state=None):
        wrapper = ModelWrapper(model=clone(self.model), # clone from sklean.base
                               name=self.name, 
                               uses_eval_set=self.uses_eval_set,
                               fit_params=self.fit_params)
        wrapper.name = self.name
        if random_state is not None:
            wrapper.set_random_state(random_state)
        
        return wrapper
    
    def set_random_state(self, random_state):
        if hasattr(self.model, 'random_state'):
            self.model.random_state = random_state
        elif hasattr(self.model, 'random_seed'):
            self.model.random_seed = random_state
            
    def get_random_state(self):
        if hasattr(self.model, 'random_state'):
            return self.model.random_state 
        elif hasattr(self.model, 'random_seed'):
            return self.model.random_seed
    
        

### ModelTrainer
This class role is to the a model and calculate the oofs and the test results.


In [10]:
class ModelTrainer():
    def __init__(self,
                  model: ModelWrapper):
        
        self.model = model
        
    def calc_oofs(self,
                  X, y,
                  X_test,
                  folds_idxs,
                  transformer=None,
                  verbose=False,
                  use_different_random_states=True):
        """
        Return the oofs predictions and the meta features (test predictions)
        """
        
        test_predictions = 0
        oof_predictions = np.zeros_like(np.array(y))
        valid_mean_score = [] 
                
        for fold, (train_ix, valid_ix) in enumerate(folds_idxs):
            X_train, X_valid = X[train_ix], X[valid_ix]
            y_train, y_valid = y[train_ix], y[valid_ix]
                        
            
            # transform input
            if transformer is not None:
                X_train = transformer.fit_transform(X_train)
                X_valid = transformer.transform(X_valid)

            
            # check if we train each fold on differently initialized clone
            if use_different_random_states:
                model = self.model.clone_me(random_state=fold)
            else:
                model = self.model.clone_me()
            
            # fit the model
            if model.uses_eval_set:
                model.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_valid, y_valid)])
            else:
                model.fit(X_train, y_train)
                
            ## predictions
            # on the validation set
            valid_predications = model.predict(X_valid)
            score = mean_squared_error(valid_predications, y_valid, squared=False)
            valid_mean_score.append(score)
            oof_predictions[valid_ix] = valid_predications
            
            # on the test set
            # transform it first on a copy
            if transformer is not None:
                X_test_ = transformer.transform(X_test)
            else:
                X_test_ = X_test
            test_predictions += model.predict(X_test_) / len(folds_idxs)
            
            if verbose:
                print('Fold:{} score:{:.4f}'.format(fold + 1, score))
        
        if verbose:
            print('Average score:{:.4f} ({:.4f})'.format(np.mean(valid_mean_score), np.std(valid_mean_score) ))
    
        return oof_predictions, test_predictions

### LevelTrainer
This class role glues, everything in one place and train each model in a given level.


In [11]:
class LevelTrainer():
    
    def __init__(self,
                 experiment, # Experiment,
                 models,      # ModelWrapper
                 transformer, # data transformer
                 level_num=1):
        self.level_num = level_num
        self.experiment = experiment
        self.models = models
        self.transformer = transformer
        
    def train(self, X_df, y_df, X_test_df, verbose=True):
        
        assert models is not None
        

        for model_wrapper in self.models:
            
            if verbose:
                print('-'*30)
                print(f'Model:{model_wrapper.name}')
                print('-'*30)
                
            trainer = ModelTrainer(model_wrapper)
            oof_preds, test_preds = trainer.calc_oofs(X_df.values, y_df.values,
                                                      X_test_df.values,
                                                      transformer=self.transformer,
                                                      folds_idxs=self.experiment.folds_idxs,
                                                      verbose=verbose,
                                                      use_different_random_states=self.experiment.use_different_random_states)
            # store predictions
            store_oof_predictions(model_name=model_wrapper.name,
                                  valid_predictions=oof_preds,
                                  test_predictions=test_preds,
                                  X_train=X_df,
                                  X_test=X_test_df,
                                  n_folds=self.experiment.n_folds, 
                                  random_state=model_wrapper.get_random_state(),
                                  output_folder=f'{self.experiment.main_folder}{os.sep}{self.experiment.output_folder}')
            if verbose:
                print('-'*30)

### Hyperparameters

Here goes the paramaters of each model. These can actually be stored in an external JSON file.


In [12]:
# Lasso
lasso_params = {
                'alpha': 0.00005
}


# Elastic Net
enet_params = {
               'alpha': 0.00005, 
               'l1_ratio': .9
}

# extra-tree
et_params = {
    'n_jobs': -1,
    'n_estimators': 100,
    'max_features': 0.5,
    'max_depth': 12,
    'min_samples_leaf': 2,
}

# random forest

rf_params_2 = {
    'n_jobs': -1,
    'n_estimators': 500,
    'max_depth': 5
}


rf_params_1 = {
    'n_jobs': -1,
    'n_estimators': 100,
    'max_features': 0.2,
    'max_depth': 8,
    'min_samples_leaf': 2
}

# gradient boosting 
gb_params = {
    'n_estimators': 500,
     'max_depth': 3
}

# xgboost

# Using optuna but still rough
xgb_params_2 = {
#                  'tree_method': 'gpu_hist', 
#                  'gpu_id': 0, 
#                  'predictor': 'gpu_predictor',
    
                 'max_depth': 5,
                 'learning_rate': 0.021252439960137114,
                 'n_estimators': 13500,
                 'subsample': 0.62,
                 'booster': 'gbtree',
                 'colsample_bytree': 0.1,
                 'reg_lambda': 0.1584605320779582,
                 'reg_alpha': 15.715145781076245,
                 'n_jobs': -1
}

xgb_params_1 = {
            'random_state': 1, 
            # gpu
    #         'tree_method': 'gpu_hist', 
    #         'gpu_id': 0, 
    #         'predictor': 'gpu_predictor',
            # cpu
            'n_jobs': -1,
            'booster': 'gbtree',
            'n_estimators': 10000,
            # optimized params
            'learning_rate': 0.03628302216953097,
            'reg_lambda': 0.0008746338866473539,
            'reg_alpha': 23.13181079976304,
            'subsample': 0.7875490025178415,
            'colsample_bytree': 0.11807135201147481,
            'max_depth': 3,
            #'min_child_weight': 6
}


# catboost
catb_params = {'iterations': 6800,
              'learning_rate': 0.93,
              'loss_function': "RMSE",
              'random_state': 42,
              'verbose': 0,
              'thread_count': -1,
              'depth': 1,
              'l2_leaf_reg': 3.28}


# using optuna
params_lgb = {
             "n_estimators": 10000,
             'metric':'rmse',
             "objective": "regression",
             'max_depth': 12, 
             'subsample': 0.587082286344555, 
             'colsample_bytree': 0.2157299997089329, 
             'learning_rate': 0.01270518267668901,
             'reg_lambda': 36.78473508062132,
             'reg_alpha': 14.155146595119032, 
             'min_child_samples': 6, 
             'num_leaves': 34, 
             'max_bin': 914,
             'cat_smooth': 26,
             'n_jobs': -1,
             'cat_l2': 0.020257336654989123
        }

### These are model/task dependent parameters

In [13]:
# external hyperparamaters

### fit function hyperparamaters
# some models require special paramaters like early stoping in xgboost and lgbm
fit_params = {'early_stopping_rounds': 300,
                  'verbose': False}

### application/implementation paramaters
# These paramaters are implementation dependent 
app_params = {'uses_eval_set':True}



### Models
Here we define our models.

In [14]:
lrg = LinearRegression()
# lasso
lasso = Lasso(**lasso_params)

# Elastic net
e_net = ElasticNet(**enet_params)

# KNeighborsRegressor
knn =  KNeighborsRegressor()

# extra-tree
extree = ExtraTreesRegressor(**et_params)

# random forest
rfr = RandomForestRegressor(**rf_params_2)

# gradient boosting
gb = GradientBoostingRegressor(**gb_params)

#lgbm
lgb = LGBMRegressor(**params_lgb)

# xgboost 
# variants
xgb_1 =  XGBRegressor(**xgb_params_1)
xgb_2 =  XGBRegressor(**xgb_params_2)

#catboost
catb = CatBoostRegressor(**catb_params)

In [15]:
# compile all settings in one dictionary, 
# we can store/load it then to a JSON file
models = {'LinearRegression': {"model":lrg, "fit_kwargs":None, "app_params": None},
          'Lasso': {"model":lasso, "fit_kwargs":None, "app_params": None},
          'ElasticNet': {"model": e_net, "fit_kwargs":None, "app_params": None},
          'ExtraTreesRegressor': {"model": extree, "fit_kwargs":None, "app_params": None},
          'RandomForestRegressor': {"model": rfr, "fit_kwargs":None, "app_params": None},
          'GradientBoostingRegressor': {"model": gb, "fit_kwargs":None, "app_params": None},
          'XGBRegressor-1': {"model": xgb_1, "fit_kwargs":fit_params, "app_params": app_params},
          'XGBRegressor-2': {"model": xgb_2, "fit_kwargs":fit_params, "app_params": app_params},
          'CatBoostRegressor': {"model": catb, "fit_kwargs": fit_params, "app_params": app_params},
          'LGBMRegressor': {"model": lgb, "fit_kwargs":fit_params, "app_params": app_params}
          # we can add any number of models here 
        }

In [16]:
models.keys()

dict_keys(['LinearRegression', 'Lasso', 'ElasticNet', 'ExtraTreesRegressor', 'RandomForestRegressor', 'GradientBoostingRegressor', 'XGBRegressor-1', 'XGBRegressor-2', 'CatBoostRegressor', 'LGBMRegressor'])

### Stacking

Here goes the actual stacking procedure. 
   - We first define the architecture, and setup the a session.
   - Define the stack. That is, the models and transformers in the levels

In [17]:
# settings: experiment and stacking architecutre
session_folder = "Experiments/session_1"
# create experiment
ml30_experiment = Experiment(title='ML 30 days',
                    n_folds=5,
                    random_state=42,
                    shuffle=True,
                    use_different_random_states=True,
                    main_folder=f'{os.getcwd()}{os.sep}{session_folder}')

# initialize the stack to the input
X_train_, X_test_ = X_train, X_test 

# any special transformers for any level
level_1_transformers = [('cat', OrdinalEncoder(), categorical_ix), ('num', MinMaxScaler(), numerical_ix)]
level_1_transform = ColumnTransformer(transformers=level_1_transformers)


# define the actual stack
stack = [ {"level-id": "level-1", 
           "models": [
                     'CatBoostRegressor',
                     'XGBRegressor-1',
                     # we can add any model here
                    ],
            "use-only": None, # to select subset of models in the next level
            "n_folds": 10,
            "folder": "level_1", 
            "transformer": level_1_transform, 
            "frozen": True # to freeze the level if already trained
            },
               
           {"level-id": "level-2",
            "models": [
                       'RandomForestRegressor',
                       'LinearRegression',
                     #  other models can be added here
                      ],
            "use-only": None, 
            "n_folds": 10,
            "folder": "level_2",
            "transformer": None,
            "frozen": False
          },
         
         # we can add any number of levels here
         # ...
         
          {"level-id": "meta_level",
            "models": ['RandomForestRegressor'],
            "use-only": None, 
            "n_folds": 10,
            "folder": "meta_level",
            "transformer": None,
            "frozen": False
          }
         
         
        ]
         

- Loop through each level in the stack

In [18]:
# run the stack
for  level in stack:
    
    print('-'*50)
    print(f'Current Level: {level["level-id"]}')
    print('-'*50)
    
    # get current path
    
    # get models 
    # if level_1_models is set to 'all' use all models
    if level['models'][0].lower() == 'all':
        level_models_names = models.keys()
    else: 
        level_models_names = level['models']

    # join the level's folder
    ml30_experiment.join_folder(folder=level['folder'])
    # create folds indexes for the level
    ml30_experiment.calc_folds_indexes(X=X_train_.values, y=y.values)

    # create models
    model_wrappers = []
    for model_name in level_models_names:

        # get paramaters  
        model = models[model_name]
        fit_kwargs = models[model_name]['fit_kwargs']
        app_params = models[model_name]['app_params']

        model_wrapper = ModelWrapper(model=model['model'], name=model_name)
        if fit_kwargs is not None:
            model_wrapper.fit_params = fit_kwargs
        if app_params is not None:
            model_wrapper.uses_eval_set = app_params['uses_eval_set']
        model_wrappers.append(model_wrapper)

        
    # train the level
    
    if not level['frozen']:  # escape any trained level   
        level_trainer = LevelTrainer(experiment=ml30_experiment, 
                             models=model_wrappers, 
                             transformer=level['transformer'])

        level_trainer.train(X_df=X_train_, y_df=y, X_test_df=X_test_)
    else:
        print('This level is already trained')


    # collect results    
    fold_id = ml30_experiment.n_folds
    folder =f'{session_folder}{os.sep}{ml30_experiment.output_folder}'
                            
    print(f"{folder}{os.sep}*{fold_id}_valid.csv")
    
    # new features 
    X_train_ = read_oof_predictions(pattern=f"{folder}{os.sep}*{fold_id}_valid.csv",
                               models_names=level_models_names)
    X_test_ = read_oof_predictions(pattern=f"{folder}{os.sep}*{fold_id}_test.csv",
                              models_names=level_models_names)
    
    # to see each level's output
#     display(X_train_.head(10))
#     display(X_test_.head(10))


--------------------------------------------------
Current Level: level-1
--------------------------------------------------
This level is already trained
Experiments/session_1/level_1/*5_valid.csv


ValueError: No objects to concatenate

In [None]:
# final results
X_test_.head(10)

### Submit the results

In [None]:
predictions = X_test_.iloc[:, -1].values

In [None]:
# Save the predictions to a CSV file
output = pd.DataFrame({'Id': X_test.index,
                       'target': predictions})
output.to_csv('submission.csv', index=False)

In [None]:
# results 
output.head(20)