Welcome to the **[30 Days of ML competition](https://www.kaggle.com/c/30-days-of-ml/overview)**!  In this notebook, you'll learn how to make your first submission.

Before getting started, make your own editable copy of this notebook by clicking on the **Copy and Edit** button.

# Step 1: Import helpful libraries

We begin by importing the libraries we'll need.  Some of them will be familiar from the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** course and the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course.

In [1]:
# Familiar imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import os
import glob
from datetime import datetime
from pathlib import Path

from scipy import stats
from scipy.stats import norm, skew #for some statistics
from scipy.special import boxcox1p



# For ordinal encoding categorical variables, splitting data, pipeline, and so on
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder, PowerTransformer, StandardScaler, MinMaxScaler, RobustScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, KFold, cross_val_score 
from sklearn.pipeline import make_pipeline

# For training random forest model
from sklearn.linear_model import ElasticNet, Lasso
from sklearn.neighbors import KNeighborsRegressor 
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

# base
from sklearn.base import BaseEstimator, RegressorMixin, TransformerMixin, clone


from sklearn.metrics import mean_squared_error


In [2]:
pd.set_option("display.max_columns", 100)

# Step 2: Load the data

Next, we'll load the training and test data.  

We set `index_col=0` in the code cell below to use the `id` column to index the DataFrame.  (*If you're not sure how this works, try temporarily removing `index_col=0` and see how it changes the result.*)

In [3]:
# Load the training data
train = pd.read_csv("train.csv", index_col=0)
test = pd.read_csv("test.csv", index_col=0)

# Preview the data
train.head()
train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
cont0,300000.0,0.527335,0.230599,-0.118039,0.405965,0.497053,0.66806,1.058443
cont1,300000.0,0.460926,0.214003,-0.069309,0.310494,0.427903,0.615113,0.887253
cont2,300000.0,0.490498,0.253346,-0.056104,0.300604,0.502462,0.647512,1.034704
cont3,300000.0,0.496689,0.219199,0.130676,0.329783,0.465026,0.664451,1.03956
cont4,300000.0,0.491654,0.240074,0.255908,0.284188,0.39047,0.696599,1.055424
cont5,300000.0,0.510526,0.228232,0.045915,0.354141,0.488865,0.669625,1.067649
cont6,300000.0,0.467476,0.210331,-0.224689,0.342873,0.429383,0.573383,1.111552
cont7,300000.0,0.537119,0.21814,0.203763,0.355825,0.504661,0.703441,1.032837
cont8,300000.0,0.498456,0.23992,-0.260275,0.332486,0.439151,0.606056,1.040229
cont9,300000.0,0.474872,0.218007,0.117896,0.306874,0.43462,0.614333,0.982922


The next code cell separates the target (which we assign to `y`) from the training features (which we assign to `features`).

In [4]:
# Separate target from features
y = train['target']
features = train.drop(['target'], axis=1)

# Preview features
features.head().T

id,1,2,3,4,6
cat0,B,B,A,B,A
cat1,B,B,A,B,A
cat2,B,A,A,A,A
cat3,C,A,C,C,C
cat4,B,B,B,B,B
cat5,B,D,D,D,D
cat6,A,A,A,A,A
cat7,E,F,D,E,E
cat8,C,A,A,C,A
cat9,N,O,F,K,N


# Step 3: Prepare the data

Next, we'll need to handle the categorical columns (`cat0`, `cat1`, ... `cat9`).  

In the **[Categorical Variables lesson](https://www.kaggle.com/alexisbcook/categorical-variables)** in the Intermediate Machine Learning course, you learned several different ways to encode categorical variables in a dataset.  In this notebook, we'll use ordinal encoding and save our encoded features as new variables `X` and `X_test`.

In [5]:
def add_poly_features(train, test, cols=None, concate=True, poly_degree=2):
    
    if cols is  None:
        cols = train.columns
     
    columns = train[cols].columns
    poly = PolynomialFeatures(poly_degree)
    poly_train = pd.DataFrame(poly.fit_transform(train[cols]), index=train.index)
    poly_test = pd.DataFrame(poly.transform(test[cols]), index=test.index)
    
    # stamp these columns
    poly_train = poly_train.add_prefix('poly_')
    poly_test = poly_test.add_prefix('poly_')
    
    if concate:
        train = pd.concat([train, poly_train], axis=1)
        test = pd.concat([test, poly_test], axis=1)
        
    return train, test



In [6]:

# List of categorical columns
object_cols_long = [col for col in features.columns if 'cat' in col]
cont_cols = [col for col in features.columns if 'con' in col]

X = features.copy()
X_test = test.copy()

# cols = ['cont0', 'cont3', 'cont9']
# add poly features
# X, X_test = add_poly_features(X, X_test, cols)




Next, we break off a validation set from the training data.

In [7]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0)

In [8]:
X_train.shape

(225000, 24)

### Transformers

In [9]:
# transfomration: scaling, encoding, new features, ... etc
# helpers
# map columns to indexes 

class ML30Transformer(BaseEstimator, TransformerMixin):
    def __init__(self, 
                 cat_columns=None,
                 cont_columns=None,
                 target_column=None,
                 scaler=StandardScaler(),
                 normalizer=None,
                 cont_encoder=None,
                 cat_encoder=OrdinalEncoder(),
                 target_encoder=None):
        
        self.cat_columns = cat_columns
        self.cont_columns = cont_columns
        self.target_column = target_column
        self.scaler = scaler
        self.normalizer = normalizer
        self.cat_encoder = cat_encoder
        self.cont_encoder = cont_encoder
    
    def fit(self, X):
        
        self.cat_encoder.fit(X[self.cat_columns])
        return self
    
    def transform(self, X):
        
        return self.cat_encoder.transform(X[self.cat_columns])
        
        
        
        

In [None]:
# encoding
ml30_transformer = OrdinalEncoder()
#ML30Transformer(cont_columns=X.columns.get_indexer(cont_cols))
X[object_cols_long] = ml30_transformer.fit_transform(X[object_cols_long])
X_test[object_cols_long] = ml30_transformer.transform(X_test[object_cols_long])

# Step 4.1: Predict using Generalized Stacking


In [None]:

# helpers and wrappers

## helper fucntions
def store_oof_predictions(model_name,
                          valid_predictions,
                          test_predictions,
                          X_train,
                          X_test,
                          stacking_level,
                          output_folder,
                          n_folds, 
                          random_state):    
    
    """
    Store oof predictions into files.
    It is safe not to use defaults for this helper to make sure 
    we are storing the right experiement. 
    """
    
    if test_predictions is not None:
        test_df = pd.DataFrame({'Id': X_test.index,
                       model_name: test_predictions})
        test_df.to_csv( f'{output_folder}{os.sep}{model_name}_{n_folds}_test.csv',
                       index=False)

    if valid_predictions is not None:
        valid_df = pd.DataFrame({'Id': X_train.index,
                           model_name: valid_predictions})
        valid_df.to_csv(f'{output_folder}{os.sep}{model_name}_{n_folds}_valid.csv', 
                        index=False)
    
    
def read_oof_predictions(pattern, 
                     models_names, 
                     index='id'):    
    """
    Read files according the pattern in their name
    it returns a dataframe where each column is the prediction of a model
    """
    li = [] 
    for f in glob.glob(pattern):
        file_name = f.split(os.sep)[-1] # get file name only
        print(file_name)
        if file_name.lower().startswith(tuple(s.lower() for s in models_names)): # get only required models
            df = pd.read_csv(f, index_col=0)
            li.append(df)
    
                                   
    return pd.concat(li, axis=1)
    

In [None]:
class ModelWrapper():
    def __init__(self, 
                 model,
                 name,
                 uses_eval_set=False,
                 fit_params={}):
        
        self.model = model
        self.name = name
        
        self.uses_eval_set = uses_eval_set
        self.fit_params = fit_params # any extra params for the 'fit' function
        
#         self.valid_predictions = None # oof predictions on the valiation set
#         self.test_predictions = None  # meta-features on the test set
                
    def fit(self, X, y, eval_set=None):
        if self.uses_eval_set:
            self.model.fit(X, y, eval_set=eval_set, **(self.fit_params))
        else:
            self.model.fit(X, y, **(self.fit_params)) 
        return self
    

    def predict(self, X):
        return self.model.predict(X)
        
    def clone_me(self, random_state=None):
        wrapper = ModelWrapper(model=clone(self.model), # clone from sklean.base
                               name=self.name, 
                               uses_eval_set=self.uses_eval_set,
                               fit_params=self.fit_params)
        wrapper.name = self.name
        if random_state is not None:
            wrapper.set_random_state(random_state)
        
        return wrapper
    
    def set_random_state(self, random_state):
        if hasattr(self.model, 'random_state'):
            self.model.random_state = random_state
        elif hasattr(self.model, 'random_seed'):
            self.model.random_seed = random_state
            
    def get_random_state(self):
        if hasattr(self.model, 'random_state'):
            return self.model.random_state 
        elif hasattr(self.model, 'random_seed'):
            return self.model.random_seed
    
        

In [None]:
class ModelTrainer():
    def __init__(self,
                  model: ModelWrapper):
        
        self.model = model
        
    def calc_oofs(self,
                  X, y,
                  X_test,
                  folds,
                  transformer=None,
                  verbose=False,
                  use_different_random_states=True):
        """
        Return the oofs predictions and the meta features (test predictions)
        """
        
        test_predictions = 0
        oof_predictions = np.zeros_like(np.array(y))
        valid_mean_score = [] 
                
        for fold, (train_ix, valid_ix) in enumerate(folds.split(X)):
            X_train, X_valid = X[train_ix], X[valid_ix]
            y_train, y_valid = y[train_ix], y[valid_ix]
            
            print(len(train_ix), len(valid_ix))
            
            # transform input
            
            if transformer is not None:
                X_train = transformer.fit_transform(X_train)
                X_valid = transformer.transform(X_valid)

            
            # check if we train each fold on differently initialized clone
            if use_different_random_states:
                model = self.model.clone_me(random_state=fold)
            else:
                model = self.model.clone_me()
            
            # fit the model
            if model.uses_eval_set:
                model.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_valid, y_valid)])
            else:
                model.fit(X_train, y_train)
                
            ## predictions
            # on the validation set
            valid_predications = model.predict(X_valid)
            score = mean_squared_error(valid_predications, y_valid, squared=False)
            valid_mean_score.append(score)
            oof_predictions[valid_ix] = valid_predications
            
            # on the test set
            # transform it first on a copy
            X_test_ = transformer.transform(X_test)
            test_predictions += model.predict(X_test_) / folds.n_splits
            
            if verbose:
                print('Fold:{} score:{:.4f}'.format(fold + 1, score))
        
        if verbose:
            print('Average score:{:.4f} ({:.4f})'.format(np.mean(valid_mean_score), np.std(valid_mean_score) ))
    
        return oof_predictions, test_predictions

In [None]:

class Experiment():
    def __init__(self,
                 title='Experiment',
                 n_folds=3,
                 random_state=42,
                 shuffle=True,
                 use_different_random_states=True,
                 main_folder=os.getcwd()):
        
        self.title = title
        self.n_folds = n_folds
        self.random_state = random_state
        self.shuffle = shuffle
        self.use_different_random_states = use_different_random_states
        self.main_folder = main_folder
        
        

    def create(self, use_folder=None, resume=False):
        
        # create folds
        self.folds = KFold(n_splits=self.n_folds, 
                           random_state=self.random_state,
                           shuffle=self.shuffle)
        
        
        # create time stamp and subfolder with the current time stamp
        if use_folder is not None:
            self.output_folder = use_folder 
        else:
            time_stamp = datetime.now().isoformat(' ', 'seconds')
            self.output_folder = self.title + ' ' + time_stamp.replace(':', '-')
            Path(f'{self.main_folder}{os.sep}{self.output_folder}').mkdir(parents=True, exist_ok=True)
        
     
    def to_dict(self):
        
        return dict({"title": self.title,
                    "n_folds": self.n_folds,
                    "random_state": self.random_state,
                    "use_different_random_states": self.use_different_random_states,
                    "main_folder": self.main_folder,
                    "output_folder": self.output_folder})
        
    def __str__(self):
        
        return str(self.to_dict())

In [None]:
class LevelTrainer():
    
    def __init__(self,
                 experiment, # Experiment,
                 models,      # ModelWrapper
                 transformer, # data transformer
                 level_num=1):
        
        self.level_num = level_num
        self.experiment = experiment
        self.models = models
        self.transformer = transformer
        
    def train(self, X_df, y_df, X_test_df, verbose=True):
        
        assert models is not None
        

        for model_wrapper in self.models:
            
            if verbose:
                print('-'*3)
                print(f'Model:{model_wrapper.name}')
                print('-'*3)
                
            trainer = ModelTrainer(model_wrapper)
            oof_preds, test_preds = trainer.calc_oofs(X.values, y.values,
                                                      X_test.values,
                                                      transformer=self.transformer,
                                                      folds=self.experiment.folds,
                                                      verbose=verbose,
                                                      use_different_random_states=self.experiment.use_different_random_states)
            # store predictions
            store_oof_predictions(model_name=model_wrapper.name,
                                  valid_predictions=oof_preds,
                                  test_predictions=test_preds,
                                  X_train=X_df,
                                  X_test=X_test_df,
                                  stacking_level=self.level_num, 
                                  n_folds=self.experiment.n_folds, 
                                  random_state=model_wrapper.get_random_state(),
                                  output_folder=f'{self.experiment.main_folder}{os.sep}{self.experiment.output_folder}')
            if verbose:
                print('-'*3)

### ML Models

#### Hyperparamater ranges
common ranges (to be added)

#### Hyperparameters: these should go to a JSON file

In [None]:
# extra-tree
et_params = {
    'n_jobs': -1,
    'n_estimators': 100,
    'max_features': 0.5,
    'max_depth': 12,
    'min_samples_leaf': 2,
}

# random forest
rf_params = {
    'n_jobs': -1,
    'n_estimators': 100,
    'max_features': 0.2,
    'max_depth': 8,
    'min_samples_leaf': 2,
}

# xgboost
# xgb_params = {'n_estimators': 10000,
#               'learning_rate': 0.35,
#               'subsample': 0.926,
#               'colsample_bytree': 0.84,
#               'max_depth': 2,
#               'booster': 'gbtree', 
#               'reg_lambda': 35.1,
#               'reg_alpha': 34.9,
#               'random_state': 42,
#               'n_jobs': -1}

# Using optuna but still rough
xgb_params = {'max_depth': 5,
             'learning_rate': 0.021252439960137114,
             'n_estimators': 13500,
             'subsample': 0.62,
             'booster': 'gbtree',
             'colsample_bytree': 0.1,
             'reg_lambda': 0.1584605320779582,
             'reg_alpha': 15.715145781076245,
             'n_jobs': -1}

# catboost
catb_params = {'iterations': 6800,
              'learning_rate': 0.93,
              'loss_function': "RMSE",
              'random_state': 42,
              'verbose': 0,
              'thread_count': 4,
              'depth': 1,
              'l2_leaf_reg': 3.28}

SEED = 7770777
params_lgb = {"n_estimators": 10000,   
            "boosting_type": "gbdt",
            "objective": "regression",
            "metric": "rmse",
            "learning_rate": 0.007899156646724397,
            "num_leaves": 77,
            "max_depth": 77,
            "feature_fraction": 0.2256038826485174,
            "bagging_fraction": 0.7705303688019942,
            "min_child_samples": 290,
            "reg_alpha": 9.562925363678952,
            "reg_lambda": 9.355810045480153,
            "max_bin": 772,
            "min_data_per_group": 177,
            "bagging_freq": 1,
            "cat_smooth": 96,
            "cat_l2": 17,
            "verbosity": -1,
            "bagging_seed": SEED,
            "feature_fraction_seed": SEED,
            "verbose_eval":1000,
            "seed": SEED,
            "n_jobs":-1}


In [None]:
# external hyperparamaters

### fit function hyperparamaters
# some models require special paramaters like early stoping in xgboost and lgbm

# xgboost
xgb_fit_params = {'early_stopping_rounds': 200,
                  'verbose': False}

# lgbm
lgb_fit_params = {'early_stopping_rounds': 200,
                  'verbose': False}

### application/implementation paramaters
# These paramaters are implementation dependent 

# xgboost
xgb_app_params = {'uses_eval_set':True}

# lgbm
lgb_app_params = {'uses_eval_set':True}

#### Models

In [None]:
# lasso
lasso = Lasso(alpha =0.00005)

# Elastic net
e_net = ElasticNet(alpha=0.00005, l1_ratio=.9)

# KNeighborsRegressor
knn =  KNeighborsRegressor()

# extra-tree
extree = ExtraTreesRegressor(**et_params)

# random forest
rfr = RandomForestRegressor(**rf_params)

# xgboost
xgb =  XGBRegressor(**xgb_params)

#catboost
catb = CatBoostRegressor(**catb_params)

#lgbm
lgb = LGBMRegressor(**params_lgb)


In [None]:
# compile all settings in one dictionary, 
# we can store/load it then to a JSON file
models = {'Lasso': {"model":lasso, "fit_kwargs":None, "app_params": None},
          'ElasticNet': {"model": e_net, "fit_kwargs":None, "app_params": None},
          'ExtraTreesRegressor': {"model": extree, "fit_kwargs":None, "app_params": None},
          'RandomForestRegressor': {"model": rfr, "fit_kwargs":None, "app_params": None},
          'XGBRegressor': {"model": xgb, "fit_kwargs":xgb_fit_params, "app_params": xgb_app_params},
          'CatBoostRegressor': {"model": catb, "fit_kwargs":None, "app_params": None},
          'LGBMRegressor': {"model": lgb, "fit_kwargs":lgb_fit_params, "app_params": lgb_app_params}
        }

In [None]:
models.keys()

### Stacking

Stacking with two levels:
we need two things:

oof predictions and test results.
- train level 1 models and collect their oof predictions 
- use oof predictions in another model and predict the final target.

- there are some variations to this simple procedure: see 
 - https://mlwave.com/kaggle-ensembling-guide/
 - https://www.kaggle.com/getting-started/18153#post103381
 
- one approach would be to:
    - train N models in the first level and get their oof predictions
    - train a model on the whole train set and get it's predictions
    

In [None]:
# settings: experiment and stacking architecutre



#level_1_models = ['CatBoostRegressor', 'LGBMRegressor']
level_1_models = ['all'] # to select all models

output_path = os.getcwd()

ml30_experiment = Experiment(title='ML 30 days',
                        n_folds=5,
                        random_state=42,
                        shuffle=True,
                        use_different_random_states=True,
                        main_folder=f'{output_path}{os.sep}Experiments')

ml30_experiment.create(use_folder='ML 30 days 2021-08-23 17-02-31', resume=True)

print(ml30_experiment)

stack = { 'level_1_models': level_1_models, 'meta_model': 'XGBRegressor'}


# if level_1_models is set to 'all' use the models
if stack['level_1_models'][0].lower() == 'all':
    level_1_models_names = models.keys()
else: 
    level_1_models_names = stack['level_1_models']

In [None]:
# main procedure

# create models
model_wrappers = []

for model_name in level_1_models_names:
    
    # get paramaters  
    model = models[model_name]
    fit_kwargs = models[model_name]['fit_kwargs']
    app_params = models[model_name]['app_params']
    
    model_wrapper = ModelWrapper(model=model['model'], name=model_name)
    if fit_kwargs is not None:
        model_wrapper.fit_params = fit_kwargs
    if app_params is not None:
        model_wrapper.uses_eval_set = app_params['uses_eval_set']
    model_wrappers.append(model_wrapper)


In [None]:
# train the levels
ml30_transformer = ML30Transformer(cat_columns=X.columns.get_indexer(object_cols_long))

level_trainer = LevelTrainer(experiment=ml30_experiment, 
                             models=model_wrappers, 
                             transformer=None)

level_trainer.train(X_df=X, y_df=y, X_test_df=X_test)

In [None]:
                         
# required results                            
fold_id = 5
folder ='Experiments/ML 30 days 2021-08-23 17-02-31'
                            
level_1_models = [
                  'ExtraTreesRegressor',
                  'LGBMRegressor',
                  'CatBoostRegressor',
                  'ElasticNet'
                  ]

meta_regressor = 'CatBoostRegressor'



# new features 
X_train_ = read_oof_predictions(pattern=f"{folder}{os.sep}*{fold_id}_valid.csv",
                           models_names=level_1_models)
X_test_ = read_oof_predictions(pattern=f"{folder}{os.sep}*{fold_id}_test.csv",
                          models_names=level_1_models)

X_train_.head()

In [None]:
X_train_.corr()

In [None]:
meta_model = ModelWrapper(lasso, name='Meta')
model_trainer = ModelTrainer(meta_model)

meta_valid_results, meta_test_results = model_trainer.calc_oofs(X_train_.values, 
                                                                y.values, 
                                                                X_test_.values,
                                                                experiment.folds, 
                                                                verbose=True,
                                                                use_different_random_states=True)

In [None]:
#results = np.exp((np.log(lasso_test_results) + np.log(xgb_test_results) + np.log(lgb_test_results) + np.log(catb_test_results)) / 4 )
results = np.exp(np.sum(np.log(np.column_stack(test_stack)) / 4, axis=1))
#mean_squared_error(y, y3, squared=False)

In [None]:
predictions = meta_test_results

<hr>

In the code cell above, we set `squared=False` to get the root mean squared error (RMSE) on the validation data.

# Step 5: Submit to the competition

We'll begin by using the trained model to generate predictions, which we'll save to a CSV file.

In [None]:
# Save the predictions to a CSV file
output = pd.DataFrame({'Id': X_test.index,
                       'target': predictions})
output.to_csv('submission.csv', index=False)

In [None]:
output.head(10)