### Tabnet for TPS_09

This kernel uses Tabnet in Pytorch to solve the claim classification problem.
It could not beat the state-of-art GBDT models but it was part of my final ensemble. 
The reason is that, it clearly gives more diversity in the base layer of the stack and did not need too many feature engineering efforts.


As usual, I reuse the same classes I have created for stacking: **experiment**, **model**, **level**, and **stack**. These will be modeled by classes as described in this notebook.

In [1]:
!pip install pytorch-tabnet

Collecting pytorch-tabnet
  Downloading pytorch_tabnet-3.1.1-py3-none-any.whl (39 kB)
Installing collected packages: pytorch-tabnet
Successfully installed pytorch-tabnet-3.1.1


In [2]:
# Familiar imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

import os
import gc
import glob
import random
from datetime import datetime
from pathlib import Path



# helpers
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder, OneHotEncoder, PowerTransformer, StandardScaler, \
                                  MinMaxScaler, RobustScaler, PolynomialFeatures, QuantileTransformer,  KBinsDiscretizer

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, KFold, cross_val_score, StratifiedKFold
from sklearn.pipeline import make_pipeline, Pipeline

# memory stuff
#from fail_safe_parallel_memory_reduction import Reducer


# Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble     import HistGradientBoostingClassifier

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

import torch
from pytorch_tabnet.tab_model import TabNetClassifier


# base
from sklearn.base import BaseEstimator, RegressorMixin, TransformerMixin, clone

# scoring
from sklearn import metrics

In [3]:
# notebook options
pd.set_option("display.max_columns", 100)
path = "../input/tabular-playground-series-sep-2021/"
train_file = "train.csv"
test_file = "test.csv"

In [4]:
# Load the training data
train = pd.read_csv(f'{path}{os.sep}{train_file}', index_col=0)
test = pd.read_csv(f'{path}{os.sep}{test_file}', index_col=0)

In [9]:
# Separate target from features
y = train['claim']
X = train.drop(['claim'], axis=1)


### Feature Engineering

In [10]:

# identify columns
numerical_cols = list(X.select_dtypes(include=np.number).columns)
non_numeric_cols = list(X.select_dtypes(include=['object', 'bool']).columns)

print(f'We have {len(numerical_cols)} numeric and {len(non_numeric_cols)} non-numeric features')


# work on a copy
X_train = X.copy()
X_test = test.copy()


# all features
features = non_numeric_cols + numerical_cols

# new features
# https://www.kaggle.com/hiro5299834/tps-sep-2021-single-lgbm
X_train['n_missing'] = X_train[features].isna().sum(axis=1)
X_test['n_missing'] = X_test[features].isna().sum(axis=1)

X_train['std'] = X_train[features].std(axis=1)
X_test['std'] = X_test[features].std(axis=1)



features += ['n_missing', 'std']

# imputation
X_train[features] = X_train[features].fillna(X_train[features].mean())
X_test[features] = X_test[features].fillna(X_test[features].mean())



# useful for column transformers 
numerical_ix = X_train.columns.get_indexer(features)


We have 118 numeric and 0 non-numeric features


### Model-dev helper functions

These are functions to save and load predictions, they can be wrapped within a class for a better modeling or kept as they are since they are independent of the project setting.


In [11]:

## helper fucntions
def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)

def to_file(data, output_folder, idxs=None, suffix='.csv'):
    print(data)
    df = pd.DataFrame(data)
    df.to_csv(f'{output_folder}{os.sep}{suffix}', index=True)
        
    
def calc_folds_indexes(X, y, n_folds=5, shuffle=True, sampler=KFold, seeds=[42]):
    """
    Create folds from a dataset X and a target y
    sampler: can be KFold,  StratifiedKFold, or any sampling class  
    
    return a list of dictionaries of {'seed':, 'idxs':[train_idxs, test_idxs]}
    """
    folds_idxs_list = []
    for seed in seeds:
        folds = sampler(n_splits=n_folds, 
                        random_state=seed,
                        shuffle=shuffle)

        folds_idxs_list.append({'seed': seed, 'idxs':list(folds.split(X, y))})
        
    return folds_idxs_list   

def score(y, target, average=False):
    # if y is a list then it will return a list of scores
    # if average is True then it will return the mean of the scores
    
    if type(y) in [list, np.ndarray]:
        scores = []
        for y_i in y:
            scores.append(score_func(y_i, target, **score_func_params))
        if average:
            return np.mean(scores)
        else:
            return scores
        
    return score_func(y, target, **score_func_param)


In [12]:
# initialize things

# seed
seed = 42
seed_everything(seed)

### Forward selection

### ModelWrapper 
This main role of this class is to avoid coding multiple classes for each model (or model types). We can see that models can actually be categorized into different categories, where some models accept more parameters than the others. For instance xgboost can use an evaluation set to determine the stopping round number, while Lasso does not have such extra parameters.

Thanks to the flexibility of Python and the design of the base models, we can wrap the model and develope a `wrapper` to do what the model should do. In fact, we can easily stretch this class to support sklearn pipelines or any framework we are using. The idea is again, seperate code from data and try to generalize.


In [14]:
class ModelWrapper():
    def __init__(self, 
                 model,
                 name,
                 uses_eval_set=False,
                 fit_params={}):
        
        self.model = model
        self.name = name
        
        self.uses_eval_set = uses_eval_set
        self.fit_params = fit_params # any extra params for the 'fit' function
                
    def fit(self, X, y, eval_set=None):
        if self.uses_eval_set:
            self.model.fit(X, y, eval_set=eval_set, **(self.fit_params))
        else:
            self.model.fit(X, y, **(self.fit_params)) 
        return self
    

    def predict(self, X):
        return self.model.predict(X)

    def predict_proba(self, X):
        return self.model.predict_proba(X)
    
    def clone_me(self, random_state=None):
        wrapper = ModelWrapper(model=clone(self.model), # clone from sklean.base
                               name=self.name, 
                               uses_eval_set=self.uses_eval_set,
                               fit_params=self.fit_params)
        wrapper.name = self.name
        if random_state is not None:
            wrapper.set_random_state(random_state)
        
        return wrapper
    
    def set_random_state(self, random_state):
        if hasattr(self.model, 'random_state'):
            self.model.random_state = random_state
        elif hasattr(self.model, 'random_seed'):
            self.model.random_seed = random_state
            
    def get_random_state(self):
        if hasattr(self.model, 'random_state'):
            return self.model.random_state 
        elif hasattr(self.model, 'random_seed'):
            return self.model.random_seed
    
        

### ModelTrainer
This role of this class is to train a model and calculate the oofs and the test predictions (meta-features). That is, to cross validate.


In [15]:
class ModelTrainer():
    def __init__(self,
                  model: ModelWrapper):
        
        self.model = model
        
    def cross_validate(self,
                  X, y,
                  X_test,
                  folds_idxs,
                  transformer=None,
                  fit_transform_on_test_set=False,
                  verbose=False,
                  use_different_random_states=True, 
                  score_function=metrics.roc_auc_score,
                  score_function_params={}):
        """
        Return the oofs predictions and the meta features (test predictions)
        """
        
        test_predictions = 0
        oof_predictions = np.zeros_like(np.array(y), dtype=np.float64)
        valid_mean_score = [] 
        for fold, (train_ix, valid_ix) in enumerate(folds_idxs): # we are not using spilit here for a better generalization
            X_train, X_valid = X[train_ix], X[valid_ix]
            y_train, y_valid = y[train_ix], y[valid_ix]
                             
            # transform input
            if transformer is not None:
                X_train = transformer.fit_transform(X_train)
                if fit_transform_on_test_set:
                    X_valid = transformer.fit_transform(X_valid)
                    X_test_ = transformer.fit_transform(X_test)
                else:
                    X_test_ = transformer.transform(X_test)
                    X_valid = transformer.transform(X_valid)
            else:
                X_test_ = X_test
                
            # check if we train each fold on differently initialized clone
            if use_different_random_states:
                model = self.model.clone_me(random_state=fold)
            else:
                model = self.model.clone_me()
            
            # fit the model
            if model.uses_eval_set:
                model.fit(X_train, y_train, eval_set=[(X_train, y_train),(X_valid, y_valid)])
            else:
                model.fit(X_train, y_train)
                
            ## predictions
            # on the validation set
            valid_predications = model.predict_proba(X_valid)[:, -1]
            score = score_function(y_valid, valid_predications)
            valid_mean_score.append(score)
            oof_predictions[valid_ix] = valid_predications
            
            # on the test set        
            test_predictions += model.predict_proba(X_test_)[:, -1] / len(folds_idxs)
            
            if verbose:
                print('Fold:{} score:{:.4f}'.format(fold + 1, score))
        
        if verbose:
            print('Average score:{:.4f} ({:.4f})'.format(np.mean(valid_mean_score), np.std(valid_mean_score) ))
    
        return oof_predictions, test_predictions

### Level
The level class glues all components in a given layer

In [16]:
class Level():
    def __init__(self,
                level_id,
                models,
                folder,
                transformer,
                n_folds=5,
                seeds=[42],
                frozen=False,
                fit_transform_on_test_set = False,
                use_different_random_states=True):
        
        self.level_id = level_id
        self.models = models
        self.folder = folder
        self.transformer = transformer
        self.n_folds = n_folds
        self.seeds = seeds
        self.frozen = frozen=False
        self.fit_transform_on_test_set = fit_transform_on_test_set
        self.use_different_random_states = use_different_random_states
    
    def create(self, model_zoo):
        """
         Create a level.
         model_zoo: a dictionay of all avialable models.
        """
        self.model_wrappers = []
        
        # get models 
        # if models is set to 'all' use all models
        if self.models[0].lower() == 'all':
            level_models_names = model_zoo.keys()
        else: 
            level_models_names = self.models

        for model_name in level_models_names:
            # get paramaters  
            model = model_zoo[model_name]
            fit_kwargs = model_zoo[model_name]['fit_kwargs']
            app_params = model_zoo[model_name]['app_params']

            model_wrapper = ModelWrapper(model=model['model'], name=model_name)
            if fit_kwargs is not None:
                model_wrapper.fit_params = fit_kwargs
            if app_params is not None:
                model_wrapper.uses_eval_set = app_params['uses_eval_set']
            self.model_wrappers.append(model_wrapper)

#### Level Trainer
Trains all models in a given level.

In [17]:
class LevelTrainer():
    def __init__(self,
                level,
                seeds_folds_idxs_list):
        self.level = level
        self.seeds_folds_idxs_list = seeds_folds_idxs_list
        
    def train(self, X_train, y, X_test, verbose=True, agg_func=None):
        """
        train the level and return the oofs and meta-features for each model in the level.
        If the level has many seeds it will either use the agg_func to combine predictions
        or will just return eveything, it depends on agg_func
        
        agg_func: can be None, np.mean, or any other numpy reduction function
        """

        level_oof_preds, level_test_preds = {}, {}
        for model_wrapper in self.level.model_wrappers:
            if verbose:
                print('-'*30)
                print(f'Model:{model_wrapper.name}')
                print('-'*30)

            # train each model with as many times as the length of folds_idxs_list 
            model_oof_preds, model_test_preds = [], []
            
            for seeds_folds_idxs in self.seeds_folds_idxs_list:
                seed, folds_idxs = seeds_folds_idxs['seed'], seeds_folds_idxs['idxs']
                print('-'*30)
                print(f'Seed:{seed}')
                print('-'*30)
                
                trainer = ModelTrainer(model_wrapper)
                oof_preds, test_preds = trainer.cross_validate(X_train, 
                                                          y,
                                                          X_test,
                                                          transformer=self.level.transformer,
                                                          folds_idxs=folds_idxs,
                                                          verbose=verbose,
                                                          fit_transform_on_test_set=self.level.fit_transform_on_test_set)
                if agg_func is None:
                    level_oof_preds[f'{model_wrapper.name}_seed_{seed}'] =  oof_preds
                    level_test_preds[f'{model_wrapper.name}_seed_{seed}'] =  test_preds
                else: # collect them in order to aggregate them with the agg_func function
                    model_oof_preds.append(oof_preds)
                    model_test_preds.append(test_preds)

          # aggregate the results
        if agg_func is not None:
            level_oof_preds[f'{model_wrapper.name}'] = agg_func(np.column_stack(model_oof_preds))
            level_test_preds[f'{model_wrapper.name}'] = agg_func(np.column_stack(model_test_preds))

        if verbose:
            print('-'*30)

        return pd.DataFrame(level_oof_preds), pd.DataFrame(level_test_preds)

### Experiment 
Since in many cases everything boils down to stacking, the experiment class will handle the organization of the resulted files from the test: test and oofs predictions. Therefore, assuming the project has the following structure with a folder called **experiments** we can save our tests in this folder. This is what this class will do. This class is the entry point for any run (experiment) in the project. It reads the input and the settings and produces the output.

```
   TPS_09_21_project
    │   README.md
    │
    └───notebooks
    │   ...
    │
    └───experiments
    │   │   
    │   │
    │   └───experiment_1   
    │   │   level_1_oofs.csv
    │   │   level_1_test.csv
    │   │   level_2_oofs.csv
    │   │   level_2_test.csv
    │   │   ...
    │   │   meta_level_oofs.csv
    │   │   meta_level_test.csv
    │   └───experiment_...
```


>The code that generated the results is important to save too, but that can be done easily by creating a new version of the notebook or copying notebook with the CV_LB results. If we are running it in a local machine without notebooks, we can create a small function to copy the code files to the experiment levels. On other words, to save the code and the results for each experiement for a better look up.

>This class is so important when running notebooks in our computers. Since Kaggle has a nice notebook management system it saves outputs as well.



In [18]:
class Experiment():
    def __init__(self,
                 title,
                 description,
                 stack,
                 model_zoo,
                 main_folder=os.getcwd()):
        
        self.title = title
        self.main_folder = main_folder
        self.stack = stack
        self.model_zoo = model_zoo
        self.description = description
        # create the main folder if it does not exist
        if not os.path.exists(f'{self.main_folder}'):
            os.makedirs(f'{self.main_folder}', exist_ok=True)
        
    def join_folder(self, folder=None):
         """
         Join a folder and output where results will be saved.
         If 'folder' is None, it will create a folder
         with a time stamp.
         """

         # create time stamp and subfolder with the current time stamp
         if folder is not None: # if folder is specified
            self.output_folder = folder 
            # create a folder if does not exit.
            folder_path = f'{self.main_folder}{os.sep}{self.output_folder}'
            if not os.path.exists(folder_path):
                os.makedirs(folder_path)                
         else: # create a folder with the time stamp
            time_stamp = datetime.now().isoformat(' ', 'seconds')
            self.output_folder = self.title + ' ' + time_stamp.replace(':', '-')
            # create and replace if it exits.
            Path(f'{self.main_folder}{os.sep}{self.output_folder}').mkdir(parents=True, exist_ok=True)
    
    
    def run(self, X_train, y, X_test, 
            train_idxs,
            test_idxs,
            verbose=True, store=True):
        
        # run the stack
        for  level_params in self.stack:
            # create all models in the level
            level = Level(**level_params)
            level.create(self.model_zoo)

            print('-'*50)
            print(f'Current Level: {level.level_id}')
            print('-'*50)

            # join the level's output folder
            #self.join_folder(folder=level.folder)

            # create folds indexes for the level
            seeds_folds_idxs_list = calc_folds_indexes(X=X_train,
                                                       y=y,
                                                       n_folds=level.n_folds,
                                                       sampler=StratifiedKFold,
                                                       seeds=level.seeds)

            # train the level
            if not level.frozen:  # escape any trained level   
                level_trainer = LevelTrainer(level=level, 
                                             seeds_folds_idxs_list=seeds_folds_idxs_list)

                level_oof_preds, level_test_preds =  level_trainer.train(X_train=X_train,
                                                                         y=y,
                                                                         X_test=X_test)
                # store predictions?
                if store:
                    # oofs 
                    level_oof_preds.to_csv(f'{self.main_folder}{os.sep}{self.output_folder}{os.sep}{level.level_id}_oofs.csv')
                    # test predictions
                    level_test_preds.to_csv(f'{self.main_folder}{os.sep}{self.output_folder}{os.sep}{level.level_id}_test.csv')
                    
                
                # update train and test 
                X_train, X_test = level_oof_preds.values, level_test_preds.values
            else:
                print('This level is already trained')
                # load saved of this level and raise error
                fold_id = level.n_folds
                folder =f'{self.main_folder}{os.sep}{self.output_folder}'
                
                # new features 
                level_oof_preds = pd.read_csv(f"{folder}{os.sep}*{fold_id}_oofs.csv")
                level_test_preds = pdf.read_csv(f"{folder}{os.sep}*{fold_id}_test.csv")
                
                X_train = level_oof_preds.values
                X_test = level_test_preds.values
                
            if verbose:
                display(level_oof_preds.head(10))
                display(level_test_preds.head(10))
                
        # return the last output from the last level
        return level_test_preds

### Hyperparameters

Here goes the paramaters of each model. These can actually be stored in an external JSON file.


In [19]:
# rf
rf_params = {
    'n_jobs': -1,
    'n_estimators': 100,
    'max_features': 0.2,
    'max_depth': 8,
    'min_samples_leaf': 2
}


tabnet_params = {
    'optimizer_fn': torch.optim.Adam,
    'optimizer_params': {'lr': 5e-2, 'weight_decay': 5e-4},
    'scheduler_fn': torch.optim.lr_scheduler.StepLR,
    'scheduler_params': {'step_size': 1, 'gamma': 0.7},
    'mask_type': 'entmax',
    'verbose': 1
}

In [20]:
gc.collect()

114

### These are model/task dependent parameters

In [21]:
# external hyperparamaters

### fit function hyperparamaters
# some models require special paramaters like early stoping in xgboost and lgbm
fit_params = {'early_stopping_rounds': 300,
                  'verbose': 1000}

### application/implementation paramaters
# These paramaters are implementation dependent 
app_params = {'uses_eval_set':True}


tabnet_fit_params ={
    'eval_metric':['auc'],
    'eval_name': ['train', 'valid'],
    'batch_size': 1024 * 10,
    'virtual_batch_size': 128 * 10,
    'max_epochs': 10,
    'patience': 5,
    'num_workers': 0,
    'weights': 1,
    'drop_last': False
}



### Models

In [22]:
rf = RandomForestClassifier(**rf_params)

# DNNs
tabnet_clf = TabNetClassifier(**tabnet_params)


Device used : cpu


In [23]:
# compile all settings in one dictionary, 
# we can store/load it then to a JSON file
model_zoo = {
          'RandomForestClassifier': {"model": rf, "fit_kwargs":None, "app_params": None},
          # NN models
          'TabNetClassifier': {"model": tabnet_clf, "fit_kwargs":tabnet_fit_params, "app_params": app_params},
          # we can add any number of models here 
        }

### Stacking

Here goes the actual stacking procedure. 
   - We first define the architecture, and setup a session.
   - Define the stack. That is, the models and transformers in the levels

In [25]:
# settings: experiment and stacking architecutre

# initialize the stack to the input
X_train_, X_test_ = X_train, X_test

# any special transformers for any level
#

level_1_transformers = [('num', QuantileTransformer(n_quantiles=200, output_distribution='normal'), numerical_ix)]

level_1_transform = ColumnTransformer(transformers=level_1_transformers)


# define the actual stack
stack = [ {"level_id": "level-1", 
           "models": [
                     'TabNetClassifier'
                    ],
            "n_folds": 5,
            "seeds" : [42, 43, 44, 45, 46],
            "folder": "level_1", 
            "transformer": level_1_transform,
            "fit_transform_on_test_set": False,
            "frozen": False # to freeze the level if already trained
            },
            
         # ...
         # we can add any number of levels here
         # ...
         
          {"level_id": "meta_level",
            "models": [#'LinearRegression',
                       'RandomForestClassifier'
                      ],
            "n_folds": 5,
            "seeds" : [42],
            "folder": "meta_level",
            "transformer": None,
            "fit_transform_on_test_set": False,
            "frozen": False
          }
         
         
        ]
         

- Loop through each level in the stack

In [26]:
# create experiment
experiments_folder = "Experiments"
experiment_folder = 'experiement_1' # if None a folder with time stamp will be created
experiment_description = "Simple model, multiple seeds"


ml30_experiment = Experiment(title='ML 30 days',
                             description=experiment_description,
                             stack=stack,
                             model_zoo=model_zoo,
                             main_folder=f'{os.getcwd()}{os.sep}{experiments_folder}')

ml30_experiment.join_folder(experiment_folder)

results = ml30_experiment.run(X_train=X_train_.values,
                             y=y.values, 
                             X_test=X_test_.values,
                             train_idxs = X_train_.index,
                             test_idxs = X_test_.index)

--------------------------------------------------
Current Level: level-1
--------------------------------------------------
------------------------------
Model:TabNetClassifier
------------------------------
------------------------------
Seed:42
------------------------------
Device used : cpu
epoch 0  | loss: 0.55659 | train_auc: 0.79471 | valid_auc: 0.79455 |  0:01:26s
epoch 1  | loss: 0.51395 | train_auc: 0.79614 | valid_auc: 0.79575 |  0:02:46s
epoch 2  | loss: 0.5146  | train_auc: 0.79648 | valid_auc: 0.79648 |  0:04:03s
epoch 3  | loss: 0.51451 | train_auc: 0.79687 | valid_auc: 0.79715 |  0:05:18s
epoch 4  | loss: 0.51364 | train_auc: 0.80014 | valid_auc: 0.79952 |  0:06:35s
epoch 5  | loss: 0.51379 | train_auc: 0.80132 | valid_auc: 0.80158 |  0:07:53s
epoch 6  | loss: 0.51408 | train_auc: 0.80032 | valid_auc: 0.80017 |  0:09:11s
epoch 7  | loss: 0.51266 | train_auc: 0.80134 | valid_auc: 0.80104 |  0:10:30s
epoch 8  | loss: 0.51184 | train_auc: 0.79989 | valid_auc: 0.79927 |  

Unnamed: 0,TabNetClassifier_seed_42,TabNetClassifier_seed_43,TabNetClassifier_seed_44,TabNetClassifier_seed_45,TabNetClassifier_seed_46
0,0.668845,0.634002,0.665295,0.673753,0.611043
1,0.138175,0.158296,0.170551,0.161148,0.143295
2,0.754606,0.748117,0.769604,0.748956,0.769263
3,0.772215,0.744379,0.786504,0.761118,0.769494
4,0.720653,0.738022,0.729255,0.721497,0.688198
5,0.569254,0.55135,0.597032,0.578916,0.545902
6,0.787783,0.76123,0.783643,0.767866,0.783468
7,0.569412,0.680548,0.596979,0.621404,0.648933
8,0.124382,0.157565,0.139266,0.128661,0.144687
9,0.174613,0.16518,0.165106,0.159535,0.176325


Unnamed: 0,TabNetClassifier_seed_42,TabNetClassifier_seed_43,TabNetClassifier_seed_44,TabNetClassifier_seed_45,TabNetClassifier_seed_46
0,0.616992,0.60259,0.602178,0.623913,0.614374
1,0.109996,0.116707,0.119328,0.11526,0.113463
2,0.604985,0.593591,0.587148,0.623631,0.624244
3,0.12703,0.127495,0.13085,0.125784,0.125076
4,0.156608,0.149459,0.150756,0.163641,0.153743
5,0.16554,0.160551,0.158838,0.163451,0.158534
6,0.753526,0.759853,0.756801,0.763993,0.75619
7,0.144957,0.13881,0.154822,0.163763,0.14997
8,0.610114,0.597375,0.595995,0.625845,0.612657
9,0.757413,0.757797,0.762128,0.752083,0.748238


--------------------------------------------------
Current Level: meta_level
--------------------------------------------------
------------------------------
Model:RandomForestClassifier
------------------------------
------------------------------
Seed:42
------------------------------
Fold:1 score:0.8093
Fold:2 score:0.8101
Fold:3 score:0.8090
Fold:4 score:0.8093
Fold:5 score:0.8085
Average score:0.8092 (0.0005)
------------------------------


Unnamed: 0,RandomForestClassifier_seed_42
0,0.629174
1,0.142287
2,0.762289
3,0.773264
4,0.72274
5,0.548139
6,0.783596
7,0.621731
8,0.129857
9,0.152417


Unnamed: 0,RandomForestClassifier_seed_42
0,0.60301
1,0.100475
2,0.602926
3,0.118946
4,0.143307
5,0.14694
6,0.770042
7,0.137679
8,0.602311
9,0.761035


In [27]:
# final results
results.head(10)

Unnamed: 0,RandomForestClassifier_seed_42
0,0.60301
1,0.100475
2,0.602926
3,0.118946
4,0.143307
5,0.14694
6,0.770042
7,0.137679
8,0.602311
9,0.761035


### Submit the results

In [28]:
predictions = results.iloc[:, -1].values

In [29]:
# Save the predictions to a CSV file
output = pd.DataFrame({'id': X_test.index,
                       'target': predictions})
output.to_csv('submission.csv', index=False)

In [30]:
# results 
output.head(20)

Unnamed: 0,id,target
0,957919,0.60301
1,957920,0.100475
2,957921,0.602926
3,957922,0.118946
4,957923,0.143307
5,957924,0.14694
6,957925,0.770042
7,957926,0.137679
8,957927,0.602311
9,957928,0.761035
