# 🛳 Fast kernel with Intel® Extension for Scikit-learn* - Spaceship Titanic

For classical machine learning algorithms, we often use the most popular Python library, Scikit-learn. With Scikit-learn you can fit models and search for optimal parameters, but it sometimes works for hours. Speeding up this process is something anyone who uses Scikit-learn would be interested in.

I want to show you how to use Scikit-learn library and get the results faster without changing the code. To do this, we will make use of another Python library, [**Intel® Extension for Scikit-learn***](https://github.com/intel/scikit-learn-intelex). It accelerates Scikit-learn and does not require you to change the code written for Scikit-learn.

I will show you how to **speed up** your kernel without changing your code!

More information you can find in [Introduction to scikit-learn-intelex](https://www.kaggle.com/lordozvlad/introduction-to-scikit-learn-intelex)!

### Intel® Extension for Scikit-learn installation:

In [None]:
!pip install scikit-learn-intelex -q --progress-bar off

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import warnings
import gc
from IPython.display import HTML
warnings.filterwarnings("ignore")

from math import sin, cos, pi
from timeit import default_timer as timer
import matplotlib.pyplot as plt

random_state = 42

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

### Reading Data

In [None]:
PATH_TRAIN      = '../input/spaceship-titanic/train.csv'
PATH_TEST       = '../input/spaceship-titanic/test.csv'
PATH_SUBMISSION = '../input/spaceship-titanic/sample_submission.csv'

In [None]:
train_data = pd.read_csv(PATH_TRAIN)
test_data  = pd.read_csv(PATH_TEST)
submission = pd.read_csv(PATH_SUBMISSION)

In [None]:
train_data.info()

In [None]:
train_data[:5]

### Data Pre-Processing

In [None]:
from sklearn.impute import SimpleImputer

imputer_cols = ["Age", "FoodCourt", "ShoppingMall", "Spa", "VRDeck" ,"RoomService"]
imputer = SimpleImputer(strategy = 'median')

imputer.fit(train_data[imputer_cols])

train_data[imputer_cols] = imputer.transform(train_data[imputer_cols])
test_data[imputer_cols] = imputer.transform(test_data[imputer_cols])

train_data["HomePlanet"].fillna('Z', inplace=True)
test_data["HomePlanet"].fillna('Z', inplace=True)

In [None]:
from sklearn.preprocessing import LabelEncoder

label_cols = ["HomePlanet", "CryoSleep","Cabin", "Destination" ,"VIP"]

def label_encoder(train,test,columns):
    for col in columns:
        train[col] = train[col].astype(str)
        test[col] = test[col].astype(str)
        train[col] = LabelEncoder().fit_transform(train[col])
        test[col] =  LabelEncoder().fit_transform(test[col])
    return train, test

train_data, test_data = label_encoder(train_data, test_data, label_cols)

In [None]:
from sklearn.model_selection import train_test_split

train_data.drop(["PassengerId", "Name" ,"Cabin"] , axis = 1 ,inplace = True)
test_data.drop(["PassengerId", "Name" ,"Cabin"] , axis = 1 ,inplace = True)

X, y = train_data.drop(['Transported'] , axis = 1), train_data['Transported']

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                       random_state = 777,
                                                       test_size = 0.1)

<a id="top"></a>
<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Feature selection</h3>
    
   * [Boruta](#1)
   * [RFE](#2)
   * [Shap](#3) 
   

<a id="1"></a>
### Boruta
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to Feature selection</a>

<center><a><img src="https://miro.medium.com/max/1400/1*J-7kkseitdupEm5adCfHmw.png" alt="header" border="0" width=800 height=400 class="center"></a>

In [None]:
from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier

timeFirstI  = timer()
clf = RandomForestClassifier(n_estimators = 200, n_jobs = -1, max_depth = 5)

trans = BorutaPy(clf, random_state = 777)
sel = trans.fit_transform(X.values, y.values)
timeSecondI = timer()

In [None]:
trans.support_

In [None]:
trans.ranking_

In [None]:
print("Total time with Intel Extension: {} seconds".format(timeSecondI - timeFirstI))

### Default Scikit-learn

In [None]:
from sklearnex import unpatch_sklearn
unpatch_sklearn()

In [None]:
from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier

timeFirstD  = timer()
clf = RandomForestClassifier(n_estimators = 200, n_jobs = -1, max_depth = 5)

trans = BorutaPy(clf, random_state = 777)
sel = trans.fit_transform(X.values, y.values)
timeSecondD = timer()

In [None]:
trans.support_

In [None]:
trans.ranking_

In [None]:
print("Total time with default Scikit-learn: {} seconds".format(timeSecondD - timeFirstD))

In [None]:
boruta_speedup = round((timeSecondD - timeFirstD) / (timeSecondI - timeFirstI), 2)
HTML(f'<h2>Boruta speedup: {boruta_speedup}x</h2>'
     f'(from {round((timeSecondD - timeFirstD), 2)} to {round((timeSecondI - timeFirstI), 2)} seconds)')

<a id="2"></a>
### RFE
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to Feature selection</a>

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

In [None]:
from yellowbrick.model_selection import RFECV
from sklearn.ensemble import RandomForestClassifier

rfecv = RFECV(
    estimator = RandomForestClassifier(n_estimators = 300, random_state = 777),
    n_jobs = -1,
    cv = 5
)

timeFirstI  = timer()
sel = rfecv.fit(X_train, y_train)
timeSecondI = timer()
rfecv.show()
pass

In [None]:
print("Total time with Intel Extension: {} seconds".format(timeSecondI - timeFirstI))

### Default Scikit-learn

In [None]:
from sklearnex import unpatch_sklearn
unpatch_sklearn()

In [None]:
from yellowbrick.model_selection import RFECV
from sklearn.ensemble import RandomForestClassifier

rfecv = RFECV(
    estimator = RandomForestClassifier(n_estimators = 300, random_state = 777),
    n_jobs = -1,
    cv = 5
)

timeFirstD  = timer()
sel = rfecv.fit(X_train, y_train)
timeSecondD = timer()
rfecv.show()
pass

In [None]:
print("Total time with default Scikit-learn: {} seconds".format(timeSecondD - timeFirstD))

In [None]:
rfe_speedup = round((timeSecondD - timeFirstD) / (timeSecondI - timeFirstI), 2)
HTML(f'<h2>RFE speedup: {rfe_speedup}x</h2>'
     f'(from {round((timeSecondD - timeFirstD), 2)} to {round((timeSecondI - timeFirstI), 2)} seconds)')

<a id="3"></a>
### Shap
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to Feature selection</a>

In [None]:
import shap

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

In [None]:
from sklearn.ensemble import RandomForestClassifier

timeFirstI  = timer()
slf = RandomForestClassifier(n_estimators = 200, random_state = 777)
slf.fit(X_train, y_train)
explainer = shap.TreeExplainer(slf)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, X_train, plot_type="bar")
timeSecondI = timer()

In [None]:
print("Total time with Intel Extension: {} seconds".format(timeSecondI - timeFirstI))

### Default scikit-learn

In [None]:
from sklearnex import unpatch_sklearn
unpatch_sklearn()

In [None]:
from sklearn.ensemble import RandomForestClassifier

timeFirstD  = timer()
slf = RandomForestClassifier(n_estimators = 200, random_state = 777)
slf.fit(X_train, y_train)
explainer = shap.TreeExplainer(slf)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, X_train, plot_type="bar")
timeSecondD = timer()

In [None]:
print("Total time with default Scikit-learn: {} seconds".format(timeSecondD - timeFirstD))

In [None]:
shap_speedup = round((timeSecondD - timeFirstD) / (timeSecondI - timeFirstI), 2)
HTML(f'<h2>Shap speedup: {shap_speedup}x</h2>'
     f'(from {round((timeSecondD - timeFirstD), 2)} to {round((timeSecondI - timeFirstI), 2)} seconds)')

### Modeling

In [None]:
from catboost import CatBoostClassifier

model = CatBoostClassifier()
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(test_data) == 'True'

In [None]:
submission['Transported'] = y_pred
submission.to_csv('submission.csv', index=False)

## Conclusion

**Intel® Extension for Scikit-learn** gives you opportunities to:
* Use your Scikit-learn code for training and inference without modification.
* Get speed up your kernel

*Please upvote if you liked it.*

## Other notebooks with scikit-learn-intelex usage

### [[predict sales] Stacking with scikit-learn-intelex](https://www.kaggle.com/alexeykolobyanin/predict-sales-stacking-with-scikit-learn-intelex)

### [[TPS-Aug] NuSVR with Intel Extension for Sklearn](https://www.kaggle.com/alexeykolobyanin/tps-aug-nusvr-with-intel-extension-for-sklearn)

### [Using scikit-learn-intelex for What's Cooking](https://www.kaggle.com/kppetrov/using-scikit-learn-intelex-for-what-s-cooking?scriptVersionId=58739642)

### [Fast KNN using  scikit-learn-intelex for MNIST](https://www.kaggle.com/kppetrov/fast-knn-using-scikit-learn-intelex-for-mnist?scriptVersionId=58738635)

### [Fast SVC using scikit-learn-intelex for MNIST](https://www.kaggle.com/kppetrov/fast-svc-using-scikit-learn-intelex-for-mnist?scriptVersionId=58739300)

### [Fast SVC using scikit-learn-intelex for NLP](https://www.kaggle.com/kppetrov/fast-svc-using-scikit-learn-intelex-for-nlp?scriptVersionId=58739339)

### [Fast AutoML with Intel Extension for Scikit-learn](https://www.kaggle.com/lordozvlad/fast-automl-with-intel-extension-for-scikit-learn)

### [[Titanic] AutoML with Intel Extension for Sklearn](https://www.kaggle.com/lordozvlad/titanic-automl-with-intel-extension-for-sklearn)