# Challenge : predict conversions 🏆🏆

This is the template that shows the different steps of the challenge. In this notebook, all the training/predictions steps are implemented for a very basic model (logistic regression with only one variable). Please use this template and feel free to change the preprocessing/training steps to get the model with the best f1-score ! May the force be with you 🧨🧨  

**For a detailed description of this project, please refer to *02-Conversion_rate_challenge.ipynb*.**

# Import libraries

In [25]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio


# Read file with labels

In [26]:
data = pd.read_csv('conversion_data_train.csv')
print('Set with labels (our train+test) :', data.shape)

Set with labels (our train+test) : (284580, 6)


In [27]:
data.head()
data.describe()
data.isna().sum()

country                0
age                    0
new_user               0
source                 0
total_pages_visited    0
converted              0
dtype: int64

* Il n'y a pas de valeurs manquantes : pas besoin d'imputation
* Pas besoin de label encoder
* new_user, country, source est catégorielle : mais pas besoin d'encoder
* age et pages visitées sont quantitatives : on doit les normaliser
* ATTENTION : outliers dans Age : 123 ans ! 

In [28]:
#filtrer que les ages < 90
data = data[data['age'] < 90]

# Explore dataset

In [29]:
# The dataset is quite big : you must create a sample of the dataset before making any visualizations !
data_sample = data.sample(10000)

In [30]:
# Visualize pairwise dependencies before taking out outliers
fig = px.scatter_matrix(data_sample)
fig.update_layout(
        title = go.layout.Title(text = "Bivariate analysis", x = 0.5), showlegend = False, 
            autosize=False, height=800, width = 800)
fig.show()

In [31]:
# pip install kaleido

## Preprocessing 
classification model as Y (target = conversion) is categorical ! 

In [32]:
#définition features :

features_list = ['country', 'age', 'new_user', 'source', 'total_pages_visited']
target_variable = 'converted'

X = data.loc[:, features_list]
Y = data.loc[:, target_variable]

print('Explanatory variables : ', X.columns)
print()

Explanatory variables :  Index(['country', 'age', 'new_user', 'source', 'total_pages_visited'], dtype='object')



In [33]:
# Divide dataset Train set & Test set 
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=0)
print("...Done.")
print()

Dividing into train and test sets...
...Done.



## Training pipeline

In [34]:
# Put here all the preprocessings
print("Encoding categorical features and standardizing numerical features...")

num_featureencoder = StandardScaler()
num_features = ['age', 'total_pages_visited']

cat_featureencoder = OneHotEncoder(drop='first')
cat_features = ['country', 'new_user', 'source']

# Use ColumnTransformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_featureencoder, num_features),
        ('cat', cat_featureencoder, cat_features)
    ])

print("Preprocessing done")

Encoding categorical features and standardizing numerical features...
Preprocessing done


In [35]:
# Preprocessings on train set
print("Performing preprocessings on train set...")
X_train = preprocessor.fit_transform(X_train)
print('...Done.')
print(X_train[0:5]) # MUST use this syntax because X_train is a numpy array and not a pandas DataFrame anymore
print()

# Use X_test, and the same preprocessings as in training pipeline, 
# but call "transform()" instead of "fit_transform" methods (see example below)

print("Encoding categorical features and standardizing numerical features...")

# Preprocessings on test set
print("Performing preprocessings on test set...")
X_test = preprocessor.transform(X_test)
print("...Done")
print(X_test[0:5,:])

Performing preprocessings on train set...
...Done.
[[ 0.53681877  0.33742062  0.          0.          1.          0.
   0.          1.        ]
 [-0.55226096  0.03809786  0.          0.          1.          0.
   0.          0.        ]
 [ 0.17379219 -0.85987044  0.          0.          1.          1.
   0.          1.        ]
 [ 1.38388079  0.33742062  0.          1.          0.          1.
   0.          1.        ]
 [ 1.26287193 -0.26122491  0.          0.          0.          0.
   0.          1.        ]]

Encoding categorical features and standardizing numerical features...
Performing preprocessings on test set...
...Done
[[-0.4312521   0.63674339  0.          0.          1.          1.
   0.          1.        ]
 [ 2.10993394 -0.85987044  0.          0.          1.          1.
   0.          1.        ]
 [-1.27831412 -1.15919321  0.          0.          1.          0.
   0.          1.        ]
 [-1.0362964  -0.85987044  0.          0.          1.          1.
   0.          0. 

In [36]:
# Train model
print("Train model...")
classifier = DecisionTreeClassifier() # 
classifier.fit(X_train, Y_train)

scores = cross_val_score(classifier, X_train, Y_train, scoring = "f1", cv=5)
print('The cross-validated is : ', scores.mean())
print('The standard deviation is : ', scores.std())

# Grid of values to be tested
params = {
    'max_depth': [11, 15, 20],
    'min_samples_leaf': [6, 10, 20],
    'min_samples_split': [1, 2, 3]
}
gridsearch = GridSearchCV(classifier, param_grid = params, cv = 10) # cv : the number of folds to be used for CV
gridsearch.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", gridsearch.best_params_)
print("Best validation accuracy gini: ", gridsearch.best_score_)
print("...Done.")

Train model...
The cross-validated is :  0.9841363423395283
The standard deviation is :  0.00015051019262372634




90 fits failed out of a total of 270.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
90 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Celine\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Celine\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 937, in fit
    super().fit(
  File "c:\Users\Celine\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 250, in fit
    raise ValueError(
ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1



One or more of the test scores are non-finite: [       nan 0.9854209  

...Done.
Best hyperparameters :  {'max_depth': 11, 'min_samples_leaf': 10, 'min_samples_split': 2}
Best validation accuracy gini:  0.9854794627518351
...Done.


In [37]:
# Perform 3-fold cross-validation to evaluate the generalized R2 score obtained with a Ridge model
print("3-fold cross-validation...")
regressor = Ridge()
scores = cross_val_score(regressor, X_train, Y_train, cv=3)
print('The cross-validated R2-score is : ', scores.mean())
print('The standard deviation is : ', scores.std())

3-fold cross-validation...


NameError: name 'Ridge' is not defined

In [None]:
# Predictions on training set
print("Predictions on training set...")
Y_train_pred = gridsearch.best_estimator_.predict(X_train)
print("...Done.")
print(Y_train_pred)
print()

# Predictions on test set
print("Predictions on test set...")
Y_test_pred = gridsearch.best_estimator_.predict(X_test)
print("...Done.")
print(Y_test_pred)
print()

Predictions on training set...
...Done.
[0 0 0 ... 0 0 0]

Predictions on test set...
...Done.
[0 0 0 ... 0 0 0]



## Performance assessment

In [None]:
# WARNING : Use the same score as the one that will be used by Kaggle !
# Here, the f1-score will be used to assess the performances on the leaderboard
print("f1-score on train set : ", f1_score(Y_train, Y_train_pred))
print("f1-score on test set : ", f1_score(Y_test, Y_test_pred))

# Concatenate our train and test set to train your best classifier on all data with labels
X_total = np.append(X_train,X_test,axis=0)
Y_total = np.append(Y_train,Y_test)

gridsearch.best_estimator_.fit(X_total,Y_total)
Y_total_pred = gridsearch.best_estimator_.predict(X_total)

print("f1-score on entire set : ", f1_score(Y_total, Y_total_pred))

f1-score on train set :  0.7729158303224942
f1-score on test set :  0.7466820542412003
f1-score on entire set :  0.772107003306282


In [None]:
# You can also check more performance metrics to better understand what your model is doing
print("Confusion matrix on train set : ")
print(confusion_matrix(Y_train, Y_train_pred))
print()
print("Confusion matrix on test set : ")
print(confusion_matrix(Y_test, Y_test_pred))
print()

print("Confusion matrix on entire set : ")
print(confusion_matrix(Y_total, Y_total_pred))
print()

# Visualize confusion matrices
_ , ax = plt.subplots() # Get subplot from matplotlib
ax.set(title="Confusion Matrix on Train set") # Set a title that we will add into ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(gridsearch, X_train, Y_train, ax=ax) # ConfusionMatrixDisplay from sklearn
plt.show()

_ , ax = plt.subplots() # Get subplot from matplotlib
ax.set(title="Confusion Matrix on Test set") # Set a title that we will add into ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(gridsearch, X_test, Y_test, ax=ax) # ConfusionMatrixDisplay from sklearn
plt.show()


Confusion matrix on train set : 
[[246950    950]
 [  2444   5776]]

Confusion matrix on test set : 
[[27372   128]
 [  311   647]]

Confusion matrix on entire set : 
[[274365   1035]
 [  2756   6422]]



### Prédictions sur TEST set + écriture

In [None]:
# Read data without labels
data_without_labels = pd.read_csv('conversion_data_test.csv')
print('Prediction set (without labels) :', data_without_labels.shape)

# Warning : check consistency of features_list (must be the same than the features 
# used by your best classifier)
features_list = ['country', 'age', 'new_user', 'source', 'total_pages_visited']
X_without_labels = data_without_labels.loc[:, features_list]

# Convert pandas DataFrames to numpy arrays before using scikit-learn
#print("Convert pandas DataFrames to numpy arrays...")
#X_without_labels = X_without_labels.values
#print("...Done")

# print(X_without_labels[0:5,:])

Prediction set (without labels) : (31620, 5)


In [None]:
# WARNING : PUT HERE THE SAME PREPROCESSING AS FOR YOUR TEST SET
# CHECK YOU ARE USING X_without_labels
print("Encoding categorical features and standardizing numerical features...")

X_without_labels = preprocessor.transform(X_without_labels)
print("...Done")
print(X_without_labels[0:5,:])

Encoding categorical features and standardizing numerical features...
...Done
[[-0.31024324  3.33064829  0.          1.          0.          0.
   0.          1.        ]
 [-1.0362964   0.03809786  0.          1.          0.          1.
   1.          0.        ]
 [ 0.17379219 -1.15919321  0.          0.          0.          1.
   0.          1.        ]
 [ 0.17379219  0.33742062  0.          0.          1.          1.
   0.          0.        ]
 [-0.67326982 -0.56054768  0.          0.          0.          0.
   0.          1.        ]]


In [None]:
# Make predictions and dump to file
# WARNING : MAKE SURE THE FILE IS A CSV WITH ONE COLUMN NAMED 'converted' AND NO INDEX !
# WARNING : FILE NAME MUST HAVE FORMAT 'conversion_data_test_predictions_[name].csv'
# where [name] is the name of your team/model separated by a '-'
# For example : [name] = AURELIE-model1
data_end = {
    'converted': gridsearch.best_estimator_.predict(X_without_labels)
}

Y_predictions = pd.DataFrame(columns=['converted'],data=data_end)
Y_predictions.to_csv('conversion_data_test_predictions_CTang-model8.csv', index=False)
