# Challenge : predict conversions 🏆🏆

This is the template that shows the different steps of the challenge. In this notebook, all the training/predictions steps are implemented for a very basic model (logistic regression with only one variable). Please use this template and feel free to change the preprocessing/training steps to get the model with the best f1-score ! May the force be with you 🧨🧨  

**For a detailed description of this project, please refer to *02-Conversion_rate_challenge.ipynb*.**

# Import libraries

In [16]:
!pip install -q xgboost

In [17]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression, SGDClassifier
from sklearn.metrics import f1_score, confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay, classification_report
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, VotingClassifier,  StackingClassifier, BaggingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio


# Read file with labels

In [18]:
import datetime
now = datetime.datetime.now()
print(now.strftime("%Y-%m-%d %H:%M"))

2023-04-28 13:57


In [19]:
data = pd.read_csv('conversion_data_train.csv')
print('Set with labels (our train+test) :', data.shape)

Set with labels (our train+test) : (284580, 6)


In [20]:
data.head()
data.describe()
data.isna().sum()

country                0
age                    0
new_user               0
source                 0
total_pages_visited    0
converted              0
dtype: int64

* Il n'y a pas de valeurs manquantes : pas besoin d'imputation
* Pas besoin de label encoder
* new_user, country, source est catégorielle : mais pas besoin d'encoder
* age et pages visitées sont quantitatives : on doit les normaliser
* ATTENTION : outliers dans Age : 123 ans ! 

In [21]:
#filtrer que les ages < 90
data = data[data['age'] < 90]

# Explore dataset

In [22]:
# The dataset is quite big : you must create a sample of the dataset before making any visualizations !
data_sample = data.sample(10000)

In [23]:
# Visualize pairwise dependencies before taking out outliers
fig = px.scatter_matrix(data_sample)
fig.update_layout(
        title = go.layout.Title(text = "Bivariate analysis", x = 0.5), showlegend = False, 
            autosize=False, height=800, width = 800)
fig.show()

In [24]:
# pip install kaleido

# Preprocessing 
classification model as Y (target = conversion) is categorical ! 

In [25]:
# définition features :

features_list = ['country', 'age', 'new_user', 'source', 'total_pages_visited']
target_variable = 'converted'

X = data.loc[:, features_list]
Y = data.loc[:, target_variable]

print('Explanatory variables : ', X.columns)
print()

Explanatory variables :  Index(['country', 'age', 'new_user', 'source', 'total_pages_visited'], dtype='object')



In [26]:
# Divide dataset Train set & Test set 
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=0, stratify = Y)
print("...Done.")
print()

Dividing into train and test sets...
...Done.



In [27]:
# Put here all the preprocessings
print("Encoding categorical features and standardizing numerical features...")

num_featureencoder = StandardScaler()
num_features = ['age', 'total_pages_visited']

cat_featureencoder = OneHotEncoder(drop='first')
cat_features = ['country', 'new_user', 'source']

# Use ColumnTransformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_featureencoder, num_features),
        ('cat', cat_featureencoder, cat_features)
    ])

print("Preprocessing done")

Encoding categorical features and standardizing numerical features...
Preprocessing done


In [28]:
# Preprocessings on train set
print("Performing preprocessings on train set...")
X_train = preprocessor.fit_transform(X_train)
print('...Done.')
print(X_train[0:5]) # MUST use this syntax because X_train is a numpy array and not a pandas DataFrame anymore
print()

# Use X_test, and the same preprocessings as in training pipeline, 
# but call "transform()" instead of "fit_transform" methods (see example below)

print("Encoding categorical features and standardizing numerical features...")

# Preprocessings on test set
print("Performing preprocessings on test set...")
X_test = preprocessor.transform(X_test)
print("...Done")
print(X_test[0:5,:])

Performing preprocessings on train set...
...Done.
[[-1.51987065 -0.26099836  0.          0.          1.          0.
   0.          1.        ]
 [ 2.2307148   0.03829244  0.          0.          0.          0.
   0.          1.        ]
 [ 1.86775492 -0.56028917  0.          0.          0.          1.
   0.          1.        ]
 [-1.51987065 -1.15887077  0.          0.          1.          0.
   0.          0.        ]
 [-1.03592414  0.33758325  0.          0.          1.          1.
   0.          0.        ]]

Encoding categorical features and standardizing numerical features...
Performing preprocessings on test set...
...Done
[[-0.79395089  2.73190968  0.          0.          0.          1.
   0.          0.        ]
 [ 0.53690202  0.03829244  0.          0.          1.          0.
   1.          0.        ]
 [-0.18901775 -0.26099836  0.          0.          1.          0.
   1.          0.        ]
 [ 0.0529555   0.93616485  0.          0.          0.          1.
   0.          1. 

# Model Training

c1 = LogisticRegressionCV()
c2 = SGDClassifier(alpha = 0.0001)

c1.fit(X_train, Y_train)
c2.fit(X_train, Y_train)

# Voting
c = VotingClassifier(estimators=[("logistic", c1), ("SGB", c2)], voting='hard') # soft: use probabilities for voting
c.fit(X_train, Y_train)


Y_train_pred = c.predict(X_train)
print("...Done.")
print(Y_train_pred)
print()

# Predictions on test set
print("Predictions on test set...")
Y_test_pred = c.predict(X_test)
print("...Done.")
print(Y_test_pred)
print()

# Concatenate our train and test set to train your best classifier on all data with labels
X_total = np.append(X_train,X_test,axis=0)
Y_total = np.append(Y_train,Y_test)

c.fit(X_total,Y_total)
print("Predictions on test set...")
Y_total_pred = c.predict(X_total)
print("...Done.")

print("f1-score on train set : ", f1_score(Y_train, Y_train_pred))
print("f1-score on test set : ", f1_score(Y_test, Y_test_pred))
print("f1-score on entire set : ", f1_score(Y_total, Y_total_pred))

In [29]:
now = datetime.datetime.now()
print(now.strftime("%Y-%m-%d %H:%M"))
# Perform grid search
print("Grid search...")
sgd = SGDClassifier(alpha = 0.0001) # max_iter changed because of convergence warning
bag = BaggingClassifier(sgd)

# Grid of values to be tested
params = {
     # base_estimator__ prefix because C is a parameter from LogisticRegression! 
    'n_estimators': [20, 30, 40, 50] # n_estimators is a hyperparameter of the ensemble method
}
print(params)
gridsearch0 = GridSearchCV(bag, param_grid = params, cv = 3) # cv : the number of folds to be used for CV
gridsearch0.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", gridsearch0.best_params_)
print("Best validation accuracy : ", gridsearch0.best_score_)
print()
print("Accuracy on training set : ", gridsearch0.score(X_train, Y_train))
print("Accuracy on test set : ", gridsearch0.score(X_test, Y_test))

now = datetime.datetime.now()
print(now.strftime("%Y-%m-%d %H:%M"))

2023-04-28 13:57
Grid search...
{'n_estimators': [20, 30, 40, 50]}
...Done.
Best hyperparameters :  {'n_estimators': 50}
Best validation accuracy :  0.9860963630011182

Accuracy on training set :  0.9861510229579884
Accuracy on test set :  0.9862604540023895
2023-04-28 14:07


In [30]:
# Predictions on training set
print("Predictions on training set...")
Y_train_pred0 = gridsearch0.best_estimator_.predict(X_train)
Y_test_pred0 = gridsearch0.best_estimator_.predict(X_test)
# Concatenate our train and test set to train your best classifier on all data with labels
X_total = np.append(X_train,X_test,axis=0)
Y_total = np.append(Y_train,Y_test)

gridsearch0.best_estimator_.fit(X_total,Y_total)
Y_total_pred0 = gridsearch0.best_estimator_.predict(X_total)

Predictions on training set...


In [31]:
c = gridsearch0.best_estimator_

In [32]:
# WARNING : Use the same score as the one that will be used by Kaggle !
# Here, the f1-score will be used to assess the performances on the leaderboard
print("f1-score on train set : ", f1_score(Y_train, Y_train_pred0))
print("f1-score on test set : ", f1_score(Y_test, Y_test_pred0))
print("f1-score on entire set : ", f1_score(Y_total, Y_total_pred0))

f1-score on train set :  0.7591825650078077
f1-score on test set :  0.7620206938527084
f1-score on entire set :  0.7612809315866086


now = datetime.datetime.now()
print(now.strftime("%Y-%m-%d %H:%M"))
print("Grid search...")
gradboost = GradientBoostingClassifier()

# Grid of values to be tested
params = {
    'max_depth': [1, 2, 3], # no base_estimator_ prefix because these are all arguments of GradientBoostingClassifier
    'min_samples_leaf': [1, 2, 3],
    'min_samples_split': [2, 3, 4],
    'n_estimators': [2, 4, 6, 8, 10]
}
print(params)
gridsearch1 = GridSearchCV(gradboost, param_grid = params, cv = 3) # cv : the number of folds to be used for CV
gridsearch1.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", gridsearch1.best_params_)
print("Best validation accuracy : ", gridsearch1.best_score_)
print()
print("Accuracy on training set : ", gridsearch1.score(X_train, Y_train))
print("Accuracy on test set : ", gridsearch1.score(X_test, Y_test))
now = datetime.datetime.now()
print(now.strftime("%Y-%m-%d %H:%M"))

now = datetime.datetime.now()
print(now.strftime("%Y-%m-%d %H:%M"))
# Perform grid search
print("Grid search...")
xgboost = XGBClassifier()

# Grid of values to be tested
params = {
    'max_depth': [2, 4, 6], # exactly the same role as in scikit-learn
    'min_child_weight': [1, 2, 3], # effect is more or less similar to min_samples_leaf and min_samples_split
    'n_estimators': [2, 4, 6, 8,] # exactly the same role as in scikit-learn
}
print(params)
gridsearch2 = GridSearchCV(xgboost, param_grid = params, cv = 3) # cv : the number of folds to be used for CV
gridsearch2.fit(X_train, Y_train)
print("...Done.")
print("Best hyperparameters : ", gridsearch2.best_params_)
print("Best validation accuracy : ", gridsearch2.best_score_)
print()
print("Accuracy on training set : ", gridsearch2.score(X_train, Y_train))
print("Accuracy on test set : ", gridsearch2.score(X_test, Y_test))

now = datetime.datetime.now()
print(now.strftime("%Y-%m-%d %H:%M"))

# Predictions et résultats

now = datetime.datetime.now()
print(now.strftime("%Y-%m-%d %H:%M"))

# Predictions on training set
print("Predictions on training set...")
Y_train_pred0 = gridsearch0.best_estimator_.predict(X_train)
Y_test_pred0 = gridsearch0.best_estimator_.predict(X_test)
# Concatenate our train and test set to train your best classifier on all data with labels
X_total = np.append(X_train,X_test,axis=0)
Y_total = np.append(Y_train,Y_test)

gridsearch0.best_estimator_.fit(X_total,Y_total)

Y_train_pred1 = gridsearch1.best_estimator_.predict(X_train)
Y_train_pred2 = gridsearch2.best_estimator_.predict(X_train)
print("...Done.")


# Predictions on test set
print("Predictions on test set...")
Y_test_pred0 = gridsearch0.best_estimator_.predict(X_test)
Y_test_pred1 = gridsearch1.best_estimator_.predict(X_test)
Y_test_pred2 = gridsearch2.best_estimator_.predict(X_test)
print("...Done.")
print()

# Concatenate our train and test set to train your best classifier on all data with labels
X_total = np.append(X_train,X_test,axis=0)
Y_total = np.append(Y_train,Y_test)

gridsearch0.best_estimator_.fit(X_total,Y_total)
gridsearch1.best_estimator_.fit(X_total,Y_total)
gridsearch2.best_estimator_.fit(X_total,Y_total)
print("Predictions on test set...")
Y_total_pred0 = gridsearch0.best_estimator_.predict(X_total)
Y_total_pred1 = gridsearch1.best_estimator_.predict(X_total)
Y_total_pred2 = gridsearch2.best_estimator_.predict(X_total)
print("...Done.")

## Performance assessment


# WARNING : Use the same score as the one that will be used by Kaggle !
# Here, the f1-score will be used to assess the performances on the leaderboard
print("f1-score on train set : ", f1_score(Y_train, Y_train_pred0))
print("f1-score on test set : ", f1_score(Y_test, Y_test_pred0))
print("f1-score on entire set : ", f1_score(Y_total, Y_total_pred0))

print("f1-score on train set : ", f1_score(Y_train, Y_train_pred1))
print("f1-score on test set : ", f1_score(Y_test, Y_test_pred1))
print("f1-score on entire set : ", f1_score(Y_total, Y_total_pred1))

print("f1-score on train set : ", f1_score(Y_train, Y_train_pred2))
print("f1-score on test set : ", f1_score(Y_test, Y_test_pred2))
print("f1-score on entire set : ", f1_score(Y_total, Y_total_pred2))

# Visualize confusion matrices
_ , ax = plt.subplots() # Get subplot from matplotlib
ax.set(title="Confusion Matrix on Train set") # Set a title that we will add into ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(classifier, X_train, Y_train, ax=ax) # ConfusionMatrixDisplay from sklearn
plt.show()

_ , ax = plt.subplots() # Get subplot from matplotlib
ax.set(title="Confusion Matrix on Test set") # Set a title that we will add into ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(classifier, X_test, Y_test, ax=ax) # ConfusionMatrixDisplay from sklearn
plt.show()

_ , ax = plt.subplots() # Get subplot from matplotlib
ax.set(title="Confusion Matrix on entire set") # Set a title that we will add into ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(classifier, X_total, Y_total, ax=ax) # ConfusionMatrixDisplay from sklearn
plt.show()


print("Report on Train set : \n" + classification_report(Y_train, Y_train_pred))
print("Report on Train set : \n" + classification_report(Y_test, Y_test_pred))
print("Report on Train set : \n" + classification_report(Y_total, Y_total_pred))

### Prédictions sur TEST set + écriture

In [33]:
# Read data without labels
data_without_labels = pd.read_csv('conversion_data_test.csv')
print('Prediction set (without labels) :', data_without_labels.shape)

# Warning : check consistency of features_list (must be the same than the features 
# used by your best classifier)
features_list = ['country', 'age', 'new_user', 'source', 'total_pages_visited']
X_without_labels = data_without_labels.loc[:, features_list]

# Convert pandas DataFrames to numpy arrays before using scikit-learn
#print("Convert pandas DataFrames to numpy arrays...")
#X_without_labels = X_without_labels.values
#print("...Done")

# print(X_without_labels[0:5,:])

Prediction set (without labels) : (31620, 5)


In [34]:
# WARNING : PUT HERE THE SAME PREPROCESSING AS FOR YOUR TEST SET
# CHECK YOU ARE USING X_without_labels
print("Encoding categorical features and standardizing numerical features...")

X_without_labels = preprocessor.transform(X_without_labels)
print("...Done")
print(X_without_labels[0:5,:])

Encoding categorical features and standardizing numerical features...
...Done
[[-0.31000438  3.33049128  0.          1.          0.          0.
   0.          1.        ]
 [-1.03592414  0.03829244  0.          1.          0.          1.
   1.          0.        ]
 [ 0.17394213 -1.15887077  0.          0.          0.          1.
   0.          1.        ]
 [ 0.17394213  0.33758325  0.          0.          1.          1.
   0.          0.        ]
 [-0.67296426 -0.56028917  0.          0.          0.          0.
   0.          1.        ]]


In [35]:

now = datetime.datetime.now()
print(now.strftime("%Y-%m-%d %H:%M"))

2023-04-28 14:10


In [36]:
# Make predictions and dump to file
# WARNING : MAKE SURE THE FILE IS A CSV WITH ONE COLUMN NAMED 'converted' AND NO INDEX !
# WARNING : FILE NAME MUST HAVE FORMAT 'conversion_data_test_predictions_[name].csv'
# where [name] is the name of your team/model separated by a '-'
# For example : [name] = AURELIE-model1
data_end = {
    'converted': c.predict(X_without_labels)
}

Y_predictions = pd.DataFrame(columns=['converted'],data=data_end)
Y_predictions.to_csv('conversion_data_test_predictions_CTang-model30.csv', index=False)
