# Challenge : predict conversions 🏆🏆

This is the template that shows the different steps of the challenge. In this notebook, all the training/predictions steps are implemented for a very basic model (logistic regression with only one variable). Please use this template and feel free to change the preprocessing/training steps to get the model with the best f1-score ! May the force be with you 🧨🧨  

**For a detailed description of this project, please refer to *02-Conversion_rate_challenge.ipynb*.**

# Import libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.linear_model import LogisticRegressionCV

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

In [2]:
data = pd.read_csv('conversion_data_train.csv')
print('Set with labels (our train+test) :', data.shape)
data_sample = data.sample(10000)

Set with labels (our train+test) : (284580, 6)


## Choose variables to use in the model, and create train and test sets
**From the EDA, we know that the most useful feature is total_pages_visited. Let's create a baseline model by using at first only this feature : in the next cells, we'll make preprocessings and train a simple (univariate) logistic regression.**

In [3]:
features_list = ['total_pages_visited','new_user','age','country','source']
numeric_indices = [0,2]
categorical_indices = [1,3,4]
target_variable = 'converted'

In [4]:
X = data.loc[:, features_list]
Y = data.loc[:, target_variable]

print('Explanatory variables : ', X.columns)
print()

Explanatory variables :  Index(['total_pages_visited', 'new_user', 'age', 'country', 'source'], dtype='object')



In [5]:
# Divide dataset Train set & Test set 
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=0, stratify=Y)
print("...Done.")
print()

Dividing into train and test sets...
...Done.



In [6]:
# Create pipeline for numeric features
numeric_transformer = StandardScaler()

# Create pipeline for categorical features
categorical_transformer = OneHotEncoder(drop="first")

# Applying pipeline on X variables depending on columns
feature_encoder = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_indices),    
        ('num', numeric_transformer, numeric_indices)
        ]
    )

# Preprocessings on train set
print("Performing preprocessings on train set...")
print(X_train.head())
X_train = feature_encoder.fit_transform(X_train)
print('...Done.')
print(X_train[0:5,:])
print()

# Preprocessings on test set
print("Performing preprocessings on test set...")
print(X_test.head())
X_test = feature_encoder.transform(X_test) # Don't fit again !!
print('...Done.')
print(X_test[0:5,:])
print()

# Preprocessings on train set
print("No preprocessing needed for target, already in 0 and 1")

Performing preprocessings on train set...
        total_pages_visited  new_user  age country  source
178877                    2         0   23   China  Direct
215523                    2         0   28   China     Ads
73318                     7         1   30   China     Seo
58164                     3         1   37      UK     Seo
234640                    8         1   31      US     Ads
...Done.
[[ 0.          0.          0.          0.          1.          0.
  -0.85939501 -0.91458053]
 [ 0.          0.          0.          0.          0.          0.
  -0.85939501 -0.30994956]
 [ 1.          0.          0.          0.          0.          1.
   0.63639894 -0.06809718]
 [ 1.          0.          1.          0.          0.          1.
  -0.56023622  0.77838618]
 [ 1.          0.          0.          1.          0.          0.
   0.93555773  0.05282902]]

Performing preprocessings on test set...
        total_pages_visited  new_user  age country  source
269237                   14 

## Training pipeline

In [7]:
# Train model
print("Train model...")
classifier = LogisticRegressionCV(cv=5) # 
classifier.fit(X_train, Y_train)
print("...Done.")

Train model...
...Done.


## Test pipeline

## Performance assessment

In [8]:
# Predictions on training set
print("Predictions on training set...")
Y_train_pred = classifier.predict(X_train)
print("...Done.")
print(Y_train_pred)
print()
# Predictions on test set
print("Predictions on test set...")
Y_test_pred = classifier.predict(X_test)
print("...Done.")
print(Y_test_pred)
print()

Predictions on training set...
...Done.
[0 0 0 ... 0 0 0]

Predictions on test set...
...Done.
[0 0 0 ... 0 0 0]



In [9]:
# WARNING : Use the same score as the one that will be used by Kaggle !
# Here, the f1-score will be used to assess the performances on the leaderboard
print("f1-score on train set : ", f1_score(Y_train, Y_train_pred))
print("f1-score on test set : ", f1_score(Y_test, Y_test_pred))

f1-score on train set :  0.7642058764473596
f1-score on test set :  0.7590799031476998


In [10]:
# You can also check more performance metrics to better understand what your model is doing
print("Confusion matrix on train set : ")
print(confusion_matrix(Y_train, Y_train_pred))
print()
print("Confusion matrix on test set : ")
print(confusion_matrix(Y_test, Y_test_pred))
print()

Confusion matrix on train set : 
[[246890    970]
 [  2553   5709]]

Confusion matrix on test set : 
[[27433   107]
 [  291   627]]



In [11]:
# # Perform 10-fold cross-validation 
# from sklearn.model_selection import cross_val_score
# print("10-fold cross-validation...")
# scores = cross_val_score(classifier, X_train, Y_train, cv=10)
# print('The cross-validated accuracy is : ', scores.mean())
# print('The standard deviation is : ', scores.std())

# Train best classifier on all data and use it to make predictions on X_without_labels
**Before making predictions on the file conversion_data_test.csv, let's train our model on ALL the data that was in conversion_data_train.csv. Sometimes, this allows to make tiny improvements in the score because we're using more examples to train the model.**

# Train best classifier on all data and use it to make predictions on X_without_labels
**Before making predictions on the file conversion_data_test.csv, let's train our model on ALL the data that was in conversion_data_train.csv. Sometimes, this allows to make tiny improvements in the score because we're using more examples to train the model.**

In [12]:
# Concatenate our train and test set to train your best classifier on all data with labels
X = np.append(X_train,X_test,axis=0)
Y = np.append(Y_train,Y_test)

classifier.fit(X,Y)

LogisticRegressionCV(cv=5)

In [13]:
print("Predictions on full set...")
Y_pred = classifier.predict(X)
print("...Done.")
print(Y_pred)
print()

print("f1-score on full set : ", f1_score(Y,Y_pred))

Predictions on full set...
...Done.
[0 0 0 ... 0 0 0]

f1-score on full set :  0.7637481910274964


In [14]:
# Read data without labels
data_without_labels = pd.read_csv('conversion_data_test.csv')
print('Prediction set (without labels) :', data_without_labels.shape)

# Warning : check consistency of features_list (must be the same than the features 
# used by your best classifier)
X_without_labels = data_without_labels.loc[:, features_list]

# Convert pandas DataFrames to numpy arrays before using scikit-learn
print("Convert pandas DataFrames to numpy arrays...")
X_without_labels = X_without_labels.values
print("...Done")

print(X_without_labels[0:5,:])

Prediction set (without labels) : (31620, 5)
Convert pandas DataFrames to numpy arrays...
...Done
[[16 0 28 'UK' 'Seo']
 [5 1 22 'UK' 'Direct']
 [1 1 32 'China' 'Seo']
 [6 1 32 'US' 'Ads']
 [3 0 25 'China' 'Seo']]


In [15]:
# WARNING : PUT HERE THE SAME PREPROCESSING AS FOR YOUR TEST SET
# CHECK YOU ARE USING X_without_labels
print("Encoding categorical features and standardizing numerical features...")

X_without_labels = feature_encoder.transform(X_without_labels)
print("...Done")
print(X_without_labels[0:5,:])

Encoding categorical features and standardizing numerical features...
...Done
[[ 0.          0.          1.          0.          0.          1.
   3.32882805 -0.30994956]
 [ 1.          0.          1.          0.          1.          0.
   0.03808136 -1.03550673]
 [ 1.          0.          0.          0.          0.          1.
  -1.1585538   0.17375521]
 [ 1.          0.          0.          1.          0.          0.
   0.33724015  0.17375521]
 [ 0.          0.          0.          0.          0.          1.
  -0.56023622 -0.67272814]]




In [16]:
# Make predictions and dump to file
# WARNING : MAKE SURE THE FILE IS A CSV WITH ONE COLUMN NAMED 'converted' AND NO INDEX !
# WARNING : FILE NAME MUST HAVE FORMAT 'conversion_data_test_predictions_[name].csv'
# where [name] is the name of your team/model separated by a '-'
# For example : [name] = AURELIE-model1
data = {
    'converted': classifier.predict(X_without_labels)
}

Y_predictions = pd.DataFrame(columns=['converted'],data=data)
Y_predictions.to_csv('conversion_data_test_predictions_LogRegCV5.csv', index=False)


## Analyzing the coefficients and interpreting the result
**In this template, we just trained a model with only one feature (total_pages_visited), so there's no analysis to be done about the feature importance 🤔**

**Once you've included more features in your model, please take some time to analyze the model's parameters and try to find some lever for action to improve the newsletter's conversion rate 😎😎**