# KAGGLE-LIKE CHALLENGE
On vous propose ici de tester tout ce que vous avez appris sur le machine learning supervisé, dans le but de faire un modèle de prédiction sur des données fournies, à la manière des compétitions Kaggle.

**Déroulement d'un challenge Kaggle**
- Kaggle vous envoie toujours deux datasets :
  - un fichier data_train.csv qui contient des données correspondant aux variables X, et au label Y à prédire. Utilisez ce fichier pour entraîner vos modèles comme d'habitude.
  - un fichier data_test.csv, qui contient les données X au même format que dans data_train.csv, mais cette fois les labels sont cachés. Votre but est de faire des prédictions sur ces données et de renvoyer ces prédictions à Kaggle, pour qu'ils évaluent votre modèle de manière indépendante
- Kaggle compare vos prédictions aux vrais labels et propose un leaderboard (équipes classées en fonction de leur score)
- Kaggle vous annonce à l'avance quelle métrique va être utilisée pour évaluer les modèles : veillez à utiliser la même métrique pour évaluer les performances de vos modèles

**Prédiction de conversion**

Ici, on vous propose d'essayer de créer le meilleur modèle pour prédire des conversions en fonction de différentes variables explicatives. Vos modèles seront évalués à l'aide du f1-score.

*Inspirez-vous du template ci-dessous pour la lecture des fichiers, la structure à suivre, et l'écriture des prédictions finales.*

In [232]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import f1_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier

import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

# Read file with labels

In [215]:
df = pd.read_csv('conversion_data_train.csv')
print('Set with labels (our train+test) :', df.shape)

Set with labels (our train+test) : (284580, 6)


In [216]:
#df['age_page'] = df['age'] * df['total_pages_visited']

# Explore dataset

In [217]:
# Don't forget to compute statistics and visualize your data

# Make your model (as always)

## Choose variables to use in the model, and create train and test sets

In [218]:
# Separate target variable Y from features X
print("Separating labels from features...")
features_list = ["country", "age", "new_user", "source", "total_pages_visited"]
target_variable = "converted"

X = df.loc[:,features_list]
Y = df.loc[:,target_variable]

print("...Done.")
print()

print('y : ')
print(Y.head())
print()
print('X :')
print(X.head())

Separating labels from features...
...Done.

y : 
0    0
1    0
2    1
3    0
4    0
Name: converted, dtype: int64

X :
   country  age  new_user  source  total_pages_visited
0    China   22         1  Direct                    2
1       UK   21         1     Ads                    3
2  Germany   20         0     Seo                   14
3       US   23         1     Seo                    3
4       US   28         1  Direct                    3


In [219]:
# Search categorical features and numeric features

idx = 0
numeric_features = []
numeric_indices = []
categorical_features = []
categorical_indices = []
for i,t in X.dtypes.iteritems():
  if ('float' in str(t)) or ('int' in str(t)) :
    numeric_features.append(i)
    numeric_indices.append(idx)
  else :
    categorical_features.append(i)
    categorical_indices.append(idx)

  idx = idx + 1

print('Found numeric features ', numeric_features,' at positions ', numeric_indices)
print('Found categorical features ', categorical_features,' at positions ', categorical_indices)

Found numeric features  ['age', 'new_user', 'total_pages_visited']  at positions  [1, 2, 4]
Found categorical features  ['country', 'source']  at positions  [0, 3]


In [220]:
# Divide dataset Train set & Test set 
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42, stratify=Y)
print("...Done.")
print()

Dividing into train and test sets...
...Done.



In [221]:
# Convert pandas DataFrames to numpy arrays before using scikit-learn
print("Convert pandas DataFrames to numpy arrays...")
X_train = X_train.values
X_test = X_test.values
Y_train = Y_train.values
Y_test = Y_test.values
print("...Done")

print(X_train[0:5,:])
print(X_test[0:2,:])
print()
print(Y_train[0:5])
print(Y_test[0:2])

Convert pandas DataFrames to numpy arrays...
...Done
[['US' 30 0 'Ads' 3]
 ['Germany' 46 0 'Seo' 10]
 ['US' 36 1 'Direct' 1]
 ['US' 41 0 'Direct' 3]
 ['China' 24 1 'Direct' 1]]
[['US' 32 1 'Seo' 2]
 ['China' 45 1 'Ads' 3]]

[0 0 0 0 0]
[0 0]


## Training pipeline

In [222]:
# Encoding categorical features and standardizing numerical features
print("Encoding categorical features and standardizing numerical features...")
print()
print(X_train[0:5,:])
print()
print(X_test[0:5,:])

# Normalization
numeric_transformer = StandardScaler()

# OneHotEncoder
categorical_transformer = OneHotEncoder(drop='first')

featureencoder = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_indices),
        ('cat', categorical_transformer, categorical_indices)        
        ]
    )

X_train = featureencoder.fit_transform(X_train)
X_test = featureencoder.transform(X_test)
print("...Done")
print(X_train[0:5,:])
print(X_test[0:5])

Encoding categorical features and standardizing numerical features...

[['US' 30 0 'Ads' 3]
 ['Germany' 46 0 'Seo' 10]
 ['US' 36 1 'Direct' 1]
 ['US' 41 0 'Direct' 3]
 ['China' 24 1 'Direct' 1]]

[['US' 32 1 'Seo' 2]
 ['China' 45 1 'Ads' 3]
 ['UK' 37 0 'Ads' 3]
 ['US' 31 1 'Ads' 5]
 ['UK' 27 1 'Direct' 4]]
...Done
[[-0.06905373 -1.47802048 -0.56068654  0.          0.          1.
   0.          0.        ]
 [ 1.86294015 -1.47802048  1.5340925   1.          0.          0.
   0.          1.        ]
 [ 0.65544398  0.67658061 -1.15919483  0.          0.          1.
   1.          0.        ]
 [ 1.25919206 -1.47802048 -0.56068654  0.          0.          1.
   1.          0.        ]
 [-0.79355144  0.67658061 -1.15919483  0.          0.          0.
   1.          0.        ]]
[[ 0.17244551  0.67658061 -0.85994069  0.          0.          1.
   0.          1.        ]
 [ 1.74219054  0.67658061 -0.56068654  0.          0.          0.
   0.          0.        ]
 [ 0.77619359 -1.47802048 -0.560

In [223]:
# Train model
print("Train model...")
'''
classifier = RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=10, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=6,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=None,
                       verbose=2, warm_start=False)
                       '''
classifier = LogisticRegressionCV(Cs= 2, n_jobs= -1, penalty= 'l1', solver= 'saga', verbose= 2, cv=10)
# regularized logit with regularization strength chosen by cross-val
classifier.fit(X_train, Y_train)
print("...Done.")

Train model...
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
convergence after 11 epochs took 2 seconds
convergence after 13 epochs took 2 seconds
convergence after 12 epochs took 2 seconds
convergence after 15 epochs took 3 seconds
convergence after 35 epochs took 5 seconds
convergence after 35 epochs took 6 seconds
convergence after 36 epochs took 7 seconds
convergence after 36 epochs took 6 seconds
convergence after 13 epochs took 2 seconds
convergence after 16 epochs took 2 seconds
convergence after 10 epochs took 2 seconds
convergence after 12 epochs took 2 seconds
convergence after 35 epochs took 6 seconds
convergence after 35 epochs took 5 seconds
convergence after 34 epochs took 6 seconds
convergence after 34 epochs took 6 seconds
convergence after 10 epochs took 1 seconds
convergence after 13 epochs took 2 seconds
convergence after 36 epochs took 5 seconds
convergence after 36 epochs took 4 seconds
[Parallel(n_jobs=-1)]: Done  10 out of  10 |

In [224]:
# Predictions on training set
print("Predictions on training set...")
Y_train_pred = classifier.predict(X_train)
print("...Done.")
print(Y_train_pred)
print()

Predictions on training set...
...Done.
[0 0 0 ... 0 0 0]



## Test pipeline

In [225]:
# Predictions on test set
print("Predictions on test set...")
Y_test_pred = classifier.predict(X_test)
print("...Done.")
print(Y_test_pred)
print()

Predictions on test set...
...Done.
[0 0 0 ... 0 0 0]



## Performance assessment

In [226]:
# WARNING : Use the same score as the one that will be used by Kaggle !
# Here, the f1-score will be used to assess the performances on the leaderboard
print("f1-score on train set : ", f1_score(Y_train, Y_train_pred))
print("f1-score on test set : ", f1_score(Y_test, Y_test_pred))

f1-score on train set :  0.7629043358568479
f1-score on test set :  0.7654519830883832


In [227]:
# You can also check more performance metrics to better understand what your model is doing
print("Confusion matrix on train set : ")
print(confusion_matrix(Y_train, Y_train_pred))
print()
print("Confusion matrix on test set : ")
print(confusion_matrix(Y_test, Y_test_pred))
print()

Confusion matrix on train set :
[[192016    764]
 [  1992   4434]]

Confusion matrix on test set : 
[[82308   312]
 [  853  1901]]



In [233]:
# print classification report
print("Classification Report on Test set:\n\n", classification_report(Y_test, Y_test_pred))

Classification Report on Test set:

               precision    recall  f1-score   support

           0       0.99      1.00      0.99     82620
           1       0.86      0.69      0.77      2754

    accuracy                           0.99     85374
   macro avg       0.92      0.84      0.88     85374
weighted avg       0.99      0.99      0.99     85374



# Train best classifier on all data and use it to make predictions on X_without_labels

In [228]:
# Concatenate our train and test set to train your best classifier on all data with labels
X = np.append(X_train,X_test,axis=0)
Y = np.append(Y_train,Y_test)

classifier.fit(X,Y)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
convergence after 10 epochs took 2 seconds
convergence after 10 epochs took 2 seconds
convergence after 12 epochs took 3 seconds
convergence after 12 epochs took 3 seconds
convergence after 33 epochs took 9 seconds
convergence after 33 epochs took 8 seconds
convergence after 33 epochs took 8 seconds
convergence after 34 epochs took 10 seconds
convergence after 11 epochs took 2 seconds
convergence after 8 epochs took 3 seconds
convergence after 13 epochs took 3 seconds
convergence after 12 epochs took 3 seconds
convergence after 34 epochs took 13 seconds
convergence after 33 epochs took 13 seconds
convergence after 34 epochs took 14 seconds
convergence after 33 epochs took 14 seconds
convergence after 10 epochs took 3 seconds
convergence after 11 epochs took 3 seconds
convergence after 34 epochs took 9 seconds
convergence after 34 epochs took 10 seconds
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed: 

LogisticRegressionCV(Cs=2, class_weight=None, cv=10, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=-1, penalty='l1',
                     random_state=None, refit=True, scoring=None, solver='saga',
                     tol=0.0001, verbose=2)

In [229]:
# Read data without labels
data_without_labels = pd.read_csv('conversion_data_test.csv')
print('Prediction set (without labels) :', data_without_labels.shape)

#data_without_labels['age_page'] = data_without_labels['age'] * data_without_labels['total_pages_visited']
# Warning : check consistency of features_list (must be the same than the features 
# used by your best classifier)
features_list = ["country", "age", "new_user", "source", "total_pages_visited"]
X_without_labels = data_without_labels.loc[:, features_list]

# Convert pandas DataFrames to numpy arrays before using scikit-learn
print("Convert pandas DataFrames to numpy arrays...")
X_without_labels = X_without_labels.values
print("...Done")

print(X_without_labels[0:5,:])

Prediction set (without labels) : (31620, 5)
Convert pandas DataFrames to numpy arrays...
...Done
[['UK' 28 0 'Seo' 16]
 ['UK' 22 1 'Direct' 5]
 ['China' 32 1 'Seo' 1]
 ['US' 32 1 'Ads' 6]
 ['China' 25 0 'Seo' 3]]


In [230]:
# WARNING : PUT HERE THE SAME PREPROCESSING AS FOR YOUR TEST SET
# CHECK YOU ARE USING X_without_labels
print("Encoding categorical features and standardizing numerical features...")

X_without_labels = featureencoder.transform(X_without_labels)
print("...Done")
print(X_without_labels[0:5,:])

Encoding categorical features and standardizing numerical features...
...Done
[[-0.31055296 -1.47802048  3.3296174   0.          1.          0.
   0.          1.        ]
 [-1.03505067  0.67658061  0.03782176  0.          1.          0.
   1.          0.        ]
 [ 0.17244551  0.67658061 -1.15919483  0.          0.          0.
   0.          1.        ]
 [ 0.17244551  0.67658061  0.33707591  0.          0.          1.
   0.          0.        ]
 [-0.67280182 -1.47802048 -0.56068654  0.          0.          0.
   0.          1.        ]]


In [231]:
# Make predictions and dump to file
# WARNING : MAKE SURE THE FILE IS A CSV WITH ONE COLUMN NAMED 'converted' AND NO INDEX !
# WARNING : FILE NAME MUST HAVE FORMAT 'conversion_data_test_predictions_[name].csv'
# where [name] is the name of your team/model separated by a '-'
# For example : [name] = AURELIE-model1
data = {
    'converted': classifier.predict(X_without_labels)
}

Y_predictions = pd.DataFrame(columns=['converted'],data=data)
Y_predictions.to_csv('conversion_data_test_predictions_LORENZO-Model5.csv', index=False)
