# CLEAR BRAIN CHALLENGE

# Import the environments

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Import the data

In [2]:
data = pd.read_csv('data/conversion_data.csv')
print('shape: ', data.shape)
data.head()

shape:  (316200, 6)


Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
0,UK,25,1,Ads,1,0
1,US,23,1,Seo,5,0
2,US,28,1,Seo,4,0
3,China,39,1,Seo,5,0
4,US,30,1,Seo,6,0


# PART I - Exploratory Data Analysis

### 1) Country

In [3]:
data['country'].unique()

array(['UK', 'US', 'China', 'Germany'], dtype=object)

The users come from 4 different countries.

### 2) Age

In [4]:
data['age'].unique()

array([ 25,  23,  28,  39,  30,  31,  27,  29,  38,  43,  24,  36,  37,
        33,  20,  35,  17,  50,  22,  18,  34,  19,  42,  32,  21,  48,
        40,  41,  26,  45,  44,  49,  46,  56,  52,  54,  51,  47,  53,
        60,  57,  55,  59,  61,  58,  62,  65,  63,  66,  67,  64,  68,
        69, 123,  70,  73,  77,  72,  79, 111], dtype=int64)

Here it shows that some user are 111 or 123 years old. It seems to be an error due to the acquisition of the data.

### 3) New User

In [5]:
data['new_user'].unique()

array([1, 0], dtype=int64)

It seems coherent: If the user is new, the value is 1. Otherwise, the value is 0.

### 4) Source

In [6]:
data['source'].unique()

array(['Ads', 'Seo', 'Direct'], dtype=object)

As mentionned in the README file, the 3 possible sources are advertisement, search results, or URL.

### 5) total pages visited

In [7]:
data['total_pages_visited'].unique()

array([ 1,  5,  4,  6,  2,  8,  7,  3,  9, 14, 10, 11, 18, 15, 19, 12, 13,
       21, 17, 23, 16, 25, 26, 20, 22, 24, 27, 28, 29], dtype=int64)

The total number of pages visited by a user during the session varies from 1 to 29

### 6) Converted

In [8]:
data['converted'].unique()

array([0, 1], dtype=int64)

It is coherent, this column only contains 0 or 1: 0 means the user left the session without buying anything while 1 means the user accepted a trip.

### Conclusion

From this analysis, we have a better understanding of the data. The data don't contain any NaN or unexpected value. Let's now prepare the data in order to build our prediction model.

# PART II - Data Cleaning and Feature Preprocessing

### 1) Country

We want to have integer values instead of string values. We affect each of the unique country to an integer through the mean of a dictionnary and replace in the DataFrame the string by these integers.

In [9]:
dico_country = {
    'UK': 1,
    'US': 2,
    'China': 3,
    'Germany': 4
}

In [10]:
data['country'] = data['country'].replace(dico_country)
data.head()

Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
0,1,25,1,Ads,1,0
1,2,23,1,Seo,5,0
2,2,28,1,Seo,4,0
3,3,39,1,Seo,5,0
4,2,30,1,Seo,6,0


### 2) Age

We noticed during the Exploratory Data Analysis that some customers are 111 or 123 years old. This seems to be an error. In order to have a better accuracy when building our prediction model, we will not take these rows into consideration.

In [11]:
data = data.loc[ (data['age'] != 111) & (data['age'] != 123) , : ]
print('shape: ', data.shape)
data.head()

shape:  (316198, 6)


Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
0,1,25,1,Ads,1,0
1,2,23,1,Seo,5,0
2,2,28,1,Seo,4,0
3,3,39,1,Seo,5,0
4,2,30,1,Seo,6,0


We only have 2 row less than the beginning.

### 3) Source

We want to have integer values instead of string values. We affect each of the unique source to an integer through the mean of a dictionnary and replace in the DataFrame the string by these integers.

In [12]:
dico_source = {
    'Ads': 1,
    'Seo': 2,
    'Direct': 3
}

In [13]:
data['source'] = data['source'].replace(dico_source)
data.head()

Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
0,1,25,1,1,1,0
1,2,23,1,2,5,0
2,2,28,1,2,4,0
3,3,39,1,2,5,0
4,2,30,1,2,6,0


### Conclusion

We don't need to change anything for the 'new_user', the 'total_pages_visited', and the 'converted' columns since these three features only have integers. Let's now build the prediction model.

# PART III - Classification

### 1) Training Validation Split

The data we have is all the data we have available for both training the model and validating the model that we train. We therefore need to split the data into separate training and validation datasets. We will need this validation data to assess the performance of our classifier once we are finished training. Note that we set the seed (random_state) to 42. This will produce a pseudo-random sequence of random numbers that is the same for every runs.

In [14]:
train, val = train_test_split(data, test_size = 0.1, random_state = 42)
print('train shape: ', train.shape)
print('val shape: ', val.shape)

train shape:  (284578, 6)
val shape:  (31620, 6)


### 2) Retrieve X_train and Y_train

We divide our training dataset into a features matrix and a real values (converted or not converted) vector

In [15]:
features = ['country', 'age', 'new_user', 'source', 'total_pages_visited']

scaler = StandardScaler()

# Features matrix
X_train = train[features]

# We normalize the matrix
scaler.fit(X_train)
X_train = scaler.transform(X_train)

# Real Values vector
Y_train = train['converted']

  return self.partial_fit(X, y)
  # Remove the CWD from sys.path while we load stuff.


Normalizing the features helps us to have a better accuracy when building our model. Each feature will contain values between 0 and 1. This will also permit us to interpret the importance of each feature after building the prediction model.

### 3) Fit the model

We fit a Logistic Regression model with these vectors and we look at the training accuracy.

In [16]:
model = LogisticRegression()
model.fit(X_train, Y_train)
predictions = model.predict(X_train)

training_accuracy = model.score(X_train, Y_train)

print("Training Accuracy: ", training_accuracy)



Training Accuracy:  0.9852342767185095


### Conclusion

It seems that we have a great accuracy. However, we need to evaluate our classifier with another mean.

# PART IV - Evaluating Classifiers

First, we are evaluating accuracy on the training set, which may lead to a misleading accuracy measure, especially if we used the training set to identify discriminative features. 

Presumably, our classifier will be used for filtering. There are two kinds of errors we can make:
- False positive (FP): A user that will not buy the service and that will be considered as converted.
- False negative (FN): A user that will be buy the service and that won't be considered as converted.

These definitions depend both on the true labels and the predicted labels. False positives and false negatives may be of differing importance, leading us to consider more ways of evaluating a classifier, in addition to overall accuracy:

**Precision** measures the proportion $\frac{\text{TP}}{\text{TP} + \text{FP}}$ of predicted conversions that are actually conversions.

**Recall** measures the proportion $\frac{\text{TP}}{\text{TP} + \text{FN}}$ of conversions that were correctly flagged as conversions. 

**False-alarm rate** measures the proportion $\frac{\text{FP}}{\text{FP} + \text{TN}}$ of non-conversions that were incorrectly flagged as conversions. 

The following image summarizes these errors:

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/700px-Precisionrecall.svg.png" width="500px">

Note that a true positive (TP) is a user that will buy the service and will be considered as converted, and a true negative (TN) is a user that won't buy anything and that won't be considered as converted.

In [17]:
predictions = model.predict(X_train)

TP = sum((predictions == Y_train) & (predictions == 1))
TN = sum((predictions == Y_train) & (predictions == 0))
FP = sum((predictions != Y_train) & (predictions == 1))
FN = sum((predictions != Y_train) & (predictions == 0))

logistic_predictor_precision = TP / (TP + FP) 
logistic_predictor_recall = TP / (TP + FN) 
logistic_predictor_far = FP / (FP + TN) 

In [18]:
print('The training precision is: ', logistic_predictor_precision)
print('\nThe training recall is: ', logistic_predictor_recall)
print('\nThe training False Alarm Rate is: ', logistic_predictor_far)

The training precision is:  0.845430665007604

The training recall is:  0.6647461680617458

The training False Alarm Rate is:  0.004059859321153755


### Conclusion

We can see here that the training precision of our model is 84.54%. It means that 84.54% of the clients that are predicted as 'converted' by the model will actually buy the service. 

The training recall here is 66.47%. It means that 66.47% of the converted clients were actually flagged as converted by the prediction model.

The training False Alarm Rate is 0.41%. It is really low! Only 0.41% of the users that leave without buying anything were flagged as converted.

# PART V - Test our model on the validation dataset

### 1) Retrieve X_val and Y_end

We divide our validation dataset into a features matrix and a real values (match or no match) vector

In [19]:
features = ['country', 'age', 'new_user', 'source', 'total_pages_visited']

scaler = StandardScaler()

# Features matrix
X_val = val[features]

# We normalize the matrix
scaler.fit(X_val)
X_val = scaler.transform(X_val)

# Real Values vector
Y_val = val['converted']

  return self.partial_fit(X, y)
  # Remove the CWD from sys.path while we load stuff.


### 2) Use the model

We use the Logistic Regression model with these vectors and we look at the validation accuracy.

In [20]:
from sklearn.linear_model import LogisticRegression

predictions_val = model.predict(X_val)

val_accuracy = model.score(X_val, Y_val)

print("Validation Accuracy: ", val_accuracy)

Validation Accuracy:  0.9858950031625553


Again, we need to evaluate our classifier with another mean.

### 3) Evaluating the classifiers

In [21]:
TP = sum((predictions_val == Y_val) & (predictions_val == 1))
TN = sum((predictions_val == Y_val) & (predictions_val == 0))
FP = sum((predictions_val != Y_val) & (predictions_val == 1))
FN = sum((predictions_val != Y_val) & (predictions_val == 0))

logistic_predictor_precision_val = TP / (TP + FP) 
logistic_predictor_recall_val = TP / (TP + FN) 
logistic_predictor_far_val = FP / (FP + TN) 

In [22]:
print('The validation precision is: ', logistic_predictor_precision_val)
print('\nThe validation recall is: ', logistic_predictor_recall_val)
print('\nThe validation False Alarm Rate is: ', logistic_predictor_far_val)

The validation precision is:  0.8586251621271076

The validation recall is:  0.6626626626626627

The validation False Alarm Rate is:  0.0035596486071650174


### Conclusions

We can see here that the validation precision of our model is 85.86%. It means that 85.86% of the clients that are predicted as 'converted' by the model will actually buy the service.

The validation recall here is 66.27%. It means that 66.27% of the converted clients were actually flagged as converted by the prediction model.

The validation False Alarm Rate is 0.36%. It is really low! Only 0.36% of the users that leave without buying anything were flagged as converted.

We will discuss about these numbers in the final part and how to increase the precision.

# PART VI - Conclusions

### Accuracy and Precision

As we said before, even if the accuracy seems really high (98.59%), the real precision of the classifier is only 85.86%. The precision could be improved by adding more features.

### Coefficients of the model

In [23]:
print('The features are: ', features)

The features are:  ['country', 'age', 'new_user', 'source', 'total_pages_visited']


In [24]:
print('The coefficient of each feature is: ', model.coef_[0])

The coefficient of each feature is:  [-0.34561996 -0.61114143 -0.8076646  -0.05304043  2.51510061]


In [25]:
print('The constant of the model is: ', model.intercept_[0])

The constant of the model is:  -7.079680150480073


Since we normalized the features, we can analyze and compare their coefficients:

- Here, we can see that the total number of visited pages has a big importance on deciding if an user is going to buy the service or not. 
- If a user is new to the platform, he is not willing to buy the service.
- The age of the user is also an important feature: the older the user is, the less he is going to buy the service.
- Finally, the country of the user and the source he is using are not important features.

### Recommandations and Improvements

In order to improve the model, it could be useful to have more features such as:
   - The gender of the user
   - The price of the service
   - The total duration of the ride
   
These features could improve the precision of the classifier.