# Project

This is a lab based on Udemy "Deep Learning Prerequisites: Logistic Regression in Python" course.
Here I'm following along with explanations and implementing some math as I go.
Mostly based on the live coding done by the author of the course, but not strictly following the examples.

## Import basic libraries

In [1]:
import numpy as np
import pandas as pd

## Read file

In [2]:
df = pd.read_csv('ecommerce_data.csv')

In [3]:
df.head(5)

Unnamed: 0,is_mobile,n_products_viewed,visit_duration,is_returning_visitor,time_of_day,user_action
0,1,0,0.65751,0,3,0
1,1,1,0.568571,0,2,1
2,1,0,0.042246,1,1,0
3,1,1,1.659793,1,1,2
4,0,1,2.014745,1,1,2


### Definition of features

| Feature  | Description | Comments |
|--------------|-------------|--------|
| is_mobile | Whether the user is visiting from mobile | |
| n_products_viewed | Number of products viewed | (During the session) |
| visit_duration | Visit duration in minutes | |
| is_returning_visitor | Whether is returning user | |
| time_of_day | 0/1/2/3, 24 hours split in 4 categories | |
| user_action | bounce/add_to_card/begin_checkout/finish_checkout | |


## Prepare the data

In the course, the author suggests converting df to matrix, apparently that is easier to process. I am going to try and apply the same transformations to the dataframe

is_mobile and is_returning_visitor are good, it's categorical and has only 0 and 1

n_products_viewed and visit_duration are non-categorical numbers. Since scale and distance are important, these numbers need to be normalized (subtract the mean and divide by standard deviation). This way the numbers will also have a bigger effect when applying sigmoid function

In [4]:
df['n_products_viewed'] = (df['n_products_viewed'] - df['n_products_viewed'].mean()) / df['n_products_viewed'].std()
df['visit_duration'] = (df['visit_duration'] - df['visit_duration'].mean()) / df['visit_duration'].std()

Encode time of the day using one-hot encoding. The geometrical distance between data points is important for thevclassification, but in case of categorical data the distance between numbers does not really make sense.

This is the same situation as I had when I tried to cluster books, with authors encoded by numbers. Authors with close IDs were considered similar where in fact, author ID cannot be the measure of similarity. In that case I changes the algorithm from K-means to K-prototype, but I could have encoded authors using one-hot encoding too.

In [5]:
df['tod1'] = df['time_of_day']
df['tod2'] = df['time_of_day']
df['tod3'] = df['time_of_day']
df['tod4'] = df['time_of_day']

for i in df.index:
    df.at[i, 'tod1'] = (1 if df.at[i, 'tod1'] == 0 else 0)
    df.at[i, 'tod2'] = (1 if df.at[i, 'tod2'] == 1 else 0)
    df.at[i, 'tod3'] = (1 if df.at[i, 'tod3'] == 2 else 0)
    df.at[i, 'tod4'] = (1 if df.at[i, 'tod4'] == 3 else 0)

del df['time_of_day']

In [6]:
df.head(5)

Unnamed: 0,is_mobile,n_products_viewed,visit_duration,is_returning_visitor,user_action,tod1,tod2,tod3,tod4
0,1,-0.816161,-0.407869,0,0,0,0,0,1
1,1,0.139531,-0.498929,0,1,0,0,1,0
2,1,-0.816161,-1.037804,1,0,0,1,0,0
3,1,0.139531,0.618313,1,2,0,1,0,0
4,0,0.139531,0.981728,1,2,0,1,0,0


Since we are using logistic regression, we can only predict "Yes" or "No", no more than 2 classes. So we are going to predict "bounce" or "add_to_card" actions

In [7]:
df_f = df[df['user_action'] <= 1]

In [8]:
df_f.head(5)

Unnamed: 0,is_mobile,n_products_viewed,visit_duration,is_returning_visitor,user_action,tod1,tod2,tod3,tod4
0,1,-0.816161,-0.407869,0,0,0,0,0,1
1,1,0.139531,-0.498929,0,1,0,0,1,0
2,1,-0.816161,-1.037804,1,0,0,1,0,0
6,0,-0.816161,0.393614,1,0,0,1,0,0
7,1,-0.816161,-1.044956,0,0,0,0,0,1


## Predict user action (use the math from basic_math)

In [9]:
N = df_f.shape[0]
N

398

Here I'm abandoning dataframe.

Xb is a matrix with features only

In [10]:
a = df_f.loc[:, 'is_mobile':'is_returning_visitor'].to_numpy()
b = df_f.loc[:, 'tod1':'tod4'].to_numpy()
ones = np.ones((N, 1))
Xb = np.concatenate((np.concatenate((ones, a), axis=1), b), axis=1)

In [11]:
Xb[:3]

array([[ 1.        ,  1.        , -0.81616102, -0.4078692 ,  0.        ,
         0.        ,  0.        ,  0.        ,  1.        ],
       [ 1.        ,  1.        ,  0.13953104, -0.49892862,  0.        ,
         0.        ,  0.        ,  1.        ,  0.        ],
       [ 1.        ,  1.        , -0.81616102, -1.03780386,  1.        ,
         0.        ,  1.        ,  0.        ,  0.        ]])

In [12]:
D = Xb.shape[1]
D

9

Random weigths initially

In [13]:
w = np.random.randn(D)

In [14]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

In [15]:
z = Xb.dot(w)

In [16]:
z[:5]

array([ 0.55420858, -0.69448534, -0.79790094, -0.26087228,  0.94526781])

In [17]:
predictions = sigmoid(z)

In [18]:
predictions[:20]

array([0.63511146, 0.33303603, 0.31047471, 0.43514929, 0.7201625 ,
       0.81106455, 0.88210927, 0.85598754, 0.94661648, 0.02900604,
       0.86692319, 0.24598306, 0.00839984, 0.1601554 , 0.90623126,
       0.98530035, 0.58451359, 0.95510291, 0.49098445, 0.36473055])

Actual labels

In [19]:
T = df_f.loc[:, 'user_action'].to_numpy()

Calculate cross-entropy

In [20]:
def cross_entropy(T, Y):
    return -np.mean(T * np.log(Y) + (1 - T) * np.log(1 - Y))

cross_entropy(T, predictions)

1.6518865596159171

## Train the model

In [21]:
XTrain = Xb[:-100]
XTest = Xb[-100:]

TTrain = T[:-100]
TTest = T[-100:]

In [22]:
learning_rate = 0.01
for i in range(100):
    Y = sigmoid(XTrain.dot(w))
    w += learning_rate * XTrain.T.dot(TTrain - Y)

    if i % 10 == 0:
        print(cross_entropy(TTrain, Y))

1.627507159002893
0.278458853514323
0.232454029139268
0.21846609730598493
0.2121919655462052
0.20885508378398815
0.206900262307872
0.2056814090679353
0.20488778748321892
0.20435446759763745


## Test the model

In [23]:
YTest = sigmoid(XTest.dot(w))
cross_entropy(TTest, YTest)

0.1303259251472449

In [24]:
TTest * 1.0

array([0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 1., 0., 0., 0.,
       0., 1., 0., 1., 1., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 1.,
       0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 0., 1., 0., 0., 0.,
       1., 0., 0., 0., 0., 1., 1., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 1., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0.])

In [25]:
np.round(YTest)

array([0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 1., 0., 0., 0.,
       0., 1., 0., 1., 1., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 1.,
       0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0., 1., 0., 0., 0.,
       1., 0., 0., 0., 0., 1., 1., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 1., 1., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0.])

In [26]:
abs(np.sum(TTest - np.round(YTest)))

3.0

Seem to be only 3 incorrectly classified points.

Not really sure why such a difference in cross entropy, comparing to training data. Probably doesn't matter. Probably it is simply guiding correct values for weigth, but the actual quality of the model should be expressed in precision and recall.