# Classification and Regression Experiments

# Data Preprocessing

In [1]:
#Imports
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [2]:
df = pd.read_csv("../data/cleaned/cleaned_data.csv", parse_dates=['event_time'])

In [None]:
#Creating a new view_count column

view_counts = df[df['event_type'] == 'view'].groupby(['user_id', 'product_id']).size().reset_index(name='view_count')
df = df.merge(view_counts, on=['user_id', 'product_id'], how='left')
df['view_count'] = df['view_count'].fillna(0)

# Creating a column where purchases are labeled as 1 and non-purchases as 0
df['purchased'] = df['event_type'].apply(lambda x:1 if x == 'purchase' else 0)

#Create a smaller sample of the dataframe (Random 100,000 Values)
df = df.sample(n=100000, random_state=42) 

# Logistic Regression for brand and category_code Predicting Product Purchases

In [15]:
# Feature Selection
x = df[['brand', 'category_code']]
y = df['purchased']

x = pd.get_dummies(x, columns=['brand', 'category_code'], drop_first=True)

# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

#Conduct Experiment
lr_model = LogisticRegression(max_iter=1000, class_weight='balanced')

lr_model.fit(x_train, y_train)

y_pred = lr_model.predict(x_test)

report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.99      0.56      0.72      9785
           1       0.03      0.66      0.06       215

    accuracy                           0.57     10000
   macro avg       0.51      0.61      0.39     10000
weighted avg       0.97      0.57      0.70     10000



In [17]:
#Experiment Improved

# Feature Selection
x = df[['brand', 'category_code', 'category_id']]
y = df['purchased']

x = pd.get_dummies(x, columns=['brand', 'category_code', 'category_id'], drop_first=True)

# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

#Conduct Experiment
lr_model = LogisticRegression(max_iter=1000, class_weight='balanced')

lr_model.fit(x_train, y_train)

y_pred = lr_model.predict(x_test)

report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.99      0.57      0.73      9785
           1       0.03      0.66      0.06       215

    accuracy                           0.58     10000
   macro avg       0.51      0.62      0.39     10000
weighted avg       0.97      0.58      0.71     10000



The input data (x) for this experiment is brand and category_code. The output or target data (y) for this experiment is purchased which indicates whether a product was purchased (1) or not (0). 

The model maintained a high accuracy of 0.99 for the majority class which was non-purchases but still struggled with the minority class which was purchases by showing a low accuracy of 0.03. The recall for the purchases class was high at 0.66 which means that the model correctly identified 66% of the purchases. However, the low accuracy hurt its overall performance. The overall accuracy of 0.57 indicates that the model isn't performing very well.

The model was underfitting, it was not able to capture sufficient patterns to make accurate predictions for the minority class which was purchased. Looking at the high recall and low precision, that tells us that the model predicted a lot of purchases but many of the predictions were wrong. 

Any changes that I could make to improve this underfitting would be to add more features or add more sophistication to the model so that it becomes more complex. The higher the complexity, the less the model will underfit. What we want to do is make the model better capture the complexities of the data. If you look at the "Experiment Improved" section, you can see that I've added an extra feature to the logistic regression. However, based on the results, shows that the model got better at predicting non-purchases but is the same when it comes to predicting purchases accurately. 

On the data side, I could show this model a bigger sample of the dataset. If the model gets more data, then it could see more instances of items being purchased and that may help to give the model more cases of items being purchased. This will allow a better opportunity for the model to understand relationships when items are being purchased. 

# Logistic Regression Experiment For New Created Column view_count and price Predicting Product Purchases

In [4]:
#Feature Selection

x = df[['view_count', 'price']]
y = df['purchased']

#Split the data into training and test sets 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

#Conduct Experiment
lr_model = LogisticRegression(max_iter=1000)

lr_model.fit(x_train, y_train)

y_prediction = lr_model.predict(x_test)

report = classification_report(y_test, y_prediction)
print(report)

              precision    recall  f1-score   support

           0       0.98      1.00      0.99     19591
           1       0.22      0.00      0.01       409

    accuracy                           0.98     20000
   macro avg       0.60      0.50      0.50     20000
weighted avg       0.96      0.98      0.97     20000



In [5]:
#Experiment Changed

x = df[['view_count', 'price']]
y = df['purchased']

#Split the data into training and test sets 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

#Conduct Experiment
lr_model = LogisticRegression(max_iter=1000, class_weight='balanced')

lr_model.fit(x_train, y_train)

y_prediction = lr_model.predict(x_test)

report = classification_report(y_test, y_prediction)
print(report)

              precision    recall  f1-score   support

           0       0.99      0.82      0.89     19569
           1       0.05      0.45      0.09       431

    accuracy                           0.81     20000
   macro avg       0.52      0.63      0.49     20000
weighted avg       0.97      0.81      0.88     20000



The input data (x) for this experiment was view_count which was the number of times a product was viewed by a user and price of the product. The target or output data (y) that I used for this prediction task was purchased, which indicates whether a product was purchased. 

The model performed fairly well for non-purchased items, indicating that through 98% accuracy. For 0 or non-purchased items, the model predicted that a product was not purchased 98% of the time. For 1 or purchased items, the model was correct only 13% of the time when it predicted that a product had been purchased. The model was performing extremely well when predicting that a product was not purchased but poorly when it came to predicting if a product was purchased. 

The model is underfitting when it comes to the minority class or "purchased". It is evident through the poor recall that the model isn't learning enough from actual purchased products. 

If you take a look at the "Experiment Changed" section, I made a change by changing the weight of the classes. This does a better job at addressing the underfitting, but not entirely. The addition of class weights helped to increase the recall for purchased products and reduced the class bias. However, we still have issues with low accuracy when it comes to determining whether a product was purchased or not. The model is now better at identifying purchases, but still not very accurate. 

On the data side, I could engineer more features that could help give the model more relevant information to predict information. Things such as "time_spent_viewing" or "user_history" could be useful in showing the model factors that contribute to products being purchased. 

# Classification Experiment for price, category_id, and brand Predicting Product Purchases

In [8]:
# Feature Selection
x = df[['price', 'category_id', 'brand']]
y = df['purchased']


x = pd.get_dummies(x, drop_first=True)

# Split the data into test and trainng set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

#Conduct Experiment
rf_model = RandomForestClassifier()
rf_model.fit(x_train, y_train)


y_prediction = rf_model.predict(x_test)

report = classification_report(y_test, y_prediction)
print(report)

              precision    recall  f1-score   support

           0       0.98      1.00      0.99     19577
           1       0.05      0.01      0.01       423

    accuracy                           0.98     20000
   macro avg       0.51      0.50      0.50     20000
weighted avg       0.96      0.98      0.97     20000



In [9]:
#Experiment Improved

# Feature Selection
x = df[['price', 'category_id', 'brand']]
y = df['purchased']


x = pd.get_dummies(x, drop_first=True)

# Split the data into test and trainng set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

#Conduct Experiment
rf_model = RandomForestClassifier(n_estimators=500, random_state=42, n_jobs=-1)
rf_model.fit(x_train, y_train)


y_prediction = rf_model.predict(x_test)

report = classification_report(y_test, y_prediction)
print(report)

              precision    recall  f1-score   support

           0       0.98      1.00      0.99     19577
           1       0.05      0.01      0.01       423

    accuracy                           0.98     20000
   macro avg       0.52      0.50      0.50     20000
weighted avg       0.96      0.98      0.97     20000



The input data (x) for this experiment was price, category_id, and brand. The target data or output (y) was purchased, which indicates whether a product was purchased or not. 

The model achieved an overall accuracy of 98% on the test set. However, the result of this experiment is very similar to the results from the previous one. The performance metrics revealed that it had a high precision and recall when identifying a product that was not being purchased but poor performance when it came to identifying whether a product was purchased. This can be seen from the precision for whether a product was purchased to be only 5% and the recall only being 1%. 

The model underfits when it comes to purchased products. This can be seen through the low precision and recall when the model is identifying a purchased product. Although the model performs well in identifying if a product was not purchased, it fails to generalize well for the minority class, this means that the model hasn't learned enough about the characteristics of purchased products. This was surprising because I used an extra feature in comparison to the last experiment. The results may have been slightly different but they are mostly the same. 

To address the underfitting, I could work on increasing the complexity of the model. For example, increasing the number of trees in the RandomForestClassifier() method. This will help to improve performance by allowing the model to learn more complex models. Another thing I could do to add complexity would be to increase the tree depth in the same method. By allowing the trees to grow deeper, the model can capture more complex patterns. If you look at the "Experiment Improved" section, I increased the number of trees to see if there was a meaningful change in underfitting, but based on the results, they are identical. 

On the data side, the main thing that I think would help a lot more would be creating an artifical example of the minority class which is the purchased products in this case. This can be important in terms of being able to balance the dataset and making sure that the model has a more balanced amount of input between both classes. 