# Classification and Regression Experiments

# Data Preprocessing

In [2]:
#Imports
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [3]:
df = pd.read_csv("../data/cleaned/cleaned_data.csv", parse_dates=['event_time'])

In [None]:
#Creating a new view_count column
print(df.columns)

view_counts = df[df['event_type'] == 'view'].groupby(['user_id', 'product_id']).size().reset_index(name='view_count')
df = df.merge(view_counts, on=['user_id', 'product_id'], how='left')
df['view_count'] = df['view_count'].fillna(0)

# Creating a column where purchases are labeled as 1 and non-purchases as 0
df['purchased'] = df['event_type'].apply(lambda x:1 if x == 'purchase' else 0)

#Create a smaller sample of the dataframe (Random 100,000 Values)
df = df.sample(n=100000, random_state=42) 

# Logistic Regression for brand and category_code Predicting Product Purchases

In [10]:
# Feature Selection
x = df[['brand', 'category_code']]
y = df['purchased']

x = pd.get_dummies(x, columns=['brand', 'category_code'], drop_first=True)

# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

#Conduct Experiment
lr_model = LogisticRegression(max_iter=1000, class_weight='balanced')

lr_model.fit(x_train, y_train)

y_pred = lr_model.predict(x_test)

report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.99      0.54      0.70     19574
           1       0.03      0.68      0.06       426

    accuracy                           0.54     20000
   macro avg       0.51      0.61      0.38     20000
weighted avg       0.97      0.54      0.68     20000



In [11]:
#Experiment Improved

# Feature Selection
x = df[['brand', 'category_code', 'category_id']]
y = df['purchased']

x = pd.get_dummies(x, columns=['brand', 'category_code', 'category_id'], drop_first=True)

# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

#Conduct Experiment
lr_model = LogisticRegression(max_iter=1000, class_weight='balanced')

lr_model.fit(x_train, y_train)

y_pred = lr_model.predict(x_test)

report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.99      0.52      0.68     19574
           1       0.03      0.70      0.06       426

    accuracy                           0.52     20000
   macro avg       0.51      0.61      0.37     20000
weighted avg       0.97      0.52      0.67     20000



The input data (x) for this experiment is brand and category_code. The output or target data (y) for this experiment is purchased which indicates whether a product was purchased (1) or not (0). 

The model maintained a high accuracy of 0.99 for the majority class which was non-purchases but still struggled with the minority class which was purchases by showing a low accuracy of 0.03. The recall for the purchases class was high at 0.66 which means that the model correctly identified 66% of the purchases. However, the low accuracy hurt its overall performance. The overall accuracy of 0.57 indicates that the model isn't performing very well.

The model was underfitting, it was not able to capture sufficient patterns to make accurate predictions for the minority class which was purchased. Looking at the high recall and low precision, that tells us that the model predicted a lot of purchases but many of the predictions were wrong. 

Any changes that I could make to improve this underfitting would be to add more features or add more sophistication to the model so that it becomes more complex. The higher the complexity, the less the model will underfit. What we want to do is make the model better capture the complexities of the data. If you look at the "Experiment Improved" section, you can see that I've added an extra feature to the logistic regression. However, based on the results, shows that the model got better at predicting non-purchases but is the same when it comes to predicting purchases accurately. 

On the data side, I could show this model a bigger sample of the dataset. If the model gets more data, then it could see more instances of items being purchased and that may help to give the model more cases of items being purchased. This will allow a better opportunity for the model to understand relationships when items are being purchased. 

# Logistic Regression Experiment For New Created Column view_count and price Predicting Product Purchases

In [15]:
#Feature Selection

x = df[['view_count', 'price']]
y = df['purchased']

#Split the data into training and test sets 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

#Conduct Experiment
lr_model = LogisticRegression(max_iter=1000)

lr_model.fit(x_train, y_train)

y_prediction = lr_model.predict(x_test)

report = classification_report(y_test, y_prediction)
print(report)

              precision    recall  f1-score   support

           0       0.98      1.00      0.99     19588
           1       0.00      0.00      0.00       412

    accuracy                           0.98     20000
   macro avg       0.49      0.50      0.49     20000
weighted avg       0.96      0.98      0.97     20000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [16]:
#Experiment Changed

x = df[['view_count', 'price']]
y = df['purchased']

#Split the data into training and test sets 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

#Conduct Experiment
lr_model = LogisticRegression(max_iter=1000, class_weight='balanced')

lr_model.fit(x_train, y_train)

y_prediction = lr_model.predict(x_test)

report = classification_report(y_test, y_prediction)
print(report)

              precision    recall  f1-score   support

           0       1.00      0.97      0.99     19561
           1       0.43      0.97      0.60       439

    accuracy                           0.97     20000
   macro avg       0.72      0.97      0.79     20000
weighted avg       0.99      0.97      0.98     20000



The input data (x) for this experiment was view_count which was the number of times a product was viewed by a user and price of the product. The target or output data (y) that I used for this prediction task was purchased, which indicates whether a product was purchased. 

The model performed fairly well for non-purchased items, indicating that through 98% accuracy. For 0 or non-purchased items, the model predicted that a product was not purchased 98% of the time. For 1 or purchased items, the model was correct only 13% of the time when it predicted that a product had been purchased. The model was performing extremely well when predicting that a product was not purchased but poorly when it came to predicting if a product was purchased. 

The model is underfitting when it comes to the minority class or "purchased". It is evident through the poor recall that the model isn't learning enough from actual purchased products. 

If you take a look at the "Experiment Changed" section, I made a change by changing the weight of the classes. This does a better job at addressing the underfitting, but not entirely. The addition of class weights helped to increase the recall for purchased products and reduced the class bias. However, we still have issues with low accuracy when it comes to determining whether a product was purchased or not. The model is now better at identifying purchases, but still not very accurate. 

On the data side, I could engineer more features that could help give the model more relevant information to predict information. Things such as "time_spent_viewing" or "user_history" could be useful in showing the model factors that contribute to products being purchased. 

# Classification Experiment for price, category_id, and brand Predicting Product Purchases

In [17]:
# Feature Selection
x = df[['price', 'category_id', 'brand']]
y = df['purchased']


x = pd.get_dummies(x, drop_first=True)

# Split the data into test and trainng set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

#Conduct Experiment
rf_model = RandomForestClassifier()
rf_model.fit(x_train, y_train)


y_prediction = rf_model.predict(x_test)

report = classification_report(y_test, y_prediction)
print(report)

              precision    recall  f1-score   support

           0       0.98      1.00      0.99     19574
           1       0.03      0.00      0.01       426

    accuracy                           0.98     20000
   macro avg       0.50      0.50      0.50     20000
weighted avg       0.96      0.98      0.97     20000



In [18]:
#Experiment Improved

# Feature Selection
x = df[['price', 'category_id', 'brand']]
y = df['purchased']


x = pd.get_dummies(x, drop_first=True)

# Split the data into test and trainng set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

#Conduct Experiment
rf_model = RandomForestClassifier(n_estimators=500, random_state=42, n_jobs=-1)
rf_model.fit(x_train, y_train)


y_prediction = rf_model.predict(x_test)

report = classification_report(y_test, y_prediction)
print(report)

              precision    recall  f1-score   support

           0       0.98      1.00      0.99     19574
           1       0.03      0.00      0.01       426

    accuracy                           0.98     20000
   macro avg       0.50      0.50      0.50     20000
weighted avg       0.96      0.98      0.97     20000



The input data (x) for this experiment was price, category_id, and brand. The target data or output (y) was purchased, which indicates whether a product was purchased or not. 

The model achieved an overall accuracy of 98% on the test set. However, the result of this experiment is very similar to the results from the previous one. The performance metrics revealed that it had a high precision and recall when identifying a product that was not being purchased but poor performance when it came to identifying whether a product was purchased. This can be seen from the precision for whether a product was purchased to be only 5% and the recall only being 1%. 

The model underfits when it comes to purchased products. This can be seen through the low precision and recall when the model is identifying a purchased product. Although the model performs well in identifying if a product was not purchased, it fails to generalize well for the minority class, this means that the model hasn't learned enough about the characteristics of purchased products. This was surprising because I used an extra feature in comparison to the last experiment. The results may have been slightly different but they are mostly the same. 

To address the underfitting, I could work on increasing the complexity of the model. For example, increasing the number of trees in the RandomForestClassifier() method. This will help to improve performance by allowing the model to learn more complex models. Another thing I could do to add complexity would be to increase the tree depth in the same method. By allowing the trees to grow deeper, the model can capture more complex patterns. If you look at the "Experiment Improved" section, I increased the number of trees to see if there was a meaningful change in underfitting, but based on the results, they are identical. 

On the data side, the main thing that I think would help a lot more would be creating an artifical example of the minority class which is the purchased products in this case. This can be important in terms of being able to balance the dataset and making sure that the model has a more balanced amount of input between both classes. 

#  Using Price & Category ID to Calculate Brand

In [29]:
# Feature
x = df[['price', 'category_id']]
y = df['brand']



# Label the brand names so we can operate based on it
y = pd.factorize(y)[0]

# Training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Conduct Experiment
lr_model = LogisticRegression(max_iter=1000, multi_class='multinomial', solver='lbfgs')
lr_model.fit(x_train, y_train)

#predictions
y_pred = lr_model.predict(x_test)


report = classification_report(y_test, y_pred)
print(report)



              precision    recall  f1-score   support

           0       0.19      1.00      0.32      3875
           1       0.00      0.00      0.00       394
           2       0.00      0.00      0.00       117
           3       0.00      0.00      0.00        15
           4       0.00      0.00      0.00        21
           5       0.00      0.00      0.00      3062
           6       0.00      0.00      0.00       227
           7       0.00      0.00      0.00      2030
           8       0.00      0.00      0.00       279
           9       0.00      0.00      0.00       832
          10       0.00      0.00      0.00        31
          11       0.00      0.00      0.00       259
          12       0.00      0.00      0.00        66
          13       0.00      0.00      0.00        30
          14       0.00      0.00      0.00         5
          15       0.00      0.00      0.00        17
          16       0.00      0.00      0.00         9
          17       0.00    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [32]:
# IMPROVED
# Feature ()

from sklearn.preprocessing import LabelEncoder

x = df[['price', 'category_id']]
y = df['brand']



# Try a different label encoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# Training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Conduct Experiment
lr_model = LogisticRegression(max_iter=1000, multi_class='multinomial', solver='lbfgs')
lr_model.fit(x_train, y_train)

#predictions
y_pred = lr_model.predict(x_test)


report = classification_report(y_test, y_pred)
print(report)



              precision    recall  f1-score   support

           0       0.00      0.00      0.00       328
           2       0.00      0.00      0.00         5
           5       0.00      0.00      0.00         3
           6       0.00      0.00      0.00         1
           7       0.00      0.00      0.00        23
           9       0.00      0.00      0.00         2
          10       0.00      0.00      0.00         2
          11       0.00      0.00      0.00         1
          12       0.00      0.00      0.00         3
          13       0.00      0.00      0.00        13
          14       0.00      0.00      0.00         2
          15       0.00      0.00      0.00         2
          16       0.00      0.00      0.00         1
          17       0.00      0.00      0.00         1
          18       0.00      0.00      0.00        21
          19       0.00      0.00      0.00         3
          20       0.00      0.00      0.00         1
          22       0.00    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The input data (x)  for this experiment was price and category id, and the target output data (y) was the brand. I wanted to see if I was able to determine brand by category and price, to see if there were trends in how brands priced themselves.

The model, quite honestly, performed rather poorly. However, I don’t think this is unexpected. Many tech products price themselves similarly to competitors, as to foster competition. A new iPhone and Galaxy are not that far off price wise, for example. It underfitted a lot as well, with an accuracy of .19. I could try and use a different algorithm, but honestly I think that this is maybe not a complex enough relationship to model.

To tweak it a bit, I’ll try using the SkLearn class LabelEncoder() which I’ve used before and gotten good results with. 

I think to tweak this in a more meaningful way however, my approach in general should be different. Calculating the brand might be hard without an idea of how similarly priced items of the same type are from other brands. I’d need to analyze all apple products first, for example, break the price points down, and then see what specific items are at those specific price points. I think, even then, with that level of granularity, I’d still struggle. I’m not entirely sure if this is even a wise set of things to be predicting over.


# Classification Experiment for price, category_id, and brand Predicting Product Purchases (Using a Neural Network)

In [40]:
from sklearn.neural_network import MLPClassifier

# Feature Selection
x = df[['price', 'category_id', 'brand']]
y = df['purchased']

# One-hot encoding 
x = pd.get_dummies(x, drop_first=True)


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Use NN
nn_model = MLPClassifier(hidden_layer_sizes=(50,50), max_iter=1000, random_state=42)
nn_model.fit(x_train, y_train)


y_prediction = nn_model.predict(x_test)

# Report
report = classification_report(y_test, y_prediction)
print(report)

              precision    recall  f1-score   support

           0       0.98      1.00      0.99     19574
           1       0.00      0.00      0.00       426

    accuracy                           0.98     20000
   macro avg       0.49      0.50      0.49     20000
weighted avg       0.96      0.98      0.97     20000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [45]:
from sklearn.neural_network import MLPClassifier
#Revised

# Feature Selection
x = df[['price', 'category_id', 'brand']]
y = df['purchased']

# One-hot encoding 
x = pd.get_dummies(x, drop_first=True)


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Use NN
nn_model = MLPClassifier(hidden_layer_sizes=(100,100), max_iter=1000, random_state=42)
nn_model.fit(x_train, y_train)


y_prediction = nn_model.predict(x_test)

# Report
report = classification_report(y_test, y_prediction)
print(report)

              precision    recall  f1-score   support

           0       0.98      1.00      0.99     19574
           1       0.00      0.00      0.00       426

    accuracy                           0.98     20000
   macro avg       0.49      0.50      0.49     20000
weighted avg       0.96      0.98      0.97     20000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



We wanted to see if I could take one of our previous experiments, where we used a RandomForest to determine “purchased” based on brand, category_id, and price, and apply a neural network to it. 


From our understanding, each node in the neural network has tweakable activation values which allow it to string together a probable answer, and then iterate over itself backwards (backpropagation). The hidden_layer_sizes is the number of hidden layers and neurons in that layer, in a format of hidden_layer_size=(LAYERS, NEURONS). 

As expected, given the high percent of accuracy in our last iteration of this algorithm with RandomForest, we had a 98% accuracy with slight underfitting. We wanted to see if we could change the amount of neurons to perhaps get closer to a 99%, but at this stage, it seemed as though this was as close as we were going to get. 

In the end, even further tweaks didn’t really affect it much. It seemed like just as good of a predictive algorithm as the RandomForest, but is probably better suited, given it’s complexity, for visual predictions.

#  Predicting Event Type via  RandomForestClassifier Using Price, View_Count, Category_Id, and Event

In [43]:

x = df[['price', 'view_count', 'category_id', 'brand']]
y = df['event_type']  

# One-hot encoding
x = pd.get_dummies(x, drop_first=True)


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(x_train, y_train)

y_pred = model.predict(x_test)
report = classification_report(y_test, y_pred)
print(report)


              precision    recall  f1-score   support

        cart       0.56      0.62      0.59       558
    purchase       0.44      0.36      0.40       426
        view       1.00      1.00      1.00     19016

    accuracy                           0.98     20000
   macro avg       0.67      0.66      0.66     20000
weighted avg       0.97      0.98      0.98     20000



In [44]:
#Revised 

x = df[['price', 'view_count', 'category_id', 'brand']]
y = df['event_type']  

# One-hot encoding
x = pd.get_dummies(x, drop_first=True)


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

from sklearn.ensemble import RandomForestClassifier

model = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42)
model.fit(x_train, y_train)


y_pred = model.predict(x_test)
report = classification_report(y_test, y_pred)
print(report)


              precision    recall  f1-score   support

        cart       0.00      0.00      0.00       558
    purchase       0.00      0.00      0.00       426
        view       0.95      1.00      0.97     19016

    accuracy                           0.95     20000
   macro avg       0.32      0.33      0.32     20000
weighted avg       0.90      0.95      0.93     20000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


We wanted to see if an event type could be determined given price, view count, category id, and brand, as it allows us to see if we can automate the process of determining how specific conditions might make a product appealing, but not fully added to a cart. If an item, for example, is interesting to look at, but not often being purchased, perhaps it means that the price isn’t right. If it’s not even being looked at, then perhaps the product and the price are the issue.

Using a randomforest is simple and has thus far proven trustworthy and efficient at getting high results. As expected, it returned a 98% accuracy with slight underfitting, which isn’t an issue. We could attempt to deepen the trees, and widen the amount of trees itself, but that likely wouldn’t amount to much. Instead, we wanted to see again how different a neural network approach may be. However, this also didn’t change much in the report, as it was still predicting very well. 

Having more features that allow for complex relationships to be modeled would be helpful. I wish there was some kind of “also_purchased” which gave an n-depth look at what related items were purchased in a short timespan around this action. This way, users can be better classified, and their behavior fit to trends.
