# E-commerce Recommendation System

## Business Understanding

### Project Overview
E-commerce platforms rely on personalized recommendations to improve user experience, increase engagement, and drive sales. Customers interact with products through various events such as **views, clicks, and add-to-cart actions**, but these interactions are often unstructured. The goal of this project is to **build a recommendation system that predicts item properties for "add to cart" events based on prior "view" events** while also detecting abnormal user behavior to enhance recommendation accuracy.

### Business Objectives
1. **Improve Personalization**
   - Predict which product properties influence a user's decision to add an item to the cart.
   - Help businesses tailor their recommendations based on implicit browsing behavior.

2. **Reduce Noise and Improve Efficiency**
   - Identify and remove **abnormal users** who introduce bias and noise into the dataset.
   - Ensure data quality for better recommendation performance.

3. **Enhance Customer Engagement and Sales**
   - Deliver relevant product recommendations, increasing conversion rates.
   - Improve user retention by optimizing the browsing experience.

### Problem Statement
- Customers interact with multiple products before making a purchase decision. However, the properties influencing these decisions (e.g., price, brand, availability) are not explicitly logged.
- **How can we infer product properties that contribute to an "add to cart" decision based on past "view" events?**
- Additionally, **how can we detect and filter out abnormal users who distort recommendation accuracy?**

### Data Understanding
The project relies on three key datasets:

#### 1. `events.csv` (User Interaction Data)
| Column       | Description |
|-------------|------------|
| `timestamp`  | Time when the interaction occurred. |
| `visitorid`  | Unique identifier for each user. |
| `event`      | Type of interaction (e.g., view, add to cart). |
| `itemid`     | Unique identifier for each product. |
| `transactionid` | Identifies transactions (for purchases). |

#### 2. `item_properties.csv` (Product Metadata)
| Column       | Description |
|-------------|------------|
| `timestamp`  | Time when the property was recorded. |
| `itemid`     | Product identifier. |
| `property`   | Feature of the product (e.g., category, availability). |
| `value`      | Corresponding value of the property. |

#### 3. `category_tree.csv` (Product Category Data)
| Column       | Description |
|-------------|------------|
| `categoryid`  | Child category identifier |
| `parentid`     | Parent category identifier. |

### Project Scope
- **Task 1:** Develop an algorithm to predict item properties for "add to cart" events based on "view" events.
- **Task 2:** Detect abnormal users who generate noise and remove them to improve recommendation accuracy.

By addressing these tasks, the project will deliver a robust recommendation system that enhances **e-commerce personalization** and **business intelligence insights** while ensuring **clean and reliable data**. 🚀

### Hypothesis Testing
- **Null Hypothesis(Ho):**
- **Alternate Hypothesis(Ha):**

### Analytical Questions
1. How do we identify product properties that influence an "add to cart" decision?
2. What factors contribute to the decision to add an item to the cart?
3. How can we detect and filter out abnormal users who distort recommendation accuracy?
4. What are the potential impacts of this recommendation system on the e-commerce business?
5. How can we measure the success of this recommendation system and improve it over time?
6. Does visit time have influence on an "add to cart" decision?
7. What is the view to "add to cart" conversion rate?



#### Import all necessary libraries

In [1]:
# Data Manipulation Libraries
import pandas as pd
import numpy as np
import dask.dataframe as dd

# Data Visualization Libraries
import matplotlib.pyplot as plt
import plotly.express as px

# Statistical Libraries
from scipy import stats

# Feature Engineering Libraries
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Machine Learning Libraries
# from lightgbm import LGBMClassifier

# Metrics Libraries
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

#Utils
import warnings
from datetime import datetime
warnings.filterwarnings('ignore')

print("Successfully imported all libraries...")

Successfully imported all libraries...


#### Data Understanding

In [2]:
# Load datasets
events_df = pd.read_csv('../data/events.csv')
item_properties1_df = pd.read_csv('../data/item_properties_part1.1.csv')
item_properties2_df = pd.read_csv('../data/item_properties_part2.csv')
category_tree_df = pd.read_csv('../data/category_tree.csv')


In [3]:
# Load the events_df
events_df

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,1433221332117,257597,view,355908,
1,1433224214164,992329,view,248676,
2,1433221999827,111016,view,318965,
3,1433221955914,483717,view,253185,
4,1433221337106,951259,view,367447,
...,...,...,...,...,...
2756096,1438398785939,591435,view,261427,
2756097,1438399813142,762376,view,115946,
2756098,1438397820527,1251746,view,78144,
2756099,1438398530703,1184451,view,283392,


In [4]:
# Load the category_tree_df 
category_tree_df

Unnamed: 0,categoryid,parentid
0,1016,213.0
1,809,169.0
2,570,9.0
3,1691,885.0
4,536,1691.0
...,...,...
1664,49,1125.0
1665,1112,630.0
1666,1336,745.0
1667,689,207.0


In [5]:
# Load the item_properties1_df
item_properties1_df

Unnamed: 0,timestamp,itemid,property,value
0,1435460400000,460429,categoryid,1338
1,1441508400000,206783,888,1116713 960601 n277.200
2,1439089200000,395014,400,n552.000 639502 n720.000 424566
3,1431226800000,59481,790,n15360.000
4,1431831600000,156781,917,828513
...,...,...,...,...
10999994,1439694000000,86599,categoryid,618
10999995,1435460400000,153032,1066,n1020.000 424566
10999996,1440298800000,421788,888,35975 856003 37346
10999997,1437879600000,159792,400,n552.000 639502 n720.000 424566


In [6]:
# Load the item_properties2_df
item_properties2_df

Unnamed: 0,timestamp,itemid,property,value
0,1433041200000,183478,561,769062
1,1439694000000,132256,976,n26.400 1135780
2,1435460400000,420307,921,1149317 1257525
3,1431831600000,403324,917,1204143
4,1435460400000,230701,521,769062
...,...,...,...,...
9275898,1433646000000,236931,929,n12.000
9275899,1440903600000,455746,6,150169 639134
9275900,1439694000000,347565,686,610834
9275901,1433646000000,287231,867,769062


#### Concatenate the item properties dataframes

In [7]:
item_properties_df = pd.concat([item_properties1_df,item_properties2_df],ignore_index=True)

# Display the first few rows of the concatenated dataframe

item_properties_df.head()

Unnamed: 0,timestamp,itemid,property,value
0,1435460400000,460429,categoryid,1338
1,1441508400000,206783,888,1116713 960601 n277.200
2,1439089200000,395014,400,n552.000 639502 n720.000 424566
3,1431226800000,59481,790,n15360.000
4,1431831600000,156781,917,828513


In [8]:
# check the shape of the new items_properties_df

item_properties_df.shape

(20275902, 4)

#### Create New Features from Property column of the item_properties

In [9]:
# initialize the categoryid and available columns
item_properties_df["categoryid"] = np.nan
item_properties_df["available"] = np.nan

# Assign values to the new column based on the 'property' column
item_properties_df.loc[item_properties_df["property"] == "categoryid","categoryid"] = item_properties_df["value"]
item_properties_df.loc[item_properties_df["property"] == "available","available"] = item_properties_df["value"]

In [10]:

# Move all the already existing numerical category ids into categoryid column
item_properties_df.loc[item_properties_df["property"].str.isnumeric(), "categoryid"] = item_properties_df["property"]

In [11]:
# create new column called description column
item_properties_df["description"] = item_properties_df.apply(
    lambda row: row["value"] if row["property"] not in ["categoryid","available"] else np.nan, axis=1 )


In [12]:
# drop unnecessary property and value columns
item_properties_df.drop(["property", "value"], axis=1, inplace=True)

In [13]:
# confirm changes
item_properties_df

Unnamed: 0,timestamp,itemid,categoryid,available,description
0,1435460400000,460429,1338,,
1,1441508400000,206783,888,,1116713 960601 n277.200
2,1439089200000,395014,400,,n552.000 639502 n720.000 424566
3,1431226800000,59481,790,,n15360.000
4,1431831600000,156781,917,,828513
...,...,...,...,...,...
20275897,1433646000000,236931,929,,n12.000
20275898,1440903600000,455746,6,,150169 639134
20275899,1439694000000,347565,686,,610834
20275900,1433646000000,287231,867,,769062


In [None]:
# Get info about the item_properties_df

item_properties_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20275902 entries, 0 to 20275901
Data columns (total 5 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   timestamp    int64 
 1   itemid       int64 
 2   categoryid   object
 3   available    object
 4   description  object
dtypes: int64(2), object(3)
memory usage: 773.5+ MB


: 

In [None]:
# check for null values
item_properties_df.isna().sum()

In [None]:
# Look at the events_df
events_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2756101 entries, 0 to 2756100
Data columns (total 5 columns):
 #   Column         Dtype  
---  ------         -----  
 0   timestamp      int64  
 1   visitorid      int64  
 2   event          object 
 3   itemid         int64  
 4   transactionid  float64
dtypes: float64(1), int64(3), object(1)
memory usage: 105.1+ MB


In [None]:
# check for null values
events_df.isna().sum()

timestamp              0
visitorid              0
event                  0
itemid                 0
transactionid    2733644
dtype: int64

In [None]:
item_properties_df.duplicated().sum()

np.int64(0)

#### Merging Datasets
- The merge strategy to be used for merging the events_df and the item_properties is the left join taking the events_df as the main dataframe. 
- I shall consider the timestamp and the itemid as the two unique columns to base my joins on. 
- I will finally merge the item_event dataframe to category_tree based on the categoryid to form the final_df 

In [None]:
# Convert timestamp in item properties to datetime
item_properties_df["timestamp"] = pd.to_datetime(item_properties_df["timestamp"])

# Convert timestamp in events to
events_df["timestamp"] = pd.to_datetime(events_df["timestamp"])


item_properties_df["timestamp"] = item_properties_df["timestamp"].dt.floor("s")
events_df["timestamp"] = events_df["timestamp"].dt.floor("s")


In [None]:
# confirm changes
item_properties_df["timestamp"].head()

0   1970-01-01 00:23:55
1   1970-01-01 00:24:01
2   1970-01-01 00:23:59
3   1970-01-01 00:23:51
4   1970-01-01 00:23:51
Name: timestamp, dtype: datetime64[ns]

In [None]:
# Rename timestamp columns to datetime column
item_properties_df = item_properties_df.rename(columns = {"timestamp":"datetime"})
events_df = events_df.rename(columns={"timestamp": "datetime"})

In [None]:
# Convert datetime dtype to s
item_properties_df["datetime"] = item_properties_df["datetime"].astype("datetime64[s]")
events_df["datetime"] = events_df["datetime"].astype("datetime64[s]")


In [None]:
# Merge using 
event_item_df = events_df.merge(item_properties_df, on=["datetime","itemid"], how="inner", indicator=True)

In [None]:
# count the indicator types
event_item_df["_merge"].value_counts()

_merge
both          14701714
left_only            0
right_only           0
Name: count, dtype: int64

In [None]:
event_item_df.head()

Unnamed: 0,datetime,visitorid,event,itemid,transactionid,categoryid,available,description,_merge
0,1970-01-01 00:23:53,257597,view,355908,,1036,,726612,both
1,1970-01-01 00:23:53,257597,view,355908,,364,,610075,both
2,1970-01-01 00:23:53,257597,view,355908,,400,,n600.000 424566,both
3,1970-01-01 00:23:53,257597,view,355908,,400,,n600.000 424566,both
4,1970-01-01 00:23:53,257597,view,355908,,1066,,n1020.000 424566,both


In [None]:
# view last five rows
event_item_df.tail()

In [None]:
event_item_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20275902 entries, 0 to 20275901
Data columns (total 9 columns):
 #   Column         Dtype   
---  ------         -----   
 0   timestamp      int64   
 1   itemid         int64   
 2   categoryid     object  
 3   available      object  
 4   description    object  
 5   visitorid      float64 
 6   event          object  
 7   transactionid  float64 
 8   _merge         category
dtypes: category(1), float64(2), int64(2), object(4)
memory usage: 1.2+ GB


In [None]:
# Convert the categoryid and available to int datatype

event_item_df["categoryid"] = event_item_df["categoryid"].astype("Int64")
event_item_df["available"] = event_item_df["available"].astype("Int64")


In [None]:
# Merge item_event_df to category_id
final_df = event_item_df.merge(category_tree_df, how="inner", on="categoryid", indicator=True)


In [None]:
# check the shape of the final_df
final_df.shape

(20275902, 10)

In [None]:
# View first five rows
final_df.head()

Unnamed: 0,timestamp,itemid,categoryid,available,description,visitorid,event,transactionid,_merge,parentid
0,1435460400000,460429,1338,,,,,,left_only,1278.0
1,1441508400000,206783,888,,1116713 960601 n277.200,,,,left_only,866.0
2,1439089200000,395014,400,,n552.000 639502 n720.000 424566,,,,left_only,110.0
3,1431226800000,59481,790,,n15360.000,,,,left_only,1492.0
4,1431831600000,156781,917,,828513,,,,left_only,1374.0


In [None]:
# check the info about final_df
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20275902 entries, 0 to 20275901
Data columns (total 10 columns):
 #   Column         Dtype   
---  ------         -----   
 0   timestamp      int64   
 1   itemid         int64   
 2   categoryid     Int64   
 3   available      Int64   
 4   description    object  
 5   visitorid      float64 
 6   event          object  
 7   transactionid  float64 
 8   _merge         category
 9   parentid       float64 
dtypes: Int64(2), category(1), float64(3), int64(2), object(2)
memory usage: 1.4+ GB


In [None]:
item_properties_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20275902 entries, 0 to 20275901
Data columns (total 5 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   timestamp    int64 
 1   itemid       int64 
 2   categoryid   object
 3   available    object
 4   description  object
dtypes: int64(2), object(3)
memory usage: 773.5+ MB


In [None]:
# perform summary descriptive analysis on the final_df
final_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
timestamp,20275902.0,1435156943682.8833,3327797780.708256,1431226800000.0,1432436400000.0,1433646000000.0,1437879600000.0,1442113200000.0
itemid,20275902.0,233390.432499,134845.230654,0.0,116516.0,233483.0,350304.0,466866.0
categoryid,18772263.0,617.875911,322.80467,0.0,348.0,747.0,888.0,1697.0
available,1503639.0,0.426002,0.494494,0.0,0.0,0.0,1.0,1.0
visitorid,0.0,,,,,,,
transactionid,0.0,,,,,,,
parentid,18604575.0,895.854688,461.922945,8.0,602.0,866.0,1370.0,1698.0


### Exploratory Data Analysis

#### Answering Analytical Questions

#### Hypothesis Testing

## Data Preparation

## Modeling and Evaluation

## Deployment

## Conclusion