# Enhancing E-Commerce Strategies Through Customer Behavior Analysis
<span style="font-size:20px;"> - Amisha Kelkar, Arundhati (Ari) Kolahal, Chaitali Deshmukh, Neha Shastri

## Sentiment Analysis using Pretrained Model from Hugging Face


### Importing Relevant Libraries

In [None]:
! pip install transformers torch
! pip install pyarrow
import pandas as pd
import re
from transformers import pipeline

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

### Loading the Dataset

In [None]:
df_sentiment_analysis = pd.read_parquet('product_recommendation.parquet')
df_sentiment_analysis.head()


Unnamed: 0,parent_asin,product_title,product_description,product_reviews,review_timestamps
0,B000AST3AK,GE MWF Refrigerator Water Filter | Certified t...,GE MWF Refrigerator Water Filter | Certified t...,[Although it was recommended for my refrigerat...,"[1690065844128, 1685935815785, 1685221375681, ..."
1,B000DLB2FI,Keurig My K-Cup Reusable Coffee Filter - Old M...,Keurig My K-Cup Reusable Coffee Filter - Old M...,[Works but.... This item works but if you like...,"[1681058624205, 1679970351001, 1671392629773, ..."
2,B000UW2DTE,Whirlpool 4396841 PUR [Fast Fill] FILTER3 Refr...,Whirlpool 4396841 PUR [Fast Fill] FILTER3 Refr...,"[Genuine Whirlpool filter works as designed, P...","[1661202847659, 1632334479657, 1616358618992, ..."
3,B001ICYB2M,SAMSUNG Heating Element Dc47-00019A,"SAMSUNG Heating Element Dc47-00019A [""This hig...",[Great value and it does the job. Bought it. A...,"[1681473410328, 1680181341346, 1679992987436, ..."
4,B002JAKRAM,Frigidaire ULTRAWF PureSource Ultra Water and ...,Frigidaire ULTRAWF PureSource Ultra Water and ...,[Excellent Filter for Frigidaire Refrigerator ...,"[1692655397593, 1691978846991, 1690735181462, ..."


In [None]:
df_sentiment_analysis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   parent_asin          100 non-null    object
 1   product_title        100 non-null    object
 2   product_description  100 non-null    object
 3   product_reviews      100 non-null    object
 4   review_timestamps    100 non-null    object
dtypes: object(5)
memory usage: 4.0+ KB


### Sentiment Analysis Using Pretrained Analysis
* The sentiment analysis was conducted using a pretrained model from Hugging Face(https://huggingface.co/LiYuan/amazon-review-sentiment-analysis)

* Cleaned the dataset before implementing the pretrained model.

In [None]:
# Sample cleaning function
def clean_text(text):
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove special characters
    text = text.lower().strip()  # Convert to lowercase and strip spaces
    return text

def safe_convert(text):
    try:
        return ast.literal_eval(text) if isinstance(text, str) else text
    except (ValueError, SyntaxError):
        return []  # Return empty list if conversion fails

df_sentiment_analysis['product_reviews'] = df_sentiment_analysis['product_reviews'].apply(safe_convert)

# Ensure all elements are strings and apply text cleaning
df_sentiment_analysis['product_reviews'] = df_sentiment_analysis['product_reviews'].apply(lambda x: [clean_text(str(review)) for review in x if isinstance(review, str)])
df_sentiment_analysis['product_description'] = df_sentiment_analysis['product_description'].apply(lambda x : [clean_text(str(x))])

* Since the reviews for a product were stored as list i.e. 100 reviews per product as a single list, it had to be unlisted in order to have one record per review.

In [None]:
df_exploded = df_sentiment_analysis.explode("product_reviews").reset_index(drop=True)


In [None]:
df_exploded.head()

Unnamed: 0,parent_asin,product_title,product_description,product_reviews,review_timestamps
0,B000AST3AK,GE MWF Refrigerator Water Filter | Certified t...,[ge mwf refrigerator water filter certified t...,although it was recommended for my refrigerato...,"[1690065844128, 1685935815785, 1685221375681, ..."
1,B000AST3AK,GE MWF Refrigerator Water Filter | Certified t...,[ge mwf refrigerator water filter certified t...,horrible after installing vibrated and had to ...,"[1690065844128, 1685935815785, 1685221375681, ..."
2,B000AST3AK,GE MWF Refrigerator Water Filter | Certified t...,[ge mwf refrigerator water filter certified t...,oem filter keeps all the governments mind cont...,"[1690065844128, 1685935815785, 1685221375681, ..."
3,B000AST3AK,GE MWF Refrigerator Water Filter | Certified t...,[ge mwf refrigerator water filter certified t...,oem you just cant beat oem generics never seem...,"[1690065844128, 1685935815785, 1685221375681, ..."
4,B000AST3AK,GE MWF Refrigerator Water Filter | Certified t...,[ge mwf refrigerator water filter certified t...,very easy to installed easy to install,"[1690065844128, 1685935815785, 1685221375681, ..."


* Initiated the pretrained model to implement senetiment analysis

In [None]:
sentiment_pipeline = pipeline("text-classification", model="LiYuan/amazon-review-sentiment-analysis")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.56M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/669M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


* Since the model has a cap for the number of characters at 512, the product reviews had to truncated at 512 characters.
* The model categorized the reviews based on stars from 1-5.

In [None]:
df_sentiment_analysis = df_exploded[["product_title", "product_reviews"]].dropna()

df_sentiment_analysis["product_reviews"] = df_sentiment_analysis["product_reviews"].astype(str).fillna("")
df_sentiment_analysis["product_reviews"] = df_sentiment_analysis["product_reviews"].apply(lambda x: x[:512])
df_sentiment_analyzed=df_sentiment_analysis.copy()

df_sentiment_analyzed["sentiment"] = df_sentiment_analyzed["product_reviews"].apply(lambda x: sentiment_pipeline(x)[0]["label"])

df_sentiment_analyzed.head()


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Unnamed: 0,product_title,product_reviews,sentiment
0,GE MWF Refrigerator Water Filter | Certified t...,although it was recommended for my refrigerato...,1 star
1,GE MWF Refrigerator Water Filter | Certified t...,horrible after installing vibrated and had to ...,1 star
2,GE MWF Refrigerator Water Filter | Certified t...,oem filter keeps all the governments mind cont...,5 stars
3,GE MWF Refrigerator Water Filter | Certified t...,oem you just cant beat oem generics never seem...,1 star
4,GE MWF Refrigerator Water Filter | Certified t...,very easy to installed easy to install,5 stars


* Categorized reviews marked over 3 stars as positive reviews.
* Categorized reviews marked as 3 stars as neutral reviews.
* Categorized reviews marked below 3 stars as negative reviews.

In [None]:
def map_sentiment(label):
    if label in ["4 stars", "5 stars"]:
        return "Positive"
    elif label == "3 stars":
        return "Neutral"
    else:
        return "Negative"

df_sentiment_analyzed["sentiment_category"] = df_sentiment_analyzed["sentiment"].apply(map_sentiment)
df_sentiment_analyzed.head()

Unnamed: 0,product_title,product_reviews,sentiment,sentiment_category
0,GE MWF Refrigerator Water Filter | Certified t...,although it was recommended for my refrigerato...,1 star,Negative
1,GE MWF Refrigerator Water Filter | Certified t...,horrible after installing vibrated and had to ...,1 star,Negative
2,GE MWF Refrigerator Water Filter | Certified t...,oem filter keeps all the governments mind cont...,5 stars,Positive
3,GE MWF Refrigerator Water Filter | Certified t...,oem you just cant beat oem generics never seem...,1 star,Negative
4,GE MWF Refrigerator Water Filter | Certified t...,very easy to installed easy to install,5 stars,Positive


* Categorized the list of products into broad categories for better interpretability using Gen AI assistance.

In [None]:
def categorize_appliance(text):
    text = text.lower()
    if any(keyword in text for keyword in ["microwave", "blender", "toaster", "coffee", "kettle", "keurig"]):
        return "Kitchen Appliance"
    elif any(keyword in text for keyword in ["vacuum", "dishwasher", "washing machine", "laundry", "dryer"]):
        return "Cleaning Appliance"
    elif any(keyword in text for keyword in ["air conditioner", "heater", "fan", "refrigerator", "fridge", "freezer"]):
        return "Cooling & Heating"
    elif any(keyword in text for keyword in ["oven", "stove", "cooktop", "grill", "burner", "range hood"]):
        return "Cooking Appliance"
    else:
        return "Other"

df_sentiment_analyzed["product_category"] = df_sentiment_analyzed["product_title"].apply(categorize_appliance)
df_sentiment_analyzed.head()


Unnamed: 0,product_title,product_reviews,sentiment,sentiment_category,product_category
0,GE MWF Refrigerator Water Filter | Certified t...,although it was recommended for my refrigerato...,1 star,Negative,Cooling & Heating
1,GE MWF Refrigerator Water Filter | Certified t...,horrible after installing vibrated and had to ...,1 star,Negative,Cooling & Heating
2,GE MWF Refrigerator Water Filter | Certified t...,oem filter keeps all the governments mind cont...,5 stars,Positive,Cooling & Heating
3,GE MWF Refrigerator Water Filter | Certified t...,oem you just cant beat oem generics never seem...,1 star,Negative,Cooling & Heating
4,GE MWF Refrigerator Water Filter | Certified t...,very easy to installed easy to install,5 stars,Positive,Cooling & Heating


* Calculated average rating of reviews per product

In [None]:
df_sentiment_analyzed["sentiment"] = df_sentiment_analyzed["sentiment"].str.extract("(\d)").astype(float)
avg_sentiment = df_sentiment_analyzed.groupby(["product_category", "product_title"])["sentiment"].mean().reset_index()
avg_sentiment = avg_sentiment.sort_values(["product_category", "sentiment"], ascending=[True, False])
avg_sentiment

Unnamed: 0,product_category,product_title,sentiment
0,Cleaning Appliance,(2023 Update) 3392519 Dryer Thermal Fuse Repla...,4.76
7,Cleaning Appliance,"Dishwasher Magnet Clean Dirty Sign, Universal ...",4.75
5,Cleaning Appliance,Cimkiz Dishwasher Magnet Clean Dirty Sign Shut...,4.66
13,Cleaning Appliance,ULTRA DURABLE 3406107 Dryer Door Switch Replac...,4.61
9,Cleaning Appliance,"OUGAR8 Refrigerator Door Handle Covers,Keep Yo...",4.60
...,...,...,...
96,Other,"Silonn Ice Makers Countertop, 9 Cubes Ready in...",3.30
88,Other,"FRIGIDAIRE EFIC189-Silver Compact Ice Maker, 2...",3.25
89,Other,Frigidaire Ice Maker Machine - SELF CLEANING -...,3.24
81,Other,"AGLUCKY Countertop Ice Maker Machine, Portable...",3.12


* Filtered top 5 highly rated products under each categories

In [None]:
top_5_per_category = avg_sentiment.groupby("product_category").head(5)
categories = top_5_per_category["product_category"].unique()
# Creating a dictionary to store DataFrames per category
category_dfs = {}

# Loop through each category and create a separate DataFrame
for category in categories:
    category_df = top_5_per_category[top_5_per_category["product_category"] == category][["product_title", "sentiment"]]
    category_df = category_df.rename(columns={"sentimentc": "average_review"})
    category_df = category_df.reset_index(drop=True)  # Reset index for readability

    # Store in dictionary with category name as key
    category_dfs[category] = category_df
category_df_list = []

for category, df in category_dfs.items():
    df = df.copy()  # Ensure modifications are applied safely
    df.insert(0, "product_category", category)  # Add category as a column
    df = df.reset_index(drop=True)  # Reset index for readability
    category_df_list.append(df)

for df in category_df_list:
    category_name = df["product_category"].iloc[0]
    print(f"\n{'='*40}")
    print(f"Top 5 Products - {category_name}")
    print(f"{'='*40}")
    print(df.to_string(index=False))  # Print DataFrame without row index
    print("\n")


Top 5 Products - Cleaning Appliance
  product_category                                                                                                                                                                                     product_title  sentiment
Cleaning Appliance         (2023 Update) 3392519 Dryer Thermal Fuse Replacement Part by BlueStars - Kenmore Dryer Thermal Fuse Exact Fit for Whirlpool Kenmore - Replaces AP6008325 3388651 694511 80005 WP3392519VP       4.76
Cleaning Appliance                                                           Dishwasher Magnet Clean Dirty Sign, Universal Clean Dirty Magnet for Dishwasher or Refrigerator, Magnetic Dirty Clean Dishwasher Magnet       4.75
Cleaning Appliance    Cimkiz Dishwasher Magnet Clean Dirty Sign Shutter Only Changes When You Push It Non-Scratching Strong Magnet or 3M Adhesive Options Indicator Tells Whether Dishes Are Clean or Dirty (Silver)       4.66
Cleaning Appliance ULTRA DURABLE 3406107 Dryer Door Switch Replacem