# CS 4342 Final Project - Amazon Sales Marketing Data

Analyzing the ratings of Amazon products

## Using Kagglehub to import the dataset
You may need to use `pip install kagglehub` to run this section.
The dataset is relatively small, so this should not take a long time.
We will print a couple example rows to see what each row looks like.

In [21]:
import kagglehub
import csv
import re

# Download latest version
folder_path = kagglehub.dataset_download("karkavelrajaj/amazon-sales-dataset")
print(folder_path)
file_path = folder_path + "/amazon.csv"

fields = []
text_rows = []

with open(file_path, "r", encoding="utf-8") as csvfile:
    csvreader = csv.reader(csvfile)

    fields = next(csvreader)
    for text_row in csvreader:
        text_rows.append(text_row)

print(fields)
print(text_rows[0])
print(text_rows[1])
print(text_rows[2])

C:\Users\Nora\.cache\kagglehub\datasets\karkavelrajaj\amazon-sales-dataset\versions\1
['product_id', 'product_name', 'category', 'discounted_price', 'actual_price', 'discount_percentage', 'rating', 'rating_count', 'about_product', 'user_id', 'user_name', 'review_id', 'review_title', 'review_content', 'img_link', 'product_link']
['B07JW9H4J1', 'Wayona Nylon Braided USB to Lightning Fast Charging and Data Sync Cable Compatible for iPhone 13, 12,11, X, 8, 7, 6, 5, iPad Air, Pro, Mini (3 FT Pack of 1, Grey)', 'Computers&Accessories|Accessories&Peripherals|Cables&Accessories|Cables|USBCables', '₹399', '₹1,099', '64%', '4.2', '24,269', "High Compatibility : Compatible With iPhone 12, 11, X/XsMax/Xr ,iPhone 8/8 Plus,iPhone 7/7 Plus,iPhone 6s/6s Plus,iPhone 6/6 Plus,iPhone 5/5s/5c/se,iPad Pro,iPad Air 1/2,iPad mini 1/2/3,iPod nano7,iPod touch and more apple devices.|Fast Charge&Data Sync : It can charge and sync simultaneously at a rapid speed, Compatible with any charging adaptor, multi-port 

## Preprocessing the data

We notice that some fields are strings depicting numbers, which is not desirable. Specifically fields 3-7 which are:
- Discounted price
- Actual price
- Percent discount
- Average rating
- Rating count
For our preprocessing we will convert these to numbers (all floats except rating count.)
Also, some numeric strings are in rupees, but we are more familiar with US dollars, so we will convert.

We also notice some columns with irrelevant data like IDs or URLs. These include:
- product_id (Field 0)
- user_id (Field 9)
- user_name (Field 10)
- review_id (Field 11)
- img_link (Field 14)
- product_link (Field 15)
We will remove these fields from the data and from our "fields" variable.

A few rows have problems, so we print out the undesirable sections. This is a small proportion (3 of >1k) of the dataset - we will leave these rows out.

We also notice some reviews have URLs to images in them, such as https://m.media-amazon.com/images/W/WEBP_402378-T1/images/I/81---F1ZgHL._SY88.jpg. We will not be reading these images so we will use a regexp to find and remove these as well.

We print out some of the corrected rows and see they now contain numbers no IDs and with prices converted to USD.

In [22]:
rows = []
fields = fields[1:9] + fields[12:14]

def rupee_str_to_usd_float(str):
    rupee_float = float(str[1:].replace(",", ""))
    usd_float = rupee_float * 0.011 # conversion rate
    return usd_float

print('Problematic rows:')
for text_row in text_rows:
    try:
        row = [
            # remove field 0 (product_id)
            text_row[1], # product name
            text_row[2], # product categories
            rupee_str_to_usd_float(text_row[3]), # process discount price (string) into number
            rupee_str_to_usd_float(text_row[4]), # process full price (string) into number
            float(text_row[5][:-1]), # process discount percent into number
            float(text_row[6]), # process avg rating into number
            int(text_row[7].replace(",", "")), # process rating count
            text_row[8], # about product
            # remove fields 9-11 (user_id, user_name, review_id)
            text_row[12], # review titles
            re.sub(r'https://.*?\.(?:jpg|png)', '', text_row[13]) # review contents
            # remove fields 14 (img_link) and 15 (product_link)
        ]
        rows.append(row)
    except ValueError:
        print(text_row[3:8])
        continue

print('Fixed rows:')
print(rows[0])
print(rows[1])
print(rows[2])

fields = ['product_name','category','discounted_price','actual_price','discount_percentage',
          'rating','rating_count','about_product','review_title','review_content']


Problematic rows:
['₹199', '₹999', '80%', '3.0', '']
['₹249', '₹999', '75%', '5.0', '']
['₹2,099', '₹2,499', '16%', '|', '992']
Fixed rows:
['Wayona Nylon Braided USB to Lightning Fast Charging and Data Sync Cable Compatible for iPhone 13, 12,11, X, 8, 7, 6, 5, iPad Air, Pro, Mini (3 FT Pack of 1, Grey)', 'Computers&Accessories|Accessories&Peripherals|Cables&Accessories|Cables|USBCables', 4.388999999999999, 12.088999999999999, 64.0, 4.2, 24269, "High Compatibility : Compatible With iPhone 12, 11, X/XsMax/Xr ,iPhone 8/8 Plus,iPhone 7/7 Plus,iPhone 6s/6s Plus,iPhone 6/6 Plus,iPhone 5/5s/5c/se,iPad Pro,iPad Air 1/2,iPad mini 1/2/3,iPod nano7,iPod touch and more apple devices.|Fast Charge&Data Sync : It can charge and sync simultaneously at a rapid speed, Compatible with any charging adaptor, multi-port charging station or power bank.|Durability : Durable nylon braided design with premium aluminum housing and toughened nylon fiber wound tightly around the cord lending it superior durabilit

## Multi-Class Logistic Regression

In [23]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score

In [24]:
df = pd.DataFrame(rows, columns = fields)

#converting numeric columns 
df['discounted_price'] = df['discounted_price'].astype(float)
df['actual_price'] = df['actual_price'].astype(float)
df['discount_percentage'] = df['discount_percentage'].astype(float)
df['rating'] = df['rating'].astype(float)
df['rating_count'] = df['rating_count'].astype(int)

In [25]:
# prepping review text
df_reviews = df[['review_title','review_content','rating']].dropna()
df_reviews['text'] = df_reviews['review_title'].astype(str) + " " + df_reviews['review_content'].astype(str)
df_reviews['label'] = df_reviews['rating'].round().astype(int)  # 1-5 stars

X_text = df_reviews['text']
y = df_reviews['label']

## TF-IDF (Term Frequency-Inverse Document Frequency)
First we use Term Frequency-Inverse Document Frequency to create a vector representation of our Amazon reviews. The term frequency is the number of times a word appears in the text. The Inverse Document Frequency accounts for very common words like 'the' or 'is'. Using TF-IDF reduces the weight of less informative words, giving a higher score to rare words. The result is a vector with one dimension per word in the vocabulary where the values are high for important or unique words and low for frequent/common words.

Running multi-class logistic regression with TF-IDF text-vectorization will allow us to highlight the most important words for each rating value.

In [26]:
vectorizer = TfidfVectorizer(stop_words='english', max_features = 5000)
X = vectorizer.fit_transform(X_text)

In [27]:
log_reg = LogisticRegression(max_iter = 2000, multi_class='multinomial', solver = 'lbfgs')
kf = KFold(n_splits=5, shuffle = True, random_state=42)
scores = cross_val_score(log_reg, X, y, cv=kf, scoring = 'accuracy')

print("K-Fold Accuracy Scores:", scores)
print("Mean Accuracy:", scores.mean())

log_reg.fit(X, y)




K-Fold Accuracy Scores: [0.95221843 0.94197952 0.94863014 0.95205479 0.96917808]
Mean Accuracy: 0.9528121931834121




In [28]:
feature_names = vectorizer.get_feature_names_out()
coefs = log_reg.coef_

class_idx = np.where(log_reg.classes_ == 5)[0][0]
coef_class = log_reg.coef_[class_idx]

top_pos_idx = np.argsort(coef_class)[-20:]
top_neg_idx = np.argsort(coef_class)[:20]

print("\nTop words that predict 5-star reviews:")
for i in reversed(top_pos_idx):
    print(f"{feature_names[i]} : {coef_class[i]:.3f}")

print("\nTop words that predict against 5-star reviews (lean towards lower ratings):")
for i in top_neg_idx:
    print(f"{feature_names[i]} : {coef_class[i]:.3f}")


Top words that predict 5-star reviews:
mouse : 1.336
easy : 0.808
screen : 0.718
best : 0.672
amazing : 0.627
instagram : 0.596
install : 0.566
pad : 0.561
big : 0.518
clips : 0.475
wonders : 0.456
installation : 0.444
coffee : 0.439
products : 0.436
manikandan : 0.420
tablet : 0.420
instant : 0.405
ओर : 0.404
excellent : 0.401
tempered : 0.397

Top words that predict against 5-star reviews (lean towards lower ratings):
ok : -0.497
charging : -0.439
remote : -0.413
good : -0.385
watch : -0.377
price : -0.322
battery : -0.316
power : -0.279
work : -0.269
bad : -0.267
amazon : -0.266
nice : -0.263
does : -0.251
received : -0.229
decent : -0.225
usb : -0.214
poor : -0.212
weight : -0.204
days : -0.203
better : -0.199
