# CS 4342 Final Project - Amazon Sales Marketing Data

Analyzing the ratings of Amazon products

## Using Kagglehub to import the dataset
You may need to use `pip install kagglehub` to run this section.
The dataset is relatively small, so this should not take a long time.
We will print a couple example rows to see what each row looks like.

In [3]:
import kagglehub
import csv

# Download latest version
folder_path = kagglehub.dataset_download("karkavelrajaj/amazon-sales-dataset")
print(folder_path)
file_path = folder_path + "/amazon.csv"

fields = []
text_rows = []

with open(file_path, "r", encoding="utf-8") as csvfile:
    csvreader = csv.reader(csvfile)

    fields = next(csvreader)
    for text_row in csvreader:
        text_rows.append(text_row)

print(fields)
print(text_rows[0])
print(text_rows[1])
print(text_rows[2])

/Users/esthermao/.cache/kagglehub/datasets/karkavelrajaj/amazon-sales-dataset/versions/1
['product_id', 'product_name', 'category', 'discounted_price', 'actual_price', 'discount_percentage', 'rating', 'rating_count', 'about_product', 'user_id', 'user_name', 'review_id', 'review_title', 'review_content', 'img_link', 'product_link']
['B07JW9H4J1', 'Wayona Nylon Braided USB to Lightning Fast Charging and Data Sync Cable Compatible for iPhone 13, 12,11, X, 8, 7, 6, 5, iPad Air, Pro, Mini (3 FT Pack of 1, Grey)', 'Computers&Accessories|Accessories&Peripherals|Cables&Accessories|Cables|USBCables', '₹399', '₹1,099', '64%', '4.2', '24,269', "High Compatibility : Compatible With iPhone 12, 11, X/XsMax/Xr ,iPhone 8/8 Plus,iPhone 7/7 Plus,iPhone 6s/6s Plus,iPhone 6/6 Plus,iPhone 5/5s/5c/se,iPad Pro,iPad Air 1/2,iPad mini 1/2/3,iPod nano7,iPod touch and more apple devices.|Fast Charge&Data Sync : It can charge and sync simultaneously at a rapid speed, Compatible with any charging adaptor, multi-po

## Preprocessing the data

We notice that some fields are strings depicting numbers, which is not desirable. Specifically fields 3-7 which are:
- Discounted price
- Actual price
- Percent discount
- Average rating
- Rating count
For our preprocessing we will convert these to numbers (all floats except rating count.)
Also, some numeric strings are in rupees, but we are more familiar with US dollars, so we will convert.

A few rows have problems, so we print out the undesirable sections. This is a small proportion (3 of >1k) of the dataset - we will leave these rows out.

We print out some of the corrected rows and see they now contain numbers with prices converted to USD.

In [4]:
rows = []

def rupee_str_to_usd_float(str):
    rupee_float = float(str[1:].replace(",", ""))
    usd_float = rupee_float * 0.011 # conversion rate
    return usd_float

print('Problematic rows:')
for text_row in text_rows:
    try:
        row = text_row
        row[3] = rupee_str_to_usd_float(text_row[3]) # process discount price (string) into number
        row[4] = rupee_str_to_usd_float(text_row[4]) # process full price (string) into number
        row[5] = float(text_row[5][:-1]) # process discount percent into number
        row[6] = float(text_row[6]) # process avg rating into number
        row[7] = int(text_row[7].replace(",", "")) # process int 
        rows.append(row)
    except ValueError:
        print(text_row[3:8])
        continue

print('Fixed rows:')
print(rows[0])
print(rows[1])
print(rows[2])

Problematic rows:
[2.189, 10.988999999999999, 80.0, 3.0, '']
[2.739, 10.988999999999999, 75.0, 5.0, '']
[23.089, 27.488999999999997, 16.0, '|', '992']
Fixed rows:
['B07JW9H4J1', 'Wayona Nylon Braided USB to Lightning Fast Charging and Data Sync Cable Compatible for iPhone 13, 12,11, X, 8, 7, 6, 5, iPad Air, Pro, Mini (3 FT Pack of 1, Grey)', 'Computers&Accessories|Accessories&Peripherals|Cables&Accessories|Cables|USBCables', 4.388999999999999, 12.088999999999999, 64.0, 4.2, 24269, "High Compatibility : Compatible With iPhone 12, 11, X/XsMax/Xr ,iPhone 8/8 Plus,iPhone 7/7 Plus,iPhone 6s/6s Plus,iPhone 6/6 Plus,iPhone 5/5s/5c/se,iPad Pro,iPad Air 1/2,iPad mini 1/2/3,iPod nano7,iPod touch and more apple devices.|Fast Charge&Data Sync : It can charge and sync simultaneously at a rapid speed, Compatible with any charging adaptor, multi-port charging station or power bank.|Durability : Durable nylon braided design with premium aluminum housing and toughened nylon fiber wound tightly around t

In [5]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer, accuracy_score

In [6]:
df = pd.DataFrame(rows, columns = fields)

#converting numeric columns 
df['discounted_price'] = df['discounted_price'].astype(float)
df['actual_price'] = df['actual_price'].astype(float)
df['discount_percentage'] = df['discount_percentage'].astype(float)
df['rating'] = df['rating'].astype(float)
df['rating_count'] = df['rating_count'].astype(int)

In [7]:
# prepping review text
df['review_title'] = df['review_title'].fillna('')
df['review_content'] = df['review_content'].fillna('')
df['review_text'] = df['review_title'] + ' ' + df['review_content']

print("Sample review text:", df['review_text'].iloc[0])

Sample review text: Satisfied,Charging is really fast,Value for money,Product review,Good quality,Good product,Good Product,As of now seems good Looks durable Charging is fine tooNo complains,Charging is really fast, good product.,Till now satisfied with the quality.,This is a good product . The charging speed is slower than the original iPhone cable,Good quality, would recommend,https://m.media-amazon.com/images/W/WEBP_402378-T1/images/I/81---F1ZgHL._SY88.jpg,Product had worked well till date and was having no issue.Cable is also sturdy enough...Have asked for replacement and company is doing the same...,Value for money


In [8]:
model = SentenceTransformer('all-MiniLM-L6-v2')
review_embeddings = model.encode(df['review_text'].tolist(), batch_size = 64, show_progress_bar = True)

Batches:   0%|          | 0/23 [00:00<?, ?it/s]

In [9]:
# aggregating to product level
def aggregate_embeddings(df, embeddings):
    product_ids = df['product_id'].unique()
    agg_embeddings = []
    for pid in product_ids:
        mask = df['product_id'] == pid
        agg_emb = embeddings[mask].mean(axis=0)  # mean pooling
        agg_embeddings.append(agg_emb)
    return np.array(agg_embeddings), product_ids

product_embeddings, product_ids = aggregate_embeddings(df, review_embeddings)


In [10]:
# building product-level dataset 
df_product = df.groupby('product_id').agg({
    'discounted_price': 'first',
    'actual_price': 'first',
    'discount_percentage': 'first',
    'rating': 'mean'  # average rating = label
}).reset_index()

numeric_features = df_product[['discounted_price','actual_price','discount_percentage']].values

X = np.hstack([product_embeddings, numeric_features])
y = df_product['rating'].round().astype(int)  # classification target (1–5 stars)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [12]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

clf = LogisticRegression( solver='lbfgs', max_iter=1000)

scores = cross_val_score(clf, X_scaled, y, cv=kf, scoring='accuracy')

print("K-Fold Accuracy Scores:", scores)
print("Mean Accuracy:", scores.mean())


K-Fold Accuracy Scores: [0.94814815 0.93333333 0.92592593 0.91078067 0.91078067]
Mean Accuracy: 0.925793749139474
