# Amazon Sales Analysis - Data Collection and Cleaning

**Team:** CAP_3764_2025_Fall_Team_1  
**Dataset:** BSR Visual Data - Amazon Product Sales with Image Quality Metrics  
**Data Source:** Self-collected using Amazon SP-API (Selling Partner API)

## Data Collection Overview

Our team collected data from **18,000+ Amazon product listings** using:
- **Amazon SP-API** for product metadata, pricing, and sales indicators
- **Custom web scraping tools** for additional product details
- **Computer vision processing** for image quality metrics

## Objectives
1. Load the dataset using our custom data collection module
2. Perform initial data cleaning (missing values, duplicates)
3. Explore data structure and variable types

In [2]:
# Import required libraries
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import custom module
from data_collection import load_bsr_data, clean_data, get_numerical_columns, get_categorical_columns

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Data Collection Methodology

### API-Based Collection
We utilized the **Amazon SP-API (Selling Partner API)** to extract:
- Product ASIN (unique identifier)
- Product titles and brand information
- Best Seller Rank (BSR) - primary sales indicator
- Review counts and average ratings
- Product URLs and metadata

### Image Processing Pipeline
For each product, we:
1. Downloaded main product images from Amazon
2. Applied computer vision algorithms to compute quality metrics:
   - **Edge Density:** Measures sharpness and visual detail
   - **Background Analysis:** Calculates white/neutral background percentages
   - **Color Clustering:** Identifies significant color groups and diversity
   - **Clutter Score:** Composite metric of visual complexity

### Data Enrichment
- Combined API data with computed image features
- Calculated z-score normalized versions of image metrics
- Removed duplicates and handled missing values

**Final Dataset:** 18,148 unique Amazon products with 30+ features

## 1. Data Loading

In [3]:
# Load the dataset
df_raw = load_bsr_data('../data/raw/bsr_visual_data.csv')

Dataset loaded successfully!
Shape: (18147, 34)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18147 entries, 0 to 18146
Data columns (total 34 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   asin                   18147 non-null  object 
 1   image_path             18147 non-null  object 
 2   edge_density           18147 non-null  float64
 3   bg_white_pct           18147 non-null  float64
 4   bg_neutral_pct         18147 non-null  float64
 5   n_clusters_sig         18147 non-null  float64
 6   color_entropy          18147 non-null  float64
 7   largest_cluster_pct    18147 non-null  float64
 8   edge_density_z         18147 non-null  float64
 9   n_clusters_sig_z       18147 non-null  float64
 10  color_entropy_z        18147 non-null  float64
 11  bg_white_pct_z         18147 non-null  float64
 12  bg_neutral_pct_z       18147 non-null  float64
 13  largest_cluster_pct_z  18147 non-null  float64
 14  clutte

In [4]:
# Display first few rows
df_raw.head()

Unnamed: 0,asin,image_path,edge_density,bg_white_pct,bg_neutral_pct,n_clusters_sig,color_entropy,largest_cluster_pct,edge_density_z,n_clusters_sig_z,color_entropy_z,bg_white_pct_z,bg_neutral_pct_z,largest_cluster_pct_z,clutter_score,keyword,source_file,item_name,brand,image_count,main_image_url,has_aplus,has_brand_story,review_count,avg_rating,bsr_best,bsr_paths,units_per_month,sales_velocity_daily,product_url,image_list,predicted_bsr,prediction_error,error_percentage
0,B0BQPNMXQV,images_amz/3347156682c8f0ca.jpg,0.017148,0.575204,0.996824,4.0,0.713655,0.567744,-0.642028,-0.169091,-0.255746,0.289296,0.990601,0.200318,-0.718533,audio headphones catalog full 1757641194,audio_headphones_catalog_full_1757641194.csv,JBL Vibe Beam - True Wireless JBL Deep Bass So...,JBL,18,https://m.media-amazon.com/images/I/31S4tOQj4S...,False,False,,,6.0,"[[""Earbud & In-Ear Headphones"", 6], [""Electron...",,,https://www.amazon.com/dp/B0BQPNMXQV,"[{'variant': 'MAIN', 'url': 'https://m.media-a...",722.581604,716.581604,11943.026734
1,B0CTBCDD6D,images_amz/e7c4b57eac5ca716.jpg,0.007553,0.624459,0.999693,4.0,0.709959,0.608643,-0.920386,-0.169091,-0.278726,0.528078,1.003586,0.453967,-0.961818,audio headphones catalog full 1757641194,audio_headphones_catalog_full_1757641194.csv,JBL Tune 720BT - Wireless Over-Ear Headphones ...,JBL,21,https://m.media-amazon.com/images/I/61EL2AKKcB...,False,False,,,8.0,"[[""Over-Ear Headphones"", 8], [""Electronics"", 1...",,,https://www.amazon.com/dp/B0CTBCDD6D,"[{'variant': 'MAIN', 'url': 'https://m.media-a...",721.520617,713.520617,8919.007708
2,B0CTBCDD6D,images_amz/e7c4b57eac5ca716.jpg,0.007553,0.624459,0.999693,4.0,0.709959,0.608643,-0.920386,-0.169091,-0.278726,0.528078,1.003586,0.453967,-0.961818,audio headphones catalog full 1757641194,audio_headphones_catalog_full_1757641194.csv,JBL Tune 720BT - Wireless Over-Ear Headphones ...,JBL,21,https://m.media-amazon.com/images/I/61EL2AKKcB...,False,False,,,8.0,"[[""Over-Ear Headphones"", 8], [""Electronics"", 1...",,,https://www.amazon.com/dp/B0CTBCDD6D,"[{'variant': 'MAIN', 'url': 'https://m.media-a...",721.520617,713.520617,8919.007708
3,B08WM3LMJF,images_amz/a114c60191cfaf62.jpg,0.014999,0.74281,1.0,3.0,0.557774,0.733643,-0.704373,-1.203691,-1.224974,1.101834,1.004974,1.229213,-1.700246,audio headphones catalog full 1757641194,audio_headphones_catalog_full_1757641194.csv,JBL Tune 510BT - Bluetooth headphones with up ...,JBL,24,https://m.media-amazon.com/images/I/61kFL7ywsZ...,False,False,,,1.0,"[[""On-Ear Headphones"", 1], [""Electronics"", 21]]",,,https://www.amazon.com/dp/B08WM3LMJF,"[{'variant': 'MAIN', 'url': 'https://m.media-a...",490.383849,489.383849,48938.384876
4,B08WM3LMJF,images_amz/a114c60191cfaf62.jpg,0.014999,0.74281,1.0,3.0,0.557774,0.733643,-0.704373,-1.203691,-1.224974,1.101834,1.004974,1.229213,-1.700246,audio headphones catalog full 1757641194,audio_headphones_catalog_full_1757641194.csv,JBL Tune 510BT - Bluetooth headphones with up ...,JBL,24,https://m.media-amazon.com/images/I/61kFL7ywsZ...,False,False,,,1.0,"[[""On-Ear Headphones"", 1], [""Electronics"", 21]]",,,https://www.amazon.com/dp/B08WM3LMJF,"[{'variant': 'MAIN', 'url': 'https://m.media-a...",490.383849,489.383849,48938.384876


## 2. Initial Data Cleaning

In [5]:
# Clean the data
df_clean = clean_data(df_raw)

Removed 1855 duplicate rows

Missing values per column:
brand                     152
review_count            16292
avg_rating              16292
units_per_month         16292
sales_velocity_daily    16292
dtype: int64


In [6]:
# Check data types
print("Data Types:")
print(df_clean.dtypes)

Data Types:
asin                      object
image_path                object
edge_density             float64
bg_white_pct             float64
bg_neutral_pct           float64
n_clusters_sig           float64
color_entropy            float64
largest_cluster_pct      float64
edge_density_z           float64
n_clusters_sig_z         float64
color_entropy_z          float64
bg_white_pct_z           float64
bg_neutral_pct_z         float64
largest_cluster_pct_z    float64
clutter_score            float64
keyword                   object
source_file               object
item_name                 object
brand                     object
image_count                int64
main_image_url            object
has_aplus                   bool
has_brand_story             bool
review_count             float64
avg_rating               float64
bsr_best                 float64
bsr_paths                 object
units_per_month          float64
sales_velocity_daily     float64
product_url               objec

## 3. Variable Classification

In [7]:
# Get numerical and categorical columns
num_cols = get_numerical_columns(df_clean)
cat_cols = get_categorical_columns(df_clean)

print(f"Numerical Variables ({len(num_cols)}):")
print(num_cols)
print(f"\nCategorical Variables ({len(cat_cols)}):")
print(cat_cols)

Numerical Variables (22):
['edge_density', 'bg_white_pct', 'bg_neutral_pct', 'n_clusters_sig', 'color_entropy', 'largest_cluster_pct', 'edge_density_z', 'n_clusters_sig_z', 'color_entropy_z', 'bg_white_pct_z', 'bg_neutral_pct_z', 'largest_cluster_pct_z', 'clutter_score', 'image_count', 'review_count', 'avg_rating', 'bsr_best', 'units_per_month', 'sales_velocity_daily', 'predicted_bsr', 'prediction_error', 'error_percentage']

Categorical Variables (10):
['asin', 'image_path', 'keyword', 'source_file', 'item_name', 'brand', 'main_image_url', 'bsr_paths', 'product_url', 'image_list']


## 4. Save Cleaned Data

In [8]:
# Save cleaned dataset
df_clean.to_csv('../data/processed/bsr_visual_data_clean.csv', index=False)
print("Cleaned data saved to: data/processed/bsr_visual_data_clean.csv")

Cleaned data saved to: data/processed/bsr_visual_data_clean.csv
