# Amazon Sales Analysis - Data Collection and Cleaning

**Team:** CAP_3764_2025_Fall_Team_1  
**Dataset:** BSR Visual Data - Amazon Product Sales with Image Quality Metrics  
**Data Source:** Self-collected using Amazon SP-API (Selling Partner API)

## Data Collection Overview

Our team collected data from **18,000+ Amazon product listings** using:
- **Amazon SP-API** for product metadata, pricing, and sales indicators
- **Custom web scraping tools** for additional product details
- **Computer vision processing** for image quality metrics

## Objectives
1. Load the dataset using our custom data collection module
2. Perform initial data cleaning (missing values, duplicates)
3. Explore data structure and variable types

In [None]:
# Import required libraries
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import custom module
from data_collection import load_bsr_data, clean_data, get_numerical_columns, get_categorical_columns

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Data Collection Methodology

### API-Based Collection
We utilized the **Amazon SP-API (Selling Partner API)** to extract:
- Product ASIN (unique identifier)
- Product titles and brand information
- Best Seller Rank (BSR) - primary sales indicator
- Review counts and average ratings
- Product URLs and metadata

### Image Processing Pipeline
For each product, we:
1. Downloaded main product images from Amazon
2. Applied computer vision algorithms to compute quality metrics:
   - **Edge Density:** Measures sharpness and visual detail
   - **Background Analysis:** Calculates white/neutral background percentages
   - **Color Clustering:** Identifies significant color groups and diversity
   - **Clutter Score:** Composite metric of visual complexity

### Data Enrichment
- Combined API data with computed image features
- Calculated z-score normalized versions of image metrics
- Removed duplicates and handled missing values

**Final Dataset:** 18,148 unique Amazon products with 30+ features

## 1. Data Loading

In [None]:
# Load the dataset
df_raw = load_bsr_data('../data/raw/bsr_visual_data.csv')

In [None]:
# Display first few rows
df_raw.head()

## 2. Initial Data Cleaning

In [None]:
# Clean the data
df_clean = clean_data(df_raw)

In [None]:
# Check data types
print("Data Types:")
print(df_clean.dtypes)

## 3. Variable Classification

In [None]:
# Get numerical and categorical columns
num_cols = get_numerical_columns(df_clean)
cat_cols = get_categorical_columns(df_clean)

print(f"Numerical Variables ({len(num_cols)}):")
print(num_cols)
print(f"\nCategorical Variables ({len(cat_cols)}):")
print(cat_cols)

## 4. Save Cleaned Data

In [None]:
# Save cleaned dataset
df_clean.to_csv('../data/processed/bsr_visual_data_clean.csv', index=False)
print("Cleaned data saved to: data/processed/bsr_visual_data_clean.csv")