# Amazon Sales Analysis - Exploratory Data Analysis (EDA)

**Team:** CAP_3764_2025_Fall_Team_1  

## Objectives
1. Compute summary statistics for numerical variables
2. Analyze categorical variables
3. Create summary tables
4. Generate visualizations

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Set display options
pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

## 1. Load Cleaned Data

In [None]:
# Load cleaned dataset
df = pd.read_csv('../data/processed/bsr_visual_data_clean.csv')
print(f"Dataset shape: {df.shape}")
df.head()

## 2. Summary Statistics for Numerical Variables

In [None]:
# Identify numerical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numerical columns: {len(numerical_cols)}")
print(numerical_cols)

In [None]:
# Compute descriptive statistics
df[numerical_cols].describe()

In [None]:
# Additional statistics
print("Skewness:")
print(df[numerical_cols].skew())
print("\nKurtosis:")
print(df[numerical_cols].kurtosis())

## 3. Analysis of Categorical Variables

In [None]:
# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print(f"Categorical columns: {len(categorical_cols)}")
print(categorical_cols)

In [None]:
# Analyze each categorical variable
for col in categorical_cols[:5]:  # First 5 to avoid overwhelming output
    print(f"\n{'='*50}")
    print(f"Column: {col}")
    print(f"{'='*50}")
    print(f"Unique values: {df[col].nunique()}")
    print(f"\nTop 10 values:")
    print(df[col].value_counts().head(10))

## 4. Summary Tables

In [None]:
# Summary table: Sales metrics by brand (top 10 brands)
if 'brand' in df.columns and 'bsr_best' in df.columns:
    brand_summary = df.groupby('brand').agg({
        'bsr_best': ['mean', 'min', 'max'],
        'review_count': 'mean',
        'avg_rating': 'mean',
        'asin': 'count'
    }).round(2)
    brand_summary.columns = ['_'.join(col).strip() for col in brand_summary.columns.values]
    brand_summary = brand_summary.sort_values('asin_count', ascending=False).head(10)
    print("Top 10 Brands Summary:")
    display(brand_summary)

In [None]:
# Correlation matrix for key numerical variables
key_vars = ['bsr_best', 'review_count', 'avg_rating', 'clutter_score', 
            'edge_density', 'color_entropy']
available_vars = [var for var in key_vars if var in df.columns]
correlation_matrix = df[available_vars].corr()
print("Correlation Matrix:")
display(correlation_matrix)

## 5. Visualizations

In [None]:
# Distribution of BSR (Best Seller Rank)
if 'bsr_best' in df.columns:
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Histogram
    axes[0].hist(df['bsr_best'].dropna(), bins=50, edgecolor='black')
    axes[0].set_title('Distribution of Best Seller Rank', fontsize=14, fontweight='bold')
    axes[0].set_xlabel('BSR')
    axes[0].set_ylabel('Frequency')
    
    # Boxplot
    axes[1].boxplot(df['bsr_best'].dropna())
    axes[1].set_title('Boxplot of Best Seller Rank', fontsize=14, fontweight='bold')
    axes[1].set_ylabel('BSR')
    
    plt.tight_layout()
    plt.show()

In [None]:
# Correlation heatmap
if len(available_vars) > 0:
    plt.figure(figsize=(10, 8))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
                square=True, linewidths=1, cbar_kws={"shrink": 0.8})
    plt.title('Correlation Heatmap - Key Variables', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

In [None]:
# Review count vs Average rating scatter plot
if 'review_count' in df.columns and 'avg_rating' in df.columns:
    plt.figure(figsize=(12, 6))
    plt.scatter(df['review_count'], df['avg_rating'], alpha=0.5)
    plt.xlabel('Review Count', fontsize=12)
    plt.ylabel('Average Rating', fontsize=12)
    plt.title('Review Count vs Average Rating', fontsize=14, fontweight='bold')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

In [None]:
# Brand distribution (top 15 brands)
if 'brand' in df.columns:
    top_brands = df['brand'].value_counts().head(15)
    
    plt.figure(figsize=(12, 6))
    top_brands.plot(kind='barh')
    plt.xlabel('Count', fontsize=12)
    plt.ylabel('Brand', fontsize=12)
    plt.title('Top 15 Brands by Product Count', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

In [None]:
# Interactive visualization: BSR vs Image Quality Metrics
if 'bsr_best' in df.columns and 'clutter_score' in df.columns:
    sample_df = df.sample(min(1000, len(df)))  # Sample for performance
    
    fig = px.scatter(sample_df, 
                     x='clutter_score', 
                     y='bsr_best',
                     color='avg_rating' if 'avg_rating' in df.columns else None,
                     hover_data=['brand', 'review_count'] if 'brand' in df.columns else None,
                     title='BSR vs Image Clutter Score',
                     labels={'clutter_score': 'Clutter Score', 'bsr_best': 'Best Seller Rank'})
    fig.show()

## 6. Key Insights

Based on the exploratory analysis:

1. **Dataset Overview:** [Add observations about dataset size and completeness]
2. **Sales Performance:** [Add insights about BSR distribution]
3. **Image Quality:** [Add insights about image metrics]
4. **Brand Analysis:** [Add insights about top brands]
5. **Correlations:** [Add insights about key relationships]

**Next Steps:**
- Further analysis of specific product categories
- Deep dive into image quality impact on sales
- Predictive modeling for sales forecasting