# Part 3 - Exploratory Data Analysis

In [23]:
!pip install matplotlib

Collecting matplotlib
  Downloading matplotlib-3.10.1-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.2-cp311-cp311-win_amd64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.57.0-cp311-cp311-win_amd64.whl.metadata (104 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.8-cp311-cp311-win_amd64.whl.metadata (6.3 kB)
Collecting pillow>=8 (from matplotlib)
  Downloading pillow-11.2.1-cp311-cp311-win_amd64.whl.metadata (9.1 kB)
Collecting pyparsing>=2.3.1 (from matplotlib)
  Downloading pyparsing-3.2.3-py3-none-any.whl.metadata (5.0 kB)
Downloading matplotlib-3.10.1-cp311-cp311-win_amd64.whl (8.1 MB)
   ---------------------------------------- 0.0/8.1 MB ? eta -:--:--
   -- ------------------------------------- 0.5/8.1 MB 4.2 MB/s eta 0:00:02
  

In [33]:
import matplotlib
print(matplotlib.__version__)

3.10.1


## Visualizations

This section performs Task 3, analyzing the Amazon Reviews 2023 dataset through visualizations and correlation analysis. DuckDB queries aggregate data from cleaned Parquet files across all categories. The analysis includes:

### Star Rating Histogram: Distribution of ratings (1-5) to assess review sentiment.



### Top 10 Categories by Review Count: Identifies the most reviewed categories (e.g., Amazon Home, AMAZON FASHION).



### Top 10 Brands by Review Count (Excluding Unknown): Highlights popular brands (e.g., Amazon, Amazon Basics).



### Average Star Rating per Year: Tracks rating trends over time (1996-2023).



### Pearson Correlation (Review Length vs. Star Rating): Measures the linear relationship between review length and rating (-0.0673, indicating a weak negative correlation).
Plots are saved to F:/sentiments/sentiments/plots/ for review.

In [11]:
import duckdb
import matplotlib
matplotlib.use('Agg')  
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import os

# Directory and categories
data_dir = Path("F:/sentiments/sentiments")
categories = [
    "All_Beauty", "Amazon_Fashion", "Appliances", "Arts_Crafts_and_Sewing", "Automotive",
    "Baby_Products", "Beauty_and_Personal_Care", "Books", "CDs_and_Vinyl",
    "Cell_Phones_and_Accessories", "Clothing_Shoes_and_Jewelry", "Digital_Music", "Electronics",
    "Gift_Cards", "Grocery_and_Gourmet_Food", "Handmade_Products", "Health_and_Household",
    "Health_and_Personal_Care", "Home_and_Kitchen", "Industrial_and_Scientific", "Kindle_Store",
    "Magazine_Subscriptions", "Movies_and_TV", "Musical_Instruments", "Office_Products",
    "Patio_Lawn_and_Garden", "Pet_Supplies", "Software", "Sports_and_Outdoors",
    "Subscription_Boxes", "Tools_and_Home_Improvement", "Toys_and_Games", "Video_Games", "Unknown"
]

# Output directory for plots
output_dir = data_dir / "plots"
try:
    output_dir.mkdir(exist_ok=True)
    print(f"Output directory created or exists: {output_dir}")
except Exception as e:
    print(f"Error creating output directory {output_dir}: {e}")
    raise

# Create DuckDB connection
con = duckdb.connect()
con.execute("INSTALL parquet; LOAD parquet;")
con.execute("SET memory_limit='12GB';")  # 12GB for 14GB RAM
con.execute("SET threads TO 8;")  # Adjust to CPU cores

# Combine all Parquet files
try:
    con.execute(f"""
        CREATE VIEW combined_data AS
        SELECT
            parent_asin,
            rating,
            text,
            user_id,
            asin,
            categories,
            main_category,
            helpful_vote,
            verified_purchase,
            average_rating,
            rating_number,
            price,
            brand,
            review_length,
            year,
            sentiment
        FROM read_parquet('{data_dir}/sentiment_*.parquet')
    """)
    print("Successfully created combined_data view")
except Exception as e:
    print(f"Error creating combined_data view: {e}")
    raise

# a) Star Rating Histogram
try:
    con.execute("""
        SELECT rating, COUNT(*) AS count
        FROM combined_data
        WHERE rating IS NOT NULL AND rating BETWEEN 1 AND 5
        GROUP BY rating
        ORDER BY rating
    """)
    ratings_data = con.fetchdf()
    print("Star Rating Histogram Data:")
    print(ratings_data)
    if ratings_data.empty:
        print("No data for Star Rating Histogram")
    else:
        plt.figure(figsize=(8, 6))
        plt.bar(ratings_data['rating'], ratings_data['count'], color='skyblue')
        plt.title('Star Rating Histogram')
        plt.xlabel('Rating')
        plt.ylabel('Number of Reviews')
        plt.xticks([1, 2, 3, 4, 5])
        plt.grid(axis='y', alpha=0.75)
        plt.savefig(output_dir / 'star_rating_histogram.png')
        plt.close()
        print("Star Rating Histogram saved")
except Exception as e:
    print(f"Error in Star Rating Histogram: {e}")

# b) Top 10 Categories by Review Count
try:
    con.execute("""
        SELECT main_category, COUNT(*) AS review_count
        FROM combined_data
        WHERE main_category IS NOT NULL
        GROUP BY main_category
        ORDER BY review_count DESC
        LIMIT 10
    """)
    categories_data = con.fetchdf()
    print("Top 10 Categories Data:")
    print(categories_data)
    if categories_data.empty:
        print("No data for Top 10 Categories")
    else:
        plt.figure(figsize=(10, 6))
        plt.bar(categories_data['main_category'], categories_data['review_count'], color='lightgreen')
        plt.title('Top 10 Categories by Review Count')
        plt.xlabel('Category')
        plt.ylabel('Review Count')
        plt.xticks(rotation=45, ha='right')
        plt.tight_layout()
        plt.savefig(output_dir / 'top_10_categories.png')
        plt.close()
        print("Top 10 Categories plot saved")
except Exception as e:
    print(f"Error in Top 10 Categories: {e}")

# c) Top 10 Brands by Review Count (Excluding "Unknown")
try:
    con.execute("""
        SELECT brand, COUNT(*) AS review_count
        FROM combined_data
        WHERE brand IS NOT NULL AND brand != 'Unknown'
        GROUP BY brand
        ORDER BY review_count DESC
        LIMIT 10
    """)
    brands_data = con.fetchdf()
    print("Top 10 Brands Data:")
    print(brands_data)
    if brands_data.empty:
        print("No data for Top 10 Brands")
    else:
        plt.figure(figsize=(10, 6))
        plt.bar(brands_data['brand'], brands_data['review_count'], color='salmon')
        plt.title('Top 10 Brands by Review Count (Excluding Unknown)')
        plt.xlabel('Brand')
        plt.ylabel('Review Count')
        plt.xticks(rotation=45, ha='right')
        plt.tight_layout()
        plt.savefig(output_dir / 'top_10_brands.png')
        plt.close()
        print("Top 10 Brands plot saved")
except Exception as e:
    print(f"Error in Top 10 Brands: {e}")

# d) Average Star Rating per Year
try:
    con.execute("""
        SELECT year, AVG(rating) AS avg_rating
        FROM combined_data
        WHERE year IS NOT NULL AND rating IS NOT NULL
        GROUP BY year
        ORDER BY year
    """)
    trend_data = con.fetchdf()
    print("Average Star Rating per Year Data:")
    print(trend_data)
    if trend_data.empty:
        print("No data for Average Star Rating per Year")
    else:
        plt.figure(figsize=(10, 6))
        plt.plot(trend_data['year'], trend_data['avg_rating'], marker='o', color='purple')
        plt.title('Average Star Rating per Year')
        plt.xlabel('Year')
        plt.ylabel('Average Rating')
        plt.grid(True)
        plt.savefig(output_dir / 'avg_rating_trend.png')
        plt.close()
        print("Average Star Rating per Year plot saved")
except Exception as e:
    print(f"Error in Average Star Rating per Year: {e}")

# e) Pearson Correlation between Review Length and Star Rating
try:
    con.execute("""
        SELECT
            CORR(review_length, rating) AS pearson_corr
        FROM combined_data
        WHERE review_length IS NOT NULL AND rating IS NOT NULL
    """)
    corr_result = con.fetchdf()
    pearson_corr = corr_result['pearson_corr'][0]
    print(f"Pearson Correlation between Review Length and Star Rating: {pearson_corr:.4f}")
    print("Interpretation: The correlation indicates the strength and direction of the linear relationship between review length and star rating.")
    if abs(pearson_corr) < 0.1:
        print("The correlation is very weak, suggesting almost no linear relationship.")
    elif abs(pearson_corr) < 0.3:
        print("The correlation is weak, suggesting a slight linear relationship.")
    elif abs(pearson_corr) < 0.5:
        print("The correlation is moderate, suggesting a noticeable linear relationship.")
    else:
        print("The correlation is strong, suggesting a significant linear relationship.")
    if pearson_corr > 0:
        print("A positive value indicates that longer reviews tend to have higher ratings.")
    elif pearson_corr < 0:
        print("A negative value indicates that longer reviews tend to have lower ratings.")
    else:
        print("A correlation of 0 indicates no linear relationship.")
except Exception as e:
    print(f"Error in Pearson Correlation: {e}")

# Clean up
con.close()


Output directory created or exists: F:\sentiments\sentiments\plots
Successfully created combined_data view


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Star Rating Histogram Data:
   rating      count
0     1.0   50540693
1     2.0   24497869
2     3.0   35273582
3     4.0   63361041
4     5.0  329351567
Star Rating Histogram saved
Top 10 Categories Data:
               main_category  review_count
0                Amazon Home      84935034
1             AMAZON FASHION      67182919
2   Tools & Home Improvement      30723822
3               Buy a Kindle      30255213
4                      Books      24066763
5  Cell Phones & Accessories      22191807
6                 All Beauty      21635088
7     Health & Personal Care      20811458
8                 Automotive      18538520
9            All Electronics      17127183
Top 10 Categories plot saved


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Top 10 Brands Data:
            brand  review_count
0          Amazon       2655762
1   Amazon Basics       1820180
2         SAMSUNG       1025081
3        Skechers        849766
4  Amazon Renewed        831922
5         Generic        758063
6           Hanes        733109
7            Sony        637319
8          Spigen        633336
9           Anker        619502
Top 10 Brands plot saved


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Average Star Rating per Year Data:
    year  avg_rating
0   1996    4.708333
1   1997    4.400075
2   1998    4.381895
3   1999    4.310811
4   2000    4.263917
5   2001    4.215918
6   2002    4.190905
7   2003    4.150609
8   2004    4.083347
9   2005    4.062727
10  2006    4.100867
11  2007    4.186756
12  2008    4.143521
13  2009    4.125884
14  2010    4.096444
15  2011    4.090552
16  2012    4.157622
17  2013    4.245316
18  2014    4.273626
19  2015    4.287157
20  2016    4.287941
21  2017    4.248392
22  2018    4.222303
23  2019    4.280047
24  2020    4.186949
25  2021    4.081332
26  2022    4.026181
27  2023    4.062087
Average Star Rating per Year plot saved


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Pearson Correlation between Review Length and Star Rating: -0.0673
Interpretation: The correlation indicates the strength and direction of the linear relationship between review length and star rating.
The correlation is very weak, suggesting almost no linear relationship.
A negative value indicates that longer reviews tend to have lower ratings.
