## Lab - EDA Univariate Analysis: Diving into Amazon UK Product Insights

**Objective**: Explore the product listing dynamics on Amazon UK to extract actionable business insights. By understanding the distribution, central tendencies, and relationships of various product attributes, businesses can make more informed decisions on product positioning, pricing strategies, and inventory management.

**Dataset**: This lab utilizes the [Amazon UK product dataset](https://www.kaggle.com/datasets/asaniczka/uk-optimal-product-price-prediction/)
which provides information on product categories, brands, prices, ratings, and more from from Amazon UK. You'll need to download it to start working with it.

In [2]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

### Part 1: Understanding Product Categories

**Business Question**: What are the most popular product categories on Amazon UK, and how do they compare in terms of listing frequency?

1. **Frequency Tables**:
    - Generate a frequency table for the product `category`.
    - Which are the top 5 most listed product categories?



In [None]:
# Load the dataset
df = pd.read_csv('amz_uk_price_prediction_dataset.csv')

# Display basic info about the dataset
print(df.info())
print("\nFirst few rows:")
print(df.head())

# Generate frequency table for product category
category_freq = df['category'].value_counts()
print("\nFrequency table for product categories:")
print(category_freq)

print("\n\nTop 5 most listed product categories:")
print(category_freq.head(5))

**Business Insight**: The top 5 categories show where Amazon UK has the most product diversity. These categories likely represent high-demand or high-competition markets.


2. **Visualizations**:
    - Display the distribution of products across different categories using a bar chart. *If you face problems understanding the chart, do it for a subset of top categories.*
    - For a subset of top categories, visualize their proportions using a pie chart. Does any category dominate the listings?

In [None]:
# Bar chart for top 20 categories (to make it readable)
top_20_categories = category_freq.head(20)

plt.figure(figsize=(12, 6))
plt.bar(range(len(top_20_categories)), top_20_categories.values)
plt.xlabel('Category', fontsize=12)
plt.ylabel('Number of Products', fontsize=12)
plt.title('Distribution of Products Across Top 20 Categories', fontsize=14)
plt.xticks(range(len(top_20_categories)), top_20_categories.index, rotation=90, ha='right')
plt.tight_layout()
plt.show()

# Pie chart for top 5 categories
top_5_categories = category_freq.head(5)

plt.figure(figsize=(10, 8))
plt.pie(top_5_categories.values, labels=top_5_categories.index, autopct='%1.1f%%', startangle=90)
plt.title('Proportion of Top 5 Product Categories', fontsize=14)
plt.axis('equal')
plt.show()

print(f"\nThe top category '{top_5_categories.index[0]}' represents {top_5_categories.values[0]/category_freq.sum()*100:.2f}% of all listings.")
if top_5_categories.values[0]/category_freq.sum() > 0.20:
    print("This category significantly dominates the listings.")

**Business Insight**: The visualizations help identify which product categories have the highest inventory. If one category dominates, it suggests either high demand or oversupply. Businesses can use this to decide where to focus marketing efforts or inventory investments.

### Part 2: Delving into Product Pricing

**Business Question**: How are products priced on Amazon UK, and are there specific price points or ranges that are more common?

1. **Measures of Centrality**:
    - Calculate the mean, median, and mode for the `price` of products.
    - What's the average price point of products listed? How does this compare with the most common price point (mode)?


In [None]:
# Calculate measures of centrality for price
mean_price = df['price'].mean()
median_price = df['price'].median()
mode_price = df['price'].mode()[0] if len(df['price'].mode()) > 0 else None

print("Price Statistics - Measures of Centrality:")
print(f"Mean (Average) Price: £{mean_price:.2f}")
print(f"Median Price: £{median_price:.2f}")
print(f"Mode (Most Common) Price: £{mode_price:.2f}")

print(f"\n\nInterpretation:")
print(f"- The average product price is £{mean_price:.2f}")
print(f"- The median price of £{median_price:.2f} suggests that half of products are priced below this value")
print(f"- The most common price point is £{mode_price:.2f}")

if mean_price > median_price:
    print(f"- Since mean > median, the distribution is right-skewed, indicating some high-priced products pull the average up")

**Business Insight**: Understanding the average and most common price points helps sellers position their products competitively. If mean > median, it suggests premium products exist that can command higher prices.

2. **Measures of Dispersion**:
    - Determine the variance, standard deviation, range, and interquartile range for product `price`.
    - How varied are the product prices? Are there any indicators of a significant spread in prices?

**Business Insight**: High price dispersion indicates a diverse marketplace with products at various price points, allowing businesses to target different customer segments.

In [None]:
# Calculate measures of dispersion for price
variance_price = df['price'].var()
std_price = df['price'].std()
range_price = df['price'].max() - df['price'].min()
q1 = df['price'].quantile(0.25)
q3 = df['price'].quantile(0.75)
iqr_price = q3 - q1

print("Price Statistics - Measures of Dispersion:")
print(f"Variance: £{variance_price:.2f}")
print(f"Standard Deviation: £{std_price:.2f}")
print(f"Range: £{range_price:.2f} (from £{df['price'].min():.2f} to £{df['price'].max():.2f})")
print(f"Interquartile Range (IQR): £{iqr_price:.2f}")
print(f"  Q1 (25th percentile): £{q1:.2f}")
print(f"  Q3 (75th percentile): £{q3:.2f}")

print(f"\n\nInterpretation:")
cv = (std_price / mean_price) * 100
print(f"- Coefficient of Variation: {cv:.2f}%")
if cv > 50:
    print(f"- High variability in prices (CV > 50%), indicating significant price spread across products")
else:
    print(f"- Moderate variability in prices")
print(f"- The IQR of £{iqr_price:.2f} shows that the middle 50% of products vary by this amount")
print(f"- The wide range indicates products from budget to premium price points")

3. **Visualizations**:
    - Is there a specific price range where most products fall? Plot a histogram to visualize the distribution of product prices. *If its hard to read these diagrams, think why this is, and explain how it could be solved.*.
    - Are there products that are priced significantly higher than the rest? Use a box plot to showcase the spread and potential outliers in product pricing.

In [None]:
# Histogram of product prices
plt.figure(figsize=(12, 5))

# Original histogram
plt.subplot(1, 2, 1)
plt.hist(df['price'], bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Price (£)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Product Prices', fontsize=14)
plt.grid(axis='y', alpha=0.3)

# Log-scale histogram for better visibility
plt.subplot(1, 2, 2)
plt.hist(df['price'], bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Price (£)', fontsize=12)
plt.ylabel('Frequency (log scale)', fontsize=12)
plt.title('Distribution of Product Prices (Log Scale)', fontsize=14)
plt.yscale('log')
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("Note: The histogram is hard to read because most products are concentrated at lower prices,")
print("while a few expensive products stretch the x-axis. Using a log scale on the y-axis helps visualize this better.")
print("Another solution would be to use smaller bin sizes or focus on a specific price range.")

# Box plot to identify outliers
plt.figure(figsize=(12, 6))
plt.boxplot(df['price'], vert=False, widths=0.5)
plt.xlabel('Price (£)', fontsize=12)
plt.title('Box Plot of Product Prices - Identifying Outliers', fontsize=14)
plt.grid(axis='x', alpha=0.3)
plt.show()

# Identify outliers using IQR method
lower_bound = q1 - 1.5 * iqr_price
upper_bound = q3 + 1.5 * iqr_price
outliers = df[(df['price'] < lower_bound) | (df['price'] > upper_bound)]

print(f"\nOutlier Analysis:")
print(f"- Number of outliers: {len(outliers)} ({len(outliers)/len(df)*100:.2f}% of products)")
print(f"- Price range for outliers: £{outliers['price'].min():.2f} to £{outliers['price'].max():.2f}")
print(f"- These are products priced significantly higher than the typical range")
print(f"- The box plot shows a long right tail, confirming premium products exist in the marketplace")

**Business Insight**: Most products fall within a specific lower-to-mid price range, but premium products exist as outliers. This suggests opportunities for both budget-conscious and luxury market segments. Sellers should be strategic about pricing based on their target segment.

### Part 3: Unpacking Product Ratings

**Business Question**: How do customers rate products on Amazon UK, and are there any patterns or tendencies in the ratings?

1. **Measures of Centrality**:
    - Calculate the mean, median, and mode for the `rating` of products.
    - How do customers generally rate products? Is there a common trend?

**Business Insight**: Understanding customer rating patterns helps businesses gauge overall satisfaction and identify if products meet customer expectations.

In [None]:
# Calculate measures of centrality for rating
mean_rating = df['rating'].mean()
median_rating = df['rating'].median()
mode_rating = df['rating'].mode()[0] if len(df['rating'].mode()) > 0 else None

print("Rating Statistics - Measures of Centrality:")
print(f"Mean (Average) Rating: {mean_rating:.2f}")
print(f"Median Rating: {median_rating:.2f}")
print(f"Mode (Most Common) Rating: {mode_rating:.2f}")

print(f"\n\nInterpretation:")
print(f"- Customers generally rate products at an average of {mean_rating:.2f} stars")
print(f"- The median rating of {median_rating:.2f} shows the middle point of all ratings")
print(f"- The most common rating is {mode_rating:.2f} stars")

if mean_rating >= 4.0:
    print(f"- Overall, customers tend to give positive ratings, suggesting good product quality")
elif mean_rating >= 3.0:
    print(f"- Ratings are moderate, indicating mixed customer satisfaction")
else:
    print(f"- Low average ratings suggest potential quality or satisfaction issues")

2. **Measures of Dispersion**:
    - Determine the variance, standard deviation, and interquartile range for product `rating`.
    - Are the ratings consistent, or is there a wide variation in customer feedback?

**Business Insight**: Consistent high ratings indicate reliable product quality. Wide variation in ratings suggests some products excel while others disappoint, requiring quality control improvements.

In [None]:
# Calculate measures of dispersion for rating
variance_rating = df['rating'].var()
std_rating = df['rating'].std()
q1_rating = df['rating'].quantile(0.25)
q3_rating = df['rating'].quantile(0.75)
iqr_rating = q3_rating - q1_rating

print("Rating Statistics - Measures of Dispersion:")
print(f"Variance: {variance_rating:.4f}")
print(f"Standard Deviation: {std_rating:.4f}")
print(f"Interquartile Range (IQR): {iqr_rating:.4f}")
print(f"  Q1 (25th percentile): {q1_rating:.2f}")
print(f"  Q3 (75th percentile): {q3_rating:.2f}")

print(f"\n\nInterpretation:")
if std_rating < 0.5:
    print(f"- Low standard deviation ({std_rating:.4f}) indicates ratings are consistent")
    print(f"- Most customers have similar opinions about products")
elif std_rating < 1.0:
    print(f"- Moderate standard deviation ({std_rating:.4f}) shows some variation in ratings")
else:
    print(f"- High standard deviation ({std_rating:.4f}) indicates wide variation in customer feedback")
    print(f"- Products receive diverse ratings from very low to very high")

print(f"- The IQR of {iqr_rating:.2f} shows the middle 50% of ratings vary by this amount")

3. **Shape of the Distribution**:
    - Calculate the skewness and kurtosis for the `rating` column. 
    - Are the ratings normally distributed, or do they lean towards higher or lower values?

In [None]:
# Calculate skewness and kurtosis for rating
from scipy.stats import skew, kurtosis

skewness_rating = skew(df['rating'].dropna())
kurtosis_rating = kurtosis(df['rating'].dropna())

print("Rating Statistics - Shape of Distribution:")
print(f"Skewness: {skewness_rating:.4f}")
print(f"Kurtosis: {kurtosis_rating:.4f}")

print(f"\n\nInterpretation:")

# Skewness interpretation
if abs(skewness_rating) < 0.5:
    print(f"- Skewness close to 0 indicates approximately symmetric distribution (normal)")
elif skewness_rating < -0.5:
    print(f"- Negative skewness ({skewness_rating:.4f}) indicates left-skewed distribution")
    print(f"- Ratings lean toward higher values with a tail toward lower ratings")
else:
    print(f"- Positive skewness ({skewness_rating:.4f}) indicates right-skewed distribution")
    print(f"- Ratings lean toward lower values with a tail toward higher ratings")

# Kurtosis interpretation
if abs(kurtosis_rating) < 0.5:
    print(f"- Kurtosis close to 0 indicates a normal distribution (mesokurtic)")
elif kurtosis_rating > 0.5:
    print(f"- Positive kurtosis ({kurtosis_rating:.4f}) indicates heavy tails (leptokurtic)")
    print(f"- More extreme ratings (very high or very low) than a normal distribution")
else:
    print(f"- Negative kurtosis ({kurtosis_rating:.4f}) indicates light tails (platykurtic)")
    print(f"- Fewer extreme ratings than a normal distribution")

print(f"\nConclusion:")
if abs(skewness_rating) < 0.5 and abs(kurtosis_rating) < 0.5:
    print(f"- Ratings are approximately normally distributed")
else:
    print(f"- Ratings deviate from a normal distribution")

**Business Insight**: The shape of the rating distribution reveals customer sentiment patterns. Left-skewed ratings (clustering around high values) suggest strong customer satisfaction, while right-skewed ratings indicate quality concerns.

4. **Visualizations**:
    - Plot a histogram to visualize the distribution of product ratings. Is there a specific rating that is more common?


In [None]:
# Histogram of product ratings
plt.figure(figsize=(12, 6))
plt.hist(df['rating'].dropna(), bins=20, edgecolor='black', alpha=0.7, color='skyblue')
plt.xlabel('Rating', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Product Ratings', fontsize=14)
plt.grid(axis='y', alpha=0.3)
plt.show()

# Count the frequency of each rating
rating_counts = df['rating'].value_counts().sort_index()
print("\nRating Frequency Table:")
print(rating_counts)

most_common_rating = rating_counts.idxmax()
most_common_count = rating_counts.max()
print(f"\n\nMost Common Rating: {most_common_rating} stars")
print(f"Frequency: {most_common_count} products ({most_common_count/len(df)*100:.2f}% of all products)")

# Bar chart for rating distribution
plt.figure(figsize=(10, 6))
rating_counts.plot(kind='bar', color='coral', edgecolor='black', alpha=0.7)
plt.xlabel('Rating', fontsize=12)
plt.ylabel('Number of Products', fontsize=12)
plt.title('Frequency of Each Rating Value', fontsize=14)
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

**Business Insight**: Identifying the most common rating helps businesses understand typical customer satisfaction levels. If ratings cluster around 4-5 stars, products generally meet expectations. If they cluster around 1-3 stars, there are quality or service issues to address.

**Submission**: Submit a Jupyter Notebook which contains code and a business-centric report summarizing your findings.