In [None]:
import pandas as pd
import plotly.express as px

In [None]:
df = pd.read_csv("amz_uk_price_prediction_dataset.csv")

In [None]:
df.shape

In [None]:
df.head(10)

In [None]:
df.tail(10)

# Part 1: Understanding Product Categories

What are the most popular product categories on Amazon UK, and how do they compare in terms of listing frequency?
- Generate a frequency table for the product category.
- Which are the top 5 most listed product categories?

In [None]:
df["category"].value_counts()

Visualizations:

- Display the distribution of products across different categories using a bar chart. If you face problems understanding the chart, do it for a subset of top categories.
- For a subset of top categories, visualize their proportions using a pie chart. Does any category dominate the listings?

In [None]:
category_counts = df["category"].value_counts().reset_index()
category_counts.columns = ["category", "count"]

fig = px.bar(
    category_counts,
    x="category",
    y="count",
    labels={
        "category": "Product Category",
        "count": "Number of Listings"
    },
    title="Distribution of Products Across Categories"
)
fig.show()

In [None]:
top_5_products = df["category"].value_counts().head(5).reset_index()
top_5_products.columns = ["category", "count"]

fig = px.pie(
    top_5_products,
    names="category",
    values="count",
    title="Top 5 Product Categories by number of listings"
)

fig.update_traces(textinfo="percent+label")
fig.show()



## Findings

Top 5 products as per listings:
- Sports & Outdoors, 
- Beauty, 
- Handmade Clothing, Shoes & Accessories,
- Bath & Body                                
- Birthday Gifts

This huge difference in amount of Sports and Outdoors listings and other products could be because this category covers a variety of product types such as sportswear clothing, footwear, gears etc.

# Part 2: Delving into Product Pricing

Measures of Centrality:

- Calculate the mean, median, and mode for the price of products.
- What's the average price point of products listed? How does this compare with the most common price point (mode)?

In [None]:
df["price"].describe()

In [None]:
df["price"].mode()

- Mean - 89.24
- Median - 19.09
- Mode - 9.99

## Findings

The average price point is 89.24£. It is much higher than the most common price point, which is 9.99£.
It could be because of a smaller number of higher-priced products are pulling the average upwards, while most products are priced at the lower end

Measures of Dispersion:

- Determine the variance, standard deviation, range, and interquartile range for product price.
- How varied are the product prices? Are there any indicators of a significant spread in prices?

In [None]:
df["price"].describe()

In [None]:
df["price"].var()

- Standard deviation - 345.61
- Range (Max-min) - 100,000 - 0 = 100,000
- Interquartile Range - 75%-25% = 45.99 - 9.99 = 36.00
- Variance - 119445.48

## Findings

- Product prices are highly varied. The large standard deviation (345.61) and extremely wide range (100,000) indicate a significant spread in prices.
- The interquartile range shows that most products are priced within a much narrower band. This could mean the presence of high-priced outliers that increases the overall variability (this is also proven by the average price vs common price point in the previous findings)

Visualizations:

- Is there a specific price range where most products fall? Plot a histogram to visualize the distribution of product prices. If its hard to read these diagrams, think why this is, and explain how it could be solved..
- Are there products that are priced significantly higher than the rest? Use a box plot to showcase the spread and potential outliers in product pricing.

In [None]:
fig = px.histogram(
    df,
    x="price",
    nbins=10,
    title="Distribution of Product Prices"
)
fig.update_xaxes(title="Product Price")
fig.update_yaxes(title="Number of Products")

fig.show()


## Findings

- The histogram shows a strong right-skew, with most products concentrated at lower prices. 
- Only a small number of very expensive products can be seen extending the distribution. 
- This makes the initial histogram difficult to read, as extreme values stretch the x-axis. Using a log scale or limiting the price range (using a subset) could improve readability.

In [None]:
fig = px.box(
    df,
    y="price",
    title="Box Plot of Product Prices"
)
fig.show()


## Findings

- The above boxplot shows that the most of the products are priced more than the average price. 
- The median line is parallel to 0, and we can see many outliers, even as high as 100k. 

# Part 3: Unpacking Product Ratings

Measures of Centrality:

- Calculate the mean, median, and mode for the rating of products.
- How do customers generally rate products? Is there a common trend?

In [None]:
df["stars"].describe()

- Mean rating - 2.15
- Median - 0

In [None]:
df["stars"].mode()

## Findings

- The average rating of products by the customers is 2.15
- Median value is 0, which means at least 50% of the product ratings are 0
- This is confirmed by the mode value (most frequent rating), which is also 0
- This is due to products without ratings, new products, rarely purchased items, since the rating scale in Amazon usually starts at 1 (scale of 1-5, 5 being the highest)
- So the trend is that many products have no ratings, but when products are rated, they tend to receive moderate to positive scores (mean = 2.15).

Measures of Dispersion:

- Determine the variance, standard deviation, and interquartile range for product rating.
- Are the ratings consistent, or is there a wide variation in customer feedback?

In [None]:
df["stars"].var()

- Standard deviation --> 2.19
- Range (Max-min) --> 5 - 0 = 5
- Interquartile Range --> 75%-25% = 4.40 - 0 = 4.40
- Variance --> 4.187

## Findings

- SD of 2.19 indicates a moderate spread in ratings overall, meaning ratings are not tightly clustered around the mean, 2.15
- IQR denotes that the middle 50% of ratings (75% - 25%) ranges from 0 to 4.40, which is quite wide. This suggests a good amount of variation among products.
- The relatively high variance of 4.81  confirms that ratings are spread out rather than concentrated around a single value

Visualizations:

- Plot a histogram to visualize the distribution of product ratings. Is there a specific rating that is more common?

In [None]:
fig = px.histogram(
    df,
    x="stars",
    nbins=6,
    title="Distribution of Product Ratings",
    labels={
        "stars": "Product Rating",
        "count": "Number of Products"
    }
)

fig.update_xaxes(title="Product Rating")
fig.update_yaxes(title="Number of Products")

fig.show()

## Findings

- A Rating of 0 is the most common value. This indicates that many products have no ratings

- Among rated products, higher ratings, around 4–5, are more frequent than lower ones, 1-3 stars