<a href="https://colab.research.google.com/github/Y-YHat/gen_ai_course/blob/main/1_Data_Exploration_using_Amazon_book_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem statement: Visualizing Amazon Reviews

This notebook is focused on exploring Amazon product review data through various visualization techniques. The primary goal is to uncover insights from the data by examining the relationship between different variables

This exploration involves creating scatterplots, bar charts, box plots, line plots, pie charts, and word clouds to effectively analyze and interpret the data, thereby providing a comprehensive overview of consumer feedback on Amazon products.

The notebook contains one exercise in total:

* [Exercise 1](#ex_1)

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
# Import necessary libraries
import pandas as pd

df = pd.read_csv('amazon-product-review-data.csv')

# Verify if the dataset is loaded correctly
print(df.head())

In [None]:
# Display basic information about the dataset
print(df.info())

# Check for missing values
print(df.isnull().sum())

In [None]:
# Calculate summary statistics for numerical columns
summary_stats = df.describe()

# Print the summary statistics
print(summary_stats)

In [None]:
# Explore categorical variables
categorical_columns = ['market_place', 'product_category', 'sentiments']

for column in categorical_columns:
    category_counts = df[column].value_counts()
    print(f"Category counts for {column}:\n{category_counts}\n")

In [None]:
# Import necessary libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Create a histogram for star ratings
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='star_rating')
plt.title('Distribution of Star Ratings')
plt.xlabel('Star Rating')
plt.ylabel('Count')
plt.show()

In [None]:
# Convert the 'review_date' column to a datetime format
df['review_date'] = pd.to_datetime(df['review_date'])

# Extract year and month from the 'review_date' column
df['review_year'] = df['review_date'].dt.year
df['review_month'] = df['review_date'].dt.strftime('%B')
df['review_day'] = df['review_date'].dt.strftime('%A')

# Plot the count of reviews over the years
plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='review_year')
plt.title('Review Count Over the Years')
plt.xlabel('Year')
plt.ylabel('Count of Reviews')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Calculate the length of each review
df['review_body_length'] = df['review_body'].apply(len)

# Calculate descriptive statistics for review length
review_length_stats = df['review_body_length'].describe()

# Print the review length statistics
print(review_length_stats)

# Create a histogram for review lengths
plt.figure(figsize=(8, 5))
sns.histplot(data=df, x='review_body_length', bins=30)
plt.title('Distribution of Review Lengths')
plt.xlabel('Review Length')
plt.ylabel('Count')
plt.show()

In [None]:
# Calculate the correlation matrix
correlation_matrix = df.corr()

# Create a heatmap to visualize the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", square=True)
plt.title('Correlation Matrix')
plt.show()

<a name="ex_1"></a>
# Exercise 1

**Tasks:**

- Create a scatterplot to visualize the relationship between star ratings and review length
- Create a bar chart to visualize the average star rating for each product category
- Create a box plot to visualize the distribution of star ratings for each product category
- Create a line plot to visualize the trend of star ratings over time
- Create a pie chart to visualize the distribution of sentiments
- Create a word cloud to visualize the most common words in the reviews

In [None]:
# Write your code here