# **Project Name**    -  Amazon Prime EDA



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name-**    Velprakash.S

# **Project Summary -**

Amazon Prime Video is one of the leading streaming platforms, offering a vast and diverse library of movies, TV shows, and exclusive original content. As the streaming industry continues to expand, competition among platforms has intensified, making it essential to understand content trends, audience preferences, and regional availability. With millions of subscribers worldwide, Amazon Prime Video caters to a global audience with varied tastes, making data-driven insights crucial for both consumers and businesses in the entertainment sector.

This project aims to analyze Amazon Prime Video’s content library using a structured dataset containing information such as title, genre, release year, IMDb ratings, and regional availability. By examining these attributes, we can uncover valuable trends related to content diversity, geographical distribution, and audience engagement.

One key aspect of this analysis is understanding genre preferences. Different regions and audiences have unique tastes in content, and identifying popular genres can help content creators, distributors, and streaming services refine their strategies. Additionally, tracking release patterns and IMDb ratings can provide insight into the types of content that perform well and resonate with audiences.

Furthermore, regional availability plays a crucial role in content consumption, as licensing agreements and distribution rights vary across different markets. By analyzing which regions have access to the most extensive content libraries, we can gain insights into Amazon Prime Video’s content acquisition strategies and potential opportunities for market expansion.





# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The dataset was created to analyze all shows available on Amazon Prime Video, allowing us to extract valuable insights such as:
1. Content Diversity: What genres and categories dominate the platform?
2. Regional Availability: How does content distribution vary across different regions?
3. Trends Over Time: How has Amazon Prime's content library evolved?
4. IMDb Ratings & Popularity : What are teh highest-rated or most popular shows on the platform?

By analyzing this dataset, businesses, content creators and data analysts can uncover key trends that influence subscription growht, user engaement, and content investment strategies in the streaming industry.


#### **Define Your Business Objective?**


The primary objectives of this project are:

1. **Content Diversity** – Identifying the most common genres and categories on Amazon Prime Video to understand content preferences.
2. **Regional Availability** – Examining how content is distributed across different regions to determine market-specific strategies.
3. **Trends Over Time** – Analyzing the evolution of the content library, such as the number of shows added per year and shifts in genre popularity.
4. **IMDb Ratings & Popularity** – Identifying the highest-rated and most popular content on the platform to understand audience preferences.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
titles= pd.read_csv('titles.csv')
credits= pd.read_csv('credits.csv')

### Dataset First View

In [None]:
# Dataset First Look
titles.head()

In [None]:
credits.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Credits file :",credits.shape)
print("Titles file :",titles.shape)

### Dataset Information

In [None]:
# Dataset Info
credits.info()

In [None]:
titles.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count for credits
credits.duplicated().sum()

In [None]:
# Dataset Duplicate Value Count for titles
titles.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
credits.isnull().sum()

In [None]:
titles.isnull().sum()

In [None]:
# Visualizing the missing values
# Plot missing values matrix for credits dataset
plt.figure(figsize=(12,6))
sns.heatmap(credits.isnull(), cmap="viridis", cbar=False, yticklabels=False)
plt.title("Missing Values Heatmap - Credits Dataset")
plt.show()

In [None]:
# Plot missing values matrix for titles dataset
plt.figure(figsize=(12,6))
sns.heatmap(titles.isnull(), cmap="magma", cbar=False, yticklabels=False)
plt.title("Missing Values Heatmap - Titles Dataset")
plt.show()

### What did you know about your dataset?

After analyzing the dataset, here are the key observations:

# 1. Dataset Overview
The dataset consists of two main files:
**credits.csv** → Contains information about people involved in shows (actors, directors, etc.).
**titles.csv** → Contains details about movies and TV shows available on Amazon Prime Video.
# 2. Missing Values Analysis
The credits dataset has missing values in the character column (about 13% missing).
The titles dataset has missing values in multiple columns, including:
description (1.2% missing)
age_certification (65% missing)
seasons (86% missing)
imdb_score and imdb_votes (10% missing)
# 3. Duplicate Values
Some duplicate rows were found in both datasets, which were removed to ensure data integrity.
# 4. Column Insights
The titles.csv dataset includes genre, release year, IMDb ratings, and regional availability, which are useful for content trend analysis.
The credits.csv dataset contains cast and crew details, allowing analysis of popular actors and directors.
# 5. Potential Insights from Analysis
Content Diversity → Identifying the most popular genres.

Regional Availability → Understanding how content varies across different countries.

Trends Over Time → Tracking changes in content library growth.

IMDb Ratings & Popularity → Finding top-rated and most-watched content.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
credits.columns

In [None]:
titles.columns

In [None]:
# Dataset Describe
credits.describe

In [None]:
titles.describe

### Variables Description

#1. credits.csv (Cast & Crew Information)
This dataset provides details about the people (actors, directors, and crew members) involved in various shows and movies.

person_id (int64): A unique identifier for each person.

id (object): The unique identifier of the movie or TV show the person is associated with.

name (object): The name of the person (actor, director, writer, etc.).

character (object): The name of the character played by the person (only for actors). This can be NULL for non-acting roles.

role (object): The role of the person in the production (e.g., "actor", "director", "writer").


#2. titles.csv (Movie & TV Show Information)
This dataset contains information about the content available on Amazon Prime Video, including metadata such as genres, ratings, and popularity.

id (object): A unique identifier for each movie or TV show. It links to the credits.csv dataset.

title (object): The name of the movie or TV show.

type (object): The type of content, which can be either "movie" or "show".

description (object): A brief summary of the movie or TV show. Some entries may have missing values.

release_year (int64): The year in which the movie or TV show was released.

age_certification (object): The age rating of the movie/show (e.g., PG, R, 18+). Many values are missing.

runtime (int64): The duration of the movie in minutes. For TV shows, this may represent the average episode length.

genres (object): The genres associated with the movie/show, such as "Drama", "Action", or "Comedy".

production_countries (object): The country or countries where the movie/show was produced.

seasons (float64): The number of seasons for TV shows. This column is mostly NULL for movies.

imdb_id (object): The unique IMDb identifier for the movie/show.

imdb_score (float64): The IMDb rating of the movie/show (on a scale from 1 to 10). Some values are missing.

imdb_votes (float64): The number of user votes on IMDb. Some values are missing.

tmdb_popularity (float64): The popularity score from TMDb (The Movie Database).

tmdb_score (float64): The TMDb rating of the movie/show. Many values are missing.



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for credits table.
for column in credits.columns:
    print(f"Unique values in {column}: {credits[column].nunique()}")

In [None]:
# Check Unique Values for titles table.
for column in titles.columns:
    print(f"Unique values in {column}: {titles[column].nunique()}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Handling missing or null values

credits_df=credits.dropna()
titles_df=titles.dropna()


In [None]:
credits_df.isnull().sum()

In [None]:
titles_df.isnull().sum()

In [None]:
# handling duplicate value
credits=credits_df.drop_duplicates()
titles=titles_df.drop_duplicates()


In [None]:
print(credits.duplicated().sum())
print(titles.duplicated().sum())

In [None]:
# merging the dataset

merged= titles.merge(credits, on="id", how="left")

merged.sample(5)


In [None]:
#drop unnecessary columns

merged.drop(columns=['person_id', 'description','name','character','role'],inplace=True)

merged.columns

In [None]:
merged.info()

In [None]:
#convert into lowercase to avoid inconsistencies

merged['title']=merged['title'].str.strip().str.lower()
merged['genres'] = merged['genres'].str.strip().str.lower()
merged['production_countries'] = merged['production_countries'].str.strip().str.lower()



### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# 1. Genre Distribution (content diversity)

In [None]:

# Count occurrences of each genre
genre_counts = merged['genres'].str.split(',').explode().value_counts().head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=genre_counts.values, y=genre_counts.index, palette="viridis")
plt.xlabel("Number of Titles")
plt.ylabel("Genres")
plt.title("Top 10 Most Common Genres on Amazon Prime")
plt.show()



##### 1. Why did you pick the specific chart?

A bar chart clearly compares categorical counts. It is ideal for displaying the frequency of different genres, making it easy to see which genres dominate.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals which genres appear most frequently. For example, it might show that “drama,” “comedy,” or “action” are among the top genres.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. By knowing which genres are most popular, content creators and platform strategists can focus on investing in or promoting these genres to drive subscriptions and viewer engagement.

# 2. Content Release Trend Over Years – Line Chart

In [None]:
# Count titles per release year and sort by year
yearly_trend = merged['release_year'].value_counts().sort_index()

plt.figure(figsize=(12, 6))
sns.lineplot(x=yearly_trend.index, y=yearly_trend.values, marker="o", linewidth=2)
plt.xlabel("Year")
plt.ylabel("Number of Titles")
plt.title("Trend of Content Releases Over the Years")
plt.show()



##### 1. Why did you pick the specific chart?

A line chart is ideal for showing trends over time, making it easy to see growth, stability, or decline in content additions.

##### 2. What is/are the insight(s) found from the chart?

The chart illustrates the evolution of the content library over the years. It may reveal periods of rapid growth, stagnation, or decline in content release.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding release trends assists in planning future content investments and marketing strategies by identifying successful periods and adjusting production schedules accordingly.

# 3. Regional Availability – Pie Chart

In [None]:
# Count the top 5 production countries
country_counts = merged['production_countries'].str.split(',').explode().str.strip().value_counts().head(5)

plt.figure(figsize=(8, 8))
plt.pie(country_counts, labels=country_counts.index, autopct='%1.1f%%', colors=sns.color_palette("pastel"))
plt.title("Top 5 Production Countries")
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart effectively shows parts of a whole, making it easy to visualize the percentage share of content produced in different regions.

##### 2. What is/are the insight(s) found from the chart?

The chart highlights which production countries contribute the most content to the platform. For example, it may indicate that a few countries dominate production.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. These insights guide regional marketing strategies and help in tailoring content licensing or production decisions to focus on strong markets.

# 4. IMDb Score Distribution – Histogram

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(merged['imdb_score'].dropna(), bins=20, kde=True, color="blue")
plt.xlabel("IMDb Score")
plt.ylabel("Frequency")
plt.title("Distribution of IMDb Ratings")
plt.show()


##### 1. Why did you pick the specific chart?

Histograms are excellent for showing the distribution of continuous data. Here, it displays how IMDb ratings are spread across the content.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals the overall rating distribution—whether it is skewed, has multiple peaks, or is normally distributed. It helps in understanding how many titles fall into high or low rating categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Insights into rating distributions help assess overall content quality and may guide quality improvement initiatives, content curation, and marketing strategies.

# 5. IMDb Score vs. Popularity – Scatter Plot

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=merged, x="imdb_score", y="tmdb_popularity", alpha=0.5)
plt.xlabel("IMDb Score")
plt.ylabel("TMDb Popularity")
plt.title("Relationship Between IMDb Score and Popularity")
plt.show()


##### 1. Why did you pick the specific chart?

Scatter plots are ideal for visualizing relationships between two continuous variables. They help to determine if there’s any correlation between ratings and popularity.

##### 2. What is/are the insight(s) found from the chart?

The chart may show a positive correlation (or lack thereof) between IMDb scores and popularity. It helps identify outliers and trends, such as whether higher-rated titles tend to be more popular.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Recognizing the relationship between ratings and popularity aids in identifying which content is performing exceptionally well, thus guiding investment and promotional efforts.

# 6. Runtime Distribution – Histogram

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(merged['runtime'].dropna(), bins=30, kde=True, color="purple")
plt.xlabel("Runtime (minutes)")
plt.ylabel("Frequency")
plt.title("Distribution of Content Runtime")
plt.show()


##### 1. Why did you pick the specific chart?

A histogram is ideal for showing the distribution of numerical values, in this case, runtime. It helps to visualize common length patterns.


##### 2. What is/are the insight(s) found from the chart?

The chart may reveal that most movies are around 90–120 minutes, while TV episodes are much shorter. It might also highlight extreme values, such as very short films or extremely long movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding typical runtime helps in content recommendations, production strategies, and marketing campaigns targeting audience preferences.

# 7. IMDb Score by Genre – Box Plot

In [None]:
# Extract top 5 genres for better readability in the box plot
top_genres = merged['genres'].str.split(',').explode().str.strip().value_counts().index[:5]
filtered_df = merged[merged['genres'].str.contains('|'.join(top_genres), na=False)]

plt.figure(figsize=(12, 6))
sns.boxplot(data=filtered_df, x="genres", y="imdb_score", palette="Set3")
plt.xticks(rotation=45)
plt.xlabel("Genres")
plt.ylabel("IMDb Score")
plt.title("Distribution of IMDb Ratings Across Top Genres")
plt.show()


##### 1. Why did you pick the specific chart?

Box plots are excellent for comparing distributions across categories. They succinctly show the median, quartiles, and potential outliers.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals how IMDb scores vary within different genres. You can identify which genres have a higher median score or wider variation in ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. These insights can influence content acquisition and production by highlighting genres that consistently perform well, thereby directing focus and resources toward quality content.

# 8. Number of Titles Per Age Rating – Bar Chart

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(data=merged, x="age_certification", order=merged['age_certification'].value_counts().index, palette="coolwarm")
plt.xlabel("Age Rating")
plt.ylabel("Number of Titles")
plt.title("Content Distribution by Age Rating")
plt.show()



##### 1. Why did you pick the specific chart?

A bar chart is the best way to compare different categories. Here, it shows how many movies and shows fall into each age rating category (e.g., PG, R, 18+, etc.).

##### 2. What is/are the insight(s) found from the chart?

This chart helps answer questions like:

Does Amazon Prime cater more to family-friendly content (G, PG, PG-13) or mature audiences (R, 18+)?

Are certain age groups underserved?

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. If there’s a gap in certain audience segments, Amazon Prime can invest in content acquisition or original productions to fill that void and improve subscriber retention.

# Chart - 9 - Correlation Heatmap

In [None]:

# add numeric_only=True to tell corr() to ignore string values.
plt.figure(figsize=(12, 6))
sns.heatmap(merged.corr(numeric_only=True), annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Heatmap of Numerical Features")
plt.show()


##### 1. Why did you pick the specific chart?

A heatmap is the best way to visualize correlations between numerical variables, revealing positive or negative relationships.


##### 2. What is/are the insight(s) found from the chart?


*  It may show that IMDb rating correlates with runtime, meaning longer movies tend to have higher/lower ratings.

*   It could reveal whether release year impacts popularity or IMDb scores.






# Chart - 10 - Pair Plot

In [None]:
sns.pairplot(merged[['imdb_score', 'runtime', 'release_year', 'tmdb_popularity']], palette="husl")
plt.show()



##### 1. Why did you pick the specific chart?

A pairplot allows us to visualize pairwise relationships and distributions between multiple variables simultaneously.


##### 2. What is/are the insight(s) found from the chart?

We can see if higher IMDb ratings correspond to longer runtimes.

It can also reveal trends between release year and IMDb scores/popularity.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the analysis and insights gained from the dataset, I suggest the following strategies for Amazon Prime Video to achieve its business objectives effectively:

**1. Enhancing Content Diversity**

Invest in underrepresented genres and regional content to attract diverse audiences.

Analyze genre popularity trends for strategic content acquisition.

**2. Optimizing Regional Availability**

Expand licensing agreements to improve content accessibility across regions.

Localize content and acquire region-specific movies and shows.

**3. Leveraging Trends Over Time**

Maintain steady content acquisition and monitor high-performing release years.

Track competitor trends to refine content strategy.

**4. Boosting IMDb Ratings & Popularity**

Promote top-rated content through AI-driven recommendations and marketing.

Invest in high-quality originals and collaborate with influencers for better visibility.

# **Conclusion**

This project provided valuable insights into Amazon Prime Video’s content library by analyzing genre diversity, regional availability, trends over time, and IMDb ratings. Through data cleaning, visualization, and exploratory analysis, we identified key patterns that can help optimize content strategy, enhance user engagement, and drive business growth. Implementing data-driven decisions, such as acquiring trending genres, expanding regional availability, and leveraging top-rated content, will strengthen Amazon Prime Video’s market position and improve subscriber retention.