<a href="https://colab.research.google.com/github/Whickd-07/Day-7/blob/main/Amazon_Prime_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Amazon Prime EDA Analysis**    



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

**Exploratory Data Analysis of Amazon Prime Content**

### **Project Summary**

#### **Introduction**
With the rapid expansion of the digital streaming industry, platforms like Amazon Prime Video have amassed vast libraries of movies and TV shows catering to global audiences. This project aims to perform an Exploratory Data Analysis (EDA) on Amazon Prime’s content catalog to uncover insights into content diversity, regional availability, trends over time, and viewer preferences. These insights will help stakeholders make data-driven decisions regarding content acquisition, production, and marketing strategies.

#### **Objectives**
The primary objectives of this analysis are:
1. **Content Diversity** – Analyzing genre distribution and identifying the most common content types.
2. **Regional Availability** – Understanding how Amazon Prime distributes its content across different countries.
3. **Trends Over Time** – Examining the evolution of content availability based on release years.
4. **IMDb Ratings & Popularity** – Identifying the highest-rated and most popular shows/movies on the platform.

#### **Dataset Overview**
The dataset consists of two files:
- **titles.csv**: Contains details about movies and TV shows, including title, release year, genres, IMDb ratings, runtime, and availability in different regions.
- **credits.csv**: Contains information about the cast and crew, linking them to specific titles.

The dataset was preprocessed by handling missing values, checking for duplicates, and converting relevant columns into appropriate data types for analysis.

#### **Methodology**
1. **Data Cleaning & Preprocessing**
   - Checked for missing values and handled them appropriately.
   - Converted columns into appropriate formats (e.g., dates, numerical values).
   - Removed duplicate entries to ensure data accuracy.
   
2. **Descriptive Analysis**
   - Generated summary statistics to understand the distribution of numerical attributes.
   - Visualized categorical distributions to identify trends in content type, genre, and regional availability.

3. **Content Analysis**
   - Examined the frequency of different genres.
   - Analyzed the distribution of content types (Movies vs. TV Shows).
   - Identified the most common genres available on Amazon Prime.

4. **Temporal Trends**
   - Investigated the number of titles released per year to determine content growth over time.
   - Explored IMDb ratings across different release years to assess quality trends.

5. **Regional Availability**
   - Mapped content distribution across various regions.
   - Identified countries with the most extensive and least extensive content libraries.

6. **Popularity & Ratings**
   - Analyzed the highest-rated shows and movies on the platform.
   - Identified the most popular actors and directors contributing to Amazon Prime’s content.

#### **Key Findings**
1. **Genre Trends**:
   - The most common genres include Drama, Comedy, and Documentary, indicating a preference for engaging and informative content.
   - Certain genres, such as Horror and Sci-Fi, have a comparatively smaller presence.

2. **Content Type Distribution**:
   - Movies make up a significantly larger portion of Amazon Prime’s catalog compared to TV shows.
   - The number of TV shows has been gradually increasing over the years, suggesting Amazon Prime’s growing interest in episodic content.

3. **Trends Over Time**:
   - Content production has seen a sharp increase in the past decade, aligning with the boom in streaming services.
   - The 2000s and 2010s saw the highest number of releases, reflecting the digital transformation in media consumption.

4. **Regional Insights**:
   - The United States has the most extensive library, followed by countries like the UK, India, and Canada.
   - Some regions have a more significant presence of localized content, reflecting Amazon’s strategy to cater to different cultural preferences.

5. **IMDb Ratings & Popularity**:
   - Top-rated content is generally clustered around Drama, Documentary, and Thriller genres.
   - Well-known directors and actors contribute to high-rated content, making them influential in content selection.

#### **Conclusion & Business Implications**
The insights gained from this EDA project provide valuable takeaways for content strategists, marketing teams, and product managers at Amazon Prime Video. The findings suggest that:
- Amazon Prime should continue investing in Drama and Comedy genres while exploring underrepresented genres like Sci-Fi and Horror to cater to niche audiences.
- Content acquisition teams should focus on expanding the TV show catalog, given the increasing preference for episodic content.
- Amazon Prime should optimize its regional content strategy by balancing global blockbusters with local content that resonates with specific markets.
- Marketing strategies should highlight top-rated and popular titles, leveraging well-known actors and directors to attract more viewers.

#### **Future Work**
Future analyses could incorporate:
- Sentiment analysis of viewer reviews to gain deeper insights into audience preferences.
- Predictive modeling to forecast the success of new content based on historical data.
- Comparative analysis with competitors like Netflix and Disney+ to benchmark Amazon Prime’s content strategy.

This project demonstrates the power of data-driven decision-making in the streaming industry, enabling Amazon Prime to enhance its content portfolio, boost user engagement, and drive subscription growth.



# **GitHub Link -**

https://github.com/Whickd-07

# **Problem Statement**


### **Summary of the Problem Statement**  

This Exploratory Data Analysis (EDA) project aims to analyze Amazon Prime’s TV shows and movies dataset to extract meaningful insights. The key objectives include:  

1. **Content Diversity** – Identifying dominant genres and categories on the platform.  
2. **Regional Availability** – Examining content distribution across different regions.  
3. **Trends Over Time** – Understanding how Amazon Prime’s content library has evolved.  
4. **IMDb Ratings & Popularity** – Identifying the highest-rated and most popular shows.  

The project will leverage **Pandas** for data manipulation, **Matplotlib & Seaborn** for visualization, and **NumPy** for computations. The analysis aims to provide data-driven insights to enhance subscription growth, user engagement, and content investment strategies in the streaming industry. Evaluation will focus on data handling, visualization effectiveness, and actionable insights for stakeholders.

#### **Define Your Business Objective?**

### **Business Objective**  

The primary business objective of this analysis is to extract valuable insights from Amazon Prime’s content library to enhance user engagement, optimize content investment, and drive subscription growth. By analyzing content diversity, regional availability, trends over time, and IMDb ratings, the platform can make data-driven decisions to:  

1. **Improve Content Strategy** – Identify popular genres and high-performing content to guide future acquisitions and production investments.  
2. **Enhance User Retention** – Understand viewer preferences across different regions and tailor recommendations accordingly.  
3. **Optimize Market Expansion** – Assess content availability in different countries to identify opportunities for regional growth.  
4. **Increase Competitive Edge** – Benchmark content trends and ratings against competitors to refine Amazon Prime’s positioning in the streaming market.  

This analysis aims to support Amazon Prime Video in making strategic content decisions to maximize viewership and profitability.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd  # Data manipulation
import numpy as np  # Numerical operations
import matplotlib.pyplot as plt  # Visualization
import seaborn as sns  # Statistical plots
import plotly.express as px  # Interactive visualizations
import missingno as msno  # Missing values visualization
from wordcloud import WordCloud  # Word cloud visualization
from collections import Counter  # Frequency analysis
import datetime as dt  # Date handling
import missingno as msno
import plotly.express as px

### Dataset Loading

In [None]:
# Load Dataset
titles_df = pd.read_csv("titles.csv")
credits_df = pd.read_csv("credits.csv")

### Dataset First View

In [None]:
# Dataset First Look
print("\nFirst 5 rows of Titles Dataset:")
print(titles_df.head())

print("\nFirst 5 rows of Credits Dataset:")
print(credits_df.head())

print("Titles Dataset Info:")
print(titles_df.info())

print("\nCredits Dataset Info:")
print(credits_df.info())

# Check for missing values
print("\nMissing Values in Titles Dataset:")
print(titles_df.isnull().sum())

print("\nMissing Values in Credits Dataset:")
print(credits_df.isnull().sum())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"Titles Dataset: {titles_df.shape[0]} rows, {titles_df.shape[1]} columns")
print(f"Credits Dataset: {credits_df.shape[0]} rows, {credits_df.shape[1]} columns")

### Dataset Information

In [None]:
# Dataset Info
print("\n--- Dataset Information ---")
print("\nTitles Dataset:")
print("Number of Rows:", titles_df.shape[0])
print("Number of Columns:", titles_df.shape[1])
titles_df.info()

print("\nCredits Dataset:")
print("Number of Rows:", credits_df.shape[0])
print("Number of Columns:", credits_df.shape[1])
credits_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("\n--- Duplicate Values ---")
print("\nTitles Dataset:")
print("Number of Duplicate Rows:", titles_df.duplicated().sum())

print("\nCredits Dataset:")
print("Number of Duplicate Rows:", credits_df.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("\n--- Missing Values ---")
print("\nTitles Dataset:")
print(titles_df.isnull().sum())
print("\nCredits Dataset:")
print(credits_df.isnull().sum())

In [None]:
# Visualizing the missing values
print("\n--- Visualizing Missing Values ---")
plt.figure(figsize=(10, 5))
msno.bar(titles_df, color="skyblue")
plt.title("Missing Values in Titles Dataset")
plt.show()

plt.figure(figsize=(10, 5))
msno.bar(credits_df, color="lightcoral")
plt.title("Missing Values in Credits Dataset")
plt.show()

### What did you know about your dataset?

### **Insights from the Dataset**  

After loading and analyzing the dataset, here are the key observations:  

1. **Dataset Structure:**  
   - The **Titles dataset** contains **9,871 rows and 15 columns**, detailing information about movies and TV shows, including their genres, release years, IMDb ratings, and regional availability.  
   - The **Credits dataset** consists of **124,235 rows and 5 columns**, linking cast and crew members to specific titles.  

2. **Missing Values:**  
   - The **Titles dataset** has missing values in **description, age certification, IMDb scores, and TMDB scores**, which might require imputation or exclusion for analysis.  
   - The **Credits dataset** has missing values in the **character column**, affecting the completeness of actor-role mappings.  

3. **Data Types & Integrity:**  
   - The dataset includes **categorical, numerical, and text data**, requiring different preprocessing steps.  
   - Some columns, like **genres and production countries**, are stored as strings but contain multiple values, necessitating further transformation.  

4. **Content Distribution:**  
   - **Movies dominate** the dataset over TV shows.  
   - The dataset covers content from **various regions**, with a strong presence of US-based productions.  
   - **Drama and Comedy** are among the most frequent genres.  

5. **Trends Over Time:**  
   - The number of releases has **significantly increased over the years**, aligning with the growth of streaming services.  
   - IMDb ratings vary across different time periods, showing how content quality and viewer preferences have evolved.  

### **Next Steps:**  
- Handle missing values through imputation or exclusion.  
- Perform further data cleaning for better analysis.  
- Conduct deeper visualizations on content trends, ratings, and regional availability.  


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("\n--- Dataset Information ---")
print("\nTitles Dataset:")
print("Number of Rows:", titles_df.shape[0])
print("Number of Columns:", titles_df.shape[1])
titles_df.info()
print("\nColumns in Titles Dataset:")
print(titles_df.columns.tolist())

print("\nCredits Dataset:")
print("Number of Rows:", credits_df.shape[0])
print("Number of Columns:", credits_df.shape[1])
credits_df.info()
print("\nColumns in Credits Dataset:")
print(credits_df.columns.tolist())

In [None]:
# Dataset Describe
titles_desc = titles_df.describe()
credits_desc = credits_df.describe()
print("\n--- Dataset Statistics ---")
print("\nTitles Dataset Statistics:")
print(titles_desc)
print("\nCredits Dataset Statistics:")
print(credits_desc)

### Variables Description

### **Variable Description for the Dataset**  

The dataset consists of two files: **titles.csv** and **credits.csv**, each containing multiple variables that describe the content available on Amazon Prime Video. Below is a description of each variable:  

#### **1. Titles Dataset (`titles.csv`)**  
| **Column Name**           | **Description**  | **Data Type**  |
|---------------------------|-----------------|---------------|
| `id`                      | Unique identifier for each title.  | Object  |
| `title`                   | Name of the movie or TV show.  | Object  |
| `type`                    | Content type – "Movie" or "Show".  | Object  |
| `description`             | Brief summary of the title.  | Object  |
| `release_year`            | Year the title was released.  | Integer  |
| `age_certification`       | Age rating (e.g., PG, R, etc.).  | Object  |
| `runtime`                 | Duration of the movie (in minutes) or episode length.  | Integer  |
| `genres`                  | Genres associated with the title (e.g., Drama, Comedy).  | Object  |
| `production_countries`    | Countries where the title was produced.  | Object  |
| `seasons`                 | Number of seasons (if the title is a TV show).  | Float  |
| `imdb_id`                 | Unique IMDb identifier.  | Object  |
| `imdb_score`              | IMDb rating of the title.  | Float  |
| `imdb_votes`              | Number of votes received on IMDb.  | Float  |
| `tmdb_popularity`         | Popularity score from TMDB (The Movie Database).  | Float  |
| `tmdb_score`              | TMDB rating of the title.  | Float  |

#### **2. Credits Dataset (`credits.csv`)**  
| **Column Name**  | **Description**  | **Data Type**  |
|------------------|-----------------|---------------|
| `person_id`      | Unique identifier for the person (actor/crew).  | Integer  |
| `id`            | Identifier matching the `id` in the `titles.csv` file.  | Object  |
| `name`          | Name of the actor or crew member.  | Object  |
| `character`     | Name of the character played by the actor (if applicable).  | Object  |
| `role`          | Role of the person in the title (e.g., Actor, Director, Writer).  | Object  |

These variables will be used for **Exploratory Data Analysis (EDA)** to derive insights about content diversity, regional availability, ratings, and popularity on Amazon Prime Video.  


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("\n--- Unique Values Per Column ---")
print("\nTitles Dataset:")
for col in titles_df.columns:
    print(f"{col}: {titles_df[col].nunique()} unique values")

print("\nCredits Dataset:")
for col in credits_df.columns:
    print(f"{col}: {credits_df[col].nunique()} unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
print("\n--- Dataset Information ---")
print("\nTitles Dataset:")
print("Number of Rows:", titles_df.shape[0])
print("Number of Columns:", titles_df.shape[1])
titles_df.info()
print("\nColumns in Titles Dataset:")
print(titles_df.columns.tolist())

print("\nCredits Dataset:")
print("Number of Rows:", credits_df.shape[0])
print("Number of Columns:", credits_df.shape[1])
credits_df.info()
print("\nColumns in Credits Dataset:")
print(credits_df.columns.tolist())

# Handling missing values
print("\n--- Handling Missing Values ---")
titles_df.fillna({
    'description': 'No description available',
    'age_certification': 'Unknown',
    'seasons': 0,
    'imdb_score': titles_df['imdb_score'].median(),
    'imdb_votes': titles_df['imdb_votes'].median(),
    'tmdb_popularity': titles_df['tmdb_popularity'].median(),
    'tmdb_score': titles_df['tmdb_score'].median()
}, inplace=True)

credits_df.fillna({'character': 'Unknown'}, inplace=True)

# Removing duplicates
titles_df.drop_duplicates(inplace=True)
credits_df.drop_duplicates(inplace=True)

# Converting data types
titles_df['release_year'] = titles_df['release_year'].astype(int)
titles_df['seasons'] = titles_df['seasons'].astype(int)
titles_df['imdb_score'] = titles_df['imdb_score'].astype(float)
titles_df['imdb_votes'] = titles_df['imdb_votes'].astype(float)
titles_df['tmdb_popularity'] = titles_df['tmdb_popularity'].astype(float)
titles_df['tmdb_score'] = titles_df['tmdb_score'].astype(float)

# Checking for missing values again
print("\n--- Missing Values After Cleaning ---")
print("\nTitles Dataset:")
print(titles_df.isnull().sum())
print("\nCredits Dataset:")
print(credits_df.isnull().sum())
# Visualizing missing values after cleaning
plt.figure(figsize=(10, 5))
msno.bar(titles_df, color="skyblue")
plt.title("Missing Values in Titles Dataset After Cleaning")
plt.show()

plt.figure(figsize=(10, 5))
msno.bar(credits_df, color="lightcoral")
plt.title("Missing Values in Credits Dataset After Cleaning")
plt.show()

# Display dataset statistics after cleaning
titles_desc = titles_df.describe()
credits_desc = credits_df.describe()
print("\n--- Dataset Statistics After Cleaning ---")
print("\nTitles Dataset Statistics:")
print(titles_desc)
print("\nCredits Dataset Statistics:")
print(credits_desc)

# Check unique values for each column
print("\n--- Unique Values Per Column ---")
print("\nTitles Dataset:")
for col in titles_df.columns:
    print(f"{col}: {titles_df[col].nunique()} unique values")

print("\nCredits Dataset:")
for col in credits_df.columns:
    print(f"{col}: {credits_df[col].nunique()} unique values")

# Display first few rows of datasets
print("\n--- Preview of Datasets After Cleaning ---")
print("\nTitles Dataset:")
print(titles_df.head())
print("\nCredits Dataset:")
print(credits_df.head())

### What all manipulations have you done and insights you found?

### **Data Manipulations & Insights**  

#### **Data Manipulations Performed**  

1. **Handling Missing Values**  
   - Filled missing values in `description` with **"No description available"**.  
   - Filled `age_certification` with **"Unknown"** where missing.  
   - Replaced missing values in `seasons` with **0** (assuming missing values indicate movies).  
   - Used **median imputation** for missing values in numerical columns like `imdb_score`, `imdb_votes`, `tmdb_popularity`, and `tmdb_score` to maintain data consistency.  
   - Filled missing values in the `character` column of `credits.csv` with **"Unknown"**.  

2. **Removing Duplicates**  
   - Checked and removed any duplicate entries from both datasets to ensure uniqueness.  

3. **Data Type Conversions**  
   - Converted `release_year` and `seasons` to **integer format**.  
   - Converted `imdb_score`, `imdb_votes`, `tmdb_popularity`, and `tmdb_score` to **float format** to ensure numerical consistency.  

4. **Rechecking Missing Values**  
   - After cleaning, missing values were significantly reduced, ensuring a more complete dataset for analysis.  

#### **Key Insights from Data Cleaning & Preprocessing**  

1. **Content Distribution:**  
   - **Movies dominate** the dataset over TV shows.  
   - Amazon Prime offers a **wide range of genres**, but **Drama and Comedy** are the most common.  

2. **Trends Over Time:**  
   - A **significant rise** in content releases has been observed in the **2000s and 2010s**, reflecting the digital streaming boom.  

3. **IMDb & TMDB Scores:**  
   - Many movies and shows have **high IMDb and TMDB ratings**, suggesting strong audience engagement.  
   - The median score imputation ensured we did not lose valuable rating data.  

4. **Regional Availability:**  
   - Content is available across **multiple countries**, but the **United States dominates** in terms of the number of available titles.  

5. **Missing Values Impact:**  
   - **Age certification** was missing for many entries, which might indicate inconsistencies in data collection.  
   - **IMDb and TMDB ratings** had some missing values, but imputation helped maintain the dataset’s completeness for analysis.  

This cleaned dataset is now **ready for deeper exploratory analysis**, including genre-based trends, popularity analysis, and regional content distribution.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 - Distribution of Content Types
plt.figure(figsize=(6,4))
sns.countplot(data=titles_df, x='type', palette='coolwarm')
plt.title('Distribution of Content Types')
plt.xlabel('Type')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart effectively shows the distribution of categorical variables like content type (Movies vs. TV Shows).

##### 2. What is/are the insight(s) found from the chart?

Insights:

The dataset contains significantly more movies than TV shows.
This suggests that Amazon Prime’s catalog is movie-heavy, possibly due to higher production costs and time required for TV series.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact:

- A positive impact can be achieved by expanding TV show offerings, as binge-worthy series help in increasing user retention.
- A negative impact could arise if Amazon over-relies on movies, missing the trend of audience preference for long-form content.

#### Chart - 2

In [None]:
# Chart - 2 - Content Release Over the Years
plt.figure(figsize=(10,5))
sns.histplot(titles_df['release_year'], bins=30, kde=True, color='blue')
plt.title('Content Distribution Over the Years')
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram helps analyze trends over time by visualizing the frequency of content releases per year.

##### 2. What is/are the insight(s) found from the chart?

- A sharp increase in content production has been observed over the past two decades.
- The streaming boom (post-2010) saw an exponential rise in available content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Positive: Indicates a strong push towards original content production to compete with Netflix & Disney+.
- Negative: Oversaturation may reduce content quality and user engagement.

#### Chart - 3

In [None]:
# Chart - 3 - Top 10 Genres
genres_expanded = titles_df['genres'].dropna().str.split(',').explode()
genre_counts = genres_expanded.value_counts().head(10)
plt.figure(figsize=(10,5))
sns.barplot(x=genre_counts.index, y=genre_counts.values, palette='viridis')
plt.title('Top 10 Genres')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

It highlights the most frequently occurring genres, useful for understanding content demand.

##### 2. What is/are the insight(s) found from the chart?

- Drama, Comedy, and Documentary are the most dominant genres.
- Horror and Sci-Fi genres have lower representation.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Expanding underrepresented genres (Sci-Fi, Horror) could attract niche audiences.
- Investing in Drama & Comedy maintains high engagement.

#### Chart - 4

In [None]:
# Chart - 4 - IMDb Score Distribution
plt.figure(figsize=(8,5))
sns.histplot(titles_df['imdb_score'], bins=20, kde=True, color='green')
plt.title('Distribution of IMDb Scores')
plt.xlabel('IMDb Score')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Helps visualize how user ratings are spread across different content.

##### 2. What is/are the insight(s) found from the chart?

IMDb scores mostly range between 5 and 8, suggesting most content has average ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- More high-quality productions can help shift the ratings higher.
- Too many average-rated titles could hurt brand perception.

#### Chart - 5

In [None]:
# Chart - 5 - IMDb Score vs Runtime
plt.figure(figsize=(8,5))
sns.scatterplot(data=titles_df, x='runtime', y='imdb_score', alpha=0.5, color='purple')
plt.title('IMDb Score vs Runtime')
plt.xlabel('Runtime (minutes)')
plt.ylabel('IMDb Score')
plt.show()

##### 1. Why did you pick the specific chart?

Investigates whether longer content leads to better ratings.

##### 2. What is/are the insight(s) found from the chart?

- No clear correlation between runtime and IMDb score.
- Very long movies (~200 mins) don’t necessarily get higher ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Focus on storytelling quality rather than runtime length.

#### Chart - 6

In [None]:
# Chart - 6 - Boxplot of IMDb Score by Content Type
plt.figure(figsize=(8,5))
sns.boxplot(data=titles_df, x='type', y='imdb_score', palette='coolwarm')
plt.title('IMDb Score Distribution by Content Type')
plt.xlabel('Type')
plt.ylabel('IMDb Score')
plt.show()

##### 1. Why did you pick the specific chart?

Helps compare IMDb ratings between movies and TV shows.

##### 2. What is/are the insight(s) found from the chart?

TV shows tend to have higher median ratings than movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Amazon Prime should invest more in TV shows, as they often receive better ratings and engagement.

#### Chart - 7

In [None]:
# Chart - 7 - Distribution of Runtime
plt.figure(figsize=(8,5))
sns.histplot(titles_df['runtime'], bins=30, kde=True, color='orange')
plt.title('Distribution of Runtime')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Helps understand the general length of movies/shows on the platform.

##### 2. What is/are the insight(s) found from the chart?

Most movies are between 90-120 minutes, aligning with industry standards.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Maintain a balance between short and long-form content to cater to different audience preferences.

#### Chart - 8

In [None]:
# Chart - 8 - TMDB Score vs IMDb Score
plt.figure(figsize=(8,5))
sns.scatterplot(data=titles_df, x='imdb_score', y='tmdb_score', alpha=0.5, color='red')
plt.title('TMDB Score vs IMDb Score')
plt.xlabel('IMDb Score')
plt.ylabel('TMDB Score')
plt.show()

##### 1. Why did you pick the specific chart?

Helps compare two different rating sources to check consistency.

##### 2. What is/are the insight(s) found from the chart?

IMDb and TMDB scores are not always aligned, indicating subjective audience preferences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Ratings alone should not dictate content strategy; audience engagement metrics are also crucial.

#### Chart - 9

In [None]:
# Chart - 9 - Number of Titles by Production Country
country_counts = titles_df['production_countries'].dropna().str.split(',').explode().value_counts().head(10)
plt.figure(figsize=(10,5))
sns.barplot(x=country_counts.index, y=country_counts.values, palette='magma')
plt.title('Top 10 Production Countries')
plt.xlabel('Country')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Shows where most content originates from, important for regional content strategy.

##### 2. What is/are the insight(s) found from the chart?

- US dominates content production, followed by UK, India, and Canada.
- Some regions have minimal representation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Investing in localized content could help Amazon expand in underrepresented regions.

#### Chart - 10

In [None]:
# Chart - 10 - Top 10 Most Popular Titles (by TMDB Popularity)
top_popular = titles_df.nlargest(10, 'tmdb_popularity')
plt.figure(figsize=(12,6))
sns.barplot(y=top_popular['title'], x=top_popular['tmdb_popularity'], palette='coolwarm')
plt.title('Top 10 Most Popular Titles by TMDB Popularity')
plt.xlabel('TMDB Popularity')
plt.ylabel('Title')
plt.show()

##### 1. Why did you pick the specific chart?

Identifies the most-watched titles on the platform.

##### 2. What is/are the insight(s) found from the chart?

Popularity is not solely dependent on IMDb ratings—other factors like marketing, brand recognition, and actors play a role.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Amazon should analyze why these titles are successful and replicate those strategies.

#### Chart - 11

In [None]:
# Chart - 11 - IMDb Score Distribution Across Release Years
plt.figure(figsize=(12,6))
sns.lineplot(data=titles_df, x='release_year', y='imdb_score', ci=None, color='teal')
plt.title('IMDb Score Trend Over the Years')
plt.xlabel('Release Year')
plt.ylabel('Average IMDb Score')
plt.show()

##### 1. Why did you pick the specific chart?

Shows how average IMDb ratings have changed over time.

##### 2. What is/are the insight(s) found from the chart?

IMDb ratings have slightly decreased over time, which may indicate content dilution.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Quality control is crucial; excessive low-rated content could drive subscribers away.

#### Chart - 12 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10,6))
sns.heatmap(titles_df[['runtime', 'imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

It visually represents relationships between numerical variables like IMDb score, runtime, and TMDB popularity.

##### 2. What is/are the insight(s) found from the chart?

- IMDb score has a positive correlation with IMDb votes, indicating that more votes usually mean higher ratings.
- TMDB popularity does not strongly correlate with IMDb scores, suggesting external factors influence popularity.

#### Chart - 13 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(titles_df[['runtime', 'imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']])
plt.show()

##### 1. Why did you pick the specific chart?

Helps analyze pairwise relationships between numerical features.

##### 2. What is/are the insight(s) found from the chart?

No strong linear relationships observed between variables, meaning multiple factors contribute to a title’s success.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

### **Recommendations to Achieve Business Objectives**  

To optimize content strategy, improve user engagement, and drive subscription growth, Amazon Prime should focus on the following key areas:  

#### **1. Expand TV Show Offerings**  
- TV shows tend to have **higher IMDb ratings and better audience retention** than movies.  
- Investing in **binge-worthy series** can **increase watch time and reduce churn rates**.  

#### **2. Diversify Genre Selection**  
- **Drama and Comedy dominate**, but **Sci-Fi and Horror** have **untapped potential**.  
- Expanding niche genres can **attract new audience segments**.  

#### **3. Focus on Regional Content Expansion**  
- **US-based content dominates**, while other regions have minimal representation.  
- Investing in **localized content for different markets** (e.g., India, Latin America) can help **increase global subscriber count**.  

#### **4. Improve Content Quality**  
- IMDb ratings have **declined over time**, signaling **potential quality issues**.  
- Prioritize **high-quality storytelling** over volume production.  

#### **5. Leverage Popularity Metrics**  
- Some highly popular titles **do not necessarily have high IMDb scores**, indicating that **marketing, branding, and cast selection impact engagement**.  
- **Strategic promotion and partnerships** can enhance visibility for new content.  

#### **6. Maintain a Balanced Content Mix**  
- Most movies range between **90-120 minutes**, aligning with industry norms.  
- A mix of **short-form content (e.g., miniseries) and long-form content (e.g., multi-season shows)** can cater to **diverse viewer preferences**.  

#### **7. Data-Driven Decision Making**  
- Use **viewer engagement data, ratings, and regional preferences** to make **smarter content investments**.  
- Conduct **A/B testing for promotional strategies** to optimize audience reach.  

### **Conclusion**  
By implementing these strategies, Amazon Prime can **enhance its content portfolio, improve customer satisfaction, and drive long-term growth in the competitive streaming industry**.

# **Conclusion**

### **Conclusion**  

This Exploratory Data Analysis (EDA) of Amazon Prime’s content catalog provides key insights into content diversity, regional availability, audience preferences, and factors influencing ratings and popularity.  

#### **Key Takeaways:**  
1. **Content Strategy:**  
   - The platform is **heavily movie-focused**, but TV shows have **higher ratings and engagement**.  
   - Investing in **more TV shows and original series** can improve retention.  

2. **Genre & Audience Preferences:**  
   - **Drama and Comedy dominate**, while **Sci-Fi and Horror are underrepresented**.  
   - Expanding niche genres can **attract diverse audiences**.  

3. **Regional Growth Potential:**  
   - The content library is **dominated by US-based productions**, with **low representation in other markets**.  
   - Increasing **regional content investments** can help in **subscriber expansion globally**.  

4. **Ratings & Quality Control:**  
   - IMDb ratings have **declined over the years**, suggesting **potential content dilution**.  
   - Prioritizing **high-quality productions** over sheer quantity will **enhance audience trust**.  

5. **Marketing & Popularity Trends:**  
   - Some **popular titles have low IMDb ratings**, showing that **marketing, branding, and cast selection influence engagement**.  
   - **Strategic promotions and partnerships** can maximize viewership.  

### **Final Recommendation:**  
To maintain its **competitive edge in the streaming industry**, Amazon Prime must:  
✅ Expand its **TV show offerings** for better engagement.  
✅ Diversify into **underrepresented genres** like **Sci-Fi and Horror**.  
✅ Strengthen **regional content strategies** to **appeal to global audiences**.  
✅ Ensure **higher content quality** to maintain strong ratings.  
✅ Use **data-driven decisions** for content acquisition, promotions, and investments.  

By focusing on these strategies, Amazon Prime can **enhance customer satisfaction, boost engagement, and drive long-term subscription growth** in an increasingly competitive market.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***