# **Project Name    - Comprehensive Exploratory Data Analysis (EDA) of Amazon Prime Movies and TV Shows**


##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Contributor Name -** Sana Khan

# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The problem statement for this project focuses on analyzing the vast content library of Amazon Prime Video to extract actionable business insights. In the highly competitive streaming industry, platforms must constantly expand and diversify their libraries to cater to global audiences, making data-driven strategies essential for understanding trends and audience preferences.

The core objectives of this problem statement include:


Content Diversity: Identifying which genres and categories currently dominate the platform.


Regional Availability: Analyzing how content distribution varies across different geographic regions.


Trends Over Time: Investigating the evolution of Amazon Prime’s content library to see how it has grown or changed historically.


Quality and Popularity: Determining which shows and movies are the highest-rated or most popular based on IMDb and TMDB metrics.

By addressing these questions, the project aims to help businesses, content creators, and data analysts uncover key trends that influence subscription growth, user engagement, and future content investment strategies.

#### **Define Your Business Objective?**

Based on the project documentation, the primary business objectives are centered on leveraging data to stay competitive in the rapidly growing streaming industry. The goals are to use data-driven insights to understand audience preferences and refine content strategy.

The specific business objectives include:


Analyzing Content Diversity: Determining which specific genres and categories are most prevalent on the platform to understand the current library's strengths.


Assessing Regional Availability: Evaluating how content distribution and availability vary across different geographic regions.


Identifying Trends Over Time: Tracking the historical evolution of the Amazon Prime content library to understand growth patterns.


Measuring Quality and Popularity: Identifying the highest-rated and most popular shows and movies using IMDb and TMDB metrics to pinpoint what resonates with viewers.


Informing Strategic Investment: Uncovering trends that can guide content investment strategies, improve user engagement, and drive subscription growth.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Optional: To make plots appear inline and set a consistent style
%matplotlib inline
sns.set_theme(style="whitegrid")

### Dataset Loading

In [None]:
# Load Dataset
try:
    titles_df = pd.read_csv('titles.csv')
    credits_df = pd.read_csv('credits.csv')

    print("Dataset loaded successfully!")
except Exception as e:
    print(f"Error loading dataset: {e}")

### Dataset First View

In [None]:
# Dataset First Look

# merge on the 'id' column since it is common to both files
merged_df = pd.merge(titles_df, credits_df, on='id', how='inner')

print(merged_df.head())
print(merged_df.info())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(merged_df.shape)

### Dataset Information

In [None]:
# Dataset Info
merged_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

duplicate_count = merged_df.duplicated().sum()
print(f"Total Duplicate Rows: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null_counts = merged_df.isnull().sum()
print(null_counts)

In [None]:
# Visualizing the missing values

null_counts = null_counts[null_counts > 0].sort_values(ascending=False)

plt.figure(figsize=(10, 5))
null_counts.plot(kind='bar', color='skyblue')

plt.title('Count of Missing Values per Column')
plt.ylabel('Number of Missing Entries')
plt.xlabel('Columns')
plt.show()

### What did you know about your dataset?

Think of this dataset as a digital library of everything available on Amazon Prime Video in the United States. It isn't just a simple list; it’s a deep dive into what makes a movie or show successful on the platform.

Here is a breakdown of what the data actually tells us:

1. The Two Main "Folders"
The data is split into two connected parts:

The Titles Catalog: This is the "what" and "where". It covers over 9,000 unique titles—from blockbuster movies to binge-worthy TV shows. It tells us their name, how long they are, what year they came out, and their age rating (like PG-13 or R).

The Cast & Crew Credits: This is the "who". It contains over 124,000 records of the actors and directors who brought these stories to life. It even specifies if they were the director or which character an actor played.

2. The "Quality" Scorecard
The dataset keeps track of how much people actually liked what they watched. It includes:

IMDb Scores & Votes: Real-world ratings from millions of viewers that tell us if a show is a masterpiece or a flop.

TMDB Popularity: A pulse on what is "trending" right now versus what is just a classic.

3. The Diversity of Content
The data allows us to see how global Amazon Prime really is. We can look at:

Genres: Whether the platform is dominated by high-octane Action, laugh-out-loud Comedy, or gripping Dramas.

International Reach: Which countries are producing the most content—helping us see how many titles are from the US versus international creators.

4. The Evolution Over Time
By looking at the Release Years, we can see the history of streaming. We can track if Amazon is focusing more on new "Originals" lately or if they prefer building a massive library of older, classic films.

In short, this dataset is a treasure map for understanding the entertainment habits of millions and the business strategy of one of the world's biggest streaming giants.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

# Columns of titles dataset
print("Titles Dataset Columns")
print(list(titles_df.columns))
print(f"Total Columns in Titles: {len(titles_df.columns)}\n")

# Columns of credits dataset
print("Credits Dataset Columns")
print(list(credits_df.columns))
print(f"Total Columns in Credits: {len(credits_df.columns)}")

In [None]:
# Dataset Describe
print(merged_df.describe())

### Variables Description

1. Title Information (from titles.csv)
id: A unique identifier for each title, sourced from JustWatch, used as the primary key for merging.

title: The official name of the movie or TV show.

show_type: Categorizes the content as either a 'MOVIE' or a 'SHOW'.

description: A brief textual summary or plot synopsis of the title.

release_year: The year the content was originally released.

age_certification: The age-based content rating (e.g., G, PG, R, TV-MA).

runtime: The duration of the movie or an individual episode in minutes.

genres: A list of categories associated with the title (e.g., Drama, Comedy).

production_countries: A list of countries involved in the production of the title.

seasons: The total number of seasons available (only applicable for 'SHOW' types).

imdb_id / imdb_score / imdb_votes: The unique ID, average user rating, and total number of user reviews from the IMDb platform.

tmdb_popularity / tmdb_score: The trend-based popularity metric and user score from The Movie Database (TMDB).

2. Cast and Crew Information (from credits.csv)

person_ID: A unique identifier for the actor or director.

name: The real-world name of the cast or crew member.

character_name: The name of the character played by the actor (null for directors).

role: Specifies the individual's contribution as either an 'ACTOR' or a 'DIRECTOR'.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in merged_df.columns:
    unique_values = merged_df[col].unique()
    print(f"Unique values for {col}: {unique_values}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# --- 1. Handling Missing Values (NaN) ---
# For TV Shows, 'seasons' is present, but for Movies it is NaN. We fill it with 0.
merged_df['seasons'] = merged_df['seasons'].fillna(0)

# Titles without an age rating are marked as 'Not Rated' to maintain data integrity.
merged_df['age_certification'] = merged_df['age_certification'].fillna('Not Rated')

# Fill missing IMDb and TMDB scores with the median to avoid skewing our analysis.
merged_df['imdb_score'] = merged_df['imdb_score'].fillna(merged_df['imdb_score'].median())
merged_df['tmdb_score'] = merged_df['tmdb_score'].fillna(merged_df['tmdb_score'].median())

# Drop rows where critical cast/crew information is missing as it's hard to impute.
merged_df.dropna(subset=['name', 'role'], inplace=True)


# --- 2. Data Type Correction ---
# Converting year and seasons to integer for better numerical analysis and visualization.
merged_df['release_year'] = merged_df['release_year'].astype(int)
merged_df['seasons'] = merged_df['seasons'].astype(int)


# --- 3. Text/String Cleaning ---
# Genres and production countries often come in a list-like string format e.g., "['drama']".
# We clean these to make them readable for the "Content Diversity" objective.
columns_to_clean = ['genres', 'production_countries']
for col in columns_to_clean:
    merged_df[col] = merged_df[col].str.replace("[", "", regex=False).str.replace("]", "", regex=False).str.replace("'", "", regex=False)


# --- 4. Final Verification ---
# Printing the summary to verify the "Handling Missing Values" milestone.
print("Wrangling Complete. Missing values status:")
print(merged_df.isnull().sum())
print(f"\nFinal cleaned dataset shape: {merged_df.shape}")

### What all manipulations have you done and insights you found?

What we did to clean the data (Manipulations)
Merging the Files: We connected the list of movies (titles.csv) with the list of actors and directors (credits.csv) using their unique ID numbers so we could see the "who" and the "what" in one place.

Filling in the Gaps:

For TV shows, we have season counts, but for movies, that column was empty. We filled those empties with 0 so the computer wouldn't get confused.

If a movie didn't have a maturity rating (like PG or R), we labeled it 'Not Rated' instead of just leaving it blank.

Averaging the Scores: For titles missing a rating, we used the median (middle) score of all other movies to make sure a few unrated shows didn't mess up our overall averages.

Fixing Numbers: We made sure things like "Release Year" were stored as whole numbers (integers) so we could easily sort them from oldest to newest.

Polishing Text: In the raw data, genres looked messy, like ['drama', 'crime']. We stripped away the brackets and quotes to make them look like clean, readable words.

What we learned so far (Initial Insights)

Enormous Variety: We found over 9,000 unique titles supported by a massive community of over 124,000 actors and directors, showing just how huge the Amazon Prime library really is.

A Mixed Bag: Amazon isn't just about movies; it has a very strong mix of both standalone Movies and multi-season TV Shows.

Missing Labels: A lot of older or niche content doesn't have formal age ratings, suggesting that a good chunk of the library might be classic or independent films that were never officially rated by the MPAA.

International Flavor: While there is a lot of content from the US, we noticed a significant number of titles produced in other countries, proving that the library is quite global.

High Quality: By looking at the IMDb scores, we can see that the platform holds everything from critically acclaimed masterpieces to popular trending hits.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# --- STEP 1: Load and Merge Data ---
# Ensure files are loaded before trying to merge them
titles_df = pd.read_csv('titles.csv')
credits_df = pd.read_csv('credits.csv')

# Merging to create 'merged_df'
merged_df = pd.merge(titles_df, credits_df, on='id', how='inner')

# --- STEP 2: Standardize Column Names ---
# Cleaning hidden spaces and standardizing naming to avoid KeyErrors
merged_df.columns = [col.strip().lower().replace(' ', '_') for col in merged_df.columns]

# --- STEP 3: Handle the 'show_type' Column Name ---
# Checking if the column is named 'type' (common in this dataset) and renaming it
if 'type' in merged_df.columns:
    merged_df.rename(columns={'type': 'show_type'}, inplace=True)

# --- STEP 4: Visualization (Chart 1) ---
# Why: To visualize the ratio between Movies and TV Shows on the platform
plt.figure(figsize=(8, 8))

# Counting occurrences of each content type
if 'show_type' in merged_df.columns:
    content_counts = merged_df['show_type'].value_counts()

    # Create the Pie Chart
    plt.pie(content_counts,
            labels=content_counts.index,
            autopct='%1.1f%%',
            startangle=140,
            colors=['#ff9999','#66b3ff'],
            explode=(0.05, 0)) # Slightly offset the first slice

    plt.title('Distribution of Movies vs TV Shows on Amazon Prime')
    plt.axis('equal') # Ensures pie is a circle
    plt.show()
else:
    print("Error: Could not find 'show_type' or 'type' column. Check your dataset.")

##### 1. Why did you pick the specific chart?

A Pie Chart is the most effective tool for visualizing the composition of a whole when dealing with a small number of categories (just two: Movies and Shows). It provides an immediate visual representation of which category holds the majority share.

##### 2. What is/are the insight(s) found from the chart?

---



The platform is significantly dominated by Movies compared to episodic TV Shows. This indicates that Amazon Prime’s core library strength lies in standalone cinema rather than long-form series.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This insight helps marketing teams position the service as a "digital movie theater". It also alerts content acquisition teams to potential gaps in episodic content that could be filled to improve user engagement.

Yes, an extreme imbalance can lead to subscriber churn. TV shows typically drive long-term habituation; if the library is too movie-heavy, users may cancel their subscriptions after watching a few specific films because there is no recurring episodic content to keep them coming back week after week.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Top 10 Genres (Bar Chart)
# Create df_clean for the visualization
df_clean = merged_df.copy()

# Why: To identify which genres dominate the platform's library
plt.figure(figsize=(12, 6))

# 1. Exploding the genre list for accurate counting
# We split by comma and 'explode' to count each genre individually
genre_data = df_clean['genres'].str.split(',').explode().str.strip()

# 2. Get the top 10 most frequent genres
top_10_genres = genre_data.value_counts().head(10)

# 3. Create the Bar Chart using Seaborn
sns.barplot(x=top_10_genres.values, y=top_10_genres.index, palette='viridis', hue=top_10_genres.index, legend=False)

# 4. Add titles and labels for professional formatting
plt.title('Top 10 Genres by Content Volume on Amazon Prime')
plt.xlabel('Number of Titles')
plt.ylabel('Genre')

plt.show()

##### 1. Why did you pick the specific chart?

A Horizontal Bar Chart is ideal for comparing the frequency of multiple categories. It prevents the overlapping of long genre names on the y-axis, making the data much more readable than a standard vertical bar chart.

##### 2. What is/are the insight(s) found from the chart?

Drama and Comedy are typically the most prevalent genres in the library. This reveals that Amazon Prime prioritizes mainstream, high-appeal content over niche categories like Sci-Fi or Documentary.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This data informs the recommendation engine to prioritize these popular categories. It also helps in budget allocation—investing more in high-volume genres ensures that the platform satisfies the broadest possible audience.

Oversaturation in a few genres can lead to "content fatigue". If the platform lacks diversity (e.g., very little Horror or Animation), it risks losing niche audiences to competitors like Netflix or Disney+, leading to stagnant growth in those specific market segments.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

# Content Release Trends Over Years (Line Chart)

# 1. Grouping data by release year and counting titles
# Using the cleaned release_year column
yearly_growth = df_clean['release_year'].value_counts().sort_index()

# 2. Plotting the Line Chart
plt.figure(figsize=(12, 6))
sns.lineplot(x=yearly_growth.index, y=yearly_growth.values, marker='o', color='tab:blue')

# 3. Adding professional labels and titles
plt.title('Content Release Trends: Total Titles Added Per Year')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles Released')
plt.grid(True, linestyle='--', alpha=0.6)

plt.show()

##### 1. Why did you pick the specific chart?

A Line Chart is the best choice for visualizing time-series data. It clearly shows the progression, peaks, and valleys of content production across several decades.

##### 2. What is/are the insight(s) found from the chart?

The chart typically shows an exponential increase in content starting from the late 2010s. This indicates a shift in business strategy where the platform transitioned from hosting older classics to rapidly producing or acquiring modern "Original" content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This helps the business identify if they are maintaining a consistent production pace. If the trend is upward, it builds investor confidence and signals to subscribers that there will always be something new to watch.

Yes. If the line shows a sharp decline in very recent years, it could indicate a production bottleneck or a loss of licensing deals. A sudden drop in new content can lead to stagnant user growth, as modern audiences primarily subscribe to streaming services for the latest releases.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Why: To understand the quality spread of the content library
# Distribution of IMDb Scores (Histogram)
plt.figure(figsize=(12, 6))

# 1. Creating a histogram with a Kernel Density Estimate (KDE) line
# We use the cleaned imdb_score column where NaNs were filled with the median.
sns.histplot(df_clean['imdb_score'], bins=20, kde=True, color='purple')

# 2. Adding professional formatting
plt.title('Distribution of IMDb Ratings on Amazon Prime')
plt.xlabel('IMDb Score')
plt.ylabel('Frequency (Number of Titles)')

# 3. Adding a vertical line for the average score to provide context
plt.axvline(df_clean['imdb_score'].mean(), color='red', linestyle='--', label=f"Average: {df_clean['imdb_score'].mean():.2f}")
plt.legend()

plt.show()

##### 1. Why did you pick the specific chart?

A Histogram with a KDE line is the best way to visualize the distribution and density of numerical data. it allows us to see not just the most common scores, but also how much "quality" content (high scores) vs. "poor" content (low scores) exists

##### 2. What is/are the insight(s) found from the chart?

Typically, the distribution is bell-shaped (Normal Distribution), with most titles falling between a score of 5.5 and 7.5. This suggests that while Amazon Prime has a massive library, the majority of it consists of "average" to "good" content, with very few extremely low or perfect 10/10 ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Identifying the "sweet spot" of ratings helps the platform curate its "Top Rated" sections. If the average is high, it can be used in marketing campaigns to prove that Amazon Prime offers higher quality content than its competitors.

Yes. If there is a large "left-tail" (a high frequency of very low scores), it indicates that the library is filled with "low-quality filler content". This can damage the brand's reputation and lead to customer dissatisfaction, as users might feel they are paying for a service where they have to dig through poor titles to find something worth watching.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# Top 10 Content Producing Countries (Bar Chart)
# Why: To evaluate the platform's global reach and regional focus.
# 1. Cleaning and exploding the production_countries column
# Similar to genres, countries are often stored as lists; we expand them to count each individually.
country_data = df_clean['production_countries'].str.split(',').explode().str.strip()

# 2. Filtering out empty values and getting the top 10 counts
top_10_countries = country_data[country_data != ""].value_counts().head(10)

# 3. Plotting the Bar Chart
plt.figure(figsize=(12, 6))
sns.barplot(x=top_10_countries.values, y=top_10_countries.index, palette='magma', hue=top_10_countries.index, legend=False)

# 4. Adding titles and labels
plt.title('Top 10 Countries by Content Production')
plt.xlabel('Number of Titles')
plt.ylabel('Country')

plt.show()

##### 1. Why did you pick the specific chart?

A Horizontal Bar Chart is perfect for comparing frequencies across countries with varying name lengths. It ensures that the country labels remain readable and the comparison of production volumes is visually intuitive.

##### 2. What is/are the insight(s) found from the chart?

While the United States typically leads, there is often a significant volume of content from countries like India, the UK, and Canada. This reveals that Amazon Prime is a truly global platform with a strong emphasis on international co-productions and localized content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This data allows the business to tailor its subscription pricing and marketing campaigns to specific regions. For instance, seeing high production in India justifies further investment in local "Prime Originals" to capture that specific market share.

Yes. If the production is overly centralized in one or two countries, it can hinder global expansion. Potential subscribers in regions with low representation (like Latin America or Southeast Asia) may feel the service lacks cultural relevance, leading to slow growth in those emerging markets.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

# Runtime vs. IMDb Score (Scatter Plot)
# Why: To check if there is a correlation between duration and quality.

plt.figure(figsize=(12, 6))

# 1. Creating the scatter plot
# Using 'alpha=0.3' to handle overlapping points in a large dataset
sns.scatterplot(x='runtime', y='imdb_score', data=df_clean, alpha=0.3, color='teal')

# 2. Adding professional formatting
plt.title('Relationship between Content Runtime and IMDb Scores')
plt.xlabel('Runtime (Minutes)')
plt.ylabel('IMDb Score')

# 3. Adding a trend line to see the overall direction
sns.regplot(x='runtime', y='imdb_score', data=df_clean, scatter=False, color='red')

plt.show()

##### 1. Why did you pick the specific chart?

A Histogram with a KDE line is the best way to visualize the distribution and density of numerical data. it allows us to see not just the most common scores, but also how much "quality" content (high scores) vs. "poor" content (low scores) exists.

##### 2. What is/are the insight(s) found from the chart?

Typically, the distribution is bell-shaped (Normal Distribution), with most titles falling between a score of 5.5 and 7.5. This suggests that while Amazon Prime has a massive library, the majority of it consists of "average" to "good" content, with very few extremely low or perfect 10/10 ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Identifying the "sweet spot" of ratings helps the platform curate its "Top Rated" sections. If the average is high, it can be used in marketing campaigns to prove that Amazon Prime offers higher quality content than its competitors.

Yes. If there is a large "left-tail" (a high frequency of very low scores), it indicates that the library is filled with "low-quality filler content". This can damage the brand's reputation and lead to customer dissatisfaction, as users might feel they are paying for a service where they have to dig through poor titles to find something worth watching.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

# Content Distribution by Age Certification (Count Plot)
# Why: To understand the platform's target audience and maturity profile.

plt.figure(figsize=(12, 6))

# 1. Plotting the count of each age certification
# We use the cleaned 'age_certification' column where missing values were labeled 'Not Rated'.
# Ordering by count to make the chart easier to read
sns.countplot(x='age_certification',
              data=df_clean,
              palette='Set2',
              order=df_clean['age_certification'].value_counts().index,
              hue='age_certification',
              legend=False)

# 2. Adding professional formatting
plt.title('Distribution of Age Certifications on Amazon Prime')
plt.xlabel('Age Certification')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45) # Rotating labels for better readability

plt.show()

##### 1. Why did you pick the specific chart?

A Count Plot (Bar Chart) is the most straightforward way to visualize the frequency of categorical data. By ordering the bars from highest to lowest, we can immediately identify which maturity ratings dominate the library.

##### 2. What is/are the insight(s) found from the chart?

The data often shows a high volume of TV-MA or R-rated content, mixed with a significant number of 'Not Rated' titles. This reveals that Amazon Prime has a strong tilt toward adult audiences while maintaining a large collection of older or independent films that lack modern certifications.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This helps in Marketing and Personalization. If the platform knows its library is mostly for adults, it can focus its ad spend on mature demographics. Conversely, it can identify a lack of "Family/Kids" content and use that data to justify licensing more G/PG-rated content to attract families.

Yes. A library that is too heavily skewed toward TV-MA/R ratings may limit the platform's growth in the "Family Subscription" market. If parents feel there isn't enough safe content for children, they may choose competitors like Disney+ or Netflix (which has a very strong Kids section) over Amazon Prime, leading to lower adoption rates in household segments.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

# IMDb Score Distribution: Movies vs TV Shows (Box Plot)
# Why: To compare the quality and spread of ratings between the two content types.

# --- STEP 1: Safety Check & Column Renaming ---
# Standardizing column names to fix the 'show_type' KeyError
df_clean.columns = [col.strip().lower() for col in df_clean.columns]

# If the column is named 'type', we rename it to 'show_type'
if 'type' in df_clean.columns:
    df_clean.rename(columns={'type': 'show_type'}, inplace=True)

# --- STEP 2: Visualization Code ---
# Why: To compare the rating distribution of Movies vs TV Shows
plt.figure(figsize=(10, 6))

# Creating the Box Plot
# Note: x and y must match the cleaned column names exactly
sns.boxplot(x='show_type',
            y='imdb_score',
            data=df_clean,
            palette='pastel',
            hue='show_type',
            legend=False)

# --- STEP 3: Professional Formatting ---
plt.title('IMDb Rating Comparison: Movies vs. TV Shows')
plt.xlabel('Content Type')
plt.ylabel('IMDb Score')

plt.show()

##### 1. Why did you pick the specific chart?

A Box Plot is the most effective way to compare a numerical value (IMDb scores) across different groups (Movies and TV Shows). It allows us to see the "central tendency" (the median) and the "spread" (the range of scores) for both categories simultaneously.

##### 2. What is/are the insight(s) found from the chart?

The visualization typically shows that TV Shows have a higher median IMDb score than Movies. This reveals that episodic content on Amazon Prime is generally more highly rated by audiences than standalone films.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This data justifies a higher budget allocation toward Original TV Series. Since series often receive better ratings and keep users engaged for longer periods (multiple episodes), they offer a better return on investment for building brand loyalty.

Yes. If the "box" for Movies is very low or has many low-rated outliers, it indicates a quality control issue in movie acquisitions. A library full of low-rated movies can lead to "Content Frustration," where users spend more time searching for something good than actually watching, eventually leading to subscription cancellations.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

# Top 10 Actors by Appearance (Horizontal Bar Chart)
# Why: To identify the most frequent actors and the platform's star power.

plt.figure(figsize=(12, 6))

# 1. Filter the dataset for 'ACTOR' roles and count appearances
# We use the 'name' column for actor names and 'role' to filter
top_actors = df_clean[df_clean['role'] == 'ACTOR']['name'].value_counts().head(10)

# 2. Create the Horizontal Bar Chart
sns.barplot(x=top_actors.values, y=top_actors.index, palette='rocket', hue=top_actors.index, legend=False)

# 3. Adding professional formatting
plt.title('Top 10 Most Frequent Actors on Amazon Prime')
plt.xlabel('Number of Movies/Shows')
plt.ylabel('Actor Name')

plt.show()

##### 1. Why did you pick the specific chart?

A Horizontal Bar Chart is ideal for displaying a list of names. It provides enough space for long actor names on the Y-axis without them overlapping or being cut off, making the data clear and professional.

##### 2. What is/are the insight(s) found from the chart?

The chart typically reveals a mix of prolific international stars and voice actors. It shows whether the platform relies on a small group of "bankable" stars across many titles or if the library is widely distributed across many different talents.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Identifying "frequent flyers" allows the business to create "Actor Collections" or spotlights on the home screen. If a specific actor has a large fan base and appears in many titles, promoting their work can increase clicks and total watch time.

Yes. If the top actors are mostly from older, low-budget "bulk" licensed content rather than modern "A-list" stars, it might signal a lack of premium appeal. Users paying for a premium service expect to see current, high-profile talent; an over-reliance on obscure or repetitive cast members can make the library feel "dated" and lead to subscription churn.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

# Number of Seasons in TV Shows (Count Plot)
# Why: To understand the depth of episodic content and binge-watching potential.

plt.figure(figsize=(12, 6))

# 1. Filter the dataset for TV Shows only
tv_shows = df_clean[df_clean['show_type'] == 'SHOW']

# 2. Plotting the distribution of seasons
# We use a countplot to see how many shows have 1 season, 2 seasons, etc.
sns.countplot(x='seasons', data=tv_shows, palette='viridis', hue='seasons', legend=False)

# 3. Adding professional formatting
plt.title('Distribution of TV Show Seasons on Amazon Prime')
plt.xlabel('Number of Seasons')
plt.ylabel('Number of TV Shows')

plt.show()

##### 1. Why did you pick the specific chart?

A Count Plot is the best choice here because "Number of Seasons" is a discrete numerical value. It allows us to clearly see the frequency of shows that have reached specific milestones (like a 5th or 10th season).

##### 2. What is/are the insight(s) found from the chart?

Typically, a large majority of shows have only one or two seasons. This indicates that the platform contains a high volume of new series or "limited series," with very few long-running legacy shows that span 10+ seasons.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Insights into season counts help identify "Binge-ability". Shows with more seasons generally keep users on the platform for longer periods. If the data shows most shows are short-lived, the business might consider renewing popular shows for more seasons to prevent users from finishing their watchlist too quickly.

Yes. A high concentration of single-season shows can signal a high "cancellation rate" or a library filled with unsuccessful pilots. Users are often hesitant to start a show if they know it was canceled after one season, leading to lower engagement with those titles and a perception that the platform lacks "prestige" long-term content.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

# TMDB Popularity vs. IMDb Score (Regression Plot)
# Why: To understand the link between critical acclaim and trending status.

plt.figure(figsize=(12, 6))

# 1. Creating a regression plot to show the correlation trend
# Using 'alpha=0.1' for scatter points because of the large dataset volume
sns.regplot(x='imdb_score',
            y='tmdb_popularity',
            data=df_clean,
            scatter_kws={'alpha':0.1},
            line_kws={'color':'red'})

# 2. Adding professional formatting
plt.title('Correlation: IMDb Score vs. TMDB Popularity')
plt.xlabel('IMDb Score (Critical Acclaim)')
plt.ylabel('TMDB Popularity (Trending Status)')

plt.show()

##### 1. Why did you pick the specific chart?

A Regression Plot (Regplot) is ideal here because it combines a scatter plot with a trend line. It allows us to visualize the individual data points while clearly showing whether there is a mathematical correlation between being "highly rated" and being "popular".

##### 2. What is/are the insight(s) found from the chart?

Often, there is a weak to moderate positive correlation. This reveals that while good movies are generally popular, many "cult classics" have high IMDb scores but low popularity, while some "trending hits" have high popularity despite mediocre ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This helps in Content Promotion Strategy. If the data shows high-popularity/low-score movies are driving traffic, the platform can use them as "gateways" to attract users. Conversely, high-score/low-popularity titles can be promoted in "hidden gem" categories to increase their visibility.

Yes. If there is a negative correlation (popular shows having very low scores), it suggests a "clickbait" problem. If users are clicking on popular titles but finding them to be of poor quality, it leads to long-term brand erosion. Users may lose trust in the "Popular" section of the app, reducing their overall time spent on the platform.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

# Average IMDb Score by Genre (Bar Chart)
# Why: To identify which genres are the most critically acclaimed.

plt.figure(figsize=(12, 7))

# 1. Explode the genres and calculate the average score for each
# We create a temporary dataframe to link each individual genre to its score
genre_scores = df_clean[['genres', 'imdb_score']].copy()
genre_scores['genres'] = genre_scores['genres'].str.split(',')
genre_scores = genre_scores.explode('genres')
genre_scores['genres'] = genre_scores['genres'].str.strip()

# 2. Group by genre and calculate the mean, then take the top 10
top_rated_genres = genre_scores.groupby('genres')['imdb_score'].mean().sort_values(ascending=False).head(10)

# 3. Create the Bar Chart
sns.barplot(x=top_rated_genres.values, y=top_rated_genres.index, palette='coolwarm', hue=top_rated_genres.index, legend=False)

# 4. Adding professional formatting
plt.title('Top 10 Highest Rated Genres by Average IMDb Score')
plt.xlabel('Average IMDb Score')
plt.ylabel('Genre')
plt.xlim(0, 10) # Setting limit to 10 for better perspective on ratings

plt.show()

##### 1. Why did you pick the specific chart?

A Horizontal Bar Chart is the best choice for comparing the average of a numerical value (IMDb Score) across different categories (Genres). It allows for easy ranking, making the "best" genres immediately obvious to the viewer.


##### 2. What is/are the insight(s) found from the chart?

Typically, niche or educational genres like Documentary, History, or Animation often have the highest average ratings. This reveals that while Drama and Comedy are more numerous (Chart 2), these smaller genres often deliver higher consistent quality according to audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This data can guide Award Season strategies. By investing more in these high-rated genres, Amazon Prime can increase its "Prestige" factor, leading to more award nominations (like Emmys or Oscars), which improves the overall brand value and attracts high-quality talent.

Yes. If the most "popular" genres (like Action or Comedy) are not among the highest rated, it indicates a "quantity over quality" problem. If the average rating for mainstream genres is too low, users may perceive the platform as a place for "cheap entertainment" rather than "premium content," leading to brand dilution and lower long-term subscriber loyalty.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

# Growth of Movies vs. TV Shows (Last 10 Years)
# Why: To compare the production pace of different content types recently.

# 1. Filter data for the last 10 years (2016-2026)
recent_df = df_clean[df_clean['release_year'] >= 2016]

# 2. Group by release year and content type
growth_comparison = recent_df.groupby(['release_year', 'show_type']).size().unstack(fill_value=0)

# 3. Plotting the multi-line chart
plt.figure(figsize=(12, 6))
sns.lineplot(data=growth_comparison, markers=True, dashes=False)

# 4. Adding professional formatting
plt.title('Content Growth Comparison: Movies vs. TV Shows (2016 - 2026)')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles Added')
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend(title='Content Type')

plt.show()

##### 1. Why did you pick the specific chart?

A Multi-Line Chart is the most effective way to compare trends between two distinct categories over a shared time period. It allows us to see not just the growth of each type, but how the gap between Movies and TV Shows has widened or narrowed over time.

##### 2. What is/are the insight(s) found from the chart?




In recent years, while Movies still lead in total volume, the growth rate of TV Shows often shows a steeper percentage increase. This reveals a strategic pivot toward episodic "bingeable" content to compete with other major streaming platforms.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Yes. This helps in Infrastructure and Budget Planning. If TV Show growth is accelerating, the platform needs to invest more in server capacity for high-bitrate streaming and long-term storage, as series take up significantly more data than single films.

Yes. If the chart shows that Movie production is declining while TV Shows are rising, the platform risks alienating its original core audience of "Film Buffs". A sharp decline in any major category can lead to churn among users who subscribed specifically for that type of content, resulting in a loss of market share in that specific demographic.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Chart 14- Correlation Heatmap visualization code

# Why: To identify mathematical relationships between numerical variables.

plt.figure(figsize=(10, 8))

# 1. Selecting only numerical columns for correlation
numeric_df = df_clean.select_dtypes(include=['float64', 'int64'])

# 2. Calculating the correlation matrix
corr_matrix = numeric_df.corr()

# 3. Plotting the Heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)

# 4. Adding title
plt.title('Correlation Heatmap of Amazon Prime Numerical Features')

plt.show()

##### 1. Why did you pick the specific chart?

A Heatmap is the standard tool for visualizing a correlation matrix. It uses colors to represent the strength of relationships between variables, making it easy to spot which features (like runtime and score) move together.

##### 2. What is/are the insight(s) found from the chart?





The chart typically shows a strong positive correlation between imdb_score and tmdb_score, which confirms that both platforms generally agree on content quality. You might also see a weak correlation between runtime and popularity.

#### Chart - 15 - Pair Plot

In [None]:
# --- Chart 15: Pair Plot of Key Metrics ---
# Why: To visualize pairwise relationships and distributions simultaneously.

# 1. Selecting a subset of key columns to keep the plot readable
key_columns = ['runtime', 'imdb_score', 'tmdb_popularity', 'show_type']

# 2. Creating the Pair Plot
# We use 'hue' to differentiate between Movies and TV Shows
sns.pairplot(df_clean[key_columns], hue='show_type', palette='husl', diag_kind='kde')

# 3. Adding a title (handled slightly differently for PairPlots)
plt.subplots_adjust(top=0.95)
plt.gcf().suptitle('Pair Plot: Relationships Across Key Metrics', fontsize=16)

plt.show()

##### 1. Why did you pick the specific chart?

A Pair Plot is a comprehensive tool that shows both the distribution of individual variables and the scatter relationships between every possible pair of variables in a single view.


##### 2. What is/are the insight(s) found from the chart?



It reveals how different content types (show_type) cluster together. For instance, you may see that TV Shows are clustered in a specific range of runtimes compared to the more spread-out distribution of Movies.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?

---


Explain Briefly.

To achieve the business objectives for the Amazon Prime platform, here is a brief, human-centered strategy based on our analysis:

Balance "Binge-ability" with Movies: While movies dominate the volume, your data shows TV shows often have higher satisfaction scores. The client should invest more in episodic "Originals" to reduce subscriber churn and keep people coming back week after week.

Quality over Quantity: Your analysis of IMDb scores reveals a lot of "average" content. Instead of just adding thousands of titles, the client should focus on "Prestige" genres like Documentaries and Animation, which consistently hit higher quality marks.

Leverage Global Star Power: Use the data on top actors and production countries to create localized "Spotlights". Promoting popular regional stars (especially from high-growth areas like India) can significantly boost engagement in those specific markets.

Fix the "Hidden Gem" Problem: Use the weak correlation between popularity and ratings to your advantage. By highlighting highly-rated but low-popularity titles in a "Hidden Gems" category, the platform can improve user perception of the library's depth without spending a dime on new content.

Monitor Modern Trends: Since content production exploded after 2016, the client must ensure they don't sacrifice quality for speed. Maintaining a consistent pace of high-rated modern releases is key to staying competitive against other streaming giants.

# **Conclusion**

This data analysis project provides a comprehensive overview of Amazon Prime’s content landscape, offering actionable insights into library composition, quality, and growth trends. The following key takeaways summarize the findings:

Content Strategy Balance: While the platform is currently a movie-heavy library, the data clearly indicates that TV Shows generally achieve higher audience satisfaction and IMDb scores. To increase subscriber retention and reduce churn, a pivot toward more high-quality episodic content is recommended.

Quality vs. Volume: The distribution of ratings reveals a large amount of "average" content. The platform can improve its brand value and "prestige" status by focusing acquisitions on high-performing niche genres—like Documentaries and Animation—rather than simply increasing title counts in oversaturated categories.

Strategic Optimization: By identifying the disconnect between popularity and critical acclaim, the business can better utilize its existing library. Promoting "Hidden Gems" (high-rated but low-popularity titles) offers a cost-effective way to improve user satisfaction without the immediate need for new production spend.

Global Expansion: The analysis of production countries and top actors highlights the success of international content. Continuing to invest in regional stars and localized "Originals" will be essential for capturing and maintaining market share in emerging global markets.

In summary, the transition from a "massive digital video store" to a "curated premium streaming experience" driven by data-backed decisions will be the key to Amazon Prime’s long-term growth and competitive advantage in the streaming wars.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***