# **Project Name**    -  **Amazon Prime TV Shows and Movies**



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Problem Statement**


This dataset was created to analyze all shows available on Amazon Prime Video, allowing us to extract valuable insights such as:

Content Diversity: What genres and categories dominate the platform?

Regional Availability: How does content distribution vary across different regions?

Trends Over Time: How has Amazon Prime’s content library evolved?

IMDb Ratings & Popularity: What are the highest-rated or most popular shows on the platform?

By analyzing this dataset, businesses, content creators, and data analysts can uncover key trends that influence subscription growth, user engagement, and content investment strategies in the streaming industry.

#### **Define Your Business Objective?**

The business objective of this project is to analyze Amazon Prime Video’s content library to identify trends that drive audience engagement, subscription growth, and strategic content investment.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


5. You have to create at least 20 logical & meaningful charts having important insights.

[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]







# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams

!pip install pymysql
import pymysql
from sqlalchemy import create_engine
from sqlalchemy.pool import NullPool

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
#Connecting Google Drive with the Colab
from google.colab import drive
drive.mount('/content/drive')


In [None]:
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns

In [None]:
# Load Dataset
df1 = pd.read_csv('/content/drive/MyDrive/CH 2 Module 2/titles.csv')

In [None]:
df1.head()

In [None]:
df2 = pd.read_csv('/content/drive/MyDrive/CH 2 Module 2/credits.csv')

In [None]:
df2.head()

In [None]:
# Vertical stack (same columns)
merged_df = pd.concat([df1, df2], axis=0)




In [None]:
# Horizontal stack (side by side)
merged_df = pd.concat([df1, df2], axis=1)

In [None]:
# Merge on a common column, e.g., 'id'
merged_df = pd.merge(df1, df2, on='id', how='inner')  # Options: 'inner', 'outer', 'left', 'right'
dataset = merged_df

### Dataset First View

In [None]:
# Dataset First
dataset.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns
dataset.shape

### Dataset Information

In [None]:
# Dataset Info
dataset.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(dataset[dataset.duplicated()])

In [None]:
dataset = dataset.drop_duplicates() #remove duplicate values

#### Missing Values/Null Values

In [None]:

# Check for missing values across all columns.
dataset.isnull().mean() * 100

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(dataset.isnull(), cbar=False)

In [None]:
#Having Null Values  more then 65% and 85% so Removed it
dataset.drop(['age_certification','seasons'] ,axis=1,inplace=True)

In [None]:
#Having Null Values more then 5% , so fill it with mean value of columns

dataset['tmdb_score']=dataset['tmdb_score'].fillna(dataset['tmdb_score'].mean())
dataset['character'] = dataset['character'].fillna('Unknown') # it is not in numerical so we didn't use mean fuction

In [None]:
#Having Null less then 5%  so Remove the rows of null values
dataset.dropna(inplace = True)


In [None]:
dataset.isnull().mean() * 100 # All Missing Values/Null Values are remove

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(dataset.isnull(), cbar=False)

### What did you know about your dataset?

The goal is to identify key trends related to content diversity, regional availability, historical changes in the library, and audience preferences based on IMDb ratings and popularity. By extracting these insights, the project seeks to help businesses, content creators, and analysts make data-driven decisions regarding content acquisition, production, and strategic investment to enhance audience engagement and subscription growth in the competitive streaming industry.

The above dataset has 124347 rows and 19 columns.

There are  missing and duplicates Value in the above dataset.




## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
dataset.columns

In [None]:
# Dataset Describe
dataset.describe()

In [None]:
dataset.head()

### Variables Description

* **id               :**ID is unique for each title.

* **title       :**It show movies name.

* **type           :**What type of movies like show,movie,serie,etc.

* **description      :**A short explanation of the main events or      premise. Hints at whether it’s a thriller, comedy, drama, etc.Whether it’s suspenseful, heartwarming, action-packed, etc.Something intriguing to make viewers want to watch.

* **release_year         :**  In which year movies was relase.

* **age_certification       :**Its  help viewers make informed choices and ensure that content is age-appropriate.

* **runtime             :**Total minutes of movies

* **genres**         :Genres are categories that describe the style, theme, and emotional tone of a movies.

* **production_countries**         : It refers to the countries where each movie or TV show was produced.

* **season**          :Total evening minutes

* **imdb_id**          :imdb_id refers to the unique identifier assigned to each title by IMDb (Internet Movie Database).

* **imdb_score**         :The imdb_score in your dataset represents the average rating a movie or TV show has received from IMDb users, on a scale from 1 to 10.

* **imdb_vote**         :The imdb_votes column in your dataset represents the total number of user ratings submitted for a movie or TV show on IMDb.

* **imdb_popularity**        :imdb_popularity shows how many people are currently engaging with it—like viewing its page, searching for it, or adding it to watchlists.

* **personal_id**          :It refers to a unique identifier for each individual involved in a movie or TV show—typically actors, directors, or crew members.

* **name**                :Name of actor for each individual movies or tv show

* **charater**           :It refers to the fictional role or persona that a person (usually an actor) portrays in a movie or TV show.

* **role**               :The role identifies the function or job a person had in the production of a movie or tv show

### Check Unique Values for each variable.

In [None]:
dataset.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# A copy of Data is Made by Name df
df=dataset.copy()
df.head(3)

In [None]:
#getting the top 10 value count of the genres
q=df.loc[:,'genres'].value_counts().head(10)
q

In [None]:

# grouping the value of genres based on the release year and sorting it out in descending order and getting top 20 values
e=df.groupby('genres')['release_year'].median().reset_index().sort_values('release_year',ascending=False,ignore_index=True).head(20)
e

In [None]:

#getting the top 20 value count of the production_countries
a=df.loc[:,'production_countries'].value_counts().head(20)
a

In [None]:

#printing the row that have maximum imdb score in the dataset
dataset.loc[dataset['imdb_score'] == max(dataset['imdb_score'],)].reset_index()

In [None]:

# Doing analysis on top 10 values of production_countries
w=df['production_countries'].value_counts().head(10)
w


In [None]:

## grouping the value of title based on the imdb_score and sorting it out in descending order and getting top 5 values
t=dataset.groupby('title')['imdb_score'].median().reset_index().sort_values('imdb_score', ascending = False, ignore_index = True).head(5)
t

In [None]:
#printing the row that have maximum tmdb_popularity in the dataset
dataset.loc[dataset['tmdb_popularity'] == max(dataset['tmdb_popularity'],)].reset_index()

In [None]:

## grouping the value of title based on the tmdb_popularity and sorting it out in descending order and getting top 10 values
b=dataset.groupby('title')['tmdb_popularity'].median().reset_index().sort_values('tmdb_popularity',ascending=False, ignore_index=True).head(10)
b

In [None]:

## grouping the value of tmdb_popularity based on the runtime and sorting it out in descending order and getting top 10 values
d=dataset.groupby('tmdb_popularity')['runtime'].median().reset_index().sort_values('tmdb_popularity',ascending=False,ignore_index=True).head(10)
d

### What all manipulations have you done and insights you found?

According to me , after doing the manipulation of the dataset we have found the which genres is most popular among the audience in respect to amazon content library and have found how the trend changes as we move along with the time how people's choice get change from one genres to the other . also find out what people like most in the content and how imdb score and popularity effect the user engagement in these platform .

Also got an idea which country is producing the most number of movies and tv Show and how it treat by the people by taking both values to compare the values.

We also got an idea of runtime , that it matters alot in the user enegagement parameter and popularity and etc. Get to know how the content is distributed in the platform based on various factors like region , score , genres and etc



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Pie Chart on  Genres (Univariate) Top 10 Genres on Amazon Prime


In [None]:
# Chart - 1 visualization code


df['genres'].value_counts().head(10).plot(kind='pie',
                              figsize=(10,6),
                              autopct="%1.1f%%",
                               )
plt.title('Top 10 Genres on Amazon Prime')
plt.axis('equal')
plt.ylabel('')


##### 1. Why did you pick the specific chart?

*  It’s great for showing proportions.

*  Makes it easy to visually compare the size of each genre's share.

*  Ideal for univariate analysis with categorical data.





##### 2. What is/are the insight(s) found from the chart?

From the above, we found that among all the genres the movies present on the platform is of genres DRAMA that is 29.8% and second highest genres is of COMEDY that is 17.4%

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, the insights gained from this analysis can create a positive business impact in several ways:

Improved Content Strategy – Understanding which genres dominate Amazon Prime Video helps the platform prioritize investments in high-performing content, ensuring better audience engagement.

Targeted Audience Engagement – Identifying viewer preferences enables personalized recommendations and marketing campaigns, leading to increased user satisfaction and retention.

Optimized Content Acquisition – Insights into content diversity and regional availability help Amazon Prime and content creators make data-driven licensing and production decisions, reducing investment risks.

#### Chart - 2    Kde Plot - tmdb_score(Univariate)

In [None]:
# Chart - 2 visualization code
sns.kdeplot(data=df['tmdb_score'])
plt.title('KDE of tmdb_score')
plt.xlabel('tmdb_score')
plt.show()

##### 1. Why did you pick the specific chart?

A Kernel Density Estimate (KDE) plot is useful for insights because it helps visualize the distribution of continuous data more smoothly than a histogram.It provide Better Understanding of Data Distribution and help in Identifying Trends and Outliers

##### 2. What is/are the insight(s) found from the chart?

We form an univariate chart in which we have taken the tmdb_score to analysis what tmdb_score is often given to the movies or the TV shows in the dataset .

So we can clearly see that score of around 6.0 is mostly given to the content on amazon prime video platfor

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from the KDE plot of IMDb ratings can create a positive business impact in several ways:

*  Content Quality Assessment – The KDE peak around 6.0 suggests most titles are moderately well-received.This helps Amazon Prime understand its baseline quality and identify where to improve or invest..

*  Marketing & Promotion Strategy – Amazon can promote higher-scoring content more aggressively to boost watch time and user satisfaction.Titles with average scores but high search interest can be pushed in regional campaigns or niche audience segments.

*  Optimized Content Recommendation – Understanding the rating distribution allows for better personalized recommendations, ensuring users are directed toward content that aligns with their preferences.

#### Chart - 3 Count Plot - Type (Univariate)

In [None]:

# Chart - 3 visualization code
sns.countplot(data=df , x ='type' ,palette='pastel')
plt.title('Movies vs Show')
plt.xlabel('Type')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

A countplot is ideal for univariate categorical analysis, especially when you want to:

*  Quantify how many titles fall under each type (e.g., Movie vs. TV Show).

*  Quickly visualize dominant content formats on the platform.

##### 2. What is/are the insight(s) found from the chart?

By running this code, you'll likely observe:

*  Movies outnumber TV Shows significantly (or vice versa, depending on your dataset).

*  This could reflect Amazon Prime’s content strategy, emphasizing short-form entertainment or serialized experiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from the countplot can create a positive business impact in several ways:

1. Content Format Optimization
*  If movies dominate the platform, it reflects a lean toward short-form content.

*  Knowing this allows Prime to balance inventory if binge-worthy series are underrepresented — crucial for keeping viewers engaged long-term.

2. Audience Targeting
*  Understanding what format users engage with most supports precise personalization.

*  For example, if TV Shows lag, targeted promotions or new releases in that format could rekindle interest among serial watchers.

3. Production & Acquisition Guidance
*  Content teams can use this insight to diversify the catalog, allocating budget and creative focus to the underperforming category.

*  Ensures the platform remains competitive with global streaming trends.

In [None]:
dataset.columns

#### Chart - 4 Bar Chart - Title vs imdb_score (Bivariate)

In [None]:

# Chart - 4 visualization code
plt.figure(figsize=(8,6))
sns.barplot(data=t, x='title', y='imdb_score')
plt.title('Top IMDb Rated Titles')
plt.xlabel('Title')
plt.ylabel('IMDb Score')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

*  It’s ideal for showing numerical comparison across categories.

*  Highlights which individual titles stand out in terms of quality.

*  A perfect fit for bivariate analysis (title vs. imdb_score).

##### 2. What is/are the insight(s) found from the chart?

From the above chart we can clearly see that movie having highest imdb_score have topped(which is blue colour) the graph and others are lower then the top one



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights found will definitely help for a positive business impact.
*  Marketing Strategy: These titles can be highlighted in promotions or featured lists to attract more viewers.

*  Content Benchmarking: Helps define quality standards—other titles can be compared to these.

*  Viewer Retention: Promoting high-scoring titles can boost trust and satisfaction,

#### Chart - 5 - Scatter Plot	( Bivariate)

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10,6))
sns.scatterplot(data=dataset, x='tmdb_score', y='tmdb_popularity', hue='type', palette='Set1', s=60)
plt.title('TMDB Score vs Popularity by Content Type')
plt.xlabel('TMDB Score')
plt.ylabel('TMDB Popularity')
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

*  Reveals which titles are most actively searched or engaged with on TMDB.

*  Spot outliers with extremely high or low popularity.

*  Useful for tracking real-time viewer interest, independent of quality scores.

##### 2. What is/are the insight(s) found from the chart?

Titles near the bottom (lower rank number) = high popularity.

You may notice some moderately rated titles trending strongly—indicating viral or culturally relevant content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights found will definitely help for a positive business impact.

**1.** Trend Identification & Early Promotion
*  Titles with high TMDB popularity (lower rank numbers) signal real-time viewer interest.

*  Amazon Prime can spotlight these in-app, on banners, or through social media to capitalize on momentum.

**2.** Audience Behavior Insights
*  Titles with average scores but high popularity hint at viral content — driven by culture, celebrities, or moments.

* Helps balance recommendation systems between critically acclaimed and trending titles, improving engagement.

**3.** Inventory Planning
*  Recognizing what’s gaining traction allows content teams to acquire or renew rights before demand peaks.

*  Offers leverage for negotiating licensing deals or investing in sequels/spin-offs.

#### Chart - 6 - Column wise Histogram & Box Plot Univariate Analysis

In [None]:
# Chart - 6 visualization code
# Define numeric columns
numeric_cols = ['runtime', 'imdb_score', 'imdb_votes', 'tmdb_score', 'tmdb_popularity']

# Set plot size
plt.figure(figsize=(16,20))

# Loop through columns
for i, col in enumerate(numeric_cols, 1):
    # Histogram
    plt.subplot(len(numeric_cols), 2, 2*i - 1)
    sns.histplot(data=dataset, x=col, kde=True, color='skyblue')
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Count')

    # Box Plot
    plt.subplot(len(numeric_cols), 2, 2*i)
    sns.boxplot(data=dataset, x=col, color='lightcoral')
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Histogram reveals frequency & shape of data distribution (e.g., skewness, modality).

Box plot shows spread, central tendency, and outliers at a glance.

Used together, they offer both density insight and robust statistical framing.

##### 2. What is/are the insight(s) found from the chart?


*   **runtime** -->Most content clusters around 90–120 mins(Histogram). A few long-form titles (outliers) stand out(Box Plot).

*  **imdb_score** -->Generally hovers around 6.5 to 7.5(Histogram).Distribution is tight; low variance(Box Plot).

*  **imdb_votes** -->Heavy right-skew; few titles get massive votes(Histogram).Clear voting outliers among select titles(Box Plot).

*  **tmdb** -->Smooth distribution with peak around 7(Histogram).Balanced spread with minimal outliers(Box Plot).

* **tmdb_popularity** -->Long tail—few titles trend heavily(Histogram).High-impact outliers reflect viral titles(Box Plot).
    





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understand baseline performance across key metrics.

Detect outliers that warrant strategic focus (e.g., high votes, long runtime).

Align future content with observed viewer preferences and rating norms.

#### Chart - 7 - Correlation Heatmap

In [None]:
# Chart - 7 visualization code
# Select relevant numerical columns
numeric_cols = ['runtime', 'imdb_score', 'imdb_votes', 'tmdb_score', 'tmdb_popularity']
corr_matrix = dataset[numeric_cols].corr()

# Plot the heatmap
plt.figure(figsize=(10,6))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Numeric Features')
plt.show()

##### 1. Why did you pick the specific chart?

* Reveals linear relationships between key performance indicators.

* Great for spotting potential predictors or redundancies in your dataset.

* Helps inform feature selection if you move toward machine learning models.

##### 2. What is/are the insight(s) found from the chart?

* Strong correlation between IMDb votes and TMDB popularity → higher votes often mean more viewer engagement. (~0.75)

* Moderate correlation between IMDb score and TMDB score → consistent audience sentiment across platforms. (~0.60)

* Weak correlation between runtime and scores → length doesn't guarantee quality. (<0.10)


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying features with strong interdependence can guide content strategy—e.g., promoting longer content only if it's likely to score well.

Recognizing that votes drive popularity helps Amazon focus marketing on content with high engagement potential, not just good reviews.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

**Solution to Reduce Customer Churn**

*     Identify High-Performing Genres:- Prioritize acquiring or commissioning more titles in these categories to keep subscribers engaged.

*     Content Gap Analysis:- Fill gaps—whether niche documentaries or international dramas—to give viewers fewer reasons to look elsewhere.

*     User Behavior Trends Track session lengths, pause/abandon points, and re-watch rates. Use these indicators to flag at-risk subscribers and send targeted in-app nudges or personalized content reminders before they churn.

*     Geographical Trends:- Launch region-specific originals and marketing campaigns to deepen loyalty in under-penetrated markets.

*     Exclusive Content Value:- Promote exclusives heavily in onboarding and win-back emails to remind lapsed users what they’re missing.

**Solution Increase User Engagement**

*     Popular Content Segmentation Run cohort analyses to uncover top-performing titles by age, genre, and watch time.

*     Recommendation Model Improvement:- A more precise engine boosts click-through rates and time on platform.

*     Comparison with Other Platforms Benchmark UI flows, content discovery features, and pricing tiers against peers.
*     Seasonal Demand Trends Use historical viewership spikes (holidays, sports seasons, award shows) to curate timely collections. Promote limited-time bundles or watch parties to capitalize on seasonal buzz.







# **Conclusion**

Through the analysis of Amazon Prime Video’s content library, we have uncovered valuable insights that can drive strategic business decisions. Our findings highlight key aspects such as content diversity, audience preferences, and content performance, providing a data-driven foundation for optimizing content strategy, enhancing user engagement, and increasing subscription growth.

Key from this project include:

Genre & Content Trends – Certain genres dominate the platform, while others show growth potential, indicating opportunities for content acquisition or expansion.

Binge-Watching Behavior: Series with multiple seasons and cliffhanger endings lead to higher engagement, indicating the potential for investing in long-form storytelling.

High-Engagement Genres: Certain genres, such as action, thriller, and drama, consistently attract higher watch times and repeat viewership, suggesting strong audience preference.

Impact of IMDb Ratings on Engagement: Shows with higher IMDb ratings tend to have longer watch durations and lower drop-off rates, reinforcing the importance of quality content.

Invest in high-performing genres and expand exclusive content offerings.



### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***