# **Project Name**    - Amazon_prime_Exploratory_Data_Analysis

![Python Logo](https://logos-marcas.com/wp-content/uploads/2021/03/Amazon-Prime-Video-Emblema.jpg)




##### **Project Type**    - Amazon_prime_Exploratory_Data_Analysis
##### **Contribution**    - Individual


# **Project Summary -**

Amazon Prime Video is one of the leading global streaming platforms that offers a wide range of movies, TV shows, and original content across various genres. With millions of subscribers worldwide, Prime Video faces immense competition from platforms like Netflix, Disney+, and Hulu. To attract and retain audiences, content quality plays a crucial role.

This project is based on a dataset containing 124,347 records across 19 attributes. The dataset provides rich information such as title, genre, release year, runtime, IMDb and TMDb ratings, votes, cast details, and production countries. Initial analysis reveals a striking insight — only 5–10% of the content achieves high ratings on IMDb and TMDb, suggesting that there are hidden patterns behind the success of certain shows and movies.

The goal of this project is to uncover these patterns and generate actionable insights that can help production houses, streaming platforms, and content creators optimize their strategies, save time, and invest in creating content that resonates strongly with audiences.

Approach Used:-

The approach I have used in this project is defined in the given format:

1) Loading our Data: We loaded the dataset in Google Colab and read the CSV file containing 124,347 records and 19 columns.

2) Data Cleaning and Processing: Removed duplicate entries. Treated missing values by either dropping them or filling with appropriate assumptions. Standardized data types (for example, converting runtime into consistent format, ensuring IMDb scores were numeric). Cleaned categorical columns like genres and production countries for uniform analysis.

3) Analysis and Visualization: We explored key variables such as genres, release year, runtime, IMDb score, TMDb score, production countries, and age certifications to identify trends and patterns. Analyzed which genres and production countries perform the best. Compared the success of TV shows vs. movies. Studied the relationship between runtime and ratings. Examined IMDb and TMDb scores to highlight high-performing content. Answered hypothetical questions to uncover the factors behind why only 5–10% of the content performs well.

4) Future Scope of Further Analysis: The dataset provides opportunities for deeper insights. For example: Identifying trends in popular genres over decades. Analyzing the effect of release years and seasons on ratings. Studying the impact of production countries on content success. Exploring more micro-trends such as which age certifications lead to higher audience approval.

* Types of Graphs Used for Data Visualization: Count Plot, Bar Plot, Scatter Plot, Heatmap, Box Plot

* Python Libraries Used for Graphs: Matplotlib, Seaborn, NumPy, Pandas


# **GitHub Link -**

https://github.com/YUVRAJSONDHIYA

# **Problem Statement**


This dataset was created to analyze all shows available on Amazon Prime Video, allowing us to extract valuable insights such as:

Content Diversity: What genres and categories dominate the platform?
Regional Availability: How does content distribution vary across different regions?

Trends Over Time: How has Amazon Prime’s content library evolved?
IMDb Ratings & Popularity: What are the highest-rated or most popular shows on the platform?

By analyzing this dataset, businesses, content creators, and data analysts can uncover key trends that influence subscription growth, user engagement, and content investment strategies in the streaming industry.

#### **Define Your Business Objective?**


Identify High-Performing Content: Analyze IMDb and TMDb ratings to determine the characteristics of shows and movies that perform exceptionally well on the platform.

Understand Genre and Category Trends: Discover which genres and categories attract the most viewership and critical acclaim to guide future content investments.

Evaluate Regional Preferences: Assess the distribution of content across different production countries to identify opportunities for regional targeting and expansion.

Examine Content Evolution: Study how Amazon Prime’s content library has evolved over time in terms of genres, runtime, and quality to predict future trends.

Optimize Content Strategy: Provide insights that help production houses and Amazon Prime focus on producing content with the highest potential for user engagement and subscription growth.

Enhance Audience Satisfaction: Understand the impact of factors such as runtime, age certification, and release year on viewer ratings to improve content relevance and audience satisfaction.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset

# Load the 'credits' dataset from the CSV file
credits = pd.read_csv("/content/credits.csv")

# Load the 'titles' dataset from the  CSV file
titles = pd.read_csv("/content/titles.csv")

# Merge the two datasets on the common column 'id'
# Perform an inner join to keep only the rows where 'id' exists in both DataFrames
data = pd.merge(credits,titles,on='id',how='inner')


### Dataset First View

In [None]:
# Dataset First Look
# Display the first 5 rows of the merged dataset
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(12, 6))
sns.heatmap(data.isnull(),
            cmap='rocket',  # better color scheme
            cbar=False,
            yticklabels=False)  # hides row numbers for cleaner view
plt.title('Heatmap of Missing Values', fontsize=15)
plt.xlabel('Columns')
plt.tight_layout()
plt.show()

### What did you know about your dataset?

This dataset contains detailed information about Amazon Prime TV Shows and Movies, comprising 124,347 records across 19 attributes. Key columns such as imdb_score, tmdb_score, imdb_votes, tmdb_popularity, name (actor/crew), and title provide valuable insights into the relationship between content characteristics and their popularity or audience ratings. These attributes enable us to analyze patterns in viewer engagement, evaluate the impact of cast and crew on content performance, and explore trends across genres, countries, and age certifications. This forms a strong foundation for extracting meaningful insights and making data-driven decisions.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
list(data.columns)

In [None]:
# Dataset Describe
data.describe(include='all')

### Variables Description

* Person_id : Total Person ID

* id : Total ID

* name : Total name of Actors

* Chareacter : Name of actors plays as a character in the content

* role : Which role is being played

* title : Name of the content

* type : Type of content (Movie or Show)     

* description : Represents the category of movies and shows

* release_year : In which year did the content have been released

* age_certification : Age rating of content

* runtime : Duration of movies/shows

* genres : Which genres have been used in content

* production_countries : Countries where the content was produced

* seasons : Number of seasons (only applicable for TV Shows)  

* imdb_id : Unique identifier for the content on IMDb  

* imdb_score : IMDb rating score for the content (0 to 10 scale)

* imdb_votes : Total number of votes on IMDb

* tmdb_popularity : Popularity score from TMDb

* tmdb_score : TMDb rating score (0 to 10 scale)


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Loop through each column in the DataFrame and print the number of unique values for that column
for i in list(data.columns):
  print("No. of unique values in ",i,"is",data[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Creating a copy of the current dataset and assigning to df
df = data.copy()

In [None]:
# Analysing the null values
df.isnull().sum()

In [None]:
# Analysing the null values in percentage
(df.isnull().sum()/len(df))*100

## Flling the missing values

In [None]:
# Dropping 'age_certification' and 'seasons' columns due to high percentage of missing values.
# (above 50%), which may lead to unreliable analysis and cannot be imputed meaningfully.

df.drop(["age_certification","seasons"],axis=1,inplace=True)

In [None]:
# Replace missing values in the 'character' column with the most frequent value (mode)
character_mode = df['character'].mode()[0]

df.fillna({"character": character_mode}, inplace=True)

In [None]:
# Replace missing values in the 'description' column with the most frequent value (mode)
description_mode = df["description"].mode()[0]

df.fillna({"description": description_mode}, inplace=True)

In [None]:
# replace missing values in the 'imdb_id' column with the most frequent value (mode)
imdb_id_mode = df['imdb_id'].mode()[0]

df.fillna({'imdb_id':imdb_id_mode},inplace=True)

In [None]:
# Check the skewness of 'imdb_score' to decide an appropriate imputation method
skewness = df["imdb_score"].skew()  # Negative skew indicates a left-skewed distribution

# Fill missing values in 'imdb_score' with the median
imdb_score_median = df["imdb_score"].median()
df.fillna({"imdb_score": imdb_score_median}, inplace=True)

In [None]:
# Check the skewness of 'imdb_votes' to determine the best imputation method
skewness = df["imdb_votes"].skew()  # Positive skew indicates a right-skewed distribution

# Fill missing values in 'imdb_votes' with the median
imdb_votes_median = df["imdb_votes"].median()
df.fillna({"imdb_votes": imdb_votes_median}, inplace=True)

In [None]:
# Check the skewness of 'tmdb_popularity' to determine the best imputation method
skewness = df['tmdb_popularity'].skew()

# fill missing values of 'tmdb_popularity' with median
tmdb_popularity_median = df['tmdb_popularity'].median()
df.fillna({"tmdb_popularity":tmdb_popularity_median},inplace=True)

In [None]:
# Check the skewness of 'tmdb_score' to determine the best imputation method
skewness = df["tmdb_score"].skew() # Negative skew indicates a left-skewed distribution

# fill missing values if 'tmdb_score' with median
tmdb_score_median = df["tmdb_score"].median()
df.fillna({'tmdb_score':tmdb_score_median},inplace=True)

In [None]:
# Analysing the null values
df.isnull().sum()

In [None]:
# Analysing the null values in persentage
(df.isnull().sum()/len(df))*100

In [None]:
# Renaming the production countries as countries
df=df.rename(columns={"production_countries":"countries"})

In [None]:
# Creating a new column name as movies
df.loc[:,'movies'] = df.loc[df.loc[:,'type']=='MOVIE','title']


In [None]:
# Drop the missing values of movies
df.dropna(subset=['movies'],inplace=True)

In [None]:
# Renaming name column as actor
df.rename(columns={"name":'actor'},inplace=True)

In [None]:
#check the missing values
df.isnull().sum()

Data Manipulation

In [None]:
# 1.Which are the top 10 countries releasing the most movies with IMDb > 7 and TMDb > 7.5

# filter movies with good IMDb and TMDb scores
good_movies = df[(df['imdb_score']>7) & (df['tmdb_score']>7.5) & (df['type']=='MOVIE')]

# Clean the countries column (string to list)
good_movies['countries']= good_movies['countries'].str.strip("[]").str.replace("'", "").str.split(", ")


#explode to get one country per row
explode = good_movies.explode("countries")

# remove empty values
explode = explode[explode['countries'].str.strip() !=""]

top_countries = (explode.groupby("countries",as_index=False)["title"].count().rename(columns={"title":"high_rated_movies"}).sort_values(by="high_rated_movies",ascending=False).head(10))

top_countries

In [None]:
# 2. Which are the top 10 most common genres on Amazon Prime, and how many titles belong to each genre?

# Clean 'genres' column (string to list)
df["genres"] = df["genres"].str.strip("[]").str.replace("'", "").str.split(", ")

# Explode genres into separate rows
explode_genres = df.explode("genres")

# Remove empty values
explode_genres = explode_genres[explode_genres["genres"].str.strip() != ""]

# Group by genres and count titles
top_genres = (
    explode_genres.groupby("genres", as_index=False)
    ["title"].count()
    .rename(columns={"title": "total_titles"})
    .sort_values(by="total_titles", ascending=False)
    .head(10)
)

top_genres



In [None]:
# 3. Which are the top 10 production countries contributing the most content on Amazon Prime?

# Clean 'countries' column (string to list)
df["countries"] = df["countries"].str.strip("[]").str.replace("'", "").str.split(", ")

# Explode countries into separate rows
explode_countries = df.explode("countries")

# Remove empty values
explode_countries = explode_countries[explode_countries["countries"].str.strip() != ""]

# Group by countries and count total titles
top_countries_content = (
    explode_countries.groupby("countries", as_index=False)
    ["title"].count()
    .rename(columns={"title": "total_titles"})
    .sort_values(by="total_titles", ascending=False)
    .head(10)
)

top_countries_content


In [None]:
# 4.  How has the number of movies and shows released each year changed over time?

# Group by release year and type (Movie/Show)
yearly_trend = (
    df.groupby(["release_year", "type"], as_index=False)
    ["title"].count()
    .rename(columns={"title": "total_titles"})
    .sort_values(by="release_year")
)

yearly_trend

In [None]:
# 5. Which are the top 10 movies/shows with the highest TMDb popularity score?

# Select top 10 titles with highest TMDb popularity
# Drop duplicate titles so each movie/show appears only once
top_popular = (
    df.drop_duplicates(subset="title")
    [["title", "type", "tmdb_popularity", "imdb_score", "tmdb_score"]]
    .sort_values(by="tmdb_popularity", ascending=False)
    .head(10)
    .reset_index(drop=True)
)

print(top_popular)


### What all manipulations have you done and insights you found?

# Created a copy of the dataset

* Ensures the original data remains safe for reference.

# Handled Missing Values

* Dropped columns age_certification and seasons (too many missing values).

* Filled character, description, and imdb_id with mode (most frequent value).

* Filled numerical columns (imdb_score, imdb_votes, tmdb_popularity, tmdb_score) with median after checking skewness.

* Ensured no important numeric column was left blank.

# Renamed Columns for Clarity

* Renamed production_countries → countries for easier reference.

* Renamed name → actor for better clarity.

# Created a New Column

* Added a movies column that holds titles only if the content type is MOVIE.


Made the dataset analysis-ready by standardizing string values.

Analysis Summaries

1️. Top 10 Countries Releasing the Most Movies with High IMDb & TMDb Scores
"I filtered the dataset to include only movies with IMDb > 7 and TMDb > 7.5.
After cleaning and exploding the countries column, I grouped by countries to count unique high-rated movies.
The results showed the United States leading, followed by India and the United Kingdom.
This reflects the dominance of Hollywood while highlighting India’s rapid growth in global cinema."

2️. Top 10 Most Common Genres
"I exploded the genres column to separate multiple entries per movie.
After grouping and counting, I identified the most common genres on Amazon Prime.
Drama and Comedy dominated the platform, with Action also having a strong presence.
This confirms the consistent popularity of mainstream genres across streaming audiences."

3️. Top 10 Production Countries Contributing the Most Content
"I cleaned and exploded the countries column, then grouped the data to count the number of titles contributed by each country.
The United States emerged as the largest contributor, followed by India and the UK.
This demonstrates Amazon Prime’s heavy reliance on content from these countries."

4️.  Number of Movies and Shows Released Each Year
"I grouped the dataset by release year and type to count how many movies and shows were added annually.
reflecting the streaming boom era and Amazon Prime’s aggressive content expansion strategy."

5️.  Top 10 Movies/Shows by TMDb Popularity Score
"I sorted the dataset by TMDb popularity after removing duplicate titles.
The analysis revealed the most popular movies and shows on Amazon Prime.
Titles such as  All the Old Knives and The eighth clause ranked among the highest,
indicating the type of content that resonates most with audiences."

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# set up the figure size for better visualization
plt.figure(figsize=(8,6))

# Creating a histplot with KDE(kernal Density Estimate) plot
# Flitering only MOVIE type dataset
sns.histplot(data=df[df["type"]=="MOVIE"],x="release_year",color='blue',bins=30,kde=True)

# Adding labels and titles for clarity
plt.xlabel("Release Year")
plt.ylabel("Number Of Movies")
plt.title("Distribution Of Release year")

# Adding grid for better readbility of values
plt.grid(True)

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

By plotting a histplot we can analyze the distribution of movies released based on different years.


##### 2. What is/are the insight(s) found from the chart?

From the above chart i got to know, there's significant increase in number of movies released after 1980. Also highes number of movies released are in the year from 2000 to 2020.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights will help create a positive business impact.
The surge in movie releases after 1980, especially between 2000 and 2020, shows a growing demand for diverse content.
This trend highlights the opportunity for production houses to focus more on genres and themes that resonated during this peak period.
No negative growth is observed, as the trend reflects steady expansion rather than decline.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Filtering the SHOW type from data
show = data[data["type"]== "SHOW"]

# Set up the figure size for better visualization
plt.figure(figsize=(8,6))

# Creating a histplot with kde(Kernal Density Estimate) plot
sns.histplot(data = show,x = "release_year",bins=30,kde=True,color="red")

# Adding labels and titles for better clarity
plt.title("Distribution of Show over the period of time")
plt.xlabel("Release Years",labelpad=15,fontsize=12)
plt.ylabel("Total number of Shows",labelpad=15,fontsize=12)

# Adding grid for better readibility
plt.minorticks_on()
plt.grid(which='minor', linestyle='--', alpha=0.1, color='black')

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

By plotting a histplot we can analyze the distribution of SHOWS released based on different years.


##### 2. What is/are the insight(s) found from the chart?

The chart shows that the number of TV shows on Amazon Prime increased significantly after the year 2000.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights will help create a positive business impact. The sharp increase in TV shows after 2000 highlights a strong audience demand for series, guiding production houses to invest more in long-format content. This trend indicates opportunities for higher user engagement and subscription growth. No signs of negative growth are observed, as the rise has been consistent without any major decline.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

# Set up Figure Size For Better Visualization
plt.figure(figsize=(6,6))

# Creating a boxplot of tmdb_score column
sns.boxplot(x='tmdb_score',data=df)

# Adding labels and title for better clarity
plt.xlabel("TMDB SCORE",labelpad=15)
plt.title("Distribution Of TMDB SCORE",pad=30)

# Adding grid for better readibility
plt.grid(True)

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

Box plot was chosen because it's ideal for visualizing the distribution of numerical values like tmdb_score. Also it helps us to know the median, IQR and outliers.

##### 2. What is/are the insight(s) found from the chart?

The median TMDb score is around 6, showing that most Amazon Prime titles receive moderate ratings.

The interquartile range (IQR) is narrow (roughly 5.5 to 7), meaning most content scores are clustered in the mid-range.

Several outliers exist on both low (below 3) and high (above 9) ends, indicating a few poorly received and a few exceptionally successful titles.

Overall, Amazon Prime content tends to hover around average ratings, with only a small share of titles standing out as highly rated.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help create a positive business impact. Since most titles cluster around a TMDb score of 5.5–7, production houses can focus on identifying factors that push content into the higher-rated category (above 8). The presence of outliers on the lower side indicates some content is underperforming, which could negatively impact viewer satisfaction and subscriptions. Addressing the quality gaps behind these low-rated titles can help reduce negative growth and improve overall audience engagement.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Set up figure size for better visualization
plt.figure(figsize=(6,6))

# Creating a boxplot of imdb_score column
sns.boxplot(x='imdb_score', data=df)

# Adding labels and title for better clarity
plt.xlabel("IMDB SCORE",labelpad=15)
plt.title("Distribution Of IMDB SCORE",pad=30)

# Adding grid for better readibility
plt.grid(True)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Box plot was chosen because it's ideal for visualizing the distribution of numerical values like imdb_score. Also it helps us to know the median, IQR and outliers.

##### 2. What is/are the insight(s) found from the chart?

The median IMDb score is around 6, showing that most Amazon Prime titles receive moderate ratings.

The interquartile range (IQR) is narrow (roughly 5.5 to 7), meaning most content scores are clustered in the mid-range.

Several outliers exist on both low (below 3) and high (above 9) ends, indicating a few poorly received and a few exceptionally successful titles.

Overall, Amazon Prime content tends to hover around average ratings, with only a small share of titles standing out as highly rated.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights will help create a positive business impact. The IMDb score distribution shows most titles perform in the average range, while only a few stand out with high ratings. By analyzing what makes those highly rated titles successful, production houses can replicate similar strategies. However, the presence of low-rated outliers signals potential risks of negative growth if quality issues are not addressed.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Set up the figure size for better visualization
plt.figure(figsize=(6,6))

# Creating a countplot for Distribution of content type
sns.countplot(x='type', data=data,color="yellow",hue="type",palette =["orange","blue"],legend=True)

# Adding labels and titles for better clarity
plt.xlabel("Type")
plt.ylabel("Count of movies")
plt.title("Distribution of content type  : Movies vs Shows",pad=30)

# Adding grid for better readibility
plt.grid(axis='y')

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

A count plot is used to show the frequency of categorical data, making it easy to compare categories like Movies vs Shows. It quickly highlights imbalances, helping identify which type dominates on Amazon Prime.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that Amazon Prime hosts a significantly higher number of Movies compared to TV Shows. This indicates the platform’s focus on expanding its movie library, while shows form a much smaller portion of the content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can create a positive business impact as they show Amazon Prime’s strong movie library, which attracts a wide range of audiences. However, the lower number of TV shows may lead to negative growth, since many subscribers now prefer binge-worthy series, and limited shows could reduce long-term engagement

#### Chart - 6

In [None]:

# Chart visualization code
# Convert to string and handle NaN
df_genres = df['genres'].dropna().astype(str).str.split(',').explode().str.strip()

# Count genres
genre_counts = df_genres.value_counts().head(10)  # Top 10 genres

# Pie Chart
plt.figure(figsize=(8,8))
plt.pie(genre_counts,
        labels=genre_counts.index,
        autopct='%1.1f%%',
        startangle=140,
        colors=plt.cm.Set3.colors)

plt.title("Top 10 Genres Distribution", fontsize=16)
plt.show()


##### 1. Why did you pick the specific chart?

I picked a pie chart because the genres column is categorical, and a pie chart clearly shows the proportion of each genre. It helps to quickly compare which genres dominate and gives an easy visual understanding of their distribution

##### 2. What is/are the insight(s) found from the chart?

The pie chart illustrates the distribution of the top 10 movie genres, with 'drama' being the most prevalent genre, accounting for 18.0% of the distribution. 'Comedy' appears multiple times, indicating its significant presence across different categories, while 'action' and 'thriller' also represent substantial portions of the genre distribution. This visualization provides a clear overview of the relative popularity and representation of various movie genres within the top 10.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can drive positive business impact by guiding content strategy, marketing, and audience targeting toward popular genres like drama, comedy, action, and thriller, boosting engagement and revenue. However, there are risks of negative growth if the platform over-relies on these top genres, leading to audience fatigue, lack of variety, and missed opportunities in underrepresented genres with growth potential. Balancing popular and niche genres is key to sustainable growth.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

# Set up figure size for better visualization
plt.figure(figsize=(6,6))

# Creating a violin plot to check the distribution of runtime
sns.violinplot(data=df,x="runtime")

# Adding label and title for better clarity
plt.xlabel("Runtime")
plt.title("Distribution of Runtime",pad=20)

# Adding grid for better readibility
plt.grid(True)

# Display the plot
plt.show()



##### 1. Why did you pick the specific chart?

The violin plot was chosen because it shows both the spread (median, IQR, outliers) and the density of runtimes, making it clearer where values are most concentrated

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most content has a runtime clustered between 80–120 minutes, indicating this is the typical length range for movies/shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insight can create a positive business impact. Knowing that most content has a runtime of 80–120 minutes helps Amazon Prime align with audience viewing preferences and optimize future content acquisition or production strategies.

No negative growth is indicated from this insight, since it reflects the current market trend. However, relying only on this range without experimenting with shorter or longer formats could limit audience diversity, so balance is important.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

 # Set up the figure size
plt.figure(figsize=(8,6))

# Scatter plot: IMDb Score vs TMDb Score
sns.scatterplot(x="imdb_score", y="tmdb_score", hue="type", data=df, palette="Set1", alpha=0.7)

# Adding labels and title for clarity
plt.xlabel("IMDb Score")
plt.ylabel("TMDb Score")
plt.title("Relationship between IMDb Score and TMDb Score", fontsize=14, pad=20)

# Adding grid for readability
plt.grid(True, linestyle="--", alpha=0.6)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is chosen to visualize the relationship between IMDb and TMDb scores, showing whether higher IMDb scores correspond to higher TMDb scores. It also highlights patterns, clusters, and outliers in the ratings effectively.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot shows that IMDb and TMDb scores are generally positively correlated, meaning movies rated highly on IMDb tend to have high TMDb scores as well. However, the spread of dots indicates some variability, suggesting that the two platforms don’t always agree on individual movie ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive business impact: The insight that IMDb and TMDb scores are generally correlated can help platforms recommend popular movies confidently, optimize content acquisition, and predict audience preferences, improving engagement and revenue.

Potential negative growth: The variability in scores indicates inconsistent ratings for some movies, which could lead to poor recommendations if relied on solely, potentially decreasing user satisfaction and trust.

Justification: For example, a movie highly rated on IMDb but low on TMDb might not appeal to all audiences, so ignoring this discrepancy could hurt viewer retention

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Set up figure size for better visualization
plt.figure(figsize = (6,5))

# Plotting a scatterplot to visually analyze the distribution of Runtime and TMDB Popularity
sns.scatterplot(data = df, x=  f'runtime', y ='tmdb_popularity')

# Adding title and labels for better clarity
plt.title("Runtime vs TMDB Popularity",pad = 10, bbox = dict(facecolor = 'White', edgecolor = 'black'))
plt.xlabel("Runtime",labelpad = 15,fontsize = 12,bbox = dict(facecolor = 'lightyellow', edgecolor = 'black'))
plt.ylabel("TMDB Popularity",labelpad = 15,fontsize = 12,bbox = dict(facecolor = 'lightyellow', edgecolor = 'black'))

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

It helps to visually analyze the correlation between two numerical columns (Runtime,TMDB Popularity).

##### 2. What is/are the insight(s) found from the chart?


Content with runtime between 90-120 are the performing well in terms of popularity metrics.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insight that movies with runtime between 90–120 minutes perform well in popularity metrics can help platforms and studios optimize content production. They can prioritize producing or acquiring movies within this runtime range to maximize audience engagement, improve viewer retention, and increase revenue from popular content

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Group by release year to calculate average IMDb score
avg_imdb_per_year = df.groupby('release_year')['imdb_score'].mean().reset_index()

# Plot line chart
plt.figure(figsize=(12,6))
plt.plot(avg_imdb_per_year['release_year'], avg_imdb_per_year['imdb_score'], marker='o', color='blue')
plt.title('Average IMDb Score Trend Over Years')
plt.xlabel('Release Year')
plt.ylabel('Average IMDb Score')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

A line chart is ideal for visualizing trends over time, such as average IMDb scores across release years. It clearly shows how ratings change year by year, highlights upward or downward trends, and makes it easy to spot patterns or anomalies in the data.

##### 2. What is/are the insight(s) found from the chart?

The line chart shows that the average IMDb score fluctuates over the years but generally remains above 5.5, indicating a consistent baseline of movie quality. There are periods of high average scores (around 6.0–7.0), suggesting years with well-received films. The chart also shows resilience, as dips in scores are often followed by recoveries, and peaks in certain years highlight periods with a concentration of critically acclaimed or popular movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights from the average IMDb score trend can help studios and streaming platforms plan content strategies by identifying eras or genres associated with higher-rated movies. Consistently high baseline scores indicate that investing in similar types of films is likely to attract audiences and maintain engagement. Additionally, recognizing periods of peak performance can guide marketing and acquisition decisions to focus on movies with proven audience appeal, ultimately improving revenue and brand reputation

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Set up figure size for better visualization
plt.figure(figsize = (6,5))

# Plotting a scatterplot to visually analyze the correlation between Runtime and IMDB Votes
sns.scatterplot(data = df, x ='runtime', y ='imdb_votes')

# Adding title and labels for better clarity
plt.title("Runtime vs IMDB Votes ",pad = 10,bbox = dict(facecolor = 'White', edgecolor = 'black'))
plt.xlabel("Runtime",labelpad = 15,fontsize = 12,bbox = dict(facecolor = 'lightyellow', edgecolor = 'black'))
plt.ylabel("Imdb Votes (In Milion)",labelpad = 15,bbox = dict(facecolor = 'lightyellow', edgecolor = 'black'))

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

It helps to visually analyze the correlation between two numerical columns (Runtime,TMDB Popularity).

##### 2. What is/are the insight(s) found from the chart?


Content with runtime between 90-120 are the performing well in terms of popularity metrics.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Businesses must invest on content whose runtime is between 90-120 as Movies/shows with such runtime are the one who is getting popular

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Group by release_year to calculate average runtime
avg_runtime_per_year = df.groupby('release_year')['runtime'].mean().reset_index()

# Plot line chart
plt.figure(figsize=(14,7))
sns.lineplot(x='release_year', y='runtime', data=avg_runtime_per_year, marker='o', color='green')
plt.title('Average Movie Runtime Over Years')
plt.xlabel('Release Year')
plt.ylabel('Average Runtime (minutes)')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

A line chart was chosen because it effectively shows trends over time.  

##### 2. What is/are the insight(s) found from the chart?

The line chart shows that from 1960 to 2020, most movies maintained an average runtime between 80–120 minutes, reflecting a standard preferred length. There are minor fluctuations year-to-year, but overall runtimes remained consistent. In recent decades, average runtimes trend toward the upper bound (~120 minutes), suggesting modern films are slightly longer due to more complex storytelling.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can create a positive business impact. Understanding that audiences prefer movies within 80–120 minutes helps producers and studios optimize movie length for better engagement and satisfaction. The trend toward slightly longer modern films indicates opportunities to invest in high-quality, engaging content without exceeding viewers’ attention spans, improving both box office performance and streaming retention.

#### Chart - 13

In [None]:
# Filter: High scoring movies (IMDB > 7, TMDB > 7.5, Movie type, Released after 2010)
score_high= df[(df["imdb_score"]>7)&(df["tmdb_score"]>7.5)&(df["type"]=="MOVIE")&(df["release_year"]>=2010)].drop_duplicates(subset="title")
high_ = score_high.sort_values(by = ["imdb_score","tmdb_score"],ascending = [False,False]).head(10)

# Using melt() to combine IMDB & TMDB Score
actor_by_score = high_.melt(id_vars = "actor",value_vars = ["imdb_score","tmdb_score"],var_name = "score_type",value_name = "score")


# Set up figure size for better visualization
plt.figure(figsize = (8,8))

# Creating a barplot to visually analyze the top 10 actor based on IMDB & TMDB Score
sns.barplot(data = actor_by_score,x = "actor",y = "score", hue = "score_type",palette = {"blue","orange"})

# Adding title and labels for better clarity
plt.title("Top 10 actor by IMDB & TMDB Score (POST-2000)",pad=15,fontsize=15, bbox = dict(facecolor = 'White', edgecolor = 'black'))
plt.xlabel("Actors",labelpad = 10,fontsize = 14)
plt.ylabel("Score",labelpad = 10,fontsize = 14)
plt.xticks(rotation = 90)

# Display the plot
plt.show()



##### 1. Why did you pick the specific chart?

A barplot is ideal for comparing numerical values across categories, such as IMDb vs TMDb scores for the top 10 actors. It clearly shows differences in scores between actors, making it easy to identify who performs better on each platform.

##### 2. What is/are the insight(s) found from the chart?

The bar chart highlights differences between IMDb and TMDb scores among the Top 10 actors post-2000. Chinmay Mandlekar has the highest IMDb score, while Alexander Babu leads in TMDb score. Most actors have higher IMDb than TMDb scores, though some (like Alexander Babu and Malhar Thakar) show the opposite trend. Overall, scores range from ~7.5 to just under 10, indicating all these actors consistently deliver strong performances, with notable discrepancies between the two rating platforms for certain actors.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes , Production house must focus on appointing actors like Chinmey Mandlekar, Suriya as they are the one who got the highes score on IMDB as well as TMDB.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Filtering numerical data
num_data = df.select_dtypes(include=["int64","float64"])
correl = num_data.corr()

# Set up figure size for better visualization
plt.figure(figsize = (8,6))

# Plot heatmap
sns.heatmap(correl,annot = True,cmap = "rainbow")
plt.tight_layout()

# Display the heatmap
plt.show()


##### 1. Why did you pick the specific chart?

The heatmap was chosen as it clearly visualizes correlations between numerical variables, making patterns and relationships easy to interpret at a glance

##### 2. What is/are the insight(s) found from the chart?

1. IMDb Score vs TMDB Score → Strong positive correlation (0.59)  
    → Both rating platforms generally agree on content quality.  

 2. Release Year vs Runtime → Moderate positive correlation (0.29)  
    → Newer releases tend to have slightly longer runtimes.  

 3. Runtime vs IMDb Score → Moderate positive correlation (0.26)  
    → Longer movies/shows are often rated higher.  

4. IMDb Votes vs IMDb Score → Mild positive correlation (0.28)  
   → Higher-rated titles usually attract more votes.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Creating pairplot
sns.pairplot(df)

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

I chose the pairplot because it shows both the distribution of each variable and the relationships between pairs of variables, making it easier to detect patterns, correlations, or clusters in the dataset

##### 2. What is/are the insight(s) found from the chart?

* IMDb vs TMDB Scores → Strong positive linear relationship, showing both platforms largely agree on ratings.

* IMDb Votes vs IMDb Score → Mild positive trend; higher-rated movies generally attract more votes.

* Runtime → Most films fall between 50–150 minutes, the standard content length range.

* IMDb Votes vs Runtime → Weak correlation; a few longer films get more votes, likely due to big-budget or popular productions.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?


To optimize business outcomes, films should generally be produced with runtimes between 80–120 minutes to match audience preferences, while prioritizing popular genres such as drama, comedy, action, and thriller. Casting high-performing actors like Chinmay Mandlekar and Alexander Babu can enhance both IMDb and TMDb scores. Balancing content quality and popularity ensures strong critical and audience approval, while strategic release planning based on yearly trends can maximize engagement. Focusing on genre-specific performance, targeting key markets, and adjusting content length for modern audiences further improves viewer satisfaction. Marketing campaigns should highlight runtime, genre, and lead actors, and streaming platforms can personalize recommendations based on these trends. Data-driven production decisions, budget allocation, and planning sequels or franchises using historical insights help optimize ROI. Additionally, considering demographics, international appeal, and award strategies ensures broad audience reach. Overall, implementing these strategies supports higher engagement, improved revenue, and long-term growth in the film industry

# **Conclusion**

* High-Quality Movies: Films with strong content, good direction, and production quality tend to receive higher IMDb and TMDb scores.

* Popular Genres: Drama, comedy, action, and thriller dominate the market and attract larger audiences.

* Bigger Audience Base: Movies with standard runtimes (80–120 minutes) maintain higher engagement and satisfaction.

* Top-Producing Countries: Countries producing the most films have established infrastructure and audiences, making them key markets.

* Shows Growing in Demand: Certain genres and content types are increasingly popular on streaming platforms, reflecting changing viewer preferences.

* Actors Matter: High-performing actors like Chinmay Mandlekar and Alexander Babu significantly influence ratings and popularity.

* Trends Over Time: Average runtimes and IMDb/TMDb scores show fluctuations, but recent decades indicate a shift toward longer, well-received films.

* Data-Driven Decisions: Insights from genres, actors, runtimes, and popularity help in planning production, marketing, and distribution.

* Content Personalization: Streaming platforms can recommend content based on audience preferences, runtime, and genre trends.

* Business Impact: Leveraging these factors ensures higher engagement, better revenue, and sustained growth in the film and streaming industry.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***