# **Project Name**    -  Amazon Prime TV Shows and Movies



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**


Analyzing content is essential for OTT platforms aiming to sustain audience engagement and remain competitive within the rapidly evolving digital entertainment sector.

Amazon Prime Video, a leading global streaming provider, features a comprehensive and diverse selection of films and television programs, encompassing a wide range of genres, languages, and cultural backgrounds. As the appetite for customized and high-caliber content continues to expand, it becomes increasingly important to evaluate which categories perform optimally and how consumer preferences shift over time.

This project conducts an exploratory data analysis of Amazon Prime’s media catalog to uncover key insights and recurring trends related to content format, release periods, genre traction, and audience ratings. The analysis is intended to offer strategic guidance for content procurement, elevate user experience, and enhance overall platform efficiency.

The insights derived from this study are poised to inform high-level business decisions by identifying priority content types, optimal release windows, and genres that demonstrate the strongest viewer appeal—thereby driving user engagement, improving satisfaction, and fostering sustained growth for the platform.


# **Problem Statement**


As the demand for personalized and premium content continues to rise across OTT platforms, it becomes essential to analyze Amazon Prime’s extensive content library to uncover underlying trends and viewer behavior. This project seeks to examine the platform’s catalog in order to gain insights into audience preferences, content effectiveness, and genre appeal—ultimately guiding more informed content acquisition decisions and enhancing viewer engagement strategies.



#### **Define Your Business Objective?**

Strengthen user interaction and contentment by improving content availability, user satisfaction and personalization features on Amazon Prime Video.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
data1=pd.read_csv('/credits.csv.zip')
data2=pd.read_csv('/titles.csv.zip')

### Dataset First View

In [None]:
# data1 Dataset First Look
data1.head()

In [None]:
# data2 Dataset First Look
data2.head()

In [None]:
# merging both the dataset
df=pd.merge(data1,data2,on='id',how='left')

In [None]:
#Final Dataset view after merging
df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

This data set has 124347 rows and 19 columns.

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(df.duplicated().sum())

As we can see there are 168 duplicated values present in the data , which need to be removed in data wrangling for better analysis.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

It's showing that there are null values present in some rows and no null values in some rows of dataset but it might be "NA", "N/A", "None", "-", "null" values are there but not detected in the analysis.



In [None]:
# Visualizing the missing values
df.replace(["null", "NULL", "NaN", "None"], np.nan, inplace=True)
df.isnull().sum()

We will replace such values with np.nan and then check for nulls

### What did you know about your dataset?

This data set  lists all shows available on Amazon Prime streaming, in order to analyze the data to find interesting facts. This dataset has data available in the United States. There are some flaws like missing values and duplicated values are there which need to be taken care of for better understaning of data.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

**id**: The title ID on JustWatch.

**title**: The name of the title.

**show_type**: TV show or movie.

**description**: A brief description.

**release_year**: The release year.

**age_certification**: The age certification.

**runtime**: The length of the episode (SHOW) or movie.

**genres**: A list of genres.

**production_countries**: A list of countries that produced the title.

**seasons**: Number of seasons if it's a SHOW.

**imdb_id**: The title ID on IMDB.

**imdb_score**: Score on IMDB.

**imdb_votes**: Votes on IMDB
.
**tmdb_popularity**: Popularity on TMDB.

**tmdb_score**: Score on TMDB.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code


## Step-1  : Handling Duplicates

In [None]:
# Write your code to make your dataset analysis ready.
df.drop_duplicates(inplace=True)
print(df.duplicated().sum())

## Step-2 : Handling missing values

In [None]:
## Step-2 : Handling missing values
df.isnull().sum()

In [None]:
# Variables with null values need to filled or dropped as per the requirement
# Character variable is containing 16277 null values
df['character'].mode()  #categorical data, null value can be replace with mode ('Himself)
df['character'].fillna('Himself',inplace=True)
print(df['character'].isnull().sum())

In [None]:
# Description variable has 91 missing values which is a very small fraction of entire data, it can be dropped as it would not affect much
df.dropna(subset=["description"],inplace=True)
print(df['description'].isnull().sum())

In [None]:
# age_certification object variable ,null values can be replaced with mode
df['age_certification'].dtypes
df['age_certification'].mode()
df['age_certification'].fillna(df['age_certification'].mode()[0],inplace=True)
print(df['age_certification'].isnull().sum())


In [None]:
# seasons--> float64 datatype
df['seasons'].dtypes
df['seasons'].skew()

In [None]:
# seasons : Right skewed , missing value will be replaced by median
df['seasons'].fillna(df['seasons'].median(),inplace=True)
print(df['seasons'].isnull().sum())

In [None]:
# imdb-id : Object data
df['imdb_id'].dtypes
df['imdb_id'].count()

In [None]:
# Replace missing 'imdb_id' values with unknown
# as the imdb id is unique for everone we can not fill it with mode, so it's better to fill with 'unknown'
df['imdb_id'].fillna('unknown', inplace=True)
print(df['imdb_id'].isnull().sum())

In [None]:
# imdb-score : float64
df['imdb_score'].dtypes
df['imdb_score'].skew()

In [None]:
# we can replace missing values with mean, as  data is slightly left skewed which is greater than skew= - 0.5
df['imdb_score'].fillna(df['imdb_score'].mean(), inplace=True)
print(df['imdb_score'].isnull().sum())

In [None]:
#imdb_votes : float64
df['imdb_votes'].dtypes
df['imdb_votes'].skew()

In [None]:
# imdb_votes : Highly right skewed -- replace with median
df['imdb_votes'].fillna(df['imdb_votes'].median(), inplace=True)
print(df['imdb_votes'].isnull().sum())

In [None]:
# tmdb_polpularity : float64
df['tmdb_popularity'].dtypes
print(df['tmdb_popularity'].isnull().sum())

In [None]:
#we can drop it too there are only 13 misssing values
df.dropna(subset=['tmdb_popularity'],inplace =True)
print(df['tmdb_popularity'].isnull().sum())

In [None]:
# tmdb_score
df['tmdb_score'].dtypes
df['tmdb_score'].skew()

In [None]:
#check distribution using histogram
plt.figure(figsize=(10,5))
sns.histplot(df,x='tmdb_score',bins=20,kde=True)
plt.show()

In [None]:
# Histogram shows data is distributed evenly
#Fill missing values with mean
df['tmdb_score'].fillna(df['tmdb_score'].mean(), inplace=True)
print(df['tmdb_score'].isnull().sum())

In [None]:
df.isnull().sum()

In [None]:
df

In [None]:
from google.colab import files

# Save DataFrame as Excel
df.to_excel("df.xlsx", index=False)

# Download the file
files.download("df.xlsx")


In [None]:
#numerical columns
numeric_cols= df.select_dtypes(include=['int64','float64'])
numeric_cols

### What all manipulations have you done and insights you found?

* The dataset initially contained 168 duplicate values, which were removed using the drop_duplicates() method.

* Several columns had missing values, which were handled as follows :

 - 'tmdb_popularity' and 'description' columns had missing values and were dropped from the dataset (very few missing values).

 - Categorical columns such as 'age_certification' and 'character' (object types) had missing values that were filled using the mode.

 - Numerical columns like 'seasons' and 'imdb_votes', which were skewed, had their missing values filled using the median .

 - For columns like 'imdb_score' and 'tmdb_score', which followed a normal distribution, missing values were imputed using the mean.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

#Histogram
**visualize the distribution of content releases**

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(6,3))
sns.histplot(df,x='release_year',bins=20,kde=True)
plt.show()

##### 1. Why did you pick the specific chart?

We selected  histogram of Release_Year to clearly show the distribution of movie release years over time. It effectively highlights patterns, such as the increase in recent releases and the left-skewed nature of the data.


##### 2. What is/are the insight(s) found from the chart?

The distribution of movie release years is left-skewed (skewness = -1.14), meaning a few very old movies pull the average earlier.

Most movies were released between 2010 and 2022, with a peak around 2020.

Mean release year is 1996, but this is lower than the median (2009) due to the influence of older movies (as early as 1912).

Movie production was low before 1960, then steadily increased, with a major boom after 2000.

The surge in recent releases reflects the growth of digital platforms and filmmaking technology.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Movies involving actors show greater variability in runtimes, reflecting diverse content that can attract a wider audience and enhance recommendation systems.However excessive runtimes may lower viewer engagement, highlighting the need to balance creativity with audience viewing habits.

#### Chart - 2

#Histogram
*Illustrate the distribution of runtimes across various types of content**

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10,5))
fig=sns.histplot(df,x='runtime',bins=20,kde=True,color = "green")
plt.show()

##### 1. Why did you pick the specific chart?

We selected the histogram  to effectively show the distribution of movie runtimes. This helps in understanding how most movies vary in length, revealing patterns and outliers in terms of time.

##### 2. What is/are the insight(s) found from the chart?

* Most of the movies  have runtime between 80 to  120 minutes
*  A very few movies are of 180 minutes
*  There are few movies whicih execeede more than 500 minutes indicating outliers
* The distribution is Right-Skewwed , indicating a small number of long- duration movies

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart reveals that most movies have runtimes between 80–120 minutes, helping businesses tailor content accordingly. Long-duration movies are rare and could lead to poor performance if not strategically planned. This data-driven insight supports informed decisions in production and distribution for better audience engagement and profitability.

#### Chart - 3

Box Plot

 **Check if there are any movies whose durantion is very far than other movies in  Runtime**

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(8,4))     #box plot to check if there is any outlier present in runtime
sns.boxplot(df, x = 'runtime', color='lightgreen')
plt.title("Box Plot of Runtime")
plt.show()

##### 1. Why did you pick the specific chart?

 The box plot clearly highlights if any movies have significantly longer or shorter durations compared to others

##### 2. What is/are the insight(s) found from the chart?

From the graph it can be seen that there are outliers present in the the runtime column which are higher than other data, some are more than 300 and some are more than 500.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* The movies whose duration are extremely high need to be evaluated, like they can be splitted into parts or series wise so that there can be better user engagement.
* In case such long runtime duration are not aligned with audience preferences then it might lead to negative growth such as decreased engagement or higher bounce rates.


#### Chart - 4

##PIE PLOT

In [None]:
# Chart - 4 visualization code
# Pie Chart to get the proportion of each content type
type_counts = df['type'].value_counts()
plt.pie(type_counts, labels=type_counts.index, autopct='%1.1f%%')
plt.title("Distribution of Content Type")
plt.axis('equal')
plt.show()

##### 1. Why did you pick the specific chart?

Pie Chart is better to visually represent theproportion of each content type in the dataset and to make it easier to compare the distribution between 'Movie' and 'Show'.

##### 2. What is/are the insight(s) found from the chart?

* From the chart it can be inferred that proportion of 'Movie' and 'Show' are 93.5% and 6.5% respectively.
* It is clear that the dominance of 'Movie' is significantly higher than 'Show' among spectators.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* As the 'Movie' is dominating the platform , the content needs to be strategised and marketers needs to focus their efforts on optimizing movie related content , improving recommendations and attracting more viewers who are interested in 'Movies'
* But at the same time as the lack of engagement in 'Shows' is also concerning. There might be multiple reason behind it like content gap or lack of interest , which need to be addressed else it would lead to losing potential viewers who prefer watching series or episodic content.
* This insight can guide the business to consider expanding its show library to cater to a broader audience and improve user retention.

#### Chart - 5

PIE CHART

To show the proportion of each Role

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(6,3))
role_counts=df['role'].value_counts()
plt.pie(role_counts,labels=role_counts.index,autopct='%1.1f%%')
plt.show()

##### 1. Why did you pick the specific chart?

Pie Chart is visually representing the proportion of each role in the dataset.

##### 2. What is/are the insight(s) found from the chart?

* There is majority in the role of 'Actors' (93.3%) which is dominating the dataset.
* The role of 'Director'(6.7%) is quite less in the dataset.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* This chart is highly focusing on the 'Actors'role which will enable targeted marketing and actor-centric strategies.
* As there is limited data for the 'Director' role , it is showcasing that there is a biasness in the dataset which might lead to missed opportunity.
* It would be better to balance the dataset to support more comprehensive and inclusive business role.

#### Chart - 6

# COUNT PLOT
To show how many records belongs to each cerification category.

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(6,3))
sns.countplot(data=df,x='age_certification',palette='viridis')
plt.show()

##### 1. Why did you pick the specific chart?

The age_certification column contains multiple categories, so count plot would provide much better and immediate understanding of how many records belong to each age certification category.

##### 2. What is/are the insight(s) found from the chart?

* The 'R' certification has the highest number of entries by a large margin followed by 'PG-13' and 'PG'.
* The certifications like 'TV-PG','TV-G','TV-Y','TV-14','NC-17','TV-Y7' and 'TV-MA' have very few entries, which indicates that they are underrepresented.
* There is notable imbalance in the distribution of age certification in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, the gained insights can help create a positive business impact.
The chart reveals that a very high proportion of content is ‘R’ rated, which indicates a strong focus on mature audience engagement. This is valuable for businesses targeting adult demographics, as it allows them to design marketing strategies and recommendations that cater specifically to this group.
* The over-concentration on ‘R’ rated content creates an imbalance in age certifications. This limits appeal to younger and family-friendly audiences, resulting in missed opportunities to expand market reach.

#### Chart - 7

#SCATTER PLOT
How tmdb popularity is dependent on release year.


In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(8,4))
sns.scatterplot(data=df,x='release_year',y='tmdb_popularity')
plt.show()

##### 1. Why did you pick the specific chart?

Scatter plots are drawn to understand how two variables(tmdb_popularity & release_year) are related and to visualize patterns,trends and anomalies if there.

##### 2. What is/are the insight(s) found from the chart?

* Movies released before 1980 generally have very low tmdb_popularity scores, suggesting older films attract less current attention on TMDB.

* From around 2000 onwards, there is a noticeable increase in the popularity of movies, with more scattered points reaching higher values.

* The chart suggests that newer movies tend to be significantly more popular than older ones, likely due to increased media coverage, digital availability, and larger global audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, the gained insights can help create a positive business impact.
* The chart shows that movies released in recent years (especially after 2000 and strongly after 2015) gain much higher popularity compared to older films. This insight helps businesses focus marketing, streaming promotions, and investments on newer content, which is more likely to attract higher audience engagement and revenue.
* The consistently low popularity of older movies indicates that they contribute less to engagement. Over-reliance on recent releases may also create a short-term growth bias, while neglecting opportunities to repackage or revive older classics for niche audiences.


#### Chart - 8

# SCATTER PLOT
What is the relationship between movie runtime and number of votes received?


In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(8,4))
sns.scatterplot(data=df,x='runtime',y='imdb_votes')
plt.show()

##### 1. Why did you pick the specific chart?

The scatterplot has been chosen to visually analyze the relationship between a movie's runtime and the number of IMDb votes received for it. This helps to identify patterns or trends that could influence viewer engagement.

##### 2. What is/are the insight(s) found from the chart?

* Runtime for most of the movies lie in between 50 to 180 mintutes and the imdb_votes for these runtime are quite higher imdb_votes than short and long runtime movies.
* Extremely long runtime movies such as more than 200 minutes do not consistently receive high votes which suggests that they have diminishing audience interest or limited viewership.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Since the majority of high-vote movies fall in the 50–180 min range, producers and streaming platforms can prioritize films in this sweet spot.
* This increases the chances of better audience reach, engagement,customer satisfaction and higher box office or streaming views.

* Very Long Movies (>180 mins) and very Short Movies (<50 mins) rarely get high votes which discourage repeat viewing, limit audience base, and may result in lower ROI for producers.

#### Chart - 9

#SCATTER PLOT
To analyze how TMDB popularity is varrying with TMDB score.

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(8,4))
sns.scatterplot(data=df,x='tmdb_score',y='tmdb_popularity')
plt.show()

##### 1. Why did you pick the specific chart?

The scatterplot has been chosen to visually analyze the relationship between tmdb_score( a measure of quality) and the number of tmdb_popularity(a measure of audience engagement) received for it. This helps identify patterns or trends to take decisions.

##### 2. What is/are the insight(s) found from the chart?





Movies with TMDB scores between 6–8 show higher popularity values which suggests that better-rated movies are generally more popular among audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, the insights can create positive business impact if companies prioritize mid-to-high scoring films(6-8) and apply strong marketing strategies.

* Negative growth may happen if resources are wasted on very low-rated films or if businesses wrongly assume high ratings always equal high popularity.

#### Chart - 10

#Line PLOT
To visualize the change in TMDB popularity across years

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(8,4))
sns.lineplot(data=df,x='release_year',y='tmdb_popularity')
plt.show()

##### 1. Why did you pick the specific chart?

It helps identify how popularity of movies has evolved from the 1910s to 2020s.
This chart shows not just individual data points but overall patterns and shifts over time.

##### 2. What is/are the insight(s) found from the chart?

* Popularity of films has grown steadily over decades, with a huge boom after 2015–2020, driven by streaming platforms and global audience access.
* Older movies (pre-1980s) show very little popularity, meaning modern releases drive audience attention today.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, these insights help in creating positive business impact by focusing on modern films and streaming opportunities.
* Negative growth risk exists if companies over-invest in older films or fail to adapt to the shorter attention span of audiences today.

#### Chart - 11

#LINE PLOT
To show the number of TV show seasons has changed over the years

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(6,3))
sns.lineplot(data=df,x='release_year',y='seasons')
plt.show()

##### 1. Why did you pick the specific chart?

This line plot was selected to visualize the trend of average TV show seasons over the years. It helps track how the number of seasons has evolved from the early 1900s to recent years.

##### 2. What is/are the insight(s) found from the chart?

* From 1920 to 1940 average number of seasons remained relatively stable
* The chart shows a rise in the average number of TV show seasons peaking around 2000, followed by a sharp decline after 2015, indicating more audience engagement towards less seasons

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes, trends show demand patterns that guide investment in limited series and reduce risks.
The chart shows a rise in the average number of TV show seasons peaking around * 2000, followed by a sharp decline after 2015. This suggests a shift toward shorter series, likely due to changing audience preferences and streaming trends.
* The decline in long-running shows and audience fragmentation indicate risk if businesses push extended series.

#### Chart - 12

#LINE CHART
How the average runtime of movies evolved over the years?

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(8,4))
sns.lineplot(data=df,x='release_year',y='runtime')
plt.show()

##### 1. Why did you pick the specific chart?


* The chart clearly shows the trend of runtime over release years.
* It highlights historical variations (early 1900s long runtimes, later stabilization).
* Line chart is suitable to show continuous changes over time.
* The shaded area gives an idea of variation/uncertainty in runtimes.

##### 2. What is/are the insight(s) found from the chart?

* In the early 1900s, runtimes were highly variable with some very long movies,
from the 1930s–1950s, runtimes dropped and stabilized around 70–90 minutes.
* After the 1960s, runtimes gradually increased and remained steady around 90–110 minutes.
* In the 2000s–2020s, runtimes stayed mostly consistent without extreme peaks, Movie runtimes have stabilized over time, suggesting industry standards (around 90–110 mins) have become the norm.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes,the stabilization of runtimes (~90–110 mins) provides a clear industry benchmark.
* It helps producers/streaming platforms plan content length according to audience expectations.
* Extremely long runtimes (early 1900s peaks) or deviations from the standard may lead to lower audience retention.

#### Chart - 13

#BAR CHART
Which performs better on IMDb ratings: Shows or Movies?

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(8,4))
sns.barplot(data=df,x='type',y='imdb_score')
plt.title('IMDB score for Shows and Movies')
plt.show()

##### 1. Why did you pick the specific chart?


* A bar chart is best for comparing categorical data (SHOW vs MOVIE).

* It provides a direct visual comparison of average IMDb scores.

##### 2. What is/are the insight(s) found from the chart?

* Shows have a higher average IMDb score (7.0) compared to Movies(5.9).
* This indicates that audiences tend to rate shows more positively than movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes,shows having higher IMDb scores than movies suggest that investing in quality shows can increase audience satisfaction.
* Platforms can prioritize original shows to attract and retain subscribers.
* Movies scoring lower on average may indicate declining audience preference for movies, so over-investment in movies without addressing quality could lead to poor returns.


#### Chart -14

#BAR CHART
Who receives more audience attention on IMDb – Actors or Directors?

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(8,4))
sns.barplot(data=df,x='role',y='imdb_votes')
plt.title('Audience Engagement: Actors vs Directors on IMDb')
plt.show()

##### 1. Why did you pick the specific chart?

* A bar chart is best for comparing two categories (Actor vs Director).
* It provides a simple and direct visual comparison of IMDb votes.
* Quickly highlights that actors receive more audience attention than directors.

##### 2. What is/are the insight(s) found from the chart?

* Actors receive significantly more IMDb votes compared to Directors.
* This shows that audiences are generally more engaged with on-screen talent than behind-the-scenes roles.
* Indicates that star power heavily influences audience interaction and visibility.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes,it shows that audiences engage more with actors than directors.
* Businesses can leverage star power for marketing and promotions.
* While star power drives short-term engagement and revenue, ignoring directors’ contribution could harm content quality, leading to negative growth over time.

#### Chart - 15

#BOXP PLOT
How does the runtime of shows differ from that of movies, and which category tends to be longer?

In [None]:
# Chart - 15 visualization code
plt.figure(figsize=(8,4))
sns.boxplot(data=df,x='type',y='runtime')
plt.title('Runtime Comparison of Shows and Movies')
plt.show()

##### 1. Why did you pick the specific chart?

* A boxplot is ideal to compare the distribution of runtimes between shows and movies.
* It shows the variability and extreme values (like very long movies) that a simple bar or line chart would miss.



##### 2. What is/are the insight(s) found from the chart?

* Movies generally have longer runtimes than shows (higher median) whereas Shows are shorter and more consistent, with less variation in runtimes.
* Movies show a wide spread and many outliers, some exceeding 500 minutes, unlike shows.
* This indicates that movies cater to longer viewing sessions, while shows are designed for shorter, episodic engagement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes,clear understanding that shows are shorter and consistent, while movies are longer and more varied.
* It helps businesses decide content strategy : use shows for regular engagement and movies for special/longer viewing.

#### Chart - 16

#BAR CHART
Which type of content—shows or movies—has a longer average runtime?

In [None]:
# Chart - 16 visualization code
plt.figure(figsize=(6,3))
sns.barplot(data=df,x='type',y='runtime')
plt.title('Average Runtime Comparison: Shows vs Movies')
plt.show()

##### 1. Why did you pick the specific chart?

* A bar chart is ideal for comparing averages between two categories (shows vs movies).
* It provides a clear visual difference in runtime with simple interpretation.

##### 2. What is/are the insight(s) found from the chart?

* Movies have a much longer average runtime (100 mins) compared to shows
(40 mins).
* This highlights the difference in content structure: movies are designed for one-time, long viewing, while shows focus on shorter, episodic formats.
* Audiences consume shows in smaller time slots, whereas movies require longer dedicated viewing.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Yes,it helps businesses understand that shows are suited for frequent, short engagement, while movies are for long, immersive viewing.
* Streaming platforms can design strategies by balancing both formats to target different audience needs.
* If platforms push movies too frequently, viewers may feel time-constrained, reducing engagement.
* Over-reliance on short shows without strong storytelling may fail to retain audiences.

#### Chart - 17

#BAR CHART
Show the content type with the highest number of seasons using a bar chart.

In [None]:
# Chart - 17 visualization code
plt.figure(figsize=(8,4))
sns.barplot(data=df,x='type', y='seasons')
plt.xticks(rotation=45)
plt.title("Seasons in Shows and Movies")
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is ideal for comparing categorical data, and in this case, it helps quickly identify which content type has the highest average number of seasons.

##### 2. What is/are the insight(s) found from the chart?

* The chart shows that Shows have a significantly higher average number of seasons compared to Movies, which have much fewer. This makes sense because movies are typically one-time content, while shows often span multiple seasons.
* This insight highlights that shows tend to offer longer-term content engagement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* The insight supports investing in shows for longer viewer engagement, leading to positive business impact, with no indication of negative growth.

#### Chart - 18

#BAR PLOT
Which country's TV shows and films are most widely viewed?

In [None]:
# Chart - 18 visualization code
plt.figure(figsize=(10, 5))
# Get the top 10 production countries based on tmdb_popularity
top_10_countries = df.groupby('production_countries')['tmdb_popularity'].sum().nlargest(10).index
# Filter the DataFrame to include only the top 10 countries
filtered_df = df[df['production_countries'].isin(top_10_countries)]
# Create the bar plot
sns.barplot(data=filtered_df, x='production_countries', y='tmdb_popularity')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.title("Top 10 Production Countries by TMDB Popularity")
plt.tight_layout()  # Adjust layout to prevent labels from overlapping
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts are excellent for comparing values across categories. Here, it allows us to easily compare the TMDB popularity scores of the top 10 production countries.

##### 2. What is/are the insight(s) found from the chart?



* United States and Japan have significantly higher tmdb popularity compared to all other countries.This suggests that the majority of popular content on TMDB is produced in the US and Japan.
* While the US and Japan  leads, countries like India (IN), United Kingdom (GB), and Canada (CA) also appear, showing a global contribution to popular content, albeit at a smaller scale.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights show that the US and Japan  dominates in TMDB popularity, which suggests that focusing on US and Japan-based or collaborative productions can increase global viewership and drive business growth. This can create a positive impact by aligning content strategy with audience demand. However, over-reliance on US and Japan  content may limit regional diversity and lead to missed opportunities in other growing markets.

#### Chart - 19 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(8,4))
sns.heatmap(numeric_cols.corr(),annot=True,cmap='GnBu')
plt.show()

##### 1. Why did you pick the specific chart?

The heatmap was chosen because it is an excellent way to visually represent the correlation between multiple numerical variables in a grid format.

##### 2. What is/are the insight(s) found from the chart?

* imdb_score has a moderate positive correlation (0.6) with tmdb_score, which makes sense as both are scoring systems evaluating content.

* imdb_votes also shows positive correlation with both imdb_score (0.26) and tmdb_score (0.22), indicating that more voted content might have higher scores.

* runtime, seasons, and release_year show weak correlations with most variables, suggesting they may not strongly influence ratings or popularity.

* There is no strong multicollinearity, as most correlation values are low to moderate, making it safer to use multiple features in a model.

#### Chart - 20 - Pair Plot

In [None]:
# Pair Plot visualization code
plt.figure(figsize=(8,4))
sns.pairplot(df)

##### 1. Why did you pick the specific chart?

* A pair plot was chosen to explore relationships between multiple numeric features and detect patterns, correlations, and outliers.

##### 2. What is/are the insight(s) found from the chart?

* The pair plot reveals a strong correlation between IMDb votes and scores, indicating that widely reviewed content tends to be better rated.
* TMDb popularity also shows a mild positive trend with TMDb scores.
*  While runtime and release year show high variability without clear trends, they highlight the diversity in content, suggesting opportunities for personalized recommendations and content segmentation.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Solution to Business Objective To increase user engagement and satisfaction on Amazon Prime Video, based on the EDA findings:

* Leverage recent releases – Promote and prioritize movies released after 2020, as they dominate the platform and align with current user interest.

* Diversify content types – Add more TV shows to balance with the high number of movies and boost long-term engagement.

* Expand genre variety – Introduce more niche genres like Sci-Fi, Documentary, and Horror to attract a wider audience.

* Offer content for all age groups – Ensure a mix of family-friendly and mature content based on ratings analysis.

* Enhance regional content – Invest more in content from countries like India and the UK to attract global viewers.

* Monitor content performance – Regularly analyze viewer behavior to keep improving recommendations and content strategy.

# **Conclusion**

* Amazon Prime should increase the number of high-quality TV shows, as they offer stronger viewer engagement and longer watch time. Combining this with strategic genre expansion, localization, and personalized recommendations will help drive higher user satisfaction and retention.

* Popular genres like Drama, Comedy, and Action dominate, but there is room for expanding into niche genres to attract diverse viewers.

* The platform features a large volume of content released after 2020, highlighting a strategy focused on modern and up-to-date offerings.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***