<a href="https://colab.research.google.com/github/anish0045h/Amazon-Prime-TV-Shows-and-Movies/blob/main/EDA_Amazon_Prime_TV_Shows_and_Movies_Fit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



**Amazon Prime TV Shows and Movies**








##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Anish Hegde**

# **Project Summary -**

The analysis highlights that the platform’s content library is dominated by a few key genres and production countries, with rapid growth in content volume over time. While popularity drives short-term engagement, content quality—reflected through IMDb ratings—plays a crucial role in long-term user retention, especially for TV shows. The findings emphasize the need for a balanced strategy that focuses on quality-controlled expansion, genre and regional diversification, and data-driven promotion to achieve sustainable business growth in the streaming industry.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

**Increase subscription growth** by expanding content diversity across genres and underrepresented production countries to attract a broader global audience.

**Improve user retention and engagement** by prioritizing high-quality, well-rated TV shows and episodic content that encourage repeat viewing.

**Optimize content investment** decisions by balancing popularity-driven titles with consistently high IMDb-rated content to maximize long-term return on investment.

**Reduce business risk and content fatigue** by avoiding over-dependence on a few dominant genres, regions, or blockbuster titles.

**Strengthen regional market** penetration through data-driven localization strategies that align content genres with regional audience preferences.

**Maintain platform quality** while scaling by controlling rapid content expansion to prevent quality dilution and protect brand reputation.

**Enhance data-driven decision** making by using ratings, popularity, and trend analysis to guide content acquisition, promotion, and lifecycle management.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import zipfile
import ast

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
'''
# Load Dataset
import zipfile

zip_file_path = '/content/titles.csv.zip'
output_dir = '/content/' # Extract to the current content directory

with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(output_dir)

print(f"Successfully unzipped {zip_file_path} to {output_dir}")
'''

In [None]:
file_path = '/content/drive/MyDrive/titles.csv.zip'
#df = pd.read_csv('/content/titles.csv')


In [None]:
df = pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
# Calculate missing values and sort them
missing_data = df.isnull().sum().sort_values(ascending=False)

# Filter out columns with no missing values
missing_data = missing_data[missing_data > 0]

# Create a bar chart of missing values
plt.figure(figsize=(12, 6))
sns.barplot(x=missing_data.index, y=missing_data.values, palette='viridis')
plt.title('Number of Missing Values Per Column')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

### What did you know about your dataset?

The dataset contains 9871 rows and 15 columns.

There are only 3 duplicate rows, which is a very small number relative to the total rows.

The dataset comprises various data types:

8 columns are of object type (likely strings, such as id, title, type, description, age_certification, genres, production_countries, imdb_id).
2 columns are of int64 type (integers, such as release_year, runtime).
5 columns are of float64 type (floating-point numbers, such as seasons, imdb_score, imdb_votes, tmdb_popularity, tmdb_score).
Several columns have missing (null) values, which will require attention during data wrangling:

seasons: Has a large number of missing values (8514 out of 9871), suggesting it might be considered for dropping or careful imputation if relevant only for 'SHOW' types.
age_certification: Also has a significant number of missing values (6487), indicating it might be a candidate for dropping or handling based on analysis goals.
tmdb_score: 2082 missing values.
imdb_votes: 1031 missing values.
imdb_score: 1021 missing values.
imdb_id: 667 missing values.
tmdb_popularity: 547 missing values.
description: 119 missing values.
This overview suggests that significant data cleaning and preprocessing will be needed, especially for columns with a high percentage of missing values

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

The dataset contains detailed metadata about movies and TV shows. Each title is uniquely identified using an id, along with its title and type, which specifies whether the content is a movie or a TV show. Descriptive information is provided through the description column, while release_year indicates when the title was released. Audience suitability is captured using age_certification. The runtime variable represents the duration of the content in minutes, and seasons specifies the number of seasons for TV shows. Content classification is further enriched by genres and production_countries, both of which may contain multiple values. External rating and popularity metrics are included through imdb_id, imdb_score, and imdb_votes from IMDb, as well as tmdb_popularity and tmdb_score from TMDb, enabling comparative analysis of audience reception and popularity across platforms.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df.sample(5)

In [None]:
#handling the missing values
df.isnull().sum()

In [None]:
#after looking the above column i think i can fill the columns like imdb_score,imdb_votes,tmdb_popularity,tmdb_score
df['imdb_score'] = df.groupby('type')['imdb_score'] \
                      .transform(lambda x: x.fillna(x.median()))

df['tmdb_score'] = df.groupby('type')['tmdb_score'] \
                      .transform(lambda x: x.fillna(x.median()))

df['imdb_votes'] = df.groupby('type')['imdb_votes'] \
                      .transform(lambda x: x.fillna(x.median()))

df['tmdb_popularity'] = df.groupby('type')['tmdb_popularity'] \
                           .transform(lambda x: x.fillna(x.median()))

In [None]:
df['age_certification'] = df['age_certification'].fillna('Unknown')
df.loc[df['type'] == 'MOVIE', 'seasons'] = 0  #since movies do not have seasons we fill it as 0

In [None]:
#here i will drop the columns which are not requried
#cols_to_drop = df['id', 'imdb_id']
#df = df.drop(columns=cols_to_drop)
df = df.drop(columns=['description'])

In [None]:
df.sample(5)

In [None]:
df_genres = df.explode('genres')
df_countries = df.explode('production_countries')

In [None]:
genre_counts = df_genres['genres'].value_counts()
country_counts = df_countries['production_countries'].value_counts()

In [None]:
df_genres.head()

In [None]:
country_counts

In [None]:
import ast

df['genres'] = df['genres'].apply(ast.literal_eval)
df['genre_count'] = df['genres'].apply(len)

In [None]:
df['production_countries'] = df['production_countries'].apply(ast.literal_eval)

df['country_count'] = df['production_countries'].apply(len)

In [None]:
df.head()

### What all manipulations have you done and insights you found?

The dataset was carefully cleaned and transformed to ensure it was analysis-ready. Missing values in numerical rating and popularity metrics such as IMDb and TMDB scores and vote counts were handled using median-based imputation to avoid distortion from skewed distributions, while categorical missing values like age certification were labeled as “Unknown” to preserve information without making assumptions. Identifier columns that did not contribute to analysis were retained only for reference, and descriptive text fields were excluded from further processing since the focus was on structured exploratory analysis. Multi-valued categorical columns such as genres and production countries were converted into list formats and transformed using the explode operation to enable accurate counting, grouping, and comparison. Additional helper features like genre count and country count were created to capture the complexity of multi-genre and multi-country productions. From these manipulations, initial insights emerged, including the prevalence of multi-genre content, the dominance of a few production countries, and the suitability of ratings and popularity metrics for comparative analysis across content types.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
df_genres['genres'].value_counts()

In [None]:
df.head()

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(6,4))
sns.countplot(data=df, x='type')
plt.title('Distribution of Content Type')
plt.xlabel('Content Type')
plt.ylabel('Number of Titles')
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is ideal for comparing categorical counts. We want to quickly see whether Movies or TV Shows dominate the platform.

##### 2. What is/are the insight(s) found from the chart?

One content type clearly dominates the catalog.That is Movie

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps decide investment allocation.

Guides content acquisition and production planning.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
top_genres = df_genres['genres'].value_counts().head(10)

plt.figure(figsize=(8,5))
sns.barplot(x=top_genres.values, y=top_genres.index)
plt.title('Top 10 Genres on the Platform')
plt.xlabel('Number of Titles')
plt.ylabel('Genre')
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart works best when category names are long and ranking matters.

##### 2. What is/are the insight(s) found from the chart?

A few genres dominate the catalog (e.g., drama, comedy).
Indicates content concentration rather than balanced diversity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Confirms strong alignment with popular audience preferences.
Useful for targeted marketing and recommendation systems.
Over-reliance on a few genres can alienate niche audiences.
Low genre diversity may slow international subscriber growth.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
top_countries = df_countries['production_countries'].value_counts().head(10)

plt.figure(figsize=(8,5))
sns.barplot(x=top_countries.values, y=top_countries.index)
plt.title('Top 10 Content Producing Countries')
plt.xlabel('Number of Titles')
plt.ylabel('Country')
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart clearly highlights regional dominance in content production.

##### 2. What is/are the insight(s) found from the chart?

Content is heavily concentrated in a few countries US and India.
Limited regional representation elsewhere.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps identify strong markets for continued investment.
Over-dependence on one region limits global appeal.
Weak regional diversity may reduce international subscriptions.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
content_year = df.groupby('release_year').size()

plt.figure(figsize=(10,5))
plt.plot(content_year.index, content_year.values)
plt.title('Content Growth Over Time')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.show()


##### 1. Why did you pick the specific chart?

A line chart is best for identifying trends over time.

##### 2. What is/are the insight(s) found from the chart?

Rapid growth in recent years.
Indicates aggressive content expansion.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Demonstrates platform scaling and competitiveness.

Supports user acquisition through frequent releases.

Rapid expansion may reduce average content quality.

Higher costs without guaranteed ROI.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(7,5))
sns.histplot(df['imdb_score'], bins=20, kde=True)
plt.title('Distribution of IMDb Ratings')
plt.xlabel('IMDb Score')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram helps understand the overall quality distribution.

##### 2. What is/are the insight(s) found from the chart?

Majority of content falls in mid-to-high rating range.

Few extremely low or high-rated titles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Indicates consistent content quality.

Supports brand credibility and trust.

Lack of top-rated standout content may weaken differentiation.

Average quality alone may not attract premium users.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(7,5))
sns.scatterplot(
    data=df,
    x='imdb_score',
    y='tmdb_popularity',
    alpha=0.6
)
plt.title('IMDb Rating vs Popularity')
plt.xlabel('IMDb Score')
plt.ylabel('TMDB Popularity')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is ideal for analyzing relationships between two numerical variables.

##### 2. What is/are the insight(s) found from the chart?

High ratings do not always guarantee high popularity.

Some popular titles have average ratings → hype-driven content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps balance marketing vs quality investment.

Identifies underrated content worth promoting.


Over-promotion of low-quality content can harm brand trust.



#### Chart - 7

In [None]:
# Chart - 7 visualization code
genre_type = pd.crosstab(df_genres['genres'], df_genres['type'])
top_genre_type = genre_type.loc[top_genres.index]

top_genre_type.plot(kind='bar', figsize=(10,6))
plt.title('Genre Distribution by Content Type')
plt.xlabel('Genre')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

A stacked bar chart reveals how genres differ across Movies and TV Shows.

##### 2. What is/are the insight(s) found from the chart?

Certain genres dominate TV (e.g., drama, comedy).

Movies tend to focus more on action, thriller, or documentary.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps optimize genre-specific investments per content type.

If one genre dominates both formats, content fatigue may occur.

Indicates lack of experimentation.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(7,5))
sns.boxplot(data=df, x='type', y='imdb_score')
plt.title('IMDb Rating Distribution by Content Type')
plt.xlabel('Content Type')
plt.ylabel('IMDb Score')
plt.show()

##### 1. Why did you pick the specific chart?

A boxplot compares distributions, medians, and outliers effectively.

##### 2. What is/are the insight(s) found from the chart?

TV Shows generally have higher median ratings

Movies show more rating variability

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

High-rated shows drive long-term subscriptions

Inconsistent movie quality may weaken brand perception


#### Chart - 9

In [None]:
# Chart - 9 visualization code
avg_rating_year = df.groupby('release_year')['imdb_score'].mean()

plt.figure(figsize=(10,5))
plt.plot(avg_rating_year.index, avg_rating_year.values)
plt.title('Average IMDb Rating Over Time')
plt.xlabel('Release Year')
plt.ylabel('Average IMDb Score')
plt.show()



##### 1. Why did you pick the specific chart?

A line chart best captures quality evolution over time.

##### 2. What is/are the insight(s) found from the chart?

Ratings remain stable / slightly improving

Quality has not degraded despite content expansion

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Indicates sustainable growth

Supports confidence in scaling content libraries




#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
genre_country = pd.crosstab(
    df_genres['genres'],
    df_countries['production_countries']
)

top_gc = genre_country.loc[
    genre_country.sum(axis=1).sort_values(ascending=False).head(10).index,
    genre_country.sum(axis=0).sort_values(ascending=False).head(10).index
]

plt.figure(figsize=(10,7))
sns.heatmap(top_gc, cmap='coolwarm')
plt.title('Genre vs Country Heatmap')
plt.xlabel('Country')
plt.ylabel('Genre')
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap reveals regional genre preferences at a glance.

##### 2. What is/are the insight(s) found from the chart?

Certain genres dominate specific regions

Strong localization patterns

Risk of cultural saturation

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

numeric_cols = [
    'imdb_score',
    'imdb_votes',
    'tmdb_popularity',
    'runtime',
    'genre_count',
    'country_count'
]

sns.pairplot(
    df[numeric_cols],
    diag_kind='kde',
    corner=True
)

plt.suptitle(
    'Pairplot of Key Numerical Features',
    y=1.02
)
plt.show()


##### 1. Why did you pick the specific chart?

explore multiple variable relationships simultaneously

##### 2. What is/are the insight(s) found from the chart?

The pairplot allowed us to simultaneously explore relationships between ratings, popularity, runtime, and content diversity. While popularity and quality show weak correlation, focused genre design and consistent quality emerge as stronger drivers of sustainable user engagement.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the analysis of content genres, production countries, ratings, popularity, and release trends, the platform should adopt a balanced, data-driven content strategy to achieve sustainable business growth. The content library is dominated by a few genres such as drama and comedy and heavily concentrated in a small number of countries, indicating strong performance in core markets but limited regional diversity. While highly popular titles drive short-term engagement, IMDb ratings reveal that well-focused, high-quality content—particularly episodic TV shows—supports stronger long-term user retention. Additionally, rapid content expansion over recent years increases the risk of quality dilution. Therefore, strategically investing in region-specific genres, promoting high-rated but under-exposed content, and maintaining controlled content growth can enhance subscription retention, expand international reach, and optimize content investment decisions.

✅ Quality over Popularity

---Weak correlation between IMDb ratings and TMDB popularity

---Several highly popular titles have only average IMDb scores

✅ TV Shows Drive Retention

---TV shows show higher median IMDb ratings than movies

---Popularity distribution shows longer engagement tails for shows

✅ Focused Genres Perform Better

---Titles with fewer genres tend to have higher IMDb ratings

---Clear genre positioning improves viewer satisfaction

✅ Regional Concentration Risk

---Content production is heavily dominated by a few countries

---Regional genre analysis reveals strong localization opportunities

✅ Controlled Content Expansion Is Necessary

---Rapid growth in content volume over recent years

---Average IMDb ratings remain stable but show risk of decline

✅ Platform Is Hit-Dependent

---IMDb votes and popularity are highly right-skewed

---Small number of titles drive majority of engagement


# **Conclusion**

My analysis explored the platform’s content library by examining content type, genre diversity, production countries, release trends, audience ratings, and popularity metrics to understand the factors influencing user engagement and business growth. The findings reveal that the catalog is heavily dominated by a few genres such as drama and comedy and is largely concentrated in content produced by a small number of countries, highlighting both strong core markets and limited regional diversity. Trend analysis shows a rapid expansion of content in recent years, while ratings indicate that overall content quality has remained stable, though continued growth poses a risk of quality dilution. Additionally, the weak correlation between popularity and IMDb ratings suggests that short-term engagement is often driven by promotion rather than content quality. Overall, the analysis demonstrates that sustainable growth in the streaming industry depends on balancing content volume with quality, strengthening regional and genre diversification, and adopting data-driven promotion and investment strategies to enhance user retention, global reach, and long-term business value.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***