<a href="https://colab.research.google.com/github/deepmalakale/EDA-AmazonPrime-TVShows-and-Movies/blob/main/AmazonPrime_TVShows_Movies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name** - Exploratory Data Analysis on Amazon Prime TV Shows and Movies



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

The **“Amazon Prime TV Shows and Movies”** project analyzes Amazon Prime Video’s content library to uncover key **trends, patterns, and insights** using Python (Pandas, Matplotlib, Seaborn, Plotly) and an interactive **Streamlit web app** for Univariate, Bivariate, and Multivariate exploration.

**📂 Datasets Used**

**Titles Dataset:** 9,871 records covering title, type, genre, runtime, release year, and ratings.

**Credits Dataset:** 124,000+ entries detailing actors, directors, and their roles.

**🧹 Data Preparation**
Data was cleaned, standardized, and merged, ensuring accuracy by handling missing values and normalizing categories.

**📊 Key Insights**

Movies dominate over TV shows.

Drama, Comedy, and Action are top genres.

U.S. and India lead in content production.

High IMDb/TMDB scores correlate with popularity.

Mature-rated content is most common.

**🧭 Impact & Conclusion**
The findings guide content strategy, audience targeting, and regional diversification, proving how data-driven insights can strengthen Amazon Prime Video’s growth, engagement, and competitive edge in the streaming market.

# **GitHub Link -**

Github Link :https://github.com/deepmalakale/EDA-AmazonPrime-TVShows-and-Movies/blob/main/AmazonPrime_TVShows_Movies.ipynb

# **Problem Statement**


The rapid expansion of video streaming platforms has created an environment where data-driven decision-making is critical for sustaining competitive advantage. Amazon Prime Video, with its vast library of movies and TV shows, faces the challenge of understanding its diverse catalog in terms of content diversity, audience preferences, and regional distribution.

The key problems addressed in this project are:

**Content Diversity & Trends** – What are the dominant genres, formats (movies vs. shows), and age certifications? How has the catalog evolved over the years?

**Regional Analysis** – Which countries contribute the most to Amazon Prime’s content library? Are certain regions underrepresented or overrepresented?

**Audience Reception & Popularity** – How do IMDb and TMDB ratings, votes, and popularity scores correlate, and what do they reveal about audience engagement?

**Runtime & Release Insights** – What patterns emerge in terms of runtime, release years, and seasonal content availability?

**Cast & Crew Contributions** – Who are the most frequently featured actors and directors, and how do they influence content success?

By addressing these questions, the project aims to transform **raw datasets** (titles and credits) into a **clean, structured, and insightful analysis.** The insights generated will not only enhance the understanding of Amazon Prime’s current catalog but also **provide strategic guidance for content acquisition, audience targeting, and recommendation systems** in the competitive streaming industry.

#### **Define Your Business Objective?**


This project aims to analyze Amazon Prime Video’s catalog to uncover insights into audience preferences, content diversity, and market trends.

By exploring genres, regions, release patterns, ratings, and cast/crew data, **the goal is to:**

Enhance content strategy by identifying high-performing genres and regions.

Boost audience engagement through data-driven recommendations.

Drive business growth by aligning content investments with viewer demand.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
#from google.colab import drive
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
'''try:
  drive.mount('/content/drive')
except Exception as e:
  print('Error mounting google drive')'''

### Dataset Loading

In [None]:
'''#creating file path
directory = '/content/drive/MyDrive/Colab Notebooks/capston_project/mod2_capston/AmazonPrime'
filename_titles = 'titles.csv'
filename_credits = 'credits.csv'''

In [None]:
'''try:
  filepath_titles = os.path.join(directory, filename_titles)
  filepath_credits = os.path.join(directory, filename_credits)
except Exception as e:
  print('Creation of filepath unsuccesful')'''
filepath_titles='/content/titles.csv'
filepath_credits='/content/credits.csv'

In [None]:
# Load Dataset
try:
  titles = pd.read_csv(filepath_titles)
  credits = pd.read_csv(filepath_credits)
except Exception as e:
  print('Error loading data')


### Dataset First View

In [None]:
# Dataset First Look
titles.head()

In [None]:

credits.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print('Titles Dataset - ', titles.shape)
print('----------------')
print('Credits Dataset - ', credits.shape)

### Dataset Information

In [None]:
# Dataset Info
print('Titles Dataset Info:\n')
print(titles.info())
print('----------------\n')
print('Credits Dataset Info:\n')
print(credits.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print('Titles Dataset Duplicate Values - ', titles.duplicated().sum())
print('----------------')
print('Credits Dataset Duplicate Values - ', credits.duplicated().sum())

In [None]:
print('---------Visualising duplicates in titles dataset--------')
titles[titles.duplicated()].sort_values(by = 'id')

In [None]:
print('---------Visualising duplicates in credits dataset--------')
credits[credits.duplicated()].sort_values(by = 'id')

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print('Titles Dataset Missing Values:\n')
print(titles.isnull().sum())
print('----------------')
print('Credits Dataset Missing Values:\n')
print(credits.isnull().sum())

In [None]:
# Visualizing the missing values in Titles Data using heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(titles.isnull(), cmap = 'viridis',cbar = False, yticklabels = False)
plt.title('Missing values Heatmap')

In [None]:
# Visualizing the missing values in Credits Data using heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(credits.isnull(), cmap = 'viridis',cbar = False, yticklabels = False)
plt.title('Missing values Heatmap')

### What did you know about your dataset?

**📊 Dataset Overview**

---

The dataset used in this project comes from Amazon Prime Video’s catalog (U.S. region) 🇺🇸 — originally in two raw files: titles.csv 📄 and credits.csv 🎭 — and later cleaned 🧼, transformed 🔄, and merged into a consolidated dataframe df 📊.

**📐 Dataset Size & Structure**

---

Titles Dataset: 9,871 unique shows 🎬 & movies 🍿 with 15 attributes.

Credits Dataset: 124,235 records of cast 👥 and crew 🎥 with 5 attributes.

**Final Dataframe (df):** Combines both datasets, enabling title-level and person-level insights 🔍.

**🧮 Types of Variables**

---

**Categorical Variables:** Content type (Movie/Show 🎞️📺), Age Certification 🔞, Genres 🎭, Production Countries 🌍, Role (Actor/Director 🎬).

**Numerical Variables:** Release Year 📅, Runtime ⏱️, Number of Seasons 📺, IMDb/TMDB Ratings ⭐, IMDb Votes 🗳️, TMDB Popularity 🔥.

**Text Variables:** Title 📖, Description 📝, Name 🧑‍💼, Character.


**🎭 Content Composition**

---

The catalog includes both movies 🎞️ and TV shows 📺.

Genres are diverse (Drama 🎭, Comedy 😂, Action 💥, Thriller 😱, Documentary 🎥).

Content originates from multiple production countries 🌎, with the U.S. 🇺🇸 being dominant.

**⭐ Ratings & Popularity**

---

Most titles feature IMDb and TMDB ratings ⭐ — allowing assessment of audience reception.

TMDB popularity scores 🔥 offer insights into trending content.

**👥 Cast & Crew Data**

---

The credits dataset enriches analysis by linking actors 🎭 and directors 🎬 to titles.

This enables identification of the most frequent actors/directors and collaboration patterns 🔄.

**🧹 Data Quality & Cleaning**

---

Raw datasets had missing values 🚫, outliers 📌 and inconsistencies (e.g., missing runtimes ⏱️, null age certifications 🔞, duplicate records 🔁).

**Wrangling steps included:**

---

Handling null values (imputation or removal) 🧼.

Removing or capping outliers (extreme runtimes ⏱️ or unrealistic ratings ⭐).

Standardising categorical variables (genres 🎭, countries 🌍).

Merging duplicates and ensuring consistency across datasets 🔄.

**📊 Final Dataset (df)**

---

**The final dataframe is structured and reliable, capturing from :**

**Title metadata** (name 📖, type 🎞️📺, genres 🎭, ratings ⭐, popularity 🔥).

**Credidts metadata** (actors 🎭, directors 🎬, roles 🎭).

It serves as a comprehensive foundation for Exploratory Data Analysis (EDA) 🔍 and deriving business insights 💡.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print('Titles Dataset Columns:\n')
print(titles.columns)
print('----------------')
print('Credits Dataset Columns:\n')
print(credits.columns)

In [None]:
# Dataset Describe
print('Titles Dataset Describe:\n')
print(titles.describe())
print('----------------')
print('Credits Dataset Describe:\n')
print(credits.describe(include = 'all'))

### Variables Description

The dataset consists of two files — titles.csv and credits.csv — which were merged into a consolidated dataframe (df) for analysis. Below is the description of the key variables:

**Titles Dataset (titles.csv)**

> Contains metadata of TV shows and movies available on Amazon Prime.

- id : Unique identifier of the title (`on JustWatch`).

- title : Name of the movie or TV show.

- type : Content type: Movie or TV Show.

- description : Brief synopsis of the title.

- release_year : Year in which the title was released.

- age_certification : Age rating (`e.g., PG, R, 18+`).

- runtime : Duration of the movie or average length of an episode (`in minutes`).

- genres : List of genres (`Drama, Comedy, Action, etc.`).

- production_countries : Countries involved in producing the title.

- seasons : Number of seasons (`only for TV shows`).

- imdb_id : Unique IMDb identifier.

- imdb_score : IMDb rating score (`0–10`).

- imdb_votes : Number of IMDb user votes.

- tmdb_popularity : Popularity score from TMDB.

- tmdb_score : TMDB rating score.

**Credits Dataset (credits.csv)**

> Contains cast and crew details for each title.

- person_id : Unique identifier for an actor/director.

- id : Title ID (`to link with titles dataset`).

- name : Name of the person (`actor or director`).

- character : Character played (`if actor`).

- role : Role in the production (`ACTOR or DIRECTOR`).

### Check Unique Values for each variable.

In [None]:
#Check count of unique values for each variable
print('Count of unique values in Titles Dataset--------')
for col in titles.columns:
  count = titles[col].nunique()
  print(f'{col} : {count}')

print('----------------')

print('Count of unique values in Credits Dataset')
for col in credits.columns:
  count = credits[col].nunique()
  print(f'{col} : {count}')

In [None]:
# Check Unique Values for each variable.
print('Unique Values in Titles Dataset-------')
for col in titles.columns:
  print(f"{col} : {titles[col].unique()}")


In [None]:
print('Unique values in Credits Dataset--------')
for col in credits.columns:
  print(f'{col} : {credits[col].unique()}')

### 3. Data Wrangling

#### Initial Preprocessing


In [None]:
#Remove Duplicated Values
try:
  titles.drop_duplicates(inplace = True)
  credits.drop_duplicates(inplace = True)
except Exception as e:
  print('Could not delete duplicates')

In [None]:
print('Duplicate Count after processing')
print(f'Duplicate count in Titles Dataset: {titles.duplicated().sum()}')
print('-----------')
print(f'Duplicate count in Credits Dataset: {credits.duplicated().sum()}')


#### Drop columns with 70% or more null values

In [None]:
#Checking % of null values for each column

titlesLen = titles.shape[0]
for col in titles.columns:
  null_count = titles[col].isnull().sum()
  print(f'% null values in {col} : {(null_count * 100 / titlesLen):.2f}%')

In [None]:
#We can see that seasons column has more than 70% null values
#So we will drop these columns

try:
  titles = titles.drop(columns = ['seasons'])
except Exception as e:
  print(f'Error dropping columns')


In [None]:
#Remaining columns after dropping seasons
titles.columns

In [None]:
#Check Null values in other columns for insights
titles[titles['description'].isnull() & titles['imdb_score'].isnull() & titles['imdb_votes'].isnull() & titles['tmdb_popularity'].isnull() & titles['tmdb_score'].isnull()]

In [None]:
#Since we can see that there are 69 rows with all null column values
# We can drop these rows as well

rowIndex = list(titles[titles['description'].isnull() & titles['imdb_score'].isnull() & titles['imdb_votes'].isnull() & titles['tmdb_popularity'].isnull() & titles['tmdb_score'].isnull()].index)
titles = titles.drop(index= rowIndex)

###**Cleaning of Columns:**

#### Clean Description Column

In [None]:
#Since we cannot use any statistical method to fill null values for description column, we will instead fill it with 'Unknown'
#This will help us to keep track for further processing and analysis

titles['description'] = titles['description'].fillna('Unknown')

In [None]:
titles['description'].isnull().sum()

#### Clean Age-Certification Column

In [None]:
# We will fill the null values with 'Unknown'
# The remaining values will give us insights to analyse this column
# Cleaning any leading/trailig whitespaces as well

titles['age_certification'] = titles['age_certification'].fillna('Unknown').str.strip()


In [None]:
titles['age_certification'].isnull().sum()

#### Clean Imdb-ID Column

In [None]:

# We will fill the null values with 'Unknown'
# Cleaning any leading/trailig whitespaces as well

titles['imdb_id'] = titles['imdb_id'].fillna('Unknown').str.strip()

In [None]:
titles['imdb_id'].isnull().sum()

#### Clean imdb score

In [None]:
#Check DataType of the imdb column to use appropriate statistical model
titles['imdb_score'].dtype

In [None]:
#Check the skewness of the column
titles['imdb_score'].skew()

In [None]:
#The column data is normally skewed, visualizing this through a histogram
plt.title('Visualization of imdb_score (includes Null)')
sns.histplot(data = titles, x = 'imdb_score', kde = True)

In [None]:
titles["type"].head()

In [None]:
#Using mean on groupby 'type' column to fill null values
titles['imdb_score'] = titles['imdb_score'].fillna(titles.groupby('type')['imdb_score'].transform('mean'))

In [None]:
#Rounding the mean values to 2 decimal places
titles['imdb_score'] = titles['imdb_score'].apply(lambda x: round(x,2))

In [None]:
titles['imdb_score'].isnull().sum()

#### Clean imdb_votes

In [None]:
#Check column datatype
titles['imdb_votes'].dtype

In [None]:
#Check column skewness
titles['imdb_score'].skew()

In [None]:
#Visualising column skewness through histplot
plt.title('Visualization of imdb_votes (includes null)')
plt.xscale("log")
sns.histplot(data = titles, x = 'imdb_votes', kde = True)


In [None]:
#Column is normally skewed, so we fill the values with mean
titles['imdb_votes'] = titles['imdb_votes'].fillna(titles.groupby('type')['imdb_votes'].transform('mean'))

In [None]:
titles['imdb_votes'].isnull().sum()

In [None]:
#Since votes cannot be in fraction, we will convert this column datatype to int
titles['imdb_votes'] = titles['imdb_votes'].astype(int)

#### Clean tmdb_popularity

In [None]:
titles['tmdb_popularity'].dtype

In [None]:
#Check column skewness
titles['tmdb_popularity'].skew()

In [None]:
#The column values are right skewed by a large margin
#Visualising it through a histplot

plt.title('Visualization of tmdb_popularity (includes Null)')
plt.xscale("symlog") #Symmetric log scale to visualize wide range of skewed data
sns.histplot(data = titles, x = 'tmdb_popularity', kde = True)

In [None]:
titles['tmdb_popularity'].isnull().sum()

In [None]:
#Since the column is right skewed
#We will fill the null values with median within each type group

titles['tmdb_popularity'] = titles['tmdb_popularity'].fillna(titles.groupby('type')['tmdb_popularity'].transform('median'))

In [None]:
#After cleaning null values
print(titles['tmdb_popularity'].isnull().sum())
print('-----------------')
print('tmdb_popularity values after processing null values')
print(titles['tmdb_popularity'].head(10))

#### Clean tmdb_score

In [None]:
titles['tmdb_score'].isnull().sum()

In [None]:
#Find skewness of the tmdb_score column
titles['tmdb_score'].skew()

In [None]:
#Visualising skewness through histplot
plt.title("Visualising tmdb_score (Includes Null)")
sns.histplot(data = titles, x ='tmdb_score', kde = True)

In [None]:
#Tmdb_score is normally skewed so we replace the null values with mean based on type
titles['tmdb_score'] = titles['tmdb_score'].fillna(titles.groupby('type')['tmdb_score'].transform('mean'))

In [None]:
titles['tmdb_score'].isnull().sum()

In [None]:
titles.info()

### Preprocess Credits Dataset

### Clean Character Column

In [None]:
credits.isnull().sum()

In [None]:
# Filling the null character values with Unknown
credits['character'] = credits['character'].fillna('Unknown')

In [None]:
titles.isnull().sum()

In [None]:
credits.isnull().sum()

### Merging Titles and Credits Dataset on common 'id' column

In [None]:
titles.head()

In [None]:
credits.head()

In [None]:
#Now that both the dataset is clean, we can merge them on common 'id' column for final analysis and visualization
#We will merge on Left Outer Join, keeping all the values from titles dataset and matching values from credits dataset

ct_dataset = pd.merge(titles, credits, how = 'left', on = 'id')

In [None]:
ct_dataset.shape

In [None]:
#First look of merged dataset

ct_dataset.head()

In [None]:
# Checking for any null values after merging
ct_dataset.isnull().sum()

In [None]:
ct_dataset["release_year"].dtype

In [None]:
# First look at the null data

ct_dataset[ct_dataset['person_id'].isnull() & ct_dataset['name'].isnull() & ct_dataset['character'].isnull() & ct_dataset['role'].isnull()]

In [None]:
# From this data we can figure out that several other columns have values such as 'Unknown' or mean / median processed values
# Narrowing the data based on these metrics to possibly drop from the table
temp = ct_dataset[ct_dataset['person_id'].isnull() & ct_dataset['name'].isnull() & ct_dataset['character'].isnull() & ct_dataset['role'].isnull()]
temp[(temp['age_certification'] == 'Unknown')]

In [None]:
# Checking if there are any duplicate values in the merged dataset

ct_dataset.duplicated().sum()

In [None]:
# Removing rows where most values are NAN and other columns are 'Unknown'
rows = temp[(temp['age_certification'] == 'Unknown')].index
ct_dataset.drop(index = rows, inplace = True)

### Filling Null values in merged dataset with 'Unknown' since these are categorical data

In [None]:
# Final processing
# Applying directly to all the null column values

df = ct_dataset.fillna('Unknown')

In [None]:

# Last check for any null values
df.isnull().sum()

**Remove Outliers**

In [None]:
#Outlier visualisation code using Box Plot
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd


# Get numeric columns
numeric_cols = df.select_dtypes(include='number').columns.tolist()
num_cols = len(numeric_cols)

# Compute number of rows needed, 3 plots per row
n_cols_per_row = 3
n_rows = (num_cols + n_cols_per_row - 1) // n_cols_per_row

# Create subplots grid
fig, axes = plt.subplots(n_rows, n_cols_per_row, figsize=(6 * n_cols_per_row, 4 * n_rows), squeeze=False)

# Flatten axes for easy indexing
axes_flat = axes.flatten()

# Set Seaborn style
sns.set(style="whitegrid")

# Use a color palette
colors = sns.color_palette("Set2", num_cols)  # Enough colors for all plots

for idx, col in enumerate(numeric_cols):
    ax = axes_flat[idx]
    sns.boxplot(
        x=df[col],
        orient='h',
        ax=ax,
        width=0.6,
        whis=1.5,
        fliersize=5,
        linewidth=1.2,
        color=colors[idx]  # Use color instead of palette to avoid warning
    )
    ax.set_title(col, fontsize=12)
    ax.set_yticks([])         # Hide y-axis ticks
    sns.despine(ax=ax, left=True)

# Turn off unused subplots, if any
for idx in range(num_cols, len(axes_flat)):
    fig.delaxes(axes_flat[idx])

plt.tight_layout(pad=3)
plt.show()

In [None]:
#Outlier Removal code based on IQR
#MARK
import pandas as pd

def remove_outliers_iqr(df):
    df_clean = df.copy()
    num_cols = df_clean.select_dtypes(include='number').columns

    for col in num_cols:
        Q1 = df_clean[col].quantile(0.25)
        Q3 = df_clean[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Keep only rows within bounds
        df_clean = df_clean[(df_clean[col] >= lower_bound) & (df_clean[col] <= upper_bound)]

    return df_clean

# Apply function
df_cleaned = remove_outliers_iqr(df)


print("Original shape:", df.shape)
df = remove_outliers_iqr(df)

In [None]:
#Box plot Visualization after removal of Outliers

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np



# Select numeric columns
num_cols = df.select_dtypes(include=np.number).columns.tolist()

# Define subplot layout: 3 plots per row
n_cols = 3
n_rows = int(np.ceil(len(num_cols) / n_cols))

# Seaborn style and color palette
sns.set(style="whitegrid")
colors = sns.color_palette("Set2", len(num_cols))

# Create subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(6 * n_cols, 4 * n_rows))
axes = axes.flatten()  # Flatten in case of multiple rows

# Plot each numeric column as horizontal boxplot
for i, col in enumerate(num_cols):
    sns.boxplot(x=df[col], ax=axes[i], color=colors[i], orient='h')
    axes[i].set_title(f'Boxplot of {col}', fontsize=12)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')

# Remove any unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout(pad=3)  # Good spacing between subplots
plt.show()

### What all manipulations have you done and insights you found?

🛠️ Data Manipulations Performed

To ensure the dataset was ready for analysis, several data cleaning and transformation steps were performed:

🔗 Merging Datasets

Combined the titles dataset (titles.csv) and credits dataset (credits.csv) on the common key (id).

Resulted in a consolidated dataframe (df) containing both title-level metadata and cast/crew details.

🧹 Handling Missing Values

Removed duplicate entries.

Imputed or dropped missing values for fields like description, runtime, age certification, and ratings (imdb_score, tmdb_score).

📉 Outlier Treatment

Identified and removed outliers in runtime, ratings, and popularity scores using the IQR method.

Ensured unrealistic entries (e.g., extremely high runtimes) were corrected or excluded.

🔄 Data Standardization

Normalized categorical variables like genres, production countries, and age certification.

Set id as a unique index where appropriate.

🧠 Feature Engineering

Derived year-wise distributions of titles.

Classified content into Movies vs. TV Shows.

Counted actor/director frequency to study cast/crew prominence.

📊 Visualization & EDA

Created box plots, bar plots, violin plots, treemaps, bubble plots, heatmaps, pie/donut charts, and correlation plots using Seaborn and Plotly for interactivity.

🔍 Insights Found
🎬 Content Composition

Amazon Prime hosts more movies than TV shows.

Certain genres (Drama, Comedy, Action) dominate the catalog, while niche genres are less represented.

📈 Trends Over Time

Significant growth in content release after 2010, showing Prime’s rapid expansion.

Recent years show diversification in genres and international productions.

🌍 Regional Distribution

Majority of titles are produced in the United States, but strong representation also comes from India, UK, and Canada.

⏱️ Runtime Analysis

Movies generally fall within the 90–120 minute range.

Outliers were present (very short or very long movies), which were filtered during cleaning.

⭐ Audience Ratings & Popularity

High IMDb/TMDB ratings correlate with higher popularity scores.

A few low-rated titles still have high popularity, suggesting marketing or trending factors.

🔞 Age Certification

A significant proportion of content is rated 18+ or PG-13, highlighting Prime’s focus on adult and teen audiences.

🎭 Cast & Crew Insights

Frequent actors and directors were identified (through credits dataset).

Certain directors and actors are repeatedly associated with highly-rated content.

In [None]:
# Save DataFrame as CSV
df.to_csv('mydata.csv', index=False)

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

## Chart - 1 :Histogram
## Univatiate Analysis
**“How are IMDb scores distributed across Movies and TV Shows on Amazon Prime?”**

In [None]:
# Histogram visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Set Seaborn style
sns.set_theme(style="whitegrid")

# Use safe font (no emoji issues)
plt.rcParams['font.family'] = 'DejaVu Sans'

# Create the figure
plt.figure(figsize=(12, 7))

# Plot histogram
ax = sns.histplot(
    data=df,
    x="imdb_score",
    hue="type",                   # Differentiate Movies and TV Shows
    bins=30,
    kde=True,
    palette=["#1f77b4", "#ff7f0e"],   # Prime-inspired colors
    alpha=0.7,
    legend=True                   # Seaborn handles legend automatically
)

# Title and labels
plt.title("Distribution of IMDb Scores by Content Type on Amazon Prime",
          fontsize=18, fontweight="bold", color="#2E4053", pad=15)
plt.xlabel("IMDb Score", fontsize=14, color="#1a5276")
plt.ylabel("Number of Titles", fontsize=14, color="#1a5276")

# ✅ Get the legend from Seaborn axis (no manual plt.legend)
leg = ax.get_legend()
if leg is not None:
    leg.set_title("Type")
    plt.setp(leg.get_title(), fontsize=13, fontweight='bold')
    plt.setp(leg.get_texts(), fontsize=12)

# Grid and annotations
plt.grid(alpha=0.2)
plt.text(8.5, 4000, "Higher-rated Titles →", color="#145A32", fontsize=13, fontweight="bold")
plt.text(4.0, 4000, "← Lower-rated Titles", color="#922B21", fontsize=13, fontweight="bold")

plt.tight_layout()
plt.show()


### **1. Why did you pick the specific chart?**

**📊 The histogram is ideal for this analysis because:**

---
It shows the distribution pattern of a single numeric variable (imdb_score).

The use of hue='type' adds a comparative layer between Movies and TV Shows.

It skewness, helping us spot whether the majority of Prime content is highly rated, average, or low-rated.

It’s visually intuitive — easy for both technical and non-technical audiences to interpret.

### **2. What is/are the insight(s) found from the chart?**

**After examining the visualization, several insights emerged:**

---
⭐ Most IMDb scores lie between 6.0 and 8.0, showing that Prime content is generally well-received.

🎥 Movies (blue) have a slightly higher density around 7.0–8.0, suggesting that they tend to achieve better audience ratings than TV Shows.

⚙️ TV Shows (orange) display more spread, indicating diverse viewer reception.

🚫 Few titles fall below 4.0, meaning low-quality content is minimal.

🏆 A small cluster above 9.0 represents critically acclaimed or exclusive original titles.

💡 **Inference:** Amazon Prime maintains strong quality consistency across its library, with most content achieving above-average viewer satisfaction.

### **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

---
**💼 Positive Business Impact**

---
✅ These insights have strong strategic implications for Amazon Prime Video.

**🧾 Quality Benchmarking:**

---
Most titles scoring between 6–8 gives Amazon a performance benchmark for upcoming releases.

**🎬 Content Investment Strategy:**

---
Since Movies outperform TV Shows slightly, Amazon can prioritize investment in high-rated genres or formats.

**👥 Customer Retention:**

---
A consistently strong rating distribution increases viewer trust and satisfaction, leading to higher retention rates.

**🔁 Recommendation Optimization:**

---
IMDb scores can refine the recommendation algorithm, promoting higher-rated titles to improve engagement.

---

**⚠️ Potential Negative Insights & Risks**

---

**🚩 Possible Concern:**

---
If future analyses show IMDb ratings drifting below 5.
**it may indicate:**

Decline in production quality,
Audience involement, or
Content overload with too many average shows.

**📉 Business Impact:**

---
This could lead to reduced engagement, brand fatigue, and eventually subscriber churn.

---
**✅ Action Plan:**

---
Continuously monitor IMDb score trends.


Use viewer feedback and ratings as early warning signals for content strategy adjustments.

## Chart - 2  Bar plot
## Univatiate Analysis

**“Which content rating category (like TV-MA, PG-13, etc.) dominates Amazon Prime’s library, and how does it differ between Movies and TV Shows?”**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Copy the original DataFrame to avoid modifying it directly
data = df.copy()

# 🧹 Clean and preprocess the data
# Handle missing or blank ratings
data['age_certification'] = data['age_certification'].fillna("Not Rated")
data['age_certification'] = data['age_certification'].replace('', "Not Rated")

# Standardize 'type' column: remove spaces, fix casing, and unify naming
data['type'] = data['type'].str.strip().str.title()

# Fix inconsistent entries (ensure all are either "Movie" or "TV Show")
data['type'] = data['type'].replace({
    'Tv Show': 'TV Show',
    'Show': 'TV Show',
    'Tvshow': 'TV Show',
    'Movies': 'Movie'
})

# Check final unique values (optional)
# print(data['type'].unique())

# 🧩 Group by age certification and type
grouped_data = (
    data.groupby(['age_certification', 'type'])
    .size()
    .reset_index(name='count')
    .sort_values(by='count', ascending=False)
)

# 🎨 Define a professional color palette
palette = {"Movie": "#1f77b4", "TV Show": "#ff7f0e"}

# 📊 Create the figure
plt.figure(figsize=(12, 7))
sns.set_theme(style="whitegrid")

# 📈 Grouped bar plot
sns.barplot(
    data=grouped_data,
    x='age_certification',
    y='count',
    hue='type',
    palette=palette
)

# 🏷️ Titles and axis labels
plt.title("🍿 Distribution of Content Ratings by Type on Amazon Prime",
          fontsize=18, fontweight="bold", color="#2E4053", pad=15)
plt.xlabel("Age Certification (Content Rating)", fontsize=14, color="#1a5276")
plt.ylabel("Number of Titles", fontsize=14, color="#1a5276")
plt.xticks(rotation=45, fontsize=12)
plt.legend(title="Type", title_fontsize=13, fontsize=12)

plt.tight_layout()
plt.show()


### **1. Why did you pick the specific chart?**

✅ **Categorical Comparison:**

---
The variable age_certification is categorical (e.g., 7+, 13+, 16+, 18+). A bar plot is ideal for comparing the frequency of each category visually.



✅ **Clarity & Interpretability:**

---
Bar plots clearly show which certifications dominate — easy to interpret at a glance and perfect for business storytelling or stakeholder presentations.

### **2. What is/are the insight(s) found from the chart?**

📈 **1. Dominant Certifications:**

---
The chart likely shows that most titles fall under “13+” and “16+”, indicating Amazon Prime’s core audience is teenagers and adults.

🎬 **2. Content Type Trends:**

---
Movies have higher counts in 13+ and 16+ ratings — suggesting more mature themes like action, thriller, or drama.

👨‍👩‍👧 **3. Gaps in “Kids” or “Family” Content:**

---
The 7+ or All Ages segments are likely small, showing a limited focus on younger audiences compared to competitors like Disney+.

### **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

---
✅ **Positive Impact**

---

**Targeted Engagement:**

---
Focusing on 13+ to 18+ age groups aligns with the most active streaming, boosting watch time and subscription retention.

**Strong Adult Genre Portfolio:**

---
Consistent mature content reinforces Prime Video’s premium and global appeal in genres like crime, action, and drama.

---
**⚠️ Negative Growth Risk**

---
**Underrepresentation of Family/Kids Content:**

---
The small share of 7+ or Not Rated titles limits potential audience diversification.

This may reduce Prime’s competitiveness in the family entertainment market.



## Chart - 3: Pie plot

## Univariate Analysis
**“What is the distribution of content production countries on Amazon Prime Video?” 🌍**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

data=df

# Clean and prepare data
data['production_countries'] = data['production_countries'].fillna("Unknown")
data['main_country'] = data['production_countries'].apply(lambda x: x.split(",")[0] if isinstance(x, str) else "Unknown")

# Count top 8 countries
country_dist = data['main_country'].value_counts().head(8)

# Refined color palette (pastel theme)
colors = sns.color_palette("pastel", n_colors=len(country_dist))

# Create Pie Plot
plt.figure(figsize=(9, 9))
plt.pie(
    country_dist,
    labels=country_dist.index,
    autopct='%1.1f%%',
    startangle=120,
    colors=colors,
    textprops={'fontsize': 13, 'color': '#2E4053'}
)

# Title
plt.title("🌍 Distribution of Content Production Countries on Amazon Prime Video",
          fontsize=18, fontweight="bold", color="#2E4053", pad=20)

plt.tight_layout()
plt.show()


### **1. Why did you pick the specific chart?**

🍰 A Pie Plot is the best choice for visualizing proportions of categories within a whole.
It clearly shows how much each country contributes to the total catalog — allowing for instant visual comparison of dominance and diversity.

### **2. What is/are the insight(s) found from the chart?**

🔹 The United States leads by a wide margin.

🔹 India and United Kingdom follow as strong contributors.

🔹 Other countries like Canada, Japan, and France hold smaller but notable shares.

🔹 This suggests a Western-centric catalog, but also a growing global diversity in content creation.

### **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

---
**Positive impact**

---
This analysis helps Amazon Prime identify which regions are major content producers and where investment gaps exist.

**It supports decisions like:**

---
Expanding localized content production in underrepresented regions.

Balancing global vs. regional content mix for higher engagement.

💰 In business terms, this can boost global subscriptions and increase market inclusivity — leading to positive growth.


---
**⚠️ Negative impact.**

---
Overreliance on the US content base can limit growth in diverse global markets.

If the content remains too Western-centric, it may fail to attract audiences in countries with strong local preferences (e.g., Japan, Korea, Middle East).

Hence, while the dominance of US titles ensures
quality, lack of localization could hinder subscriber growth in emerging regions.

## Chart - 4 :Box plot

## Univariate Analysis

**“How does the runtime distribution vary between Movies and TV Shows on Amazon Prime Video?” ⏱️🎥📺**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


data=df

# Clean and prepare data
data['runtime'] = data['runtime'].fillna(0)
data = data[data['runtime'] > 0]
data['age_certification'] = data['age_certification'].fillna('Not Rated')

# Set theme
sns.set_theme(style="whitegrid")

plt.figure(figsize=(12, 7))

# Box Plot with hue for age certification
sns.boxplot(
    data=data,
    x='type',
    y='runtime',
    hue='age_certification',
    palette='Set2'
)

# Titles and axis labels
plt.title("⏱️ Runtime Distribution by Content Type and Age Rating on Amazon Prime Video",
          fontsize=18, fontweight="bold", color="#2E4053", pad=15)
plt.xlabel("Content Type", fontsize=14, color="#1a5276")
plt.ylabel("Runtime (minutes)", fontsize=14, color="#1a5276")
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(title="Age Certification", fontsize=11, title_fontsize=12, loc="upper right")

plt.tight_layout()
plt.show()


### **1. Why did you pick the specific chart?**

📦 A Box Plot is ideal for comparing distributions of a numerical variable (IMDb score) across different categories (type: Movie vs. TV Show).

**✅ Reasons for choosing Box Plot:**

---

Highlights outliers — exceptionally high or low-rated titles.

Allows side-by-side comparison between two content types.



### **2. What is/are the insight(s) found from the chart?**

🎬 Movies show a much higher  runtime, especially for adult-rated and unrestricted titles (R, 18+, Not Rated), which tend to run between 100–150 minutes.

📺 TV Shows, across all age certifications, cluster around 20–60 minutes, showing structured episodic formats.

**Interestingly:**

---
Family or Kids-rated content (e.g., PG, G) tends to have shorter runtimes, both for movies and shows.

Adult-rated content dominates the longer runtime range, indicating more production scale.


### **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

**📈 Positive Impacts:**

---
Enables content duration optimization for different viewer groups (e.g., shorter family content, longer mature dramas).

Guides recommendation algorithms — showing users runtimes that match their viewing habits.

These insights improve user satisfaction, platform retention, and personalization accuracy, all of which enhance Prime Video’s business growth.



**⚠️ Potential Risk:**

---
If too much focus is placed on long-format and adult-rated content, it might alienate younger or family audiences.

To avoid this, Amazon must sustain diverse content lengths and age categories, ensuring inclusivity and viewer variety.

## Chart - 5: Scatter Plot
## Bi-Variate Analysis

**“Is there a relationship between IMDb Score and IMDb Popularity for Amazon Prime Titles?” ⭐📈**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Copy dataset
data = df.copy()

# 🧹 Clean data: handle missing or invalid values
data = data.dropna(subset=['imdb_score', 'tmdb_popularity', 'type'])
data = data[(data['imdb_score'] > 0) & (data['tmdb_popularity'] > 0)]

# 🔤 Standardize and clean the 'type' column completely
data['type'] = data['type'].astype(str).str.strip().str.title()
data['type'] = data['type'].replace({
    'Tv Show': 'TV Show',
    'Show': 'TV Show',
    'Tvshow': 'TV Show',
    'Tv Shows': 'TV Show',
    'Movie ': 'Movie',
    'Movies': 'Movie'
})

# ✅ Keep only 'Movie' and 'TV Show' types
data = data[data['type'].isin(['Movie', 'TV Show'])]

# 🎨 Set the visual style
sns.set_theme(style="whitegrid")

plt.figure(figsize=(12, 7))

# 📊 Scatter Plot
sns.scatterplot(
    data=data,
    x='imdb_score',
    y='tmdb_popularity',
    hue='type',
    palette={'Movie': '#1f77b4', 'TV Show': '#ff7f0e'},
    alpha=0.7,
    s=70,
    edgecolor='white'
)

# 🏷️ Titles and axis labels
plt.title("⭐ Relationship between IMDb Score and TMDb Popularity on Amazon Prime Video",
          fontsize=18, fontweight="bold", color="#2E4053", pad=15)
plt.xlabel("IMDb Score", fontsize=14, color="#1A5276")
plt.ylabel("TMDb Popularity", fontsize=14, color="#1A5276")
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.legend(title="Content Type", fontsize=11, title_fontsize=12, loc="upper left")

plt.tight_layout()
plt.show()


### **1. Why did you pick the specific chart?**

Understanding this correlation helps identifyaudience engagement.

This is a bivariate relationship — comparing two continuous variables (imdb_score and tmdb_popularity) — making a Scatter Plot ideal.

### **2. What is/are the insight(s) found from the chart?**

**📈 The scatter plot shows that:**

---
There is a moderate positive correlation between IMDb Score and TMDb Popularity — meaning higher-rated titles tend to be more popular.

Movies (🔵 blue dots) generally dominate the upper right section — combining both high ratings and strong popularity.

TV Shows (🟠 orange dots) exhibit broader spread, indicating variable audience engagement regardless of rating.

**In short:**

---
⭐ High IMDb Score ≈ High Popularity, but the trend is stronger for Movies than TV Shows.

### **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

**📈 Positive Impacts:**

---
Helps content creators focus marketing on high-rated but underexposed titles — increasing viewership efficiently.

Assists recommendation systems in promoting high-quality content that’s also gaining traction.

Guides production strategy by showing which content quality levels align with user engagement patterns.

These insights lead to smarter promotion, investment, and audience retention strategies.

⚠️ **Possible Risks:**

---
If Prime over-prioritizes popular but low-rated titles, it risks lowering perceived platform quality over time.


## Chart - 6 :Line Plot
## Bi-Variate Analysis

**“How has the release trend of Movies and TV Shows on Amazon Prime Video changed over the years?” 📅📈**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

data = df.copy()

# 🧹 Clean and prepare data
data['release_year'] = pd.to_numeric(data['release_year'], errors='coerce')
data = data.dropna(subset=['release_year', 'type'])
data = data[data['release_year'] >= 1980]  # Filter relevant years

# ✅ Standardize 'type' column
data['type'] = data['type'].str.strip().str.title()  # Fix casing
data['type'] = data['type'].replace({
    'Tv Show': 'TV Show',
    'Show': 'TV Show',
    'Movie ': 'Movie'
})

# 📊 Group by release year and content type
trend_data = (
    data.groupby(['release_year', 'type'])
    .size()
    .reset_index(name='count')
)

# 🎨 Set style
sns.set_theme(style="whitegrid")

plt.figure(figsize=(12, 7))

# 📈 Line Plot
sns.lineplot(
    data=trend_data,
    x='release_year',
    y='count',
    hue='type',
    marker='o',
    linewidth=2.5,
    palette={'Movie': '#1f77b4', 'TV Show': '#ff7f0e'}
)

# 🏷 Titles and labels
plt.title("📅 Trend of Content Releases on Amazon Prime (Movies vs TV Shows)",
          fontsize=18, fontweight="bold", color="#2E4053", pad=15)
plt.xlabel("Release Year", fontsize=14, color="#1a5276")
plt.ylabel("Number of Titles Released", fontsize=14, color="#1a5276")
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.legend(title="Content Type", title_fontsize=13, fontsize=12, loc="upper left")

plt.tight_layout()
plt.show()


### **1. Why did you pick the specific chart?**

**This question helps us understand content growth patterns over time, revealing:**

---
Whether Amazon Prime is releasing more movies or TV shows in recent years.

Which content type dominates different time periods.


A **Line Plot** is perfect for showing trends and comparative growth between multiple categories.

### **2. What is/are the insight(s) found from the chart?**

**📈 The line plot reveals:**

---
🎬 Movies have shown steady growth since the early 2000s, with a sharp rise post-2015.

📺 TV Shows exhibit rapid acceleration after 2017, reflecting Amazon’s heavy investment in Prime Originals (like The Boys, Jack Ryan, Mirzapur).

The COVID-19 pandemic (2020–2021) shows a noticeable spike, possibly due to increased digital releases.

**In short:**

Amazon Prime has shifted from being movie-centric to more balanced, increasingly focusing on serial content in recent years.

### **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

**Positive Impacts**

---
📊 Enables Amazon Prime to analyze historical content growth and forecast future release needs.

🎯 Helps determine whether to invest more in movies or shows depending on audience consumption patterns.

🕵️ Informs marketing campaigns by identifying peak content release years and genres of focus.

📅 Supports strategic planning for content diversification and scheduling.

**⚠️ Possible Risk:**

---
Overproduction post-2020 may lead to content oversaturation, reducing average engagement per title.

If TV Show growth continues unchecked without equal quality control, it might dilute overall brand value.

## Chart - 7 : Box plot
## Bi-Variate Analysis

**“How does IMDb Score vary across different Age Certifications on Amazon Prime Video?” 🎬📊**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

data = df.copy()

# 🧹 Clean and standardize 'type' column
data['type'] = data['type'].astype(str).str.strip().str.title()
data['type'] = data['type'].replace({
    'Tv Show': 'TV Show',
    'Show': 'TV Show',
    'Tv-Show': 'TV Show',
    'Movie ': 'Movie',
    'Movies': 'Movie'
})

# ✅ Keep only valid types
data = data[data['type'].isin(['Movie', 'TV Show'])]

# 🔍 Optional sanity check
print("Unique type values after cleaning:", data['type'].unique())

# 🎨 Set theme
sns.set_theme(style="whitegrid")

plt.figure(figsize=(12, 7))

# 📦 Box Plot (fixed)
sns.boxplot(
    data=data,
    x='age_certification',
    y='imdb_score',
    hue='type',
    palette={'Movie': '#1f77b4', 'TV Show': '#ff7f0e'}
)

# 🏷️ Labels and Title
plt.title("🎬 IMDb Score Distribution by Age Certification and Type",
          fontsize=18, fontweight='bold', color='#2E4053')
plt.xlabel("Age Certification", fontsize=14, color='#1A5276')
plt.ylabel("IMDb Score", fontsize=14, color='#1A5276')
plt.legend(title="Content Type", title_fontsize=13, fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


### **1. Why did you pick the specific chart?**

This question explores whether content maturity levels (like ‘U’, ‘U/A 13+’, ‘A’) influence audience ratings (IMDb scores).


**For example:**

Are family-friendly titles performing better than adult-rated ones?

Do mature-rated shows get higher ratings because of richer storytelling?

A Box Plot is perfect here since it visualizes how ratings are distributed across several categorical groups (age certifications).

### **2. What is/are the insight(s) found from the chart?**

**📦 The box plot reveals interesting patterns:**

---
‘U/A 13+’ and ‘U/A 16+’ content often receives higher  IMDb scores, suggesting audiences appreciate teen-to-young-adult-oriented stories.

‘A’ (Adult) titles show wider variation — some highly rated, some poorly received — indicating mixed audience reception.

‘U’ (Universal) titles have moderate ratings, hinting that family content performs steadily but rarely tops the charts.

Within most categories, Movies (🔵) tend to slightly outperform TV Shows (🟠) in rating.

### **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

**Positive Impacts:**

---

Helps to identify which certification categories attract higher ratings.

Informs future investments — for instance, doubling down on U/A 13+ content, which is broadly appealing.

Supports marketing segmentation, targeting promotions more effectively to specific audience groups.

**⚠️ Potential Risks:**

---
Overemphasis on adult content (‘A’ rated) could alienate family audiences, shrinking Prime’s overall viewership base.

Highly variable ratings in the adult category suggest quality inconsistency, which might hurt user retention if not addressed.

Hence, Prime must balance audience maturity levels while ensuring consistent storytelling quality.


## Chart - 8: Bar Plot
Bivariate Analysis

**“How does the average IMDb Score vary across different Genres for Movies and TV Shows on Amazon Prime Video?” 🎬📊**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

data = df.copy()

# 🧹 Clean and standardize 'type' column
data['type'] = data['type'].astype(str).str.strip().str.title()

# Replace all common variants of show/movie
data['type'] = data['type'].replace({
    'Tv Show': 'TV Show',
    'Show': 'TV Show',
    'Tv-Show': 'TV Show',
    'Tvshow': 'TV Show',
    'Movies': 'Movie',
    'Movie ': 'Movie',
    'Film': 'Movie'
})

# Drop invalid/unknown values
valid_types = ['Movie', 'TV Show']
data = data[data['type'].isin(valid_types)]

# 🧭 Sanity check
print("✅ Unique type values after cleaning:", data['type'].unique())

# 🎭 Detect genre column
genre_col = "genres"

# 🧼 Clean genre data
data[genre_col] = data[genre_col].fillna("Unknown")
data[genre_col] = data[genre_col].apply(lambda x: x.split(",")[0].strip() if isinstance(x, str) else x)

# 🎯 Filter valid IMDb scores
data = data.dropna(subset=[genre_col, "imdb_score"])
data = data[(data["imdb_score"] > 0) & (data["imdb_score"] <= 10)]

# 🧩 Top 10 Genres
top_genres = data[genre_col].value_counts().head(10).index
filtered_data = data[data[genre_col].isin(top_genres)]

# 📊 Group by Genre and Type
grouped_data = (
    filtered_data.groupby([genre_col, "type"], as_index=False)["imdb_score"]
    .mean()
    .sort_values(by="imdb_score", ascending=False)
)

# 🎨 Define consistent palette
palette = {"Movie": "#1f77b4", "TV Show": "#ff7f0e"}

# 🪄 Extra Safety Check — Fix if any unknown type sneaks in
missing_keys = set(grouped_data['type'].unique()) - set(palette.keys())
if missing_keys:
    print(f"⚠️ Unknown type(s) found and removed: {missing_keys}")
    grouped_data = grouped_data[grouped_data['type'].isin(palette.keys())]

# 📈 Plot
sns.set_theme(style="whitegrid")
plt.figure(figsize=(14, 8))

bar_plot = sns.barplot(
    data=grouped_data,
    x=genre_col,
    y="imdb_score",
    hue="type",
    palette=palette,
    dodge=True
)

# 🏷 Add labels
for p in bar_plot.patches:
    bar_plot.annotate(
        f"{p.get_height():.2f}",
        (p.get_x() + p.get_width() / 2., p.get_height()),
        ha="center", va="bottom",
        fontsize=11, color="#2E4053", fontweight="bold",
        xytext=(0, 5), textcoords="offset points"
    )

# ✨ Title & Labels
plt.title("🎬 Average IMDb Score by Top 10 Genres (Movies vs TV Shows)",
          fontsize=18, fontweight="bold", color="#2E4053", pad=20)
plt.xlabel("Genre", fontsize=14, color="#1A5276")
plt.ylabel("Average IMDb Score", fontsize=14, color="#1A5276")
plt.xticks(rotation=30, fontsize=12)
plt.legend(title="Content Type", title_fontsize=13, fontsize=12)
plt.tight_layout()
plt.show()


### **1. Why did you pick the specific chart?**

This question helps us understand which genres consistently perform better in terms of viewer ratings (IMDb Score).

It also shows whether Movies or TV Shows dominate in specific genres, helping Prime Video decide which type of content to focus on for each category.

A **Bar Plot** is perfect here since it effectively compares average metrics (IMDb Scores) across categorical variables (Genres and Content Types).

### **2. What is/are the insight(s) found from the chart?**

🎥 Drama and Documentary often lead with higher IMDb scores — indicating critical title and audience engagement.

📺 Reality and Animation tend to score lower, suggesting less critically rated content.

🎞️ Movies generally show more genre variation in scores, reflecting diversity in production quality.

### **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*


✅ **Positive  Impact:**

---
Helps Amazon Prime prioritize high-performing genres (e.g., investing more in Drama/Documentary originals).

⚠️ **Negative Trend Insight:**

---
Underperforming genres like Reality or Animation may need content quality improvements or marketing support to boost ratings and user trust.

## Chart - 9: Violin Plot
## Bi-Variate Analysis

**“How does IMDb Popularity vary across different Release Decades for Movies and TV Shows on Amazon Prime Video?”**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

data = df.copy()

# 🧹 STEP 1: Fully clean and standardize 'type'
data['type'] = (
    data['type']
    .astype(str)
    .str.strip()
    .str.title()  # Converts 'tv show' → 'Tv Show'
)

# Replace all possible variants
data['type'] = data['type'].replace({
    'Tv Show': 'TV Show',
    'Tvshow': 'TV Show',
    'Tv-Show': 'TV Show',
    'Show': 'TV Show',
    'Movies': 'Movie',
    'Movie ': 'Movie',
    'Film': 'Movie'
})

# 🧭 Check unique values before proceeding
print("✅ Unique 'type' values after cleaning:", data['type'].unique())

# Keep only valid entries
data = data[data['type'].isin(['Movie', 'TV Show'])]

# 🧮 STEP 2: Prepare numeric columns
data['release_year'] = pd.to_numeric(data['release_year'], errors='coerce')
data['release_decade'] = (data['release_year'] // 10 * 10).astype('Int64')

# 🧹 STEP 3: Filter valid rows
data = data.dropna(subset=['release_decade', 'tmdb_popularity'])
data = data[(data['tmdb_popularity'] > 0) &
            (data['tmdb_popularity'] < data['tmdb_popularity'].quantile(0.99))]

# 🎨 STEP 4: Define palette
palette = {"Movie": "#1f77b4", "TV Show": "#ff7f0e"}

# 🧩 STEP 5: Final safety filter
unknown_types = set(data['type'].unique()) - set(palette.keys())
if unknown_types:
    print(f"⚠️ Removed unknown type(s): {unknown_types}")
    data = data[data['type'].isin(palette.keys())]

# ✅ Final check
print("🎯 Types used for plotting:", data['type'].unique())

# 🎻 STEP 6: Plot
sns.set_theme(style="whitegrid")
plt.figure(figsize=(12, 7))

sns.violinplot(
    data=data,
    x='release_decade',
    y='tmdb_popularity',
    hue='type',
    split=True,
    inner='quartile',
    palette=palette
)

plt.title("🎻 Distribution of TMDB Popularity Across Release Decades by Type",
          fontsize=18, fontweight="bold", color="#2E4053", pad=15)
plt.xlabel("Release Decade", fontsize=14, color="#1A5276")
plt.ylabel("TMDB Popularity", fontsize=14, color="#1A5276")
plt.legend(title="Content Type", title_fontsize=13, fontsize=12)
plt.tight_layout()
plt.show()


### **1. Why did you pick the specific chart?**

**A violin plot is perfect here because it:**

---
Captures distribution of popularity within each decade.

Allows comparison of Movies vs. TV Shows across time — showing which format dominates public attention.

Visualizes data density.

### **2. What is/are the insight(s) found from the chart?**

🎬 Movies from 2010s onward tend to show higher popularity peaks but also more variation, meaning only a few become viral hits.

📺 TV Shows gain steadier popularity growth from the 2010s to 2020s, reflecting the rise of streaming-era binge content.

🧩 Older decades (1980s, 1990s) show lower but tighter distributions, suggesting only a handful of timeless titles still attract attention.

### **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

**✅ Positive Insights:**

---
Amazon Prime can focus marketing on reviving high-performing older titles (nostalgia-driven engagement).

Understanding decade-wise popularity helps curate “trending classics” sections to retain diverse audiences.

⚠️ **Potential Risk:**

---
The decline in popularity for older TV content indicates a need for remastering, better recommendation placement, or exclusive promotion to regain visibility.

## Chart - 10 : Donut chart
Univariate Analysis

**👉 “Genre Distribution on Amazon Prime Video”**

In [None]:
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import rgb2hex
import pandas as pd

data=df

# ✅ Data Cleaning
data['genres'] = data['genres'].fillna("Unknown")

# If multiple genres are listed, split and count each one separately
data_exploded = data.assign(genres=data['genres'].str.split(',')).explode('genres')
data_exploded['genres'] = data_exploded['genres'].str.strip()

# Group by genres and get count
genre_data = data_exploded['genres'].value_counts().reset_index()
genre_data.columns = ['Genre', 'Count']

# 🧩 Group minor genres into "Other" to declutter the plot
threshold = 0.02 * genre_data['Count'].sum()  # genres less than 2% of total
major_genres = genre_data[genre_data['Count'] >= threshold]
others = pd.DataFrame([{'Genre': 'Other', 'Count': genre_data[genre_data['Count'] < threshold]['Count'].sum()}])
genre_data = pd.concat([major_genres, others], ignore_index=True)

# 🎨 Seaborn palette → HEX for Plotly
sns_colors = sns.color_palette("Spectral", n_colors=len(genre_data))
hex_colors = [rgb2hex(c) for c in sns_colors]

# 🍩 Create a beautiful, larger Donut Plot
fig = px.pie(
    genre_data,
    names='Genre',
    values='Count',
    title='🍩 Genre Distribution on Amazon Prime Video',
    hole=0.35,  # slightly smaller hole → more visible donut
    color_discrete_sequence=hex_colors
)

# ✨ Style the plot
fig.update_traces(
    textinfo='percent+label',
    textfont_size=14,
    hovertemplate='<b>%{label}</b><br>Titles: %{value}<br>Share: %{percent}',
    marker=dict(line=dict(color='#FFFFFF', width=2)),
    pull=[0.05 if g != 'Other' else 0 for g in genre_data['Genre']]  # pop out major slices
)

# 💅 Layout improvements
fig.update_layout(
    title_font=dict(size=22, family='Arial Black', color='#2E4053'),
    font=dict(size=14, family='Arial'),
    showlegend=True,
    legend_title_text='Genre',
    legend=dict(
        orientation="v",
        yanchor="top",
        y=0.9,
        xanchor="left",
        x=1.05,
        bgcolor='rgba(255,255,255,0.8)',
        bordercolor='#D0D3D4',
        borderwidth=1,
        font=dict(size=12)
    ),
    height=700,  # ✅ Make the donut much larger
    width=950,
    margin=dict(t=100, b=50, l=50, r=200),
    paper_bgcolor='white'
)

# 🧮 Center annotation with total count
total_titles = genre_data['Count'].sum()
fig.add_annotation(
    text=f"<b>Total Titles</b><br>{total_titles}",
    x=0.5, y=0.5, showarrow=False,
    font=dict(size=18, color="#2E4053")
)

# 🎯 Show the plot
fig.show()


### **1. Why did you pick the specific chart?**


This visualization helps us understand the overall composition of Prime Video’s content library by genre, giving insights into what type of content dominates the platform — whether it’s drama, comedy, action, documentary, or others.

A **Donut Plot** is an ideal choice when we want to:
Visualize proportion or percentage composition of categorical data.

Maintain visual appeal while comparing categories in a circular layout.

It’s visually engaging, easy to interpret, and suitable for representing categorical dominance — exactly what we need here.

### **2. What is/are the insight(s) found from the chart?**

**We can observe that:**

---
Certain genres like Drama, Comedy, and Action occupy the largest portions, indicating strong audience demand for these categories.

Documentary and Animation genres form smaller shares but still significant.

The “Other” category consolidates many small genres, suggesting diverse but less frequent content types exist in the catalog.


### **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

---
**Positive Impact**

---

**✅ Content Strategy:**

---
Amazon can identify over- or under-represented genres.

**✅ Recommendation System:**

---
Genre popularity can help personalize content suggestions — boosting user engagement.

**✅ Diversification:**

---
Recognizing lesser-represented genres helps invest in fresh, underexplored segments to appeal to new demographics.

⚠️ **Potential Risks:**

---

Extremely small genre shares are grouped into “Other,” so micro trends may be hidden.


#### Chart - 11 TreeMap
Multi-variate Analysis

**“Which countries contribute the most Movies and TV Shows to Amazon Prime’s global catalog?”**

In [None]:
import pandas as pd
import plotly.express as px
import seaborn as sns
from matplotlib.colors import to_hex


data=df

# 🧹 Data Cleaning
data['production_countries'] = data['production_countries'].fillna("Unknown")
data['type'] = data['type'].fillna("Unknown")

# Take only the first listed country if multiple are present
data['production_countries'] = data['production_countries'].apply(
    lambda x: x.split(',')[0].strip() if isinstance(x, str) else x
)

# 🧩 Group and Filter for Top 20 Countries
grouped_data = (
    data.groupby(['production_countries', 'type'])
    .size()
    .reset_index(name='count')
)
top_countries = (
    grouped_data.groupby('production_countries')['count']
    .sum()
    .nlargest(20)
    .index
)
grouped_data = grouped_data[grouped_data['production_countries'].isin(top_countries)]

# 🎨 Color Palette
palette = sns.color_palette("Set2", n_colors=len(grouped_data['type'].unique()))
color_discrete_sequence = [to_hex(c) for c in palette]

# 🌍 Create Treemap
fig = px.treemap(
    grouped_data,
    path=['production_countries', 'type'],
    values='count',
    color='type',
    color_discrete_sequence=color_discrete_sequence,
    hover_data={'production_countries': True, 'count': True},
    title="🌍 Top 20 Countries by Number of Movies and TV Shows on Amazon Prime"
)

# ✨ Improve Layout for Readability
fig.update_traces(
    textinfo="label+value+percent parent",
    hovertemplate="<b>%{label}</b><br>Type: %{color}<br>Titles: %{value}"
)

fig.update_layout(
    title_font=dict(size=22, color="#2E4053", family="Arial Black"),
    font=dict(size=14),
    margin=dict(t=100, l=50, r=50, b=50),
    height=800,
    uniformtext=dict(minsize=12, mode='hide'),  # hide overlapping text
    treemapcolorway=color_discrete_sequence
)

fig.show()


### **1. Why did you pick the specific chart?**

✅ A Treemap is an excellent visualization for hierarchical and proportional data.

**It allows us to see:**

---
How each country’s total content volume compares to others.

The composition within each country (Movies vs TV Shows).

The relative dominance of one category over another visually through block sizes.

### **2. What is/are the insight(s) found from the chart?**

🇺🇸 United States dominates content production on Amazon Prime, especially in Movies.

🇮🇳 India ranks second, showing a balanced presence in both Movies and TV Shows.

🇬🇧 United Kingdom and 🇨🇦 Canada are also major contributors, focusing mostly on Movies.

Smaller regions like 🇯🇵 Japan and 🇦🇺 Australia appear but with lesser diversity.

💬 This indicates Amazon Prime’s strong global presence, but content diversity could still be improved in underrepresented regions.

### **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

---
**🚀 Positive Impact:**

---

**Strategic content acquisition:**

---
 Amazon can target countries with lower content representation to expand regional reach.

**Localization opportunities:**

---
 More localized productions could attract non-English-speaking audiences.

**Balanced portfolio:**

---
 Understanding dominance in Movies vs TV Shows helps in content strategy and budget allocation.

**Potential Risks:**

Only Top 20 countries were considered in Treemap which dimin the exposure to other contries with less content contribution.


## Chart - 12: Bubble plot
Multyi-variate Analysis

**👉 Is there a relationship between IMDb score, runtime, and TMDb popularity of Amazon Prime titles?**

In [None]:
import plotly.express as px
import pandas as pd

data=df

# 🧹 Data Cleaning
data = data.dropna(subset=['imdb_score', 'runtime', 'tmdb_popularity', 'type'])
data = data[(data['imdb_score'] > 0) & (data['runtime'] > 0) & (data['tmdb_popularity'] > 0)]

# 🎨 Create the Bubble Plot
fig = px.scatter(
    data,
    x='imdb_score',
    y='runtime',
    size='tmdb_popularity',   # Bubble size = popularity
    color='type',             # Hue based on type (Movie or Show)
    hover_name='title',       # Show title on hover
    size_max=40,
    opacity=0.6,
    color_discrete_sequence=px.colors.qualitative.Set2,
    title='🎬 IMDb Score vs Runtime vs TMDb Popularity on Amazon Prime Video'
)

# ✨ Update Layout
fig.update_layout(
    xaxis_title='IMDb Score',
    yaxis_title='Runtime (minutes)',
    plot_bgcolor='white',
    title_font_size=20,
    legend_title_text='Content Type',
    height=600,
)

# 📈 Display
fig.show()


### **1. Why did you pick the specific chart?**

A **Bubble Plot** is chosen because it enables visualization of three quantitative variables (IMDb Score, Runtime, Popularity) along with a categorical hue (Type), making it the most efficient and insightful tool to reveal trends, clusters, and correlations in viewer engagement data.

### **2. What is/are the insight(s) found from the chart?**

Titles with moderate IMDb scores (6–8) tend to have higher TMDb popularity, indicating that popularity isn’t always tied to top IMDb scores.


Movies generally have longer runtimes than TV Shows, as seen by the clustering pattern.

A few outliers (high popularity + short runtime) could represent viral mini-series or short films.

There’s a positive trend between IMDb score and popularity, but not strongly linear — suggesting other factors may also make difference.

### **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

**🚀 Positive Impact:**

---
It identify which types of content attract viewers most, even without the highest IMDb ratings.

Content strategy teams can focus on producing or acquiring content with optimal runtime and audience engagement patterns.

Recommendation algorithms can be tuned — showing users more of what’s popular in their preferred runtime range.

Can support marketing decisions, such as promoting mid-rated but high-engagement titles.

⚠️ **Negative Impact:**

---
Data had missing or zero values for runtime and tmdb_popularity, which were removed for accuracy may mislead the decision.



## Chart - 13 - Correlation Heatmap
Multi-variate Analysis

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

data=df

# 🧹 Data Cleaning: Keep only numerical columns relevant for correlation
num_cols = ['imdb_score', 'runtime', 'release_year', 'tmdb_popularity']
corr_data = data[num_cols].dropna()

# 🔢 Compute Correlation Matrix
corr_matrix = corr_data.corr(method='pearson')

# 🎨 Visualization
plt.figure(figsize=(10, 7))
sns.set_theme(style="white")

# Create heatmap
heatmap = sns.heatmap(
    corr_matrix,
    annot=True,
    cmap="coolwarm",  # Beautiful color balance between -1 and +1
    center=0,
    linewidths=0.8,
    square=True,
    fmt=".2f",
    annot_kws={"size": 12, "weight": "bold"}
)

# Titles and Labels
plt.title("🌡️ Correlation Heatmap of Key Quantitative Features", fontsize=18, fontweight="bold", color="#1B4F72", pad=15)
plt.xticks(rotation=45, ha='right', fontsize=12, color="#154360")
plt.yticks(fontsize=12, color="#154360")

plt.tight_layout()
plt.show()


### **1. Why did you pick the specific chart?**

✅ A Correlation Heatmap is the most efficient visualization for analyzing relationships among multiple continuous variables simultaneously.

**Here’s why:**

---
It computes pairwise correlation coefficients between variables.

It summarize inter-variable dependencies in a dataset.

### **2. What is/are the insight(s) found from the chart?**

**📈 Observations from the Heatmap:**

---
imdb_score and tmdb_popularity show a moderate positive correlation, indicating that higher-rated titles often have higher audience engagement.

runtime has weak correlation with both imdb_score and popularity, meaning length doesn’t directly influence quality or popularity.

release_year and tmdb_popularity show a slight positive correlation, suggesting that newer titles tend to gain more traction on Prime Video.

Most correlations are moderate, indicating diverse audience interests

### **3.Will the gained insights help creating a positive business impact?**
Are there any insights that lead to negative growth? Justify with specific reason.

**💼 Positive Impact:**

---
Helps content curators understand which attributes drive popularity — crucial for deciding future investments.

Correlations highlight key predictors for user engagement models.

Amazon Prime can optimize content recommendations using these relationships.

Marketing teams can focus on highly correlated content types.


⚠️ **Negative Impact:**

---

Ignoring weakly correlated factors could mean overlooking subtle but crucial behavioral patterns, especially across different audience segments or genres.

## Chart - 14 - Pair Plot
Multi-variate Analysis

**How do IMDb Score, TMDb Popularity, and Runtime interact across Movies and TV Shows on Amazon Prime Video?**

In [None]:
# 📦 Import Libraries
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd


data = df

# 🧹 Data Cleaning
data = data.dropna(subset=['imdb_score', 'tmdb_popularity', 'runtime', 'type'])
data = data[(data['imdb_score'] > 0) & (data['tmdb_popularity'] > 0) & (data['runtime'] > 0)]

# 🎨 Create the Pair Plot
sns.set(style="whitegrid", context="talk")

pair_plot = sns.pairplot(
    data,
    vars=['imdb_score', 'tmdb_popularity', 'runtime'],
    hue='type',
    palette='Set2',
    diag_kind='kde',
    plot_kws={'alpha': 0.7, 's': 50, 'edgecolor': 'k'}
)

pair_plot.fig.suptitle("Pair Plot – IMDb Score, TMDb Popularity & Runtime by Type",
                       y=1.03, fontsize=16, fontweight="bold")
plt.show()


### **1. Why did you pick the specific chart?**

📊 The **Pair Plot** was used to explore relationships between multiple numerical variables (IMDb Score, TMDb Popularity, and Runtime) simultaneously.

**It helps in identifying:**

---
Correlations between these features

Cluster patterns by content type (Movie vs TV Show)

Outliers or unique combinations within the dataset

### **2. What is/are the insight(s) found from the chart?**

🔹 **1. IMDb Score vs TMDb Popularity:**

---
Movies tend to have a more diverse range of IMDb scores, while their popularity (TMDb) varies widely — suggesting that a highly rated movie isn’t always the most popular.

🔹 **2. Runtime Patterns:**

---
Movies generally have higher runtimes but tighter clustering, while TV Shows display more spread, due to varying episode lengths.

🔹**3. Correlation Observation:**

---
There’s a mild positive correlation between IMDb score and TMDb popularity — titles with good IMDb ratings often show higher popularity, but exceptions exist.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

**🎯 Final Insights**

---

1️⃣ Movies dominate over TV shows on Amazon Prime.

2️⃣ Drama, Comedy, and Action are the most popular genres.

3️⃣ U.S. and India lead in content production; other regions are underrepresented.

4️⃣ High IMDb/TMDB ratings strongly correlate with votes and popularity.

5️⃣ TV-MA and 16+ rated content prevails, showing focus on mature audiences.

6️⃣ A few top titles drive the majority of overall popularity.

**💡 Business Recommendations:**

---
**🎬 Focus on Top Genres** – Invest in Drama, Comedy, and Action for higher engagement.

**🌍 Expand Globally** – Produce or acquire more regional content beyond U.S. and India.

**🏆 Prioritize Quality** – Emphasize high-rated, high-impact titles over volume.

**🎯 Personalized Experience** – Use viewer age and genre data for smarter recommendations.

**👥 Leverage Top Talent **– Promote and collaborate with popular actors/directors.

In summary, the EDA reveals that focusing on popular genres, quality content, and regional diversity will help Amazon Prime Video enhance engagement, broaden its audience, and strengthen its global position.

# **Conclusion**

This project analyzed Amazon Prime Video’s content to uncover trends in genres, ratings, and audience preferences.
The analysis showed that movies dominate, with Drama, Comedy, and Action as top genres, and the U.S. and India leading in production.
 Higher ratings aligned with stronger engagement, and mature content prevailed.

Overall, Amazon Prime should focus on high-demand genres, expand regional diversity, and prioritize quality content to boost viewer engagement and growth.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***