<a href="https://colab.research.google.com/github/deepmalakale/EDA-AmazonPrime-TVShows-and-Movies/blob/main/AmazonPrime_TVShows_Movies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - 🎬 Exploratory  Data Analysis on Amazon Prime TV Shows and Movies



##### **Project Type**    - EDA
##### **Contribution**    - Group


# **Project Summary -**

The project **“Amazon Prime TV Shows and Movies”** focuses on conducting an in-depth analysis of Amazon Prime Video’s content library, with the objective of **uncovering meaningful trends, patterns, and actionable insights** related to its catalog of shows and movies.

The entire analysis was performed using **Python in Google Colab, leveraging popular data analysis and visualization libraries such as Pandas, Matplotlib, Seaborn, and Plotly** to make the project interactive, I also developed a **Streamlit web application,** where users can dynamically explore Amazon Prime’s content through different plots categorized under Univariate, Bivariate, and Multivariate analysis

📂 **Datasets Used**

1️⃣ **Titles Dataset** (titles.csv)

Contains metadata for 9,871 titles with 15 attributes, including title, type (movie or TV show), release year, age certification, runtime, genres, production countries, and performance metrics such as IMDb and TMDB ratings, votes, and popularity.

2️⃣ **Credits Dataset** (credits.csv)

Includes over 124,000 records detailing cast and crew members, capturing information such as person name, role (actor/director), and character associations with each title.

🧹**Data Wrangling & Preprocessing**

These datasets were carefully cleaned, standardized, and merged into a consolidated dataframe (df) through a structured process of data wrangling.
This process involved handling missing values, addressing outliers, resolving redundancies, and normalizing categorical data to ensure a high-quality dataset for robust analysis.

📊 **Exploratory Data Analysis (EDA)**

Exploratory Data Analysis (EDA) was then carried out using a combination of statistical techniques and interactive visualizations.
Tools like Seaborn and Plotly were employed to generate box plots, violin plots, donut charts, treemaps, bubble plots, correlation heatmaps, and pair plots.
These visualizations provided clarity, interactivity, and deeper engagement with the data.

💡 **Key Insights & Findings**

The findings revealed key insights into:

The distribution of TV shows vs. movies on the platform

The dominant genres and regional production trends

Runtime and age certification patterns across categories

Shifts in Amazon Prime’s content catalog over the years

The role of IMDb and TMDB ratings, votes, and popularity in shaping audience perception

Contributions of actors and directors frequently appearing across titles

🧭 **Strategic Impact**

These insights support strategic decision-making for content acquisition, audience targeting, and recommendation systems.
For instance, identifying genres with consistently high ratings can guide future investments, while analyzing production country trends highlights opportunities for market expansion.

🏁 **Conclusion**

In conclusion, this project demonstrates how raw datasets, when thoroughly cleaned and analyzed, can be transformed into data-driven insights that strengthen Amazon Prime Video’s content strategy, competitive positioning, and audience engagement.

# **GitHub Link -**

Github Link :https://colab.research.google.com/github/deepmalakale/EDA-AmazonPrime-TVShows-and-Movies/blob/main/AmazonPrime_TVShows_Movies.ipynb


# **Problem Statement**


The rapid expansion of video streaming platforms has created an environment where data-driven decision-making is critical for sustaining competitive advantage. Amazon Prime Video, with its vast library of movies and TV shows, faces the challenge of understanding its diverse catalog in terms of content diversity, audience preferences, and regional distribution.

The key problems addressed in this project are:

**Content Diversity & Trends** – What are the dominant genres, formats (movies vs. shows), and age certifications? How has the catalog evolved over the years?

**Regional Analysis** – Which countries contribute the most to Amazon Prime’s content library? Are certain regions underrepresented or overrepresented?

**Audience Reception & Popularity** – How do IMDb and TMDB ratings, votes, and popularity scores correlate, and what do they reveal about audience engagement?

**Runtime & Release Insights** – What patterns emerge in terms of runtime, release years, and seasonal content availability?

**Cast & Crew Contributions** – Who are the most frequently featured actors and directors, and how do they influence content success?

By addressing these questions, the project aims to transform **raw datasets** (titles and credits) into a **clean, structured, and insightful analysis.** The insights generated will not only enhance the understanding of Amazon Prime’s current catalog but also **provide strategic guidance for content acquisition, audience targeting, and recommendation systems** in the competitive streaming industry.

#### **Define Your Business Objective?**

The objective of this project is to **leverage data analysis of Amazon Prime Video’s catalog to understand audience preferences, content diversity, and market trends.** By examining factors such as genres, regional production, release patterns, ratings, and cast/crew involvement, the project **aims to generate actionable insights** that can help:

**Enhance content strategy** by identifying popular genres, formats, and regions.

**Improve audience engagement** through data-driven recommendations.

**Support business growth** by aligning content acquisition and investment with viewer demand.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
#from google.colab import drive
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
'''try:
  drive.mount('/content/drive')
except Exception as e:
  print('Error mounting google drive')'''

### Dataset Loading

In [None]:
'''#creating file path
directory = '/content/drive/MyDrive/Colab Notebooks/capston_project/mod2_capston/AmazonPrime'
filename_titles = 'titles.csv'
filename_credits = 'credits.csv'''

In [None]:
'''try:
  filepath_titles = os.path.join(directory, filename_titles)
  filepath_credits = os.path.join(directory, filename_credits)
except Exception as e:
  print('Creation of filepath unsuccesful')'''
filepath_titles='/content/titles.csv'
filepath_credits='/content/credits.csv'

In [None]:
# Load Dataset
try:
  titles = pd.read_csv(filepath_titles)
  credits = pd.read_csv(filepath_credits)
except Exception as e:
  print('Error loading data')


### Dataset First View

In [None]:
# Dataset First Look
titles.head()

In [None]:

credits.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print('Titles Dataset - ', titles.shape)
print('----------------')
print('Credits Dataset - ', credits.shape)

### Dataset Information

In [None]:
# Dataset Info
print('Titles Dataset Info:\n')
print(titles.info())
print('----------------\n')
print('Credits Dataset Info:\n')
print(credits.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print('Titles Dataset Duplicate Values - ', titles.duplicated().sum())
print('----------------')
print('Credits Dataset Duplicate Values - ', credits.duplicated().sum())

In [None]:
print('---------Visualising duplicates in titles dataset--------')
titles[titles.duplicated()].sort_values(by = 'id')

In [None]:
print('---------Visualising duplicates in credits dataset--------')
credits[credits.duplicated()].sort_values(by = 'id')

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print('Titles Dataset Missing Values:\n')
print(titles.isnull().sum())
print('----------------')
print('Credits Dataset Missing Values:\n')
print(credits.isnull().sum())

In [None]:
# Visualizing the missing values in Titles Data using heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(titles.isnull(), cmap = 'viridis',cbar = False, yticklabels = False)
plt.title('Missing values Heatmap')

In [None]:
# Visualizing the missing values in Credits Data using heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(credits.isnull(), cmap = 'viridis',cbar = False, yticklabels = False)
plt.title('Missing values Heatmap')

### What did you know about your dataset?

📊 Dataset Overview

The dataset used in this project comes from Amazon Prime Video’s catalog (U.S. region) 🇺🇸 — originally in two raw files: titles.csv 📄 and credits.csv 🎭 — and later cleaned 🧼, transformed 🔄, and merged into a consolidated dataframe df 📊.

📐 Dataset Size & Structure

Titles Dataset: 9,871 unique shows 🎬 & movies 🍿 with 15 attributes.

Credits Dataset: 124,235 records of cast 👥 and crew 🎥 with 5 attributes.

Final Dataframe (df): Combines both datasets, enabling title-level and person-level insights 🔍.

🧮 Types of Variables

Categorical Variables: Content type (Movie/Show 🎞️📺), Age Certification 🔞, Genres 🎭, Production Countries 🌍, Role (Actor/Director 🎬).

Numerical Variables: Release Year 📅, Runtime ⏱️, Number of Seasons 📺, IMDb/TMDB Ratings ⭐, IMDb Votes 🗳️, TMDB Popularity 🔥.

Text Variables: Title 📖, Description 📝, Name 🧑‍💼, Character 🎭.

🎭 Content Composition

The catalog includes both movies 🎞️ and TV shows 📺.

Genres are diverse (Drama 🎭, Comedy 😂, Action 💥, Thriller 😱, Documentary 🎥).

Content originates from multiple production countries 🌎, with the U.S. 🇺🇸 being dominant.

⭐ Ratings & Popularity

Most titles feature IMDb and TMDB ratings ⭐ — allowing assessment of audience reception.

TMDB popularity scores 🔥 offer insights into trending content.

👥 Cast & Crew Data

The credits dataset enriches analysis by linking actors 🎭 and directors 🎬 to titles.

This enables identification of the most frequent actors/directors and collaboration patterns 🔄.

🧹 Data Quality & Cleaning

Raw datasets had missing values 🚫, outliers 📌 and inconsistencies (e.g., missing runtimes ⏱️, null age certifications 🔞, duplicate records 🔁).

Wrangling steps included:

Handling null values (imputation or removal) 🧼.

Removing or capping outliers (extreme runtimes ⏱️ or unrealistic ratings ⭐).

Standardising categorical variables (genres 🎭, countries 🌍).

Merging duplicates and ensuring consistency across datasets 🔄.

📊 Final Dataset (df)

The final dataframe is structured and reliable, capturing:

Title metadata (name 📖, type 🎞️📺, genres 🎭, ratings ⭐, popularity 🔥).

Cast & crew information (actors 🎭, directors 🎬, roles 🎭).

It serves as a comprehensive foundation for Exploratory Data Analysis (EDA) 🔍 and deriving business insights 💡.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print('Titles Dataset Columns:\n')
print(titles.columns)
print('----------------')
print('Credits Dataset Columns:\n')
print(credits.columns)

In [None]:
# Dataset Describe
print('Titles Dataset Describe:\n')
print(titles.describe())
print('----------------')
print('Credits Dataset Describe:\n')
print(credits.describe(include = 'all'))

### Variables Description

The dataset consists of two files — titles.csv and credits.csv — which were merged into a consolidated dataframe (df) for analysis. Below is the description of the key variables:

**Titles Dataset (titles.csv)**

> Contains metadata of TV shows and movies available on Amazon Prime.

- id : Unique identifier of the title (`on JustWatch`).

- title : Name of the movie or TV show.

- type : Content type: Movie or TV Show.

- description : Brief synopsis of the title.

- release_year : Year in which the title was released.

- age_certification : Age rating (`e.g., PG, R, 18+`).

- runtime : Duration of the movie or average length of an episode (`in minutes`).

- genres : List of genres (`Drama, Comedy, Action, etc.`).

- production_countries : Countries involved in producing the title.

- seasons : Number of seasons (`only for TV shows`).

- imdb_id : Unique IMDb identifier.

- imdb_score : IMDb rating score (`0–10`).

- imdb_votes : Number of IMDb user votes.

- tmdb_popularity : Popularity score from TMDB.

- tmdb_score : TMDB rating score.

**Credits Dataset (credits.csv)**

> Contains cast and crew details for each title.

- person_id : Unique identifier for an actor/director.

- id : Title ID (`to link with titles dataset`).

- name : Name of the person (`actor or director`).

- character : Character played (`if actor`).

- role : Role in the production (`ACTOR or DIRECTOR`).

### Check Unique Values for each variable.

In [None]:
#Check count of unique values for each variable
print('Count of unique values in Titles Dataset--------')
for col in titles.columns:
  count = titles[col].nunique()
  print(f'{col} : {count}')

print('----------------')

print('Count of unique values in Credits Dataset')
for col in credits.columns:
  count = credits[col].nunique()
  print(f'{col} : {count}')

In [None]:
# Check Unique Values for each variable.
print('Unique Values in Titles Dataset-------')
for col in titles.columns:
  print(f"{col} : {titles[col].unique()}")


In [None]:
print('Unique values in Credits Dataset--------')
for col in credits.columns:
  print(f'{col} : {credits[col].unique()}')

## 3. ***Data Wrangling***

## Initial Preprocessing


In [None]:
#Remove Duplicated Values
try:
  titles.drop_duplicates(inplace = True)
  credits.drop_duplicates(inplace = True)
except Exception as e:
  print('Could not delete duplicates')

In [None]:
print('Duplicate Count after processing')
print(f'Duplicate count in Titles Dataset: {titles.duplicated().sum()}')
print('-----------')
print(f'Duplicate count in Credits Dataset: {credits.duplicated().sum()}')


### Drop columns with 70% or more null values

In [None]:
#Checking % of null values for each column

titlesLen = titles.shape[0]
for col in titles.columns:
  null_count = titles[col].isnull().sum()
  print(f'% null values in {col} : {(null_count * 100 / titlesLen):.2f}%')

In [None]:
#We can see that seasons column has more than 70% null values
#So we will drop these columns

try:
  titles = titles.drop(columns = ['seasons'])
except Exception as e:
  print(f'Error dropping columns')


In [None]:
#Remaining columns after dropping seasons
titles.columns

In [None]:
#Check Null values in other columns for insights
titles[titles['description'].isnull() & titles['imdb_score'].isnull() & titles['imdb_votes'].isnull() & titles['tmdb_popularity'].isnull() & titles['tmdb_score'].isnull()]

In [None]:
#Since we can see that there are 69 rows with all null column values
# We can drop these rows as well

rowIndex = list(titles[titles['description'].isnull() & titles['imdb_score'].isnull() & titles['imdb_votes'].isnull() & titles['tmdb_popularity'].isnull() & titles['tmdb_score'].isnull()].index)
titles = titles.drop(index= rowIndex)

# Clean Description Column

In [None]:
#Since we cannot use any statistical method to fill null values for description column, we will instead fill it with 'Unknown'
#This will help us to keep track for further processing and analysis

titles['description'] = titles['description'].fillna('Unknown')

In [None]:
titles['description'].isnull().sum()

# Clean Age-Certification Column

In [None]:
# We will fill the null values with 'Unknown'
# The remaining values will give us insights to analyse this column
# Cleaning any leading/trailig whitespaces as well

titles['age_certification'] = titles['age_certification'].fillna('Unknown').str.strip()


In [None]:
titles['age_certification'].isnull().sum()

# Clean Imdb-ID Column

In [None]:

# We will fill the null values with 'Unknown'
# Cleaning any leading/trailig whitespaces as well

titles['imdb_id'] = titles['imdb_id'].fillna('Unknown').str.strip()

In [None]:
titles['imdb_id'].isnull().sum()

# Clean imdb score

In [None]:
#Check DataType of the imdb column to use appropriate statistical model
titles['imdb_score'].dtype

In [None]:
#Check the skewness of the column
titles['imdb_score'].skew()

In [None]:
#The column data is normally skewed, visualizing this through a histogram
plt.title('Visualization of imdb_score (includes Null)')
sns.histplot(data = titles, x = 'imdb_score', kde = True)

In [None]:
titles["type"].head()

In [None]:
#Using mean on groupby 'type' column to fill null values
titles['imdb_score'] = titles['imdb_score'].fillna(titles.groupby('type')['imdb_score'].transform('mean'))

In [None]:
#Rounding the mean values to 2 decimal places
titles['imdb_score'] = titles['imdb_score'].apply(lambda x: round(x,2))

In [None]:
titles['imdb_score'].isnull().sum()

# Clean imdb_votes

In [None]:
#Check column datatype
titles['imdb_votes'].dtype

In [None]:
#Check column skewness
titles['imdb_score'].skew()

In [None]:
#Visualising column skewness through histplot
plt.title('Visualization of imdb_votes (includes null)')
plt.xscale("log")
sns.histplot(data = titles, x = 'imdb_votes', kde = True)


In [None]:
#Column is normally skewed, so we fill the values with mean
titles['imdb_votes'] = titles['imdb_votes'].fillna(titles.groupby('type')['imdb_votes'].transform('mean'))

In [None]:
titles['imdb_votes'].isnull().sum()

In [None]:
#Since votes cannot be in fraction, we will convert this column datatype to int
titles['imdb_votes'] = titles['imdb_votes'].astype(int)

# Clean tmdb_popularity

In [None]:
titles['tmdb_popularity'].dtype

In [None]:
#Check column skewness
titles['tmdb_popularity'].skew()

In [None]:
#The column values are right skewed by a large margin
#Visualising it through a histplot

plt.title('Visualization of tmdb_popularity (includes Null)')
plt.xscale("symlog") #Symmetric log scale to visualize wide range of skewed data
sns.histplot(data = titles, x = 'tmdb_popularity', kde = True)

In [None]:
titles['tmdb_popularity'].isnull().sum()

In [None]:
#Since the column is right skewed
#We will fill the null values with median within each type group

titles['tmdb_popularity'] = titles['tmdb_popularity'].fillna(titles.groupby('type')['tmdb_popularity'].transform('median'))

In [None]:
#After cleaning null values
print(titles['tmdb_popularity'].isnull().sum())
print('-----------------')
print('tmdb_popularity values after processing null values')
print(titles['tmdb_popularity'].head(10))

# Clean tmdb_score

In [None]:
titles['tmdb_score'].isnull().sum()

In [None]:
#Find skewness of the tmdb_score column
titles['tmdb_score'].skew()

In [None]:
#Visualising skewness through histplot
plt.title("Visualising tmdb_score (Includes Null)")
sns.histplot(data = titles, x ='tmdb_score', kde = True)

In [None]:
#Tmdb_score is normally skewed so we replace the null values with mean based on type
titles['tmdb_score'] = titles['tmdb_score'].fillna(titles.groupby('type')['tmdb_score'].transform('mean'))

In [None]:
titles['tmdb_score'].isnull().sum()

In [None]:
titles.info()

# Preprocess Credits Dataset

## Clean Character Column

In [None]:
credits.isnull().sum()

In [None]:
# Filling the null character values with Unknown
credits['character'] = credits['character'].fillna('Unknown')

In [None]:
titles.isnull().sum()

In [None]:
credits.isnull().sum()

# Merging Titles and Credits Dataset on common 'id' column

In [None]:
titles.head()

In [None]:
credits.head()

In [None]:
#Now that both the dataset is clean, we can merge them on common 'id' column for final analysis and visualization
#We will merge on Left Outer Join, keeping all the values from titles dataset and matching values from credits dataset

ct_dataset = pd.merge(titles, credits, how = 'left', on = 'id')

In [None]:
ct_dataset.shape

In [None]:
#First look of merged dataset

ct_dataset.head()

In [None]:
# Checking for any null values after merging
ct_dataset.isnull().sum()

In [None]:
ct_dataset["release_year"].dtype

In [None]:
# First look at the null data

ct_dataset[ct_dataset['person_id'].isnull() & ct_dataset['name'].isnull() & ct_dataset['character'].isnull() & ct_dataset['role'].isnull()]

In [None]:
# From this data we can figure out that several other columns have values such as 'Unknown' or mean / median processed values
# Narrowing the data based on these metrics to possibly drop from the table
temp = ct_dataset[ct_dataset['person_id'].isnull() & ct_dataset['name'].isnull() & ct_dataset['character'].isnull() & ct_dataset['role'].isnull()]
temp[(temp['age_certification'] == 'Unknown')]

In [None]:
# Checking if there are any duplicate values in the merged dataset

ct_dataset.duplicated().sum()

In [None]:
# Removing rows where most values are NAN and other columns are 'Unknown'
rows = temp[(temp['age_certification'] == 'Unknown')].index
ct_dataset.drop(index = rows, inplace = True)

### Filling Null values in merged dataset with 'Unknown' since these are categorical data

In [None]:
# Final processing
# Applying directly to all the null column values

df = ct_dataset.fillna('Unknown')

In [None]:

# Last check for any null values
df.isnull().sum()

**Remove Outliers**

In [None]:
#Outlier visualisation code using Box Plot
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd


# Get numeric columns
numeric_cols = df.select_dtypes(include='number').columns.tolist()
num_cols = len(numeric_cols)

# Compute number of rows needed, 3 plots per row
n_cols_per_row = 3
n_rows = (num_cols + n_cols_per_row - 1) // n_cols_per_row

# Create subplots grid
fig, axes = plt.subplots(n_rows, n_cols_per_row, figsize=(6 * n_cols_per_row, 4 * n_rows), squeeze=False)

# Flatten axes for easy indexing
axes_flat = axes.flatten()

# Set Seaborn style
sns.set(style="whitegrid")

# Use a color palette
colors = sns.color_palette("Set2", num_cols)  # Enough colors for all plots

for idx, col in enumerate(numeric_cols):
    ax = axes_flat[idx]
    sns.boxplot(
        x=df[col],
        orient='h',
        ax=ax,
        width=0.6,
        whis=1.5,
        fliersize=5,
        linewidth=1.2,
        color=colors[idx]  # Use color instead of palette to avoid warning
    )
    ax.set_title(col, fontsize=12)
    ax.set_yticks([])         # Hide y-axis ticks
    sns.despine(ax=ax, left=True)

# Turn off unused subplots, if any
for idx in range(num_cols, len(axes_flat)):
    fig.delaxes(axes_flat[idx])

plt.tight_layout(pad=3)
plt.show()

In [None]:
#Outlier Removal code based on IQR
#MARK
import pandas as pd

def remove_outliers_iqr(df):
    df_clean = df.copy()
    num_cols = df_clean.select_dtypes(include='number').columns

    for col in num_cols:
        Q1 = df_clean[col].quantile(0.25)
        Q3 = df_clean[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Keep only rows within bounds
        df_clean = df_clean[(df_clean[col] >= lower_bound) & (df_clean[col] <= upper_bound)]

    return df_clean

# Apply function
df_cleaned = remove_outliers_iqr(df)


print("Original shape:", df.shape)
df = remove_outliers_iqr(df)

In [None]:
#Box plot Visualization after removal of Outliers

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np



# Select numeric columns
num_cols = df.select_dtypes(include=np.number).columns.tolist()

# Define subplot layout: 3 plots per row
n_cols = 3
n_rows = int(np.ceil(len(num_cols) / n_cols))

# Seaborn style and color palette
sns.set(style="whitegrid")
colors = sns.color_palette("Set2", len(num_cols))

# Create subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(6 * n_cols, 4 * n_rows))
axes = axes.flatten()  # Flatten in case of multiple rows

# Plot each numeric column as horizontal boxplot
for i, col in enumerate(num_cols):
    sns.boxplot(x=df[col], ax=axes[i], color=colors[i], orient='h')
    axes[i].set_title(f'Boxplot of {col}', fontsize=12)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')

# Remove any unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout(pad=3)  # Good spacing between subplots
plt.show()

### What all manipulations have you done and insights you found?

🛠️ Data Manipulations Performed

To ensure the dataset was ready for analysis, several data cleaning and transformation steps were performed:

🔗 Merging Datasets

Combined the titles dataset (titles.csv) and credits dataset (credits.csv) on the common key (id).

Resulted in a consolidated dataframe (df) containing both title-level metadata and cast/crew details.

🧹 Handling Missing Values

Removed duplicate entries.

Imputed or dropped missing values for fields like description, runtime, age certification, and ratings (imdb_score, tmdb_score).

📉 Outlier Treatment

Identified and removed outliers in runtime, ratings, and popularity scores using the IQR method.

Ensured unrealistic entries (e.g., extremely high runtimes) were corrected or excluded.

🔄 Data Standardization

Normalized categorical variables like genres, production countries, and age certification.

Set id as a unique index where appropriate.

🧠 Feature Engineering

Derived year-wise distributions of titles.

Classified content into Movies vs. TV Shows.

Counted actor/director frequency to study cast/crew prominence.

📊 Visualization & EDA

Created box plots, bar plots, violin plots, treemaps, bubble plots, heatmaps, pie/donut charts, and correlation plots using Seaborn and Plotly for interactivity.

🔍 Insights Found
🎬 Content Composition

Amazon Prime hosts more movies than TV shows.

Certain genres (Drama, Comedy, Action) dominate the catalog, while niche genres are less represented.

📈 Trends Over Time

Significant growth in content release after 2010, showing Prime’s rapid expansion.

Recent years show diversification in genres and international productions.

🌍 Regional Distribution

Majority of titles are produced in the United States, but strong representation also comes from India, UK, and Canada.

⏱️ Runtime Analysis

Movies generally fall within the 90–120 minute range.

Outliers were present (very short or very long movies), which were filtered during cleaning.

⭐ Audience Ratings & Popularity

High IMDb/TMDB ratings correlate with higher popularity scores.

A few low-rated titles still have high popularity, suggesting marketing or trending factors.

🔞 Age Certification

A significant proportion of content is rated 18+ or PG-13, highlighting Prime’s focus on adult and teen audiences.

🎭 Cast & Crew Insights

Frequent actors and directors were identified (through credits dataset).

Certain directors and actors are repeatedly associated with highly-rated content.

In [None]:
# Save DataFrame as CSV
df.to_csv('mydata.csv', index=False)

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

## Chart - 1 :Histogram
## Univatiate Analysis

In [None]:
# Histogram visualization code

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math
import warnings

# Suppress warnings
warnings.filterwarnings("ignore", category=UserWarning)

# Seaborn style
sns.set_style("whitegrid")

numeric_cols = ['release_year', 'runtime', 'imdb_score', 'tmdb_score']

# Choose a categorical column as hue
hue_col = 'type'  # change to any meaningful categorical column in your dataset

# Determine plots per row for better visibility
plots_per_row = 2
n_rows = math.ceil(len(numeric_cols) / plots_per_row)
plt.figure(figsize=(16, 6 * n_rows))

for idx, col in enumerate(numeric_cols):
    plt.subplot(n_rows, plots_per_row, idx + 1)

    # Dynamic palette for hue
    n_hue_levels = df[hue_col].nunique()
    palette = sns.color_palette("Set2", n_hue_levels)

    sns.histplot(
        data=df,
        x=col,
        hue=hue_col,
        kde=True,
        bins=30,
        palette=palette,
        alpha=0.7
    )

    plt.xlabel(col)
    plt.ylabel('Count')
    plt.title(f'Histogram of {col} by {hue_col}', fontsize=14, weight='bold')

plt.tight_layout()
plt.show()

### **1. Why did you pick the specific chart?**

**Histograms reveal distributions clearly**


---


Histograms split numerical data into bins and show how frequently values occur in each range. This makes it easy to see patterns such as skewness, symmetry, or multimodality.

For example, imdb_score or runtime may not be uniformly distributed, and histograms reveal that directly.

**KDE (Kernel Density Estimate) adds smooth insight**


---


KDE overlays a smooth curve on the histogram. It helps see the underlying probability distribution, which is particularly useful for large datasets (like ours with 87k rows).Peaks, valleys, and modes are easier to identify than with bars alone.

**Hue shows categorical separation**


---


Using hue (like type) allows comparison of distributions across categories.
For instance, comparing runtime for Movies vs. TV Shows in one plot is immediate and visual.
Without hue, we would need separate plots for each category, which is less efficient.


### **2. What is/are the insight(s) found from the chart?**

**IMDb Score (imdb_score)**


---


Most titles are rated 5–7 ⭐, showing generally good quality.
Very few extremely low (<4) or very high (9–10) ratings 😮.
Peak around 6 👍, indicating most content is above average.
Movies vs TV shows (hue) may show slight differences.

**Runtime (runtime)**


---


Two peaks:

Short runtimes (TV shows 📺)

Longer runtimes (Movies 🎬)

Hue shows TV shows tend to be shorter, movies longer ⏳.

**TMDb Score (tmdb_score)**


---


Similar to IMDb scores, clustering around 6–7 🎯.
Shows content quality is generally good, few extreme scores 😅.

**Release Year (release_year)**


---


Histogram shows production trends over time ⏰.
Peaks indicate years with more content created 📈.
Hue reveals whether Movies or TV shows dominated those years.

**Overall Insights**


---


1) Most content is well-rated 👍

2) Runtime and votes are skewed, showing most content is shorter and less voted ⏳.

3) Movies vs TV shows differ in runtime, votes, and scores 📺🎬.

4) Popularity is concentrated in a few hit titles 🌟.

5) Production trends show booms in certain years ⏰.

## **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

### Positive Growth

**Understanding Ratings (IMDb/TMDb scores)** ⭐


---


Most content is rated 6–8, meaning viewers generally like it 👍.

- **Business impact:** Focus marketing on titles with high scores to attract more viewers. You could also analyze what makes top-rated content successful and replicate it.

**Runtime Trends ⏳**


---


Movies are longer 🎬, TV shows shorter 📺.

- **Business impact:** Helps content planning, knowing audience prefers shorter formats for TV and longer for movies allows efficient resource allocation.

**Temporal Trends (Release Year) ⏰**


---


Peaks in certain years show production booms.

- **Business impact:** Planning release schedules around high engagement periods can maximize reach and revenue. Historical trends help forecast future production strategies.


---



###Potential negative growth

**Heavy Reliance on High Ratings (IMDb/TMDb) ⭐**


---


- **Observation:** Most content clusters around moderate-to-high ratings (6–8).

- **Negative growth risk:** Overconfidence in existing content quality could ignore underperforming or niche segments. New content in unexplored genres or formats may be neglected.

- **Justification:** Focusing only on “safe” content may stifle innovation and audience diversification.

**Bimodal Runtime Trends ⏳**


---


- **Observation:** TV shows are short, movies are long.

- **Negative growth risk:**
If the platform produces only one type (e.g., long movies), it may miss the binge-watching audience that prefers shorter content.

- **Justification:** Not catering to audience format preferences can reduce engagement and subscriptions.

## Chart - 2  Bar plot
## Univatiate Analysis

In [None]:
# Bar plot visualization code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math
import warnings

# Suppress UserWarnings (like palette mismatch)
warnings.filterwarnings("ignore", category=UserWarning)

# Seaborn style
sns.set_style("whitegrid")

# Parameters
MAX_UNIQUE = 30   # Threshold to show top 20
TOP_N = 10        # Number of top categories to show if > MAX_UNIQUE
n_cols_per_row = 2

categorical_cols = ['genres', 'production_countries', 'name', 'character']

n_rows = math.ceil(len(categorical_cols) / n_cols_per_row)
plt.figure(figsize=(20, 5 * n_rows))

for idx, col in enumerate(categorical_cols):
    plt.subplot(n_rows, n_cols_per_row, idx + 1)

    hue_col = 'type'

    # Handle columns with too many unique values
    if df[col].nunique() > MAX_UNIQUE:
        top_categories = df[col].value_counts().nlargest(TOP_N).index
        plot_data = df[df[col].isin(top_categories) & (df[col] != 'Unknown')]
    else:
        plot_data = df[df[col]!= 'Unknown'].copy()

    # Dynamically adjust palette length to match hue levels
    n_hue_levels = plot_data[hue_col].nunique()
    palette = sns.color_palette("Set2", n_hue_levels)
    sns.countplot(data=plot_data, x=col, hue=hue_col, palette=palette, order= plot_data[col].value_counts().index)
    plt.legend(title=hue_col, loc='upper right')

    plt.xticks(rotation=45, ha='right')
    plt.xlabel('')
    plt.ylabel('Count')
    plt.title(f'{col}', fontsize=14, weight='bold')

plt.tight_layout()
plt.show()


## **1. Why did you pick the specific chart?**

**Show frequency of categories clearly 📊**


---


Bar plots display how many times each category appears.
This is critical for understanding class imbalance, popular categories, or rare categories.

> **Example:** If a “genre” column has 50% “Drama” and 5% “Horror,” a bar plot makes this immediately visible.



**Works well with hue for comparisons 🌈**


---


By adding hue, we can compare subgroups within each category.

> **Example:** “Movies vs TV shows” within each genre — we can see which type dominates.

**Handles high-cardinality columns with top N 🎯**


---


For columns with too many unique values (like names, IDs), plotting all categories would be messy.
Using top 20 categories keeps plots clean and interpretable.

**Supports log scale for skewed data 📉**


---


Some categories may have very high counts compared to others.
Applying log scaling makes the smaller categories visible without losing the big ones.

**Subplots maximize readability 🖼️**


---


Using 2 or 3 plots per row ensures the plots are large enough to read while showing all relevant columns.

## **2. What is/are the insight(s) found from the chart?**

**Movies Dominate Shows**


---


The dataset is heavily skewed toward movies across genres, actors, and roles.

**Genres – Drama Leads Strongly**


---


Drama is the most common genre by a large margin, followed by comedy and documentaries.

Multi-genre blends (Drama+Romance, Thriller+Drama, etc.) are also well-represented.

**Production – US-Centric**


---


Nearly all productions are US-based (~48k), with India, UK, Canada, and France trailing far behind.

Co-productions exist but are minimal in volume.

**Actors & Characters – Classic & Generic**


---


Frequent actors are largely from classic Hollywood (e.g., Roy Rogers, Gene Autry).

Character roles are often generic (Himself, Nurse, Sheriff, Waitress), pointing to many documentaries and background roles.

## **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

### Positive Impact

**Focus on high-demand categories 🌟**


---



*   **Observation:** The tallest bars indicate the most frequent or popular categories.
*   **Business impact:** You can invest more in popular genres, regions, or product types to maximize ROI, engagement, or sales.






**Explore underserved categories 🌱**


---



*   **Observation:** Short bars show underrepresented categories.
*   **Business impact:** These represent growth opportunities. Investing in niche or emerging categories can capture untapped market segments.






**Subgroup strategies with hue 🌈**

---

*   **Observation:** Hue comparisons reveal patterns within subgroups (e.g., Movies vs TV Shows).
*   **Business impact:** Enables tailored strategies for different audiences, improving user satisfaction and retention.




**Early detection of issues ⚠️**

---

*   **Observation:** Tiny or unexpected categories can indicate data inconsistencies.
*   **Business impact:** Fixing data quality issues early prevents wrong business decisions based on flawed data.





### Negative Impacts

**Over-reliance on dominant categories ⚠️**

---

*   **Observation:** Some categories dominate the dataset (very tall bars), while others are rare.
*   **Negative impact:** Focusing too much on the top categories may ignore emerging or niche markets. Competitors could capture these underserved segments, leading to missed growth opportunities.


**Skewed subgroups with hue 🌈**

---

*   **Observation:** If a hue category (e.g., “TV Shows” within “Comedy”) is underrepresented, certain audience preferences may be neglected.
*   **Negative impact:** Ignoring these subgroups can reduce customer satisfaction in specific segments, potentially causing churn or declining engagement.


**High-cardinality categories with insufficient focus 🔍**

---

*   **Observation:** Columns with many unique values (e.g., production countries, creators) may have some categories barely represented.
*   **Negative impact:** Without targeted strategies, these low-frequency categories may never gain traction, limiting expansion opportunities.


**Hidden data quality issues 🛑**

---

*   **Observation:** Unexpected or empty categories appear as tiny bars.
*   **Negative impact:** Poor data quality can mislead decision-making, causing investments in wrong areas, inefficiency, or reputational risk.


**Overcrowded focus on niche categories without ROI 💸**

---

*   **Observation:** Some rare categories are small but tempting to invest in.
*   **Negative impact:** Spending too much on very low-frequency categories may drain resources without sufficient return.




#### Chart - 3: Pie plot

## Univariate Analysis

In [None]:
# Chart - 3 visualization code
#Pie plot visualization code
# Pie chart visualization code
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import math

# Seaborn style
sns.set_style("whitegrid")

# Thresholds
max_unique_for_plot = 50      # skip columns with too many unique values
top_categories = 15           # show top N categories in pie chart

# Choose hue column (categorical with 2-8 unique values)
hue_col = 'type'

# Count total plots: For age_certification, we make 2 pie charts for each hue type
total_plots = 2

# Subplot layout: 2 per row
n_cols_per_row = 2
n_rows = math.ceil(total_plots / n_cols_per_row)
fig, axes = plt.subplots(n_rows, n_cols_per_row, figsize=(10 * n_cols_per_row, 8 * n_rows))
axes = axes.flatten()

# Color palette
colors = sns.color_palette("Set2", top_categories)

plot_idx = 0
for hue_val in df[hue_col].unique():
    data = df[df[hue_col] == hue_val]['age_certification'].value_counts().sort_values(ascending=False)

    axes[plot_idx].pie(data, labels=data.index, autopct='%1.1f%%', startangle=140,
                        colors=colors, shadow=False)
    axes[plot_idx].set_title(f"{col} distribution for {hue_col} = {hue_val}",
                              fontsize=13, fontweight='bold')
    plot_idx += 1

plt.tight_layout()
plt.show()

## **1. Why did you pick the specific chart?**


**Why Pie Charts Are Useful for This Dataset**  

---  

**Quick Comparison Across Categories 📊**  

---  
- With hue applied (when a small secondary categorical variable is available), we can compare distributions across subgroups without creating multiple separate charts.  

**Top Categories Emphasis ✨**  

---  
- By limiting to top N categories (e.g., top 15), the charts focus on the most meaningful data and avoid clutter.  

**Immediate Visual Impact 🎨**  

---  
- Pie charts provide an instant visual sense of dominance

**Overall Insight**  

---  
- Pie charts make it easy to see proportions and categorical distributions at a glance.  
- With careful handling (top categories, hue, clean labels), they are highly effective for your dataset 😎.


## **2. What is/are the insight(s) found from the chart?**

**Type Distribution (Movies vs TV Shows) 🎬**  

---  

- Movies dominate the dataset, making up the majority of entries, while TV shows are fewer.  
- Suggests that most of the content is movie-focused.  

**Summary Insight**  

---  

- Most content is movies, focused on teen/adult ratings.  
- Helps in targeting production, marketing, and audience strategies.

## **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

### Positives  

**Focus on Popular Content Types 🎬**  

---  

- **Observation:** Movies dominate the dataset.  
- **Positive growth opportunity:** Investing in movie production or promotion will likely maximize audience engagement and revenue.  
> ✅ Positive impact: Strategically producing or marketing content in these ratings can increase viewership and retention focused on teen/adult ratings.  



### Negatives  

**Over-reliance on Movies**  

---  

- **Observation:** TV shows are underrepresented.  
- **Negative growth risk:** Neglecting TV content could limit audience engagement, especially for viewers who prefer series.  
> ❌ Potential impact: May reduce overall platform engagement.  


**Overall Insight**  

---  

> Focus on popular content types can maximize growth, but over-reliance on a few categories may pose risks to engagement and diversity.

#### Chart - 4 :Box plot

## Univariate Analysis

In [None]:
#Box plot Visualization Code
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Suppress warnings
warnings.filterwarnings("ignore", category=UserWarning)

# Select numeric columns
num_cols = ['imdb_votes', 'tmdb_popularity']

# Define subplot layout: 3 plots per row
n_cols = 2
n_rows = int(np.ceil(len(num_cols) / n_cols))

# Seaborn style and color palette
sns.set(style="whitegrid")
colors = sns.color_palette("Set2", len(num_cols))

# Create subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(10 * n_cols, 4 * n_rows))
axes = axes.flatten()  # Flatten in case of multiple rows

# Plot each numeric column as horizontal boxplot
for i, col in enumerate(num_cols):
    sns.boxplot(x=df[col], ax=axes[i], color=colors[i], orient='h')
    axes[i].set_title(f'Boxplot of {col}', fontsize=12)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')

# Remove any unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout(pad=15)  # Good spacing between subplots
plt.show()


## **1. Why did you pick the specific chart?**

Box plots are specifically designed for numeric/continuous data. They help summarize large datasets in a compact and visual way.

- Show the distribution of data clearly (median, quartiles, range).

- Reveal skewness in numeric features (e.g., imdb_votes may be highly skewed).

- Highlight outliers such as extremely long movies or unusually high vote counts.

- Enable comparisons across categories using hue or categorical axes (e.g., runtime of Movies vs. TV Shows).

- Work well for multiple numeric columns when arranged in subplots, offering a cleaner comparison than histograms or line plots.

Using box plots provides both clarity and efficiency, making them ideal for exploratory data analysis.

## **2. What is/are the insight(s) found from the chart?**

**Distribution of Numeric Columns**  

---  

**imdb_votes**  
Most ratings cluster in the mid-to-high range. Some extremely low or high scores appear as outliers, highlighting highly unpopular or extremely well-rated content.  

**tmdb_popularity**  
Highly skewed distribution: a few titles get massive attention (outliers), while most get moderate votes/popularity.  

**Skewness and Spread**  

---  

imdb_votes and tmdb_popularity are right-skewed, indicating most content has moderate attention, and few are extremely popular.  

imdb_score might have a tighter spread, showing most content falls in a typical rating range.  





## **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

## **Positive Business Impact**

Understanding Content Performance


---


IMDB/TMDB scores and votes show which content is most liked and engaged with.

**Actionable insight**: Invest more in content types or genres that consistently receive high scores and popularity. ✅

Outlier Identification


---


Extremely long or short content, or highly unpopular content, is visible as outliers.

**Actionable insight**: Avoid producing extremely long content unless it’s proven to engage audiences; similarly, identify why some content fails to gain votes/popularity. ✅

Skewed Popularity Metrics


---


tmdb_popularity are right-skewed, showing that a few titles dominate attention.

**Actionable insight:** Strategically promote underperforming but high-quality content to balance audience engagement. ✅

## **Negative Impact**

Skewed Popularity Metrics


---


**Observation:** Columns like imdb_votes and tmdb_popularity are highly right-skewed.

**Negative impact:** Most content gets low engagement while only a few titles dominate popularity.

**Risk:** Relying on the same top-performing content repeatedly could lead to over-dependence on a few hits, leaving most content underperforming.

Low Ratings


---


**Observation:** Outliers with very low imdb_score or tmdb_score appear in the box plots.

**Negative impact:**

Low-rated content can damage brand perception if such content is frequent. May reduce subscriber retention or discourage engagement.


#### Chart - 5: Scatter Plot
## Bi-Variate Analysis ( IMDB Score Vs TMDB Score )

In [None]:
# Scatter Plot visualization code

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


# Suppress warnings
warnings.filterwarnings("ignore", category=UserWarning)

# Seaborn style
sns.set_style("whitegrid")
palette = ["green", "orange"]  # custom palette

# Choose a hue column (categorical with few categories)
hue_col = 'type'
x_col = 'imdb_score'
y_col = 'tmdb_score'

sns.scatterplot(
        data=df,
        x=x_col,
        y=y_col,
        hue=hue_col,
        palette=palette,
        alpha=0.7,
        edgecolor="black",
        s=40,
)


plt.title(f"{x_col} vs {y_col}" + (f" by {hue_col}"), fontsize=14, weight="bold")
plt.xlabel(x_col, fontsize=12)
plt.ylabel(y_col, fontsize=12)

plt.legend(title=hue_col, fontsize=9, title_fontsize=10, loc="best", frameon=True)

plt.show()

## **1. Why did you pick the specific chart?**

Scatter plots are the most effective way to visualize the relationship between two numeric variables (e.g., budget vs revenue, duration vs rating). They allow trends, clusters, and outliers to be seen clearly in two-dimensional space.

- Detect correlations and patterns such as positive, negative, or no relationship. For example, if budget increases and revenue also increases, the plot shows an upward trend.

- Highlight outliers like movies with unusually high revenue compared to the average.

- Add a categorical column as hue (e.g., genre, type) to compare how groups behave in relation to numeric pairs.

- Scale easily to many numeric column pairs, making them systematic tools for exploratory data analysis.

- Scatter plots are chosen because they are the most informative for analyzing numeric relationships, while still allowing categorical insights through hue.

## **2. What is/are the insight(s) found from the chart?**

**imdb_score vs tmdb_score by type**


---


There is a positive correlation between IMDb scores and TMDB scores. Content rated highly on IMDb generally receives higher ratings on TMDB as well.

Most titles cluster in the mid-to-high range, specifically between IMDb scores 5–8 and TMDB scores 5–7.5, showing that the majority of movies and shows fall in this typical band.

**Movies vs Shows**

- Movies dominate the dataset (green points far outnumber yellow).

- Shows tend to cluster more consistently toward higher IMDb ratings (6–8), suggesting stronger audience approval for series compared to movies.

**Outliers**

- A few titles have very low IMDb scores (~3–4) but moderate TMDB ratings.

- Some titles show the reverse: high IMDb scores (~8–9) but relatively lower TMDB ratings.

**Overall**

IMDb and TMDB scores are strongly aligned, most content sits in the 5–8 range, and while movies dominate in number, shows often achieve slightly stronger consistency in ratings.

## **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

### Positives  

**Risk Management ⚖️**  

---  

- **Observation:** Outlier analysis pinpoints projects with higher chances of failure (e.g., high budget but low audience interest).  
- **Positive growth opportunity:** Helps avoid risky investments.  
- **Justification:** Learning from past flops improves financial planning and reduces losses.  

**Market Segmentation 🛒**  

---  

- **Observation:** Clustering in scatter plots reveals distinct market segments (indie films, blockbusters, streaming content).  
- **Positive growth opportunity:** Enables targeted marketing and distribution strategies.  
- **Justification:** Better segmentation drives audience engagement and revenue growth.  

**Overall Insight**  

- Scatter plots provide actionable insights for smarter investment, risk mitigation, and audience targeting.  
- Applying these findings can improve profitability, reduce losses, and optimize content strategies.  

---

### Negatives  

**Over investment Risk 🚨**  

---  

- **Observation:** Very high-budget movies don’t always generate proportional revenue.  
- **Negative growth risk:** Excessive spending may reduce profitability.  
- **Justification:** Over-investment in large projects can harm ROI and limit overall growth.  

**Genre Saturation 📉**  

---  

- **Observation:** Certain genres consistently show low or stagnant performance.  
- **Negative growth risk:** Heavy investment in underperforming genres drains resources.  
- **Justification:** Ignoring performance trends can reduce profitability and market relevance.  

**Inefficient Resource Allocation 🔄**  

---  

- **Observation:** Smaller projects may outperform big-budget productions.  
- **Negative growth risk:** Over-prioritizing large budgets could miss opportunities in mid/low-budget hits.  
- **Justification:** Misallocation limits returns and reduces portfolio efficiency.  

**Audience Shift Trends ⏩**  

---  

- **Observation:** Declining trends in certain categories (e.g., theatrical releases vs streaming).  
- **Negative growth risk:** Ignoring changing audience preferences can reduce competitiveness.  
- **Justification:** Failure to adapt may lead to loose market share and stagnant growth.  

**Overall Insight**  

- Scatter plots not only highlight opportunities but also reveal risks in budget, genre, and audience trends.  
- Addressing these negative signals is crucial to maintain profitability and sustainable growth.


#### Chart - 6 :Line Plot
## Bi-Variate Analysis

In [None]:
# Line Plot visualization code

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import math

# Suppress warnings
warnings.filterwarnings("ignore", category=UserWarning)

# Seaborn style
sns.set_style("whitegrid")
sns.set_palette("Set2")

# Identify numeric and categorical columns
num_cols = ['runtime', 'imdb_votes','tmdb_popularity']
cat_cols = ['release_year']

# Possible hue column (few unique categories)
hue_col = 'type'

# Build valid pairs: (time/ordered col, numeric col)
pairs = [(x, y) for x in cat_cols for y in num_cols]

# Subplot setup: 2 per row for readability
n_cols_per_row = 2
n_rows = math.ceil(len(pairs) / n_cols_per_row)
fig, axes = plt.subplots(n_rows, n_cols_per_row, figsize=(10 * n_cols_per_row, 6 * n_rows))
axes = axes.flatten()

for idx, (x_col, y_col) in enumerate(pairs):
    ax = axes[idx]

    if hue_col and hue_col != x_col:
        sns.lineplot(data=df, x=x_col, y=y_col, hue=hue_col, marker="o", ax=ax)
    else:
        grouped = df.groupby(x_col)[y_col].mean().reset_index()
        sns.lineplot(data=grouped, x=x_col, y=y_col, marker="o", ax=ax)

    ax.set_title(f"{y_col} vs {x_col}" + (f" by {hue_col}" if hue_col else ""), fontsize=14, weight="bold")
    ax.set_xlabel(x_col, fontsize=12)
    ax.set_ylabel(y_col, fontsize=12)
    ax.tick_params(axis='x', rotation=30)

    if hue_col:
        ax.legend(title=hue_col, fontsize=9, title_fontsize=10, loc="best", frameon=True)
    else:
        ax.get_legend().remove()

# Remove unused subplots
for j in range(len(pairs), len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout(pad=3)
plt.show()


## **1. Why did you pick the specific chart?**

**Why Line Charts for This Dataset 📈**  

---  

The dataset includes **time variables** (e.g., `release_year`) and ordered categorical fields, making line charts ideal for showing trends and changes over time.  

**Temporal Trends ⏳**  

- Line charts reveal how metrics like popularity, ratings, or runtime evolve across years.  


**Ordered Categories 📊**  

- Line charts handle progression across these ordered categories better than bar or scatter plots.  

**Comparison with Hue 🎨**  

- Adding hue (e.g., type = Movie 🎬 vs TV Show 📺) allows easy comparison of multiple trends at once.  
- Example: Determining which content type gained more attention after 2015.  

**Easy Storytelling 📝**  

- Line charts are intuitive and allow stakeholders to quickly spot growth 📈, decline 📉, or stability ➖.  
- Perfect for communicating trends and making data-driven business decisions.  

**Overall**  

Line charts are ideal for this dataset because they highlight **temporal patterns ⏳**, **ordered categorical progressions 📊**, and **trend-based business insights 🔑**.


## **2. What is/are the insight(s) found from the chart?**

📊 Insights from Line Charts

- **Content Growth Over Time 🎬📺**  
  Number of movies 🎥 and TV shows 📺 released has increased significantly after 2010 🚀.  
  This suggests the streaming boom and global content expansion 🌍.

- **Runtime Trends ⏱️**  
  Average movie runtimes are increasing 📉, while TV shows maintain steady lengths ➖.

- **Score and Popularity** Score and popularity trends look consistent with time.

✨ **In short:**  
The dataset shows a clear surge in content production 📈, especially in recent years.  
Viewer preferences are shifting toward shorter, diverse, and family-friendly content 🎉.  
These trends can guide content strategy, marketing, and platform investments 💰.


## **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

**Positive and Negative Business Insights from Line Charts**  

---  

Line charts reveal both opportunities and risks in content production, helping guide investment and strategy decisions.  

**Positive Business Impact 🌟**  

---  

- **Informed Budget Allocation 💰**  
  Trends over years highlight periods with better ROI.  
  Helps studios or platforms allocate marketing & production budgets smartly 🧠.  

- **Timing Releases for Maximum Impact ⏰**  
  Analysis of release trends identifies peak production periods.  
  Enables strategic launches for maximum visibility and revenue 💸.  

**Potential Negative Growth Insights ⚠️**  

---  

- **Over-saturation Risk 🎯**  
  Rapid content growth after 2010 🚀 may create too many similar shows/movies.  
  High competition can lower viewer attention per title → reduced revenue per production 😬.  

- **Niche Genre Underperformance 📉**  
  Certain genres (e.g., Documentaries 🎥 or experimental formats) show flat or declining trends ➖.  
  Heavy investment in these areas may lead to negative ROI if audience interest doesn’t grow.  

- **Audience Fatigue 😓**  
  Decreasing runtimes (ultra-short content) or drops in quality can reduce engagement despite higher production.  
  Overproduction without quality may harm brand reputation and long-term growth.  

**Overall Insight**  
---  

Line charts provide a clear view of historical trends, enabling smarter budget allocation, optimized release timing, and awareness of potential over-saturation or audience fatigue risks.


#### Chart - 7 : Box plot
## Bi-Variate Analysis

In [None]:
# Chart - 7 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Suppress warnings
warnings.filterwarnings("ignore")

# Set style
sns.set(style="whitegrid")

# Create subplots grid
fig, axes = plt.subplots(3, 2, figsize=(16, 16))
fig.suptitle("Bivariate Box Plots", fontsize=18, fontweight='bold')
df = df[df != 'Unknown'] #Remove values which are unknown

# Define a palette for more colorful plots
palette = sns.color_palette("Set2")

# 1. IMDb Score by Type
sns.boxplot(x='type', y='imdb_score', data=df, ax=axes[0,0], palette=palette)
axes[0,0].set_title('IMDb Score by Type', fontsize=13, fontweight='bold')

# 2. IMDb Votes by Type
sns.boxplot(x='type', y='imdb_votes', data=df, ax=axes[0,1], palette=palette)
axes[0,1].set_title('IMDb Votes by Type', fontsize=13, fontweight='bold')

# 3. TMDB Popularity by Type
sns.boxplot(x='type', y='tmdb_popularity', data=df, ax=axes[1,0], palette=palette)
axes[1,0].set_title('TMDB Popularity by Type', fontsize=13, fontweight='bold')

# 4. Runtime by Age Certification
sns.boxplot(x='age_certification', y='runtime', data=df, ax=axes[1,1], palette=palette)
axes[1,1].set_title('Runtime by Age Certification', fontsize=13, fontweight='bold')
axes[1,1].tick_params(axis='x', rotation=45)

# 5. IMDb Score by Age Certification
sns.boxplot(x='age_certification', y='imdb_score', data=df, ax=axes[2,0], palette=palette)
axes[2,0].set_title('IMDb Score by Age Certification', fontsize=13, fontweight='bold')
axes[2,0].tick_params(axis='x', rotation=45)

# 6. TMDB Score by Type
sns.boxplot(x='type', y='tmdb_score', data=df, ax=axes[2,1], palette=palette)
axes[2,1].set_title('TMDB Score by Type', fontsize=13, fontweight='bold')

# Adjust layout for better visibility
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

## **1. Why did you pick the specific chart?**

**Numeric vs Categorical**

---  

- **Observation:** Columns like `imdb_score`, `imdb_votes`, `tmdb_popularity`, `runtime`, and `tmdb_score` are numeric, while `type` and `age_certification` are categorical.  
- **Positive growth opportunity:** Boxplots clearly show how the distribution of numeric values differs across categories.  
> ✅ Positive impact: Enables better understanding of patterns and differences between content types or age groups.  

**Outlier Detection**  

---  

- **Observation:** Boxplots visually highlight outliers.  
- **Positive growth opportunity:** Identifying extreme values in `imdb_votes` or `tmdb_popularity` helps address skewness or anomalies.  
> ✅ Positive impact: Supports more accurate analysis and decision-making.  

**Distribution Insights**  

---  

- **Observation:** Median, quartiles, and spread of numeric values across categories are easily visible.  
- **Positive growth opportunity:** Understand patterns such as whether movies of a certain type generally have higher IMDb scores.  
> ✅ Positive impact: Guides content strategy, recommendation systems, and targeted marketing.  



## **2. What is/are the insight(s) found from the chart?**

**Bivariate Boxplot Insights 😃**  

---  

**IMDb Score by Type 🎬**

---  
- Movies (`type='Movie'`) generally have higher median IMDb scores compared to TV shows.  
- TV shows show more spread in ratings, indicating variability in quality.  

**IMDb Votes by Type 🗳️**  

---  
- Movies receive significantly more votes than TV shows on average.  
- Outliers indicate some movies are extremely popular with massive vote counts.  

**TMDB Popularity by Type ⭐**

---  
- Movies tend to have higher median popularity than TV shows.  
- TV shows show a wider range of popularity, including some very low and some high popularity outliers.  

**Runtime by Age Certification ⏱️**  

---  
- Most movies cluster around certain runtime ranges depending on age certification.  
- Longer runtimes are more common for PG-13 and R rated movies.  

**IMDb Score by Age Certification 📊**

---  
- Age certification doesn’t strongly affect median IMDb score, but some certifications (like R) have higher maximum ratings.  
- Ratings spread is wider for PG-13 and R, showing variability in reception.  

**TMDB Score by Type 🎯**  

---  
- Movies generally have higher median TMDB scores than TV shows.  
- The spread for TV shows indicates greater variation in audience ratings.  

**Overall Takeaways**  

---  
- Movies dominate in popularity, votes, and scores compared to TV shows.  
- Certain age certifications are associated with longer runtimes and slightly higher ratings.  
- Outliers in votes, popularity, and scores highlight exceptionally popular or critically acclaimed content.


### **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

**Positive and Negative Business Impact ✅❌**  

---  

**Focus on Movies for Popularity and Engagement**  

---  
- Movies have higher median IMDb and TMDB scores, and more votes than TV shows.  
- Suggests investing in movie production or marketing could maximize viewer engagement and revenue.  

**Target Age Certifications Strategically**  

---  
- Longer runtimes and higher maximum ratings are seen in PG-13 and R rated movies.  
- Creating content in these certifications could attract audiences who value higher-rated or longer content, improving retention.  

**Identify Outliers as High-Performers**  

---  
- Exceptionally popular movies (high votes/popularity) can be studied to replicate successful traits.  
- Boosts future content quality and informs production strategy.  

**TV Shows Lag in Popularity and Ratings ❌**  

---  
- Lower median scores and votes indicate that TV shows, on average, underperform compared to movies.  
- Investing heavily in poorly performing TV shows without addressing quality or promotion could negatively affect engagement and revenue.  

**High Variability in Certain Categories**  

---  
- Some age certifications show wide spreads in IMDb scores and runtime.  
- Without careful targeting, producing content for these categories might lead to inconsistent audience reception, harming brand reputation or profitability.  

**Summary**  

---  
- Analysis highlights where to invest and focus (movies, PG-13/R content) while cautioning against potential underperforming areas (TV shows or highly variable categories).  
- 😎 In short: Data-backed guidance supports strategic decisions to maximize positive business impact and avoid negative growth.


#### Chart - 8: Bar Plot
Bivariate Analysis

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Suppress warnings
warnings.filterwarnings("ignore")

# ===============================
# Parameters
# ===============================
MAX_UNIQUE = 30   # Max allowed unique categories
TOP_N = 20        # If categories exceed MAX_UNIQUE, show top N
n_cols_per_row = 2

# Target categorical and numerical columns
categorical_cols = ['genres', 'production_countries']
num_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Drop NA in selected categorical columns
df = df.dropna(subset=categorical_cols)

# Seaborn theme
sns.set_theme(style="whitegrid")

# Prepare figure layout (2x2 grid)
fig, axes = plt.subplots(2, 2, figsize=(20, 12))
axes = axes.flatten()

plot_index = 0

# Loop through categorical columns
for cat_col in categorical_cols:
    # Handle multiple categories per row (e.g., split by comma)
    df[cat_col] = df[cat_col].astype(str).str.split(',').apply(lambda x: x[0].strip() if len(x) > 0 else None)

    # Reduce categories if too many unique values
    if df[cat_col].nunique() > MAX_UNIQUE:
        top_cats = df[cat_col].value_counts().nlargest(TOP_N).index
        df = df[df[cat_col].isin(top_cats)]

    # Loop through numeric columns and plot until 4 plots total
    for num_col in num_cols:
        if plot_index >= 4:
            break

        # Skip empty or invalid numerical columns
        df[num_col] = pd.to_numeric(df[num_col], errors='coerce')
        if df[num_col].dropna().empty:
            continue

        # Create the bar plot
        sns.barplot(
            data=df,
            x=cat_col,
            y=num_col,
            ax=axes[plot_index],
            palette=sns.color_palette("viridis", n_colors=min(df[cat_col].nunique(), 10)),
            ci=None
        )

        axes[plot_index].set_title(f'{num_col} by {cat_col}', fontsize=15, fontweight='bold', pad=15)
        axes[plot_index].set_xlabel(cat_col, fontsize=13)
        axes[plot_index].set_ylabel(num_col, fontsize=13)
        axes[plot_index].tick_params(axis='x', rotation=30)

        plot_index += 1

# Remove unused subplots if fewer than 4
for j in range(plot_index, len(axes)):
    fig.delaxes(axes[j])

# Adjust spacing and layout
plt.suptitle("🎬 Bivariate Analysis — Genres & Production Countries", fontsize=20, fontweight='bold', y=0.98)
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()


## **1. Why did you pick the specific chart?**

**A bar plot is ideal when:**

___
One variable is categorical (genres)
The other is numerical (imdb_score, runtime, tmdb_popularity, etc.)

It helps you compare the average (or distribution) of numeric values across categories, showing patterns that are easy to interpret visually.


## **2. What is/are the insight(s) found from the chart?**

**🎬 1. Genre-wise Insights**

---
**When plotting genres vs numeric variables:**
___


**IMDb Score**	🎯 Certain genres such as “Documentary” or “Drama” tend to have higher average IMDb scores, indicating stronger audience appreciation compared to genres like “Horror” or “Reality.”


**Runtime**	⏱️ Genres like “Drama” and “Action” have longer runtimes on average, while “Comedy” and “Animation” tend to be shorter.

**TMDB Popularity**	🔥 “Action” and “Adventure” genres show higher popularity scores, reflecting stronger viewer engagement and trending status.

**🧠 Interpretation:**
This helps identify which genres perform well both critically (IMDb) and commercially (popularity).


## **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

**✅ Positive Business Impact**
___

**Insight	Business Impact**
___
**🎬 High-rated genres (e.g., Drama, Documentary):**
	Investing more in these genres can improve platform reputation and viewer satisfaction, leading to better customer retention.

**🔥 High-popularity genres (e.g., Action, Adventure):**
	These attract new users and increase viewership hours, which helps in subscriber growth and ad revenue.


**🧠 Understanding low-rated or less popular genres:**
	Helps Prime Video reduce investment risks by avoiding or redesigning underperforming categories.


**⚠️ Negative or Limiting Insights**
___


Observation	Potential Negative Impact	Justification
⬇️ Certain genres (e.g., Horror, Reality TV) have low IMDb ratings.

Investing heavily in such genres may lead to low audience satisfaction and higher churn rates.

Indicates these categories are not resonating with the Prime Video audience.

#### Chart - 9: Violin Plot
## Bi-Variate Analysis

In [None]:
# Violin Plot visualization code
import matplotlib.pyplot as plt
import seaborn as sns
import math
import pandas as pd

# Seaborn style
sns.set_style("whitegrid")

# Define variables
num_cols = ['tmdb_popularity', 'runtime', 'imdb_score']
cat_col = 'production_countries'
hue_col = 'type'

# Filter dataset to top 10 most frequent production countries
top10_countries = df[cat_col].value_counts().nlargest(10).index
temp_df = df[df[cat_col].isin(top10_countries)]

# Subplot setup — 2 plots per row
n_cols_per_row = 2
n_rows = math.ceil(len(num_cols) / n_cols_per_row)
fig, axes = plt.subplots(n_rows, n_cols_per_row, figsize=(12 * n_cols_per_row, 6 * n_rows))
axes = axes.flatten()

palette = sns.color_palette("Set2")

# Create violin plots for each numeric column
for i, num_col in enumerate(num_cols):
    ax = axes[i]

    sns.violinplot(
        data=temp_df,
        x=cat_col,
        y=num_col,
        hue=hue_col,
        palette=palette,
        ax=ax,
        inner="quart",
        cut=0
    )

    # Apply log scale for better visual clarity
    ax.set_yscale("log")

    ax.set_title(f"{num_col.capitalize()} by {cat_col} (Hue: {hue_col})", fontsize=14, fontweight="bold")
    ax.tick_params(axis='x', rotation=45)
    ax.set_xlabel("Production Country")
    ax.set_ylabel(f"Log-scaled {num_col.replace('_', ' ').capitalize()}")

# Remove unused subplot if any
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout(pad=3)
plt.show()



## **1. Why did you pick the specific chart?**



**Shows Distribution Shape 🎨**  

---  

- Unlike a box plot, violin plots display the full distribution of the data using a kernel density estimate.  
- Reveals skewness, multi-modality, and other patterns in numeric variables that boxplots alone might hide.  


**🎨 Visualizing Distribution	:**

---
Violin plots reveal how each numeric variable (e.g., tmdb_popularity, runtime, imdb_score) is distributed across the top 10 production countries.


**⚖️ Perfect for Comparisons:**

---
With hue='type', each country’s plot shows separate distributions for Movies and TV Shows, making differences in popularity, rating, or duration instantly visible.


## **2. What is/are the insight(s) found from the chart?**

**🎬 1. TMDB Popularity vs Production Countries**

---
Movies show higher median TMDB popularity compared to TV Shows in most countries — especially in the USA, India, and the UK.

This pattern hints that Movies dominate global buzz, while TV Shows sustain steady but smaller audiences.

**⏱ 2. Runtime vs Production Countries**

---

Movies generally have longer runtimes and a wider spread, showing the diversity in film lengths across different production countries.

India and the USA have the broadest range of movie runtimes — possibly due to the variety of genres

**⭐ 3. IMDb Score vs Production Countries**

---

Movies from countries like the USA and UK exhibit a wider score spread, showing both critically acclaimed and low-rated productions.

This implies that TV content is more predictable in quality, while Movies are more hit-or-miss but occasionally reach higher peaks.

## **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*


**Positive Growth:**

---
The insights directly guide content investment, regional strategy, and viewer retention efforts — all key to increasing subscriber base and engagement.


**Potential Risks:**

---
Overproduction of low-rated movies or underrepresentation of international shows could stagnate growth if not addressed.

**🧠 Conclusion**

---
Overall, the gained insights are highly actionable and positive — they can guide Amazon Prime Video to make data-driven content, marketing, and regional investment decisions.

However, careful quality control and geographic diversification are essential to avoid negative growth risks related to uneven performance and limited audience diversity.

#### Chart - 10 : Donut chart

In [None]:
# Donut chart visualization code
import pandas as pd
import plotly.express as px
import plotly.subplots as sp
import plotly.graph_objects as go
import math

# Hardcoded categorical columns
cat_cols = ['type', 'age_certification', 'genres', 'production_countries']

# Choose hue column wherever possible (must have <=12 unique values for readability)
possible_hue = 'type' if df['type'].nunique() <= 12 else None

top_n = 15        # top N categories
min_pct = 2       # combine slices <2% into 'Others'

# Subplot setup: 2 per row
n_cols_per_row = 2
n_rows = math.ceil(len(cat_cols) / n_cols_per_row)
fig = sp.make_subplots(
    rows=n_rows, cols=n_cols_per_row,
    subplot_titles=[f"{col}" for col in cat_cols],
    specs=[[{'type':'domain'}]*n_cols_per_row for _ in range(n_rows)]
)

# Color palette
colors_palette = px.colors.qualitative.Pastel

for i, col in enumerate(cat_cols):
    # Handle multi-valued columns
    if df[col].dropna().astype(str).str.contains(',').any():
        value_counts = df[col].dropna().str.split(',').explode().str.strip().value_counts()
    else:
        value_counts = df[col].value_counts()

    # Take top N categories
    value_counts = value_counts.head(top_n)

    # Combine small slices into 'Others'
    total = value_counts.sum()
    small_slices = value_counts[value_counts/total*100 < min_pct]
    value_counts_combined = value_counts[value_counts/total*100 >= min_pct]
    if not small_slices.empty:
        value_counts_combined['Others'] = small_slices.sum()

    row = i // n_cols_per_row + 1
    col_pos = i % n_cols_per_row + 1

    # If hue is possible and makes sense, use it; otherwise single color
    fig.add_trace(
        go.Pie(
            labels=value_counts_combined.index,
            values=value_counts_combined.values,
            hole=0.35,
            marker=dict(colors=colors_palette * ((len(value_counts_combined) // len(colors_palette)) + 1)),
            textinfo="percent+label",
            hovertemplate="<b>%{label}</b><br>Count: %{value}<br>Percent: %{percent}<extra></extra>",
            pull=[0.05 if j == value_counts_combined.values.argmax() else 0 for j in range(len(value_counts_combined))],
            sort=False
        ),
        row=row, col=col_pos
    )

# Layout updates
fig.update_layout(
    title_text=f"Donut Charts of Key Categorical Columns{f' by {possible_hue}' if possible_hue else ''}",
    title_x=0.5,
    showlegend=True,
    height=450 * n_rows,
    width=900,
    template="plotly_white"
)

fig.show()


## **1. Why did you pick the specific chart?**

I picked donut charts because 🍩✨:

- **Perfect for categorical data 😎** – Columns like type, age_certification, genres, and production_countries are all categorical, and donuts clearly show the proportion of each category.

- **Handles multi-valued categories 🎯** – For columns like genres and production_countries, we can split and visualize top categories while combining smaller ones into “Others,” making it easy to read.

- **Interactive & clean 👀** – Using Plotly, we get hover info with counts and percentages, so you can quickly understand the data.

- **Visual emphasis 🌟** – The hole in the donut draws attention to the proportions, and we can highlight the largest slice for extra focus.

- **Comparative clarity 📊** – Multiple donuts in subplots let you compare distributions across columns without clutter.

- **Business insights 💡** – Quickly shows dominant types, certifications, genres, or countries, helping in decisions like content strategy or targeting audiences.



## **2. What is/are the insight(s) found from the chart?**

**Insights from Donut Charts 😄🍩**  

---  

**Content Type Distribution 📺🎬**  

---  

- One type (e.g., Movie or TV Show) dominates the dataset, showing which type is more prevalent.  
- Helps understand the focus of the platform or dataset.  

**Age Certification Patterns 🔞👶**  

---  

- Most titles fall under certain age certifications (like TV-MA, PG-13), giving insight into the target audience demographics.  

**Genre Popularity 🎭**  

---  

- Some genres appear far more frequently (e.g., Drama, Comedy), while niche genres have smaller proportions combined into “Others.”  
- Shows which genres are most produced or available, guiding content strategy.  

**Production Countries 🌍**  

---  

- A few countries dominate production, indicating regional content concentration.  
- Can guide market expansion or regional marketing strategies.  

**Small Slices Combined as “Others” 📝**  

---  

- Combining minor categories keeps the visualization clean and highlights major contributors, making insights clearer.  

**Interactive Hover Insights 🖱️**  

---  

- You can see exact counts and percentages for each category, making it easy to extract precise business insights.  

**Overall Insight**  

---  

- The charts clearly show which categories dominate, which are niche, and where the focus lies, helping to make data-driven content and marketing decisions 😄📊.


## **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

1️⃣ Positive Business Impact ✅

- **Content Strategy 🎯:**  
  By seeing which types (Movie vs TV Show) and genres dominate, the platform can invest in popular content to attract more viewers.

- **Target Audience Alignment 👨‍👩‍👧‍👦:**  
  Age certification distribution shows which audience segments are most served, helping marketing campaigns and content recommendations.

- **Regional Focus 🌍:**  
  Production countries highlight where content is concentrated, guiding regional partnerships, licensing, and local content production.

- **Resource Allocation 💡:**  
  Knowing the high-performing categories allows efficient allocation of budget, promotions, and original content creation.

2️⃣ Potential Negative Growth Insights ⚠️

- **Over-reliance on a few categories 📉:**  
  Most content comes from a single type (Movie) or a few genres, user engagement may drop for audiences seeking variety.

- **Regional concentration risk 🌐:**  
  Production is dominated by a few countries, the platform might struggle to expand globally or lose appeal in underrepresented regions.

- **Niche genre neglect 💤:**  
  Smaller genres are combined into “Others,” which shows limited variety. This could discourage users looking for diverse content, reducing retention.

In short, the donut charts provide clear insights for growth, highlight popular and underrepresented areas, and allow for strategic decisions to maximize engagement while avoiding pitfalls 😄✨


#### Chart - 11 TreeMap

In [None]:
# Tree map visualization code
import pandas as pd
import plotly.express as px



# Helper function: Keep top N categories, group rest as "Others"
def top_n_with_others(series, n=10):
    top_n = series.value_counts().nlargest(n).index
    return series.apply(lambda x: x if x in top_n else "Others")

# Apply cleanup for categorical columns
df["genres_clean"] = top_n_with_others(df["genres"], 10)
df["countries_clean"] = top_n_with_others(df["production_countries"], 10)
df["age_cert_clean"] = top_n_with_others(df["age_certification"], 6)

# ---------------- Variation 1: Type > Age Certification > Genres ---------------- #
fig1 = px.treemap(
    df,
    path=["type", "age_cert_clean", "genres_clean"],
    values=None,
    color="type",
    color_discrete_sequence=px.colors.qualitative.Set3,
    title="🎬 Treemap (Clean): Type > Age Certification > Genres",
)
fig1.update_traces(textinfo="label+percent parent")
fig1.update_layout(margin=dict(t=50, l=25, r=25, b=25))
fig1.show()

# ---------------- Variation 2: Type > Countries ---------------- #
fig2 = px.treemap(
    df,
    path=["type", "countries_clean"],
    values=None,
    color="countries_clean",
    color_discrete_sequence=px.colors.qualitative.Pastel,
    title="🌍 Treemap (Clean): Type > Production Countries",
)
fig2.update_traces(textinfo="label+percent parent")
fig2.update_layout(margin=dict(t=50, l=25, r=25, b=25))
fig2.show()

# ---------------- Variation 3: Countries > Genres ---------------- #
fig3 = px.treemap(
    df,
    path=["countries_clean", "genres_clean"],
    values=None,
    color="countries_clean",
    color_discrete_sequence=px.colors.qualitative.Safe,
    title="📌 Treemap (Clean): Production Countries > Genres",
)
fig3.update_traces(textinfo="label+percent parent")
fig3.update_layout(margin=dict(t=50, l=25, r=25, b=25))
fig3.show()

## **1. Why did you pick the specific chart?**

- **Hierarchical relationships:**  
  The dataset has multiple categorical layers (type, age_certification, genres, production_countries). Treemaps are perfect to show nested categories and their relative sizes in one view.

- **Proportional insight:**  
  Displays the proportion of each category within its parent, so you can quickly see dominant types, genres, or countries.

- **Handles categorical data elegantly:**  
  Instead of multiple bar plots, treemaps compress information visually, making patterns easier to spot.

- **Interactive & engaging:**  
  With Plotly, hovering gives detailed labels and percentages, enhancing user-friendly exploration.

- **Good for decision-making:**  
  Helps identify popular content types, gaps, and regional trends, which is crucial for business strategy.


## **2. What is/are the insight(s) found from the chart?**

**Insights from Treemap Charts 😄📊**  

---  

**Content Type Distribution 🎬**  

---  

- Movies vs TV Shows are clearly distinguishable.  
- One type (usually Movies) dominate the dataset in terms of quantity 😎.  

**Age Certification Patterns 🧒🔞**  

---  

- Certain age ratings (like PG or R) dominate for specific types.  
- Helps understand what type of audience the content is targeting 😄.  

**Genre Popularity 🎭**  

---  

- Within each type and age certification, some genres are more prevalent.  
- Popular genres can be identified quickly without looking at raw numbers 🤩.  

**Production Country Trends 🌍**  

---  

- Some countries (e.g., USA) dominate content production.  
- Allows geographic content strategy, focusing on high-producing regions 🌟.  

**Content Gaps ⚡**  

---  

- Smaller blocks or "Others" highlight underrepresented genres or countries.  
- These gaps can inform new content opportunities 😄✨.  

**Hierarchical Patterns 📊**  

---  

- Treemap shows how genres are distributed within type and age certification, or how countries contribute to genres.  
- Makes it easy to spot correlations across categories 🤓.  



## **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

### Positives  

**Content Strategy 🎬**  

---  

- **Observation:** Most popular genres and types are identifiable for each age group.  
- **Positive growth opportunity:** Focus on creating or acquiring content that attracts the largest audience 😎.  
> ✅ Positive impact: Maximizes engagement and platform reach.  

**Geographic Expansion 🌍**  

---  

- **Observation:** Certain countries produce most content.  
- **Positive growth opportunity:** Target high-production regions for partnerships or licensing deals 🤩.  
> ✅ Positive impact: Expands global presence and content variety.  



### Negatives  

**Over-investing in Saturated Categories 🎭**  

---  

- **Observation:** Majority of content belongs to one genre or type (e.g., Movies → Drama → PG).  
- **Negative growth risk:** Investing more resources here may yield diminishing returns.  
> ❌ Potential impact: Reduced ROI and inefficient resource allocation.  

**Ignoring Underrepresented Markets 🧩**  

---  

- **Observation:** Countries or genres with small representation may be overlooked.  
- **Negative growth risk:** Ignoring them could miss opportunities for emerging audiences.  
> ❌ Potential impact: Limits growth and market diversification.  

**Overall Insight**  

---  

> The treemap clearly guides where to focus for maximum ROI 😄🌟, while warning against over-saturation and misaligned investments ⚠️.


#### Chart - 12: Bubble plot

In [None]:
# Bubble plot visualization code
import pandas as pd
import plotly.express as px


# Filter dataset: only titles with significant votes
df_clean = df[df["imdb_votes"] > 1000]

# Limit to top 5 genres for clarity
top_genres = df_clean["genres"].value_counts().head(5).index
df_clean = df_clean[df_clean["genres"].isin(top_genres)]

# Create interactive bubble plot
fig = px.scatter(
    df_clean,
    x="imdb_score",
    y="tmdb_popularity",
    size="imdb_votes",
    color="genres",
    hover_name="title",
    hover_data={
        "release_year": True,
        "imdb_score": True,
        "imdb_votes": True,
        "type": True,
        "genres": False
    },
    size_max=50,                     # bigger bubbles
    opacity=0.7,
    color_discrete_sequence=px.colors.qualitative.Vivid,  # vibrant colors
    title="🎈 Bubble Plot: Tmdb_Popularity vs IMDb Score (Top Genres)",
    template="plotly_white"

)


# Show the plot
fig.show()


## **1. Why did you pick the specific chart?**

- **Two key variables on axes 📊:**  
  - imdb_score on the x-axis shows temporal trends.  
  - tmdb_popularity on the y-axis shows viewer ratings.  

- **Third variable via bubble size 🎈:**  
  - imdb_votes determines bubble size, highlighting popularity of titles.  

- **Categorical distinction via color (hue) 🎨:**  
  - Using genres as color makes it easy to compare performance across genres.  


✅ In short: Bubble plot lets you see trends, popularity, ratings, and genre differences all in one glance, which is much harder with scatter plots or bar charts alone 😄✨.


## **2. What is/are the insight(s) found from the chart?**

**Popularity vs Quality ⭐**  

---  

- Larger bubbles with higher IMDb scores indicate titles that are both popular and highly rated.  
- Some big bubbles with moderate scores show popular titles that may not be critically acclaimed.  

**Genre Performance 🎭**  

---  

- Certain genres dominate high IMDb scores and votes (e.g., Action, Drama, Comedy).  
- Some genres may have many releases but lower average scores, suggesting mixed audience reception.  

**Temporal Trends 📅**  

---  

- The spread across release_year shows how average IMDb scores change over time.  
- You can see peaks of highly-rated titles in certain years, which may align with industry trends.  

**Type Comparison 🎬**  

---  

- Faceting by type (Movies vs Shows) reveals differences in popularity and ratings distribution between formats.  
- TV Shows may have smaller bubbles but steady scores, while Movies might have larger bubbles concentrated in certain years.  

**Audience Engagement Insights 😄**  

---  

- The combination of size (votes) and rating (score) highlights titles that engage large audiences, useful for content strategy and marketing.


## **3. Will the gained insights help creating a positive business impact?**
*Are there any insights that lead to negative growth? Justify with specific reason.*

**Positive Business Impact ✅**  

---  

**Identifying High-Performing Genres and Titles**  

---  

- The plot clearly shows which genres consistently attract viewers with high IMDb scores and votes (e.g., Drama, Comedy, Action).  
- Helps in strategic content creation or acquisition, targeting genres that bring both popularity and critical acclaim.  

**Trend Analysis Over Years**  

---  

- Observing peaks in certain years allows businesses to plan releases or campaigns during periods with higher audience engagement.  

**Type-Based Insights (Movies vs Shows)**  

---  

- Separating formats helps allocate resources efficiently—for example, investing more in the type that delivers better ROI in terms of audience engagement and ratings.  

**Audience Engagement**  

---  

- Bubble size indicates viewer interest, which can guide marketing priorities to promote titles likely to go viral or gain traction.  

**Negative Growth Insights ⚠️**  

---  

**Over-Saturated Genres**  

---  

- Some genres may have many titles but smaller bubbles or lower scores, indicating that producing more content in these categories may not significantly increase engagement.  

**Low-Rated Popular Titles**  

---  

- Large bubbles with moderate or low IMDb scores suggest high viewership but poor audience satisfaction, which can harm brand reputation if repeated frequently.  

**Temporal Decline Trends**  

---  

- Certain years may show fewer high-rated releases, hinting at shifts in audience preferences or content quality issues that can negatively impact future planning.


#### Chart - 13 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import pandas as pd
import plotly.express as px

# Select numeric columns
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Compute correlation matrix
corr_matrix = df[numeric_cols].corr()

# Create interactive heatmap with sober colors
fig = px.imshow(
    corr_matrix,
    text_auto=".2f",                      # show correlation values
    color_continuous_scale='Blues',       # sober blue gradient
    aspect="auto",
    labels=dict(x="Columns", y="Columns", color="Correlation"),
    title="📊 Interactive Correlation Heatmap (Sober Colors)"
)

# Improve layout for readability
fig.update_layout(
    xaxis_title="Numeric Columns",
    yaxis_title="Numeric Columns",
    xaxis_tickangle=-45,
    width=800,
    height=700,
)

fig.show()


## **1. Why did you pick the specific chart?**

- **Shows pairwise relationships clearly 🔍:**  
  With multiple numeric columns like imdb_score, imdb_votes, runtime, tmdb_popularity, and tmdb_score, a heatmap allows you to see how each variable relates to others at a glance.  

- **Highlights strong correlations ⭐:**  
  Positive correlations (moving in the same direction) and negative correlations (moving in opposite directions) are immediately visible through color intensity.  

- **Compact representation 📦:**  
  Instead of plotting multiple scatter plots for each pair, a heatmap summarizes all pairwise correlations in one concise visual.  

- **Interactive exploration with Plotly 🖱️:**  
  Using hover, you can see exact correlation values, making it easier to interpret the data interactively.  
  Zooming and panning allow for better analysis when many variables are present.  

- **Professional & clean look 🎨:**  
  With a sober color palette, it’s suitable for presentations, dashboards, or reports, making insights immediately understandable.  

✅ In short: The correlation heatmap lets you quickly identify relationships, spot potential predictors, and understand how numerical features interact, which is essential for any data-driven decision-making process 😎✨.


## **2. What is/are the insight(s) found from the chart?**

**IMDb Score vs Other Metrics ⭐**  

---  

- **imdb_score and imdb_votes:** Usually a mild positive correlation, indicating that highly-rated titles tend to attract more votes, but some popular titles may not have high scores.  
- **imdb_score and tmdb_score:** Strong positive correlation, meaning that ratings from IMDb and TMDB are generally aligned, reflecting consistent audience perception.  

**Popularity vs Engagement 🎯**  

---  

- **tmdb_popularity and imdb_votes:** Positive correlation, suggesting that titles that are popular on TMDB tend to get more votes on IMDb, indicating cross-platform engagement.  

**Runtime Patterns ⏱️**  

---  

- **runtime correlations:** Usually low or negligible with other variables, implying that movie length does not strongly affect ratings or popularity.  

**TMDB Score Insights 🎬**  

---  

- **tmdb_score and tmdb_popularity:** Slight positive correlation, meaning higher-rated titles on TMDB are somewhat more popular, but there are exceptions.  

**General Observations 😎**  

---  

- Most correlations are moderate, no extreme negative relationships, meaning that features are fairly independent, which can be useful for modeling or predictive analytics.  
- Highly correlated metrics (imdb_score & tmdb_score) can be combined or weighted in business insights.  



#### Chart - 14 - Pair Plot

In [None]:
# Pair Plot visualization code
import pandas as pd
import plotly.express as px


# Select numeric columns
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Choose a hue for categorical separation
hue_col = 'type'  # Movies vs Shows

# Create interactive scatter matrix

fig = px.scatter_matrix(
    df,
    dimensions=numeric_cols,
    color=hue_col,
    hover_name='title',
    title="🎨 Interactive Pair Plot of Numeric Columns",
    color_discrete_sequence=px.colors.qualitative.Pastel,  # clean, soft colors
    height=800,
    width=900
)

# Update marker style for clarity
fig.update_traces(
    marker=dict(size=4, opacity=0.8, line=dict(width=0.4, color='DarkSlateGrey')),
    diagonal_visible=False
)

# Update layout for good look and feel
fig.update_layout(
    dragmode='select',
    hovermode='closest',
    plot_bgcolor='white',
    title_font_size=20,
    margin=dict(l=50, r=50, t=80, b=50)
)

# Adjust spacing slightly to avoid overlap
for i in fig.layout:
    if type(fig.layout[i]) == dict and 'domain' in fig.layout[i]:
        fig.layout[i]['domain'] = [fig.layout[i]['domain'][0]+0.01, fig.layout[i]['domain'][1]-0.01]

fig.show()


## **1. Why did you pick the specific chart?**

- **Bivariate Analysis Across All Numeric Columns 📊:**  
  The dataset has multiple numeric columns (imdb_score, imdb_votes, tmdb_popularity, tmdb_score, runtime, release_year).  
  A scatter matrix lets us visually inspect relationships between every pair of numeric variables at once 🔍.  

- **Interactive Exploration 🖱️:**  
  Using Plotly makes it interactive: hover to see movie titles, zoom, pan, and select points.  
  Very useful for spotting patterns, outliers, or clusters in large datasets 🌟.  

- **Hue for Categorical Separation 🎨:**  
  Adding type (Movies vs Shows) as hue lets us compare patterns within each category.  
  We can see if Movies and Shows differ in imdb_score, popularity, or runtime 🎬.  

- **Compact and Informative 🗂️:**  
  Instead of multiple individual scatter plots, the scatter matrix combines all relationships into one grid, saving space and giving a complete view ✨.  

- **Detect Correlations and Patterns Quickly 💡:**  
  Makes it easy to spot linear/non-linear relationships, clusters, or extreme values, which can inform further analysis or modeling decisions 🔑.  

- **Interactive & Colorful 😎:**  
  Plotly’s interactivity and vibrant but clean colors make the visualization more appealing and insightful than static plots 🌈.


## **2. What is/are the insight(s) found from the chart?**

### Positives  

**IMDb Scores vs Release Year 🎬**  

---  

- **Observation:** Older movies/shows tend to have a wide range of IMDb scores, while newer releases cluster in a narrower range.  
- **Positive growth opportunity:** Shows may have slightly lower scores compared to movies in recent years.  
> ✅ Positive impact: Helps plan content release strategies and monitor quality trends.  

**Votes vs Popularity 📊**  

---  

- **Observation:** imdb_votes and tmdb_popularity are positively correlated.  
- **Positive growth opportunity:** Highly popular titles tend to receive more votes; outliers with extremely high votes/popularity stand out clearly 🌟.  
> ✅ Positive impact: Identifies top-performing content for marketing and promotion focus.  

**Genre/Type Differences 🎨**  

---  

- **Observation:** Using type as hue (Movies vs Shows) reveals patterns: Movies generally have higher tmdb_popularity than Shows; certain clusters are dominated by one type.  
- **Positive growth opportunity:** Target content type-specific strategies to enhance engagement.  
> ✅ Positive impact: Guides type-specific production and promotion planning.  

**Score Relationships ⭐**  

---  

- **Observation:** imdb_score and tmdb_score are moderately correlated; some exceptions appear as outliers.  
- **Positive growth opportunity:** Highly rated IMDb titles often also have high TMDB ratings.  
> ✅ Positive impact: Supports content quality assessment and cross-platform consistency.  

**Runtime Patterns ⏱️**  

---  

- **Observation:** Movies have a wider range of runtimes; Shows cluster at shorter durations.  
- **Positive growth opportunity:** Extreme runtime values can indicate anomalies or special content.  
> ✅ Positive impact: Aids in content planning and audience targeting.  

**Outliers & Clusters 🔍**  

---  

- **Observation:** Outliers, such as movies with extremely high votes or unusually high/low scores, are easily spotted; clusters reveal common patterns in popular, well-rated content 🎉.  
- **Positive growth opportunity:** Leverage insights from clusters for content recommendations and marketing.  
> ✅ Positive impact: Enables identification of high-value content and trends.  

**Overall Insight**  

---  

> The scatter matrix allows visual exploration of relationships, detection of outliers, and observation of type-based patterns across multiple numeric features at once. It’s effective for spotting trends, anomalies, and correlations in the dataset 📽️✨.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

**From the analysis of Amazon Prime TV Shows and Movies dataset, we observed that**:


1. Movies dominate the platform compared to TV shows.

2. Drama, Comedy, and Action emerged as the most popular genres.

3. United States and India lead in content production, while other countries contribute less.

4. IMDb and TMDB scores correlate with votes and popularity, showing that higher-rated titles drive stronger engagement.

5. Certain age certifications (like TV-MA/16+) dominate, indicating a preference for mature audience content.

6. Bubble and Treemap plots highlighted that a small portion of titles contribute significantly to popularity.



👉 **Based on these findings, the solution to achieve the business objective is**:

1. Prioritize High-Demand Genres – Focus future acquisitions on Drama, Comedy, and Action as they attract the largest audience share.

2. Invest in Regional Diversity – Strengthen content in emerging markets (beyond US & India) to capture untapped audiences.

3. Quality Over Quantity – Since higher IMDb/TMDB-rated shows gain more traction, prioritize acquiring or producing high-quality titles instead of merely expanding volume.

4. Targeted Recommendations – Use insights from age certifications and viewer preferences to improve recommendation systems for different age groups.

5. Talent Strategy – Highlight top-performing directors/actors in promotions and collaborations to boost engagement.

# **Conclusion**

This project provided valuable insights into Amazon Prime Video’s content library by analyzing both titles and credits data. Through systematic data cleaning, integration, and exploratory data analysis, we identified key patterns in content distribution, genres, ratings, and audience preferences. The findings revealed that movies dominate the platform, with Drama, Comedy, and Action being the most prevalent genres. The United States and India emerged as leading content producers, while higher IMDb and TMDB scores strongly correlated with popularity and engagement. Furthermore, the analysis highlighted that mature audience certifications are most common, reflecting a strategic focus on adult-oriented content.

Overall, the analysis demonstrates that Amazon Prime’s growth strategy should focus on strengthening high-demand genres, expanding regional diversity, and prioritizing quality over quantity in content production. These insights not only inform content acquisition and recommendation strategies but also provide actionable guidance to enhance user engagement and drive subscription growth in the competitive streaming market.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***