# **Project Name**    - Amazon Prime TV Shows and Movies  




##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -**  Aditya Anilkumar


# **Project Summary -**

This project performs an exploratory data analysis on Amazon Prime's TV shows and movies dataset. It aims to uncover key trends, patterns, and insights about content type, genre distribution, production countries, language diversity, and release timelines. The ultimate goal is to generate useful insights that can support decisions on content acquisition, regional targeting, and audience engagement strategies.

# **GitHub Link -**

Provide your GitHub Link here.

> https://github.com/aditya18101999



# **Problem Statement**


The objective is to understand the types and characteristics of the content available on Amazon Prime Video. We want to analyze what types of content are offered, which genres are most common, how release years are spread out, and which countries or languages contribute most to the content library.


#### **Define Your Business Objective?**



To provide Amazon Prime with useful insights regarding their content library. These insights can influence decisions on what type of content to promote, acquire, or produce, based on viewer preferences, historical trends, and genre popularity.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
credits_df = pd.read_csv("/content/credits.csv")
titles_df = pd.read_csv("/content/titles.csv")

### Dataset First View

In [None]:
# Dataset First Look
print("Titles Dataset:")
display(titles_df.head())

print("Credits Dataset:")
display(credits_df.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Titles shape:", titles_df.shape)
print("Credits shape:", credits_df.shape)

### Dataset Information

In [None]:
# Dataset Info
# Titles dataset structure
print("Titles DataFrame Info:")
titles_df.info()

# Credits dataset structure
print("\nCredits DataFrame Info:")
credits_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Titles duplicates:", titles_df.duplicated().sum())
print("Credits duplicates:", credits_df.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Titles missing values:\n", titles_df.isnull().sum())
print("\nCredits missing values:\n", credits_df.isnull().sum())

In [None]:
# Visualizing the missing values
missing_percent = titles_df.isnull().mean() * 100
missing_percent = missing_percent[missing_percent > 0].sort_values(ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x=missing_percent.values, y=missing_percent.index, palette="magma")
plt.title("Missing Data Percentage by Column")
plt.xlabel("Percentage (%)")
plt.ylabel("Column Name")
plt.show()

### What did you know about your dataset?

After exploring the `titles` and `credits` datasets, here are the key observations:

#### Titles Dataset:
- Contains **9,871 records** with metadata about Amazon Prime shows and movies.
- Each title has attributes such as `type`, `release_year`, `genres`, `runtime`, `age_certification`, and various popularity scores (`imdb`, `tmdb`).
- **Missing data is present** in several fields:
  - `age_certification` is missing in about 66% of rows.
  - `seasons` is missing in most rows, which makes sense since it only applies to TV shows.
  - Ratings and popularity metrics (`imdb_score`, `tmdb_score`) also have moderate missing data.
- There are **3 duplicate entries** that need to be cleaned.
- Column types are mostly appropriate, such as numerical for scores and runtime, and object for text fields.

#### Credits Dataset:
- Contains **124,235 records** detailing cast and crew information for various titles.
- Each row links a `person_id` and `name` to a title `id`, along with their `role` (e.g., ACTOR, DIRECTOR).
- The `character` column is missing in about 13% of entries, likely because it doesn't apply to non-actors.
- **56 duplicate rows** exist and should be removed.
- Overall, the dataset is large and well-structured for analyzing contributors, like the most frequent directors or actors.

These insights helped identify which columns are clean and useful for exploratory data analysis, and which require cleaning or combining.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Titles columns:", titles_df.columns)
print("Credits columns:", credits_df.columns)

In [None]:
# Dataset Describe
print("Titles describe:")
display(titles_df.describe())

print("\nCredits describe:")
display(credits_df.describe())

### Variables Description

**Summary of Variable Descriptions**  

**Titles Dataset (titles_df)**  
The titles_df dataset mainly contains metadata about individual movies and TV shows.

*Identifiers (id, imdb_id)*: Unique codes to identify each title. These serve as primary keys or links to external databases like IMDb.

*Basic Information (title, type, description, release_year)*: Important descriptive attributes such as the title, whether it's a movie or show, a summary, and its original release year.

*Content Characteristics (age_certification, runtime, genres, production_countries, seasons)*: Information about the content's classification (age rating), length, themes, countries of origin, and the number of seasons for TV series.

*Popularity & Ratings (imdb_score, imdb_votes, tmdb_popularity, tmdb_score)*: Metrics that show audience reception and popularity from IMDb and TMDb, including average scores and vote counts.

**Credits Dataset (credits_df**  
The credits_df dataset details the individuals involved in each title's production.

*Person & Title Identifiers (person_id, id)*: Unique codes for each individual, along with a foreign key that links back to the specific title they worked on.

*Personal & Role Details (name, character, role)*: The person's name, the specific character they portrayed (if an actor), and their overall role in the production (for example, 'ACTOR' or 'DIRECTOR').Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("Titles unique values:")
display(titles_df.nunique())

print("\nCredits unique values:")
display(credits_df.nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Remove duplicates
titles_df.drop_duplicates(inplace=True)
credits_df.drop_duplicates(inplace=True)

# Filter credits - ACTOR/DIRECTOR role for now
actors_df = credits_df[credits_df['role'] == 'ACTOR']
directors_df = credits_df[credits_df['role'] == 'DIRECTOR']

# Aggregate actors per title (as comma-separated string)
actors_grouped = actors_df.groupby('id')['name'].apply(lambda x: ', '.join(x)).reset_index()
actors_grouped.rename(columns={'name': 'actors'}, inplace=True)

# Merge titles with directors and actors
df = titles_df.merge(directors_df[['id', 'name']], on='id', how='left')
df.rename(columns={'name': 'director'}, inplace=True)

df = df.merge(actors_grouped, on='id', how='left')

# Clean/format important columns/handling null values
df['release_year'] = pd.to_numeric(df['release_year'], errors='coerce')
df['age_certification'] = df['age_certification'].fillna('Unknown')
df['genres'] = df['genres'].fillna('Unknown')
df['production_countries'] = df['production_countries'].fillna('Unknown')
df['seasons'] = df['seasons'].fillna(0).astype(int)  # treat missing as 0 for movies
df['imdb_score'] = df['imdb_score'].fillna(df['imdb_score'].median())
df['tmdb_score'] = df['tmdb_score'].fillna(df['tmdb_score'].median())

# reset index
df.reset_index(drop=True, inplace=True)

# Final check
df.info()


### What all manipulations have you done and insights you found?

We cleaned and enriched the dataset with both director and actor information:

- Removed duplicate rows from both datasets.
- Handled missing values for age certification, genres, scores, and seasons.
- Filtered `credits` to extract both `DIRECTOR` and `ACTOR` roles.
- Merged directors into the `titles` dataset using `id`.
- Aggregated actor names per title and merged as a separate column.
- Final dataset now includes `title`, `type`, `genres`, `director`, and a list of `actors`, making it ready for deep analysis of cast/crew and their influence on content performance.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***


#### Chart - 1  Distribution of Movies vs TV Shows

In [None]:
# Chart - 1 visualization code
sns.countplot(data=df, x='type', palette='Set2')
plt.title("Distribution of Content Types on Amazon Prime")
plt.xlabel("Type of Content")
plt.ylabel("Number of Titles")
plt.show()


##### 1. Why did you pick the specific chart?

Understanding the overall mix between movies and TV shows is a foundational part of content strategy analysis. This helps us understand the platform's focus: one-off viewing experiences (movies) vs long-form engagement (TV shows).

##### 2. What is/are the insight(s) found from the chart?

The platform has significantly more movies than shows. This suggests Amazon Prime leans heavily toward shorter, one-time view content. TV shows, which drive longer user engagement, are fewer in comparison.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight helps stakeholders evaluate whether the content portfolio supports long-term user retention. TV shows tend to keep users subscribed longer due to episodic formats. A heavier movie library may lead to shorter viewing bursts but could also reduce content fatigue if more diverse. Depending on engagement data, Amazon may need to rebalance this mix.

#### Chart - 2  Top Genres

In [None]:
# Chart - 2 visualization code
# Fix genres by removing unwanted characters and exploding
df_genre_clean = df.copy()

# Remove stray brackets and quotes (clean up stringified lists)
df_genre_clean['genres'] = df_genre_clean['genres'].str.replace(r"[\[\]']", "", regex=True)

# Split by comma
df_genre_clean['genres'] = df_genre_clean['genres'].str.split(', ')

# Explode
df_genre_clean = df_genre_clean.explode('genres')

# Count top genres
top_genres = df_genre_clean['genres'].value_counts().head(10)

# Plot
plt.figure(figsize=(10,6))
sns.barplot(x=top_genres.values, y=top_genres.index, palette='Set3')
plt.title("Top 10 Genres on Amazon Prime")
plt.xlabel("Number of Titles")
plt.ylabel("Genre")
plt.show()


##### 1. Why did you pick the specific chart?

Genres are key drivers of user preferences and engagement. By understanding which genres dominate, we can align content supply with demand — or identify underserved genres with high potential.

##### 2. What is/are the insight(s) found from the chart?

The most common genres are Drama, Comedy, and Action — traditional high-demand categories. Niche genres like Documentary or Sci-Fi are present but appear less frequently.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This indicates Amazon is focused on broad-appeal genres. However, underrepresented genres could be strategic opportunities — for example, increasing investment in documentaries if they have high retention. Also, knowing dominant genres helps tailor personalized recommendations or content clustering.

#### Chart - 3: Titles per Year

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(12,6))
sns.histplot(data=df, x='release_year', bins=30, kde=False, color='skyblue')
plt.title("Distribution of Titles by Release Year")
plt.xlabel("Release Year")
plt.ylabel("Number of Titles")
plt.show()


##### 1. Why did you pick the specific chart?

To assess whether Amazon's catalog favors recent vs classic content. This helps evaluate how up-to-date and culturally relevant the catalog is.

##### 2. What is/are the insight(s) found from the chart?

Most content is released post-2000, with a noticeable spike after 2015. This suggests that Amazon has been rapidly expanding its catalog in the last decade, likely due to streaming growth.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The focus on recent content aligns with user expectations — fresh, contemporary media. However, older classic titles can appeal to niche, loyal audiences. This analysis supports both licensing decisions and archive curation strategies.



#### Chart - 4: Top Directors

In [None]:
# Chart - 4 visualization code
# Filter directors with at least 3 titles
df_directors = df.dropna(subset=['director', 'imdb_score'])
top_directors = df_directors.groupby('director').agg(
    title_count=('title', 'count'),
    avg_rating=('imdb_score', 'mean')
)
top_directors = top_directors[top_directors['title_count'] >= 3]
top_directors = top_directors.sort_values('avg_rating', ascending=False).head(10)

# Plot
plt.figure(figsize=(10,6))
sns.barplot(x=top_directors['avg_rating'], y=top_directors.index, palette='viridis')
plt.title("Top 10 Directors by Average IMDb Score (Min 3 Titles)")
plt.xlabel("Average IMDb Score")
plt.ylabel("Director")
plt.show()

# Optional: print top directors
print(top_directors)


##### 1. Why did you pick the specific chart?

To highlight top-performing directors based on audience ratings, while ensuring statistical reliability with a minimum title count.

##### 2. What is/are the insight(s) found from the chart?

These directors consistently produce highly-rated content. They may not be the most frequent, but their impact is strong.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Perfect for quality partnerships. Amazon can promote these directors or fund their future projects to maintain or raise content standards.

#### Chart - 5: IMDb Score Distribution

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10,5))
sns.histplot(df['imdb_score'], bins=20, kde=True, color='green')
plt.title("IMDb Score Distribution")
plt.xlabel("IMDb Score")
plt.ylabel("Number of Titles")
plt.show()


##### 1. Why did you pick the specific chart?

IMDb score is a strong proxy for public reception and quality. Analyzing score distribution shows us how much of Amazon’s catalog is well-rated vs poorly rated.

##### 2. What is/are the insight(s) found from the chart?

Most content falls between 5 and 7, suggesting a moderate quality skew. Only a small number of titles score below 4 or above 9.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Amazon could use this to define quality benchmarks — e.g., avoid acquiring titles scoring under 5 unless niche. It also helps content curators and machine learning teams filter for top picks in recommendation engines. The lack of high-rated titles may signal a need to focus on critical acclaim, not just quantity.



#### Chart - 6  Top 10 Most Frequent Actors with their avg rating on Amazon Prime

In [None]:
# Chart - 6 visualization code
# Explode actor names (already comma-separated from earlier merge)
df_actors = df.copy()
df_actors = df_actors.dropna(subset=['actors', 'imdb_score'])  # remove rows without actors or score
df_actors['actors'] = df_actors['actors'].str.split(', ')
df_actors = df_actors.explode('actors')

# Group by actor: count + average IMDb score
top_actors = df_actors.groupby('actors').agg(
    title_count=('title', 'count'),
    avg_rating=('imdb_score', 'mean')
).sort_values('title_count', ascending=False).head(10)

# Plot
plt.figure(figsize=(10,6))
sns.barplot(x=top_actors['title_count'], y=top_actors.index, palette='mako')
plt.title("Top 10 Most Frequent Actors on Amazon Prime")
plt.xlabel("Number of Titles")
plt.ylabel("Actor")
plt.show()

# Show avg IMDb score (optional print)
print(top_actors)



##### 1. Why did you pick the specific chart?

To identify actors who appear most frequently in Amazon Prime content and evaluate their average critical reception.

##### 2. What is/are the insight(s) found from the chart?

This shows which actors Amazon collaborates with the most. Some may have a high presence but lower ratings, while others strike a balance between the two.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps recognize influential actors and potential fan-base anchors. Those with high frequency and high scores could be ideal for future collaborations or exclusives.



#### Chart - 7  Top 10 Countries by Number of Titles

In [None]:
# Chart - 7 visualization code
# Clean and explode production countries
df_country = df.copy()
df_country['production_countries'] = df_country['production_countries'].str.replace('[', '').str.replace(']', '')
df_country['production_countries'] = df_country['production_countries'].str.split(', ')
df_country = df_country.explode('production_countries')

top_countries = df_country['production_countries'].value_counts().head(10)

plt.figure(figsize=(10,5))
sns.barplot(x=top_countries.values, y=top_countries.index, palette='Blues_d')
plt.title("Top 10 Producing Countries on Amazon Prime")
plt.xlabel("Number of Titles")
plt.ylabel("Country Code")
plt.show()

##### 1. Why did you pick the specific chart?

To analyze the geographic diversity of Amazon’s content offerings and identify leading production markets.

##### 2. What is/are the insight(s) found from the chart?

The US dominates content production.

A few countries like IN (India) and GB (UK) also contribute significantly.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This helps with localization strategies and market-specific marketing. If Amazon wants to grow in global markets, boosting regional content is key.

#### Chart - 8 IMDb Score by Genre (Top 10 Genres)

In [None]:
# Reuse exploded + cleaned genre data
genre_scores = df_genre_clean.copy()
genre_scores = genre_scores.dropna(subset=['imdb_score'])

top10 = genre_scores['genres'].value_counts().head(10).index
filtered = genre_scores[genre_scores['genres'].isin(top10)]

plt.figure(figsize=(12,6))
sns.boxplot(data=filtered, x='genres', y='imdb_score', palette='cubehelix')
plt.title("IMDb Score Distribution by Top 10 Genres")
plt.xlabel("Genre")
plt.ylabel("IMDb Score")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

I selected this chart to move beyond just genre frequency and explore how genres perform in terms of audience ratings. While some genres are popular, they might not always be well-received. This visualization lets us compare the quality distribution across genres and identify which ones consistently perform well, poorly, or show high variability.

##### 2. What is/are the insight(s) found from the chart?

Documentary, History, and Crime genres tend to have higher median IMDb scores, indicating consistent audience appreciation.

Drama and Comedy, though frequent, show wider variance — suggesting they include both highly rated and poorly rated content.

Genres like Action and Adventure have relatively lower medians and tighter interquartile ranges, indicating a more predictable but mid-tier performance.

Outliers exist in nearly every genre, reflecting both critically acclaimed and underperforming titles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Double down on critically acclaimed niches like Documentaries and History to strengthen content prestige.

For frequent genres like Drama and Comedy, apply stricter quality control or rating thresholds to improve average viewer satisfaction.

Use this insight to improve genre-based recommendations — promoting high-scoring sub-genres to drive engagement.

#### Chart - 9 Number of Titles by Age Certification

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(8,5))
sns.countplot(data=df[df['age_certification'] != 'Unknown'], x='age_certification', palette='Set2')
plt.title("Content Count by Age Certification")
plt.xlabel("Age Rating")
plt.ylabel("Number of Titles")
plt.show()

##### 1. Why did you pick the specific chart?

This chart helps answer a key business question: “Is Amazon Prime targeting content appropriately across different age demographics?” By examining age certifications, we can determine whether Amazon leans more toward family-friendly content, mature themes, or has a balanced offering.

##### 2. What is/are the insight(s) found from the chart?

The majority of content falls under mature categories, such as TV-MA and R, indicating a strong focus on adult audiences.

There is a moderate presence of PG-13 and TV-14 content, appealing to teens and general audiences.

Family-friendly titles (G, PG, TV-G) are relatively limited — possibly underrepresented on the platform.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If Amazon wants to expand into family or children’s markets, it may need to invest more in G or PG content.

The current skew toward mature ratings supports a premium adult-oriented brand, but may limit accessibility in households with young children.

Marketing strategies can be tailored accordingly — highlight mature drama/thriller content for adult users, or fill genre gaps in family/kids categories.

#### Chart - 10 Average TMDb Popularity by Content Type

In [None]:
# Chart - 10 visualization code
popularity = df.groupby('type')['tmdb_popularity'].mean()

plt.figure(figsize=(6,4))
sns.barplot(x=popularity.index, y=popularity.values, palette='pastel')
plt.title("Average TMDb Popularity by Content Type")
plt.xlabel("Content Type")
plt.ylabel("Avg TMDb Popularity")
plt.show()

##### 1. Why did you pick the specific chart?

This chart was chosen to explore how content format (Movie vs TV Show) influences audience engagement or buzz, as measured by TMDb popularity. It helps answer the strategic question:

“"Which type of content drives more attention on Amazon Prime?”"

Understanding this relationship can guide decisions on content development, promotion, and platform focus.

##### 2. What is/are the insight(s) found from the chart?

TV Shows have a higher average TMDb popularity score than Movies.

This suggests that, while movies dominate the catalog in quantity, TV shows may be more successful in capturing user interest.

It could be due to longer viewer retention, ongoing discussion, or fan engagement with serialized content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This finding implies that Amazon Prime should consider:

Investing more in TV series, particularly original shows, to boost ongoing platform engagement.

Highlighting popular shows more aggressively in marketing and homepage recommendations.

Using TMDb popularity as a proxy for early success prediction — especially for newer or lesser-known titles.

Since popularity often correlates with user retention and subscription value, focusing on high-performing content types like TV shows can directly support growth and loyalty objectives.


#### Chart - 11 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Select relevant numerical columns
numeric_cols = ['imdb_score', 'imdb_votes', 'tmdb_score', 'tmdb_popularity', 'runtime', 'release_year']

# Drop rows with missing values for clean correlation
df_corr = df[numeric_cols].dropna()

# Compute correlation matrix
corr_matrix = df_corr.corr()

# Plot the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5, linecolor='gray')
plt.title("Correlation Heatmap of Numeric Features", fontsize=14)
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The correlation heatmap provides a quick overview of linear relationships between numerical variables like IMDb and TMDb scores, popularity, votes, runtime, and release year.

It is especially useful during exploratory data analysis to:

* Detect multicollinearity

* Reveal redundant features

* Understand how different content attributes relate to performance

This chart helps decide what features might be worth focusing on in recommendations, marketing, or predictive modeling.

##### 2. What is/are the insight(s) found from the chart?

* imdb_score and tmdb_score show a strong positive correlation (~0.65–0.75), confirming consistency between different rating platforms.

* imdb_votes and tmdb_popularity also have a moderate positive correlation, suggesting that popular content tends to attract more user reviews.

* imdb_score and imdb_votes are positively correlated — indicating that better-rated content often garners more user engagement.

* runtime has almost no correlation with other features — meaning longer content doesn't necessarily get higher ratings or more attention.

* release_year shows a weak negative correlation with tmdb_score and imdb_votes, possibly because older content has had more time to gather votes or newer content is being rated more harshly.




##### 3. Business Impact:

* Helps Amazon focus on key predictors of content success: score, votes, and popularity.

* Guides feature selection for building recommendation systems or content success prediction models — for example, runtime and release_year may have low predictive value.

* Confirms that cross-platform ratings (IMDb vs TMDb) are aligned, which strengthens confidence in using either metric.

* Suggests that increasing visibility of already well-rated titles could increase engagement, since popularity correlates with high votes and scores.

#### Chart - 12 - Pair Plot

In [None]:
# Selecting relevant numeric features
numeric_features = ['imdb_score', 'imdb_votes', 'tmdb_score', 'tmdb_popularity', 'runtime']

# Optional: remove extreme outliers to improve clarity
df_pair = df[numeric_features].copy()
df_pair = df_pair[df_pair['imdb_votes'] < df_pair['imdb_votes'].quantile(0.99)]  # remove top 1% of vote outliers
df_pair = df_pair[df_pair['tmdb_popularity'] < df_pair['tmdb_popularity'].quantile(0.99)]

# Plot
sns.pairplot(df_pair, diag_kind='kde', corner=True, plot_kws={'alpha': 0.5})
plt.suptitle("Pair Plot of Numeric Features (IMDb, TMDb, Runtime)", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

The pair plot is a powerful multivariate visualization that allows us to:

Observe relationships between numeric features like scores, popularity, votes, and runtime.

Detect correlations, clusters, or nonlinear trends that might not be visible in a single chart.

Identify outliers and data distributions in one place.

It’s especially useful in exploratory data analysis (EDA) to generate hypotheses for further investigation.

##### 2. What is/are the insight(s) found from the chart?

From the pair plot:

* A strong positive relationship exists between imdb_score and tmdb_score, which confirms consistency between the two platforms’ ratings.

* imdb_votes is moderately positively correlated with both imdb_score and tmdb_popularity, indicating that higher-rated content tends to get more attention.

* runtime shows little correlation with either scores or popularity — suggesting content length doesn't significantly impact how well a title is rated or how popular it is.

* Some clusters are visible in tmdb_popularity and imdb_votes, likely reflecting a small number of blockbuster titles getting disproportionately high views and ratings.

##### 3. Business Impact

These relationships help Amazon Prime identify which features are most predictive of user engagement and content success.

A high imdb_score or tmdb_score along with high tmdb_popularity may define a “quality hit”, ideal for spotlighting or franchise investment.

Weak correlation between runtime and success implies creative freedom in content length without risking user reception — useful for production strategy.

The plot can also inform machine learning features for models predicting content performance or viewer preferences.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?

The goal of this project was to help Amazon Prime understand patterns in its existing catalog to support content planning, recommendation strategies, and market targeting. Based on insights from our visual analysis, we can address the business objectives as follows:

1. **Catalog Composition**  
* Chart 1 (Type Distribution) shows that Amazon Prime primarily features movies, with significantly fewer TV shows.  
* This indicates that Amazon could consider investing in more serialized content to compete with platforms that focus on binge-worthy series.

2. **Genre Trends**  
* Chart 2 (Top 10 Genres) confirms that Drama, Comedy, and Action are the most popular genres.  
* Chart 15 (IMDb Score by Genre) reveals that while Drama is plentiful, genres like Documentary, History, and Crime have higher average IMDb scores. This suggests they may perform better critically, despite being less common.  
* Recommendation: Focus on underrepresented but high-performing genres to improve the perception of content quality.

3. **Content Freshness**  
* Chart 3 (Release Year Trend) shows that Amazon is adding new titles each year, especially since 2010. However, there has been a slight drop in recent years, possibly due to licensing delays or the impact of the pandemic.  
* Continued investment in new, original titles can help keep the platform competitive.

4. **Contributor Impact**  
* Chart 14 (Top Directors by IMDb Score) shows that a few directors consistently produce highly-rated content. These individuals could be valuable partners for future projects.  
* Chart 13 (Top Actors by Frequency and Score) highlights actors who appear often, which is useful for promotions centered around their star power.  
* Combining this data with score insights helps target creators who engage audiences.

5. **Production Diversity**  
* Chart 9 (Top Countries) confirms that Amazon’s content is mainly US-centric, with secondary contributions from India and the UK.  
To expand in emerging markets, Amazon should consider investing more in local and regional productions and promoting non-English titles.

6. **Content Quality & Engagement Drivers**  
* Chart 5 (Distribution of IMDb Scores) and Chart 17 (TMDb Popularity by Type) show that most content has mid-range scores, with TV shows demonstrating slightly more popularity per title.  
* Chart 16 (Titles by Age Certification) reveals that Amazon mostly targets a mature audience (TV-MA, R), while family-oriented content is limited.  
* Charts 6 & 7 (Correlation + Pair Plot) confirm that popularity and votes correlate with scores, but runtime has little effect, suggesting that shorter content isn't penalized.

**Summary:**  
* Increase investment in high-scoring, niche genres like History and Crime.  
* Expand TV show offerings, especially for global and younger audiences.  
* Use creator performance data to inform casting, directing, and promotion decisions.  
* Improve regional diversity in content to enhance international growth.  
* Leverage ratings, votes, and popularity correlations to refine recommendation systems.

# **Conclusion**

Through a mix of univariate, bivariate, and multivariate visual analysis, we’ve found valuable insights from Amazon Prime’s content catalog. The platform excels in drama and action films but may be missing high-quality niche genres and younger audiences.

Strong connections between IMDb and TMDb ratings and popularity metrics suggest Amazon can confidently use this data to improve content recommendations and promotions. Contributor analytics also highlight strategic opportunities for partnerships with successful actors and directors.

In conclusion, the data supports practical steps for Amazon Prime to:

- Refine its content mix based on genre performance,
- Diversify production across countries and certifications,
- Optimize recommendations and acquisition strategies using score-popularity patterns.

These insights fit well with the platform’s goals of increasing engagement, improving viewer satisfaction, and expanding market share through thoughtful, data-driven content strategy.

Use ratings, votes, and popularity correlations to refine recommendation engines.