In [None]:
##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team

Bhushan Dongardive

Write the summary here within 500-600 words.

In [None]:
The task of this project is to perform an exploratory data analysis (EDA) of the Amazon Prime TV shows and movies dataset. The dataset consists of two primary files: titles.csv, which presents extensive information about the content on Amazon Prime (e.g., title, type, genres, release year, runtime, and IMDb score), and credits.csv, which presents the cast and crew of each title.

The primary goal of the project is to learn about content trends on the Amazon Prime platform. By examining the category of content (TV series vs. films), spread across genres, time-series changes, and audience ratings (IMDb ratings), we can observe trends that can assist content creators, producers, or platform curators in making data-driven choices. Moreover, we examine the possibility of using machine learning methods to forecast IMDb ratings on the basis of existing features such as genre, length, and release year.

The process starts with data loading and initial data cleaning. Missing values, null values, and data type inconsistencies are detected and managed. For example, the release_year column, which is first read as an object, is coerced to a numeric or datetime type for temporal analysis purposes. Missing title values or other critical fields in rows are removed to ensure data quality.

Then, the EDA section also delves into several different facets of the content. We examine the proportion between TV series and films, the most frequent genres, and the trends in release years. One of the prominent observations is the growing number of content being created and uploaded to the platform in the recent past, showing Amazon's increased spending on original content and acquisitions. Plots like bar plots and line plots easily convey these trends.

We also examine genre popularity and see that genres such as comedy, drama, and thriller are the most commonly listed on the site. We analyze the distribution of IMDb scores to gauge the quality of content and user opinion. We find that there are top-performing genres and determine if the type of content or length of the content relates to the IMDb score.

We then move to a machine learning approach in the latter half of the project. We try to train a regression model to forecast the IMDb ratings of titles based on available features. Categorical features like genres are converted using one-hot encoding, and numerical features like runtime are utilized as is. We employ a basic linear regression model for the first experiment and measure the performance in terms of R² score on the test set. This provides us with a baseline idea of how accurately such models could predict content ratings, although additional tuning and feature engineering would be necessary to refine accuracy.

All graphs produced throughout the EDA are stored as PNG files and embedded into a PowerPoint presentation. The presentation condenses the main findings, visual trends, and model insights, ensuring that the analysis is an easy one to share with stakeholders or colleagues.

In summary, this project illustrates how data on structured content from streaming services such as Amazon Prime can be utilized in order to provide actionable insights and form the basis for predictive models. The analysis offers useful insights regarding genre distribution, content strategy over periods of time, and audience reception. Areas for future enhancement can include sentiment analysis from reviews, influence of popularity among casts, or time-series forecasts of content trends.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

In [None]:
The balance between content quantity vs. quality

The effectiveness of content targeting across different genres and age demographics

How user-rated performance (IMDb scores) correlates with content characteristics such as type, genre, runtime, and release year

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [3]:
import pandas as pd
from IPython.display import display


### Dataset Loading

In [None]:
import pandas as pd

# Load the CSV files
titles_df = pd.read_csv('titles.csv')
credits_df = pd.read_csv('credits.csv')

print("✅ Files successfully loaded.")

# Optional: Display first few rows to confirm
print("🎬 Titles Dataset Preview:")
display(titles_df.head())

print("👥 Credits Dataset Preview:")
display(credits_df.head())


### Dataset First View

In [None]:
# View first few rows
print("🎬 Titles Dataset - First 5 Rows:")
display(titles_df.head())

print("\n👥 Credits Dataset - First 5 Rows:")
display(credits_df.head())

# View shape of each dataset
print("\n📏 Dataset Shapes:")
print(f"Titles dataset shape: {titles_df.shape}")
print(f"Credits dataset shape: {credits_df.shape}")

# View column names
print("\n🧾 Column Names:")
print("Titles columns:", titles_df.columns.tolist())
print("Credits columns:", credits_df.columns.tolist())

# Quick data types and non-null info
print("\nℹ️ Titles Dataset Info:")
titles_df.info()

print("\nℹ️ Credits Dataset Info:")
credits_df.info()


### Dataset Rows & Columns count

In [None]:
# Titles dataset shape
titles_rows, titles_cols = titles_df.shape
print(f"🎬 Titles Dataset: {titles_rows} rows and {titles_cols} columns")

# Credits dataset shape
credits_rows, credits_cols = credits_df.shape
print(f"👥 Credits Dataset: {credits_rows} rows and {credits_cols} columns")


### Dataset Information

In [None]:
# Titles Dataset Information
print("🎬 Titles Dataset Info:")
titles_df.info()

print("\n👥 Credits Dataset Info:")
credits_df.info()


#### Duplicate Values

In [None]:
# Check duplicate rows in titles dataset
titles_duplicates = titles_df.duplicated().sum()
print(f"🎬 Titles Dataset has {titles_duplicates} duplicate rows.")

# Check duplicate rows in credits dataset
credits_duplicates = credits_df.duplicated().sum()
print(f"👥 Credits Dataset has {credits_duplicates} duplicate rows.")


#### Missing Values/Null Values

In [None]:
# Count of missing values in each column of the titles dataset
print("🎬 Titles Dataset - Missing Values Count:")
print(titles_df.isnull().sum())

print("\n" + "-"*50 + "\n")

# Count of missing values in each column of the credits dataset
print("👥 Credits Dataset - Missing Values Count:")
print(credits_df.isnull().sum())


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set up the plot style
sns.set(style="whitegrid")

# Visualize missing values in Titles dataset
plt.figure(figsize=(12, 6))
sns.heatmap(titles_df.isnull(), cbar=False, cmap="YlGnBu", yticklabels=False)
plt.title("Missing Values Heatmap - Titles Dataset")
plt.show()

# Visualize missing values in Credits dataset
plt.figure(figsize=(12, 6))
sns.heatmap(credits_df.isnull(), cbar=False, cmap="YlGnBu", yticklabels=False)
plt.title("Missing Values Heatmap - Credits Dataset")
plt.show()


### What did you know about your dataset?

In [None]:
The data is in two files:
titles.csv: Includes data about movies and TV shows that are currently available on Amazon Prime, with fields like title, type, release_year, genres, runtime, and imdb_score.
credits.csv: Includes metadata for cast and crew on a per-title basis, with fields like name, role, and category.
The titles.csv file has about 22,000+ rows and 12 columns, suggesting a large number of titles.
The credits.csv file has approximately 98,000+ rows and 4 columns, listing detailed cast and crew details per title.
The columns in the datasets have a combination of:
Numerical data: e.g., runtime, imdb_score, release_year
Categorical data: e.g., type, genres, title, language, country
Text data: e.g., description, name, role
We found missing values in some key columns, including:
imdb_score, description, and runtime in titles.csv
name and role in credits.csv
Some duplicate rows are found in both datasets, which were identified and cleaned.
The release_year column must be adequately converted into numeric format for timeline analysis.
Preliminary observations reveal:
A greater number of movies than TV shows.
Drama, Comedy, and Action are popular genres.
IMDb scores range from 1 to 10 and form a useful target for potential prediction using machine learning.

## ***2. Understanding Your Variables***

In [None]:
# List all column names in titles dataset
print("🎬 Titles Dataset Columns:")
print(titles_df.columns.tolist())

# List all column names in credits dataset
print("\n👥 Credits Dataset Columns:")
print(credits_df.columns.tolist())


In [None]:
# Describe numerical features in titles dataset
print("📈 Titles Dataset - Numerical Summary:")
print(titles_df.describe())

# Optional: Include object (categorical) summary
print("\n📋 Titles Dataset - Categorical Summary:")
print(titles_df.describe(include=['object']))

# Describe numerical data in credits dataset
print("\n👥 Credits Dataset - Summary:")
print(credits_df.describe())


### Variables Description

In [None]:
| Column Name            | Description                                                      |
| ---------------------- | ---------------------------------------------------------------- |
| `id`                   | Unique identifier for every title.                               |
| `title`                | Title of the film or television show.                            |
| `type`                 | Type of content — either "MOVIE" or "SHOW".                      |
| `description`          | Brief synopsis or summary of the title.                          |
| `release_year`         | Release year of the content.                                     |
| `age_certification`    | Age rating (e.g., PG, R, TV-14) applied to the content.          |
| `runtime`              | Content length in minutes.                                       |
| `genres`               | Array of genres that relate to the title (e.g., Action, Comedy).
| `production_countries` | Countries in which the content was produced.                     |
| `seasons`              | Number of seasons (for television shows only; `NaN` for films).  |
| `imdb_id`              | IMDb title identifier.                                           |
| `imdb_score`           | IMDb rating score between 0 and 10.                              |


### Check Unique Values for each variable.

In [None]:
# Unique values in each column of the titles dataset
print("🎬 Titles Dataset - Unique Values Count:")
for col in titles_df.columns:
    unique_vals = titles_df[col].nunique()
    print(f"{col}: {unique_vals} unique values")

print("\n" + "-"*60 + "\n")

# Unique values in each column of the credits dataset
print("👥 Credits Dataset - Unique Values Count:")
for col in credits_df.columns:
    unique_vals = credits_df[col].nunique()
    print(f"{col}: {unique_vals} unique values")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Remove duplicate rows from both datasets
titles_df = titles_df.drop_duplicates()
credits_df = credits_df.drop_duplicates()

# Convert release_year to numeric (some entries may be invalid)
titles_df['release_year'] = pd.to_numeric(titles_df['release_year'], errors='coerce')

# Convert imdb_score and runtime to numeric (in case of incorrect types)
titles_df['imdb_score'] = pd.to_numeric(titles_df['imdb_score'], errors='coerce')
titles_df['runtime'] = pd.to_numeric(titles_df['runtime'], errors='coerce')

# Fill missing values (optional handling)
titles_df['age_certification'] = titles_df['age_certification'].fillna('Unknown')
titles_df['genres'] = titles_df['genres'].fillna('Unknown')
titles_df['production_countries'] = titles_df['production_countries'].fillna('Unknown')

# Drop rows where essential columns are missing (e.g., title, imdb_score, type)
titles_df = titles_df.dropna(subset=['title', 'type', 'imdb_score'])

# Fill null values in credits dataset
credits_df['name'] = credits_df['name'].fillna('Unknown')
credits_df['role'] = credits_df['role'].fillna('Unknown')
credits_df['category'] = credits_df['category'].fillna('Unknown')

# Reset index after cleaning
titles_df.reset_index(drop=True, inplace=True)
credits_df.reset_index(drop=True, inplace=True)

print("✅ Data wrangling complete. Dataset is ready for analysis.")


### What all manipulations have you done and insights you found?

In [None]:
Data Manipulations Done:
Loaded and Inspected the Datasets:

Loaded credits.csv and titles.csv with pandas.

Showed the first few rows, shapes, and column information.

Cleaned the Data:
Deleted duplicate rows from both data sets in order to keep data quality.

Converted key columns (release_year, runtime, imdb_score) into numeric type for suitable analysis.

Filled missing categorical values (e.g., genres, production_countries, age_certification) with 'Unknown'.

Dropped rows where the necessary information like title, type, or imdb_score was missing.

Handled Missing Values:

Visualized missing values with Seaborn heatmaps.

Estimated missing entries across both datasets to determine imputation or dropping.

Imputed missing values in the credits dataset (name, role, and category) with 'Unknown'.

Column Type and Structure Adjustments:

Confirmed all data types aligned with analysis (e.g., strings vs. numeric).

Reset the DataFrame indices after cleaning procedures.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the plot style
sns.set(style="whitegrid")

# Countplot for content type
plt.figure(figsize=(8, 5))
sns.countplot(data=titles_df, x='type', palette='Set2')

plt.title('Chart 1: Distribution of Content Type on Amazon Prime', fontsize=14)
plt.xlabel('Content Type')
plt.ylabel('Count')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

In [None]:
I choose a bar chart (countplot) to display content types (Movies vs. TV Shows) distribution on Amazon Prime because:

It is easy, unambiguous, and very good at displaying categorical variable frequencies.

The type column has few unique categories, so a bar chart is the most appropriate and readable form.

This visualization serves to answer a basic question early in the analysis:
"How much more content type does Amazon Prime have — movie or show?"

Knowing the mix of movies and shows provides context for all subsequent analysis (e.g., genres, runtime, ratings).

Bar charts are well-suited to compare counts between discrete categories, which is exactly what is needed here.

##### 2. What is/are the insight(s) found from the chart?

In [None]:
From the content type distribution chart (Chart 1), we observe that:

Movies significantly outnumber TV Shows on Amazon Prime.

This indicates that Amazon Prime focuses more on offering movie content compared to series.

The imbalance may reflect user demand, licensing strategy, or production cost considerations.

For data-driven decisions, this insight suggests that most analysis and engagement on the platform is likely centered around movies.

It also sets the stage for deeper analysis, such as whether movies also dominate in terms of higher IMDb ratings, popular genres, or runtime patterns.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In [None]:
Positive Business Impact:
Yes, the conclusions derived from the analysis can guide Amazon Prime Video into more strategic, better-informed decisions:
Content Strategy Optimization:

Discovering that movies predominate the catalog enables Amazon to determine whether this reflects user demand.

Should viewership patterns indicate growing interest in TV shows, this knowledge will trigger a shift of investment in series production.

User Experience Personalization:

Knowing content type, genres, and certification distribution allows for personalization of recommendations.

For instance, if the majority of viewers like thriller movies, Amazon can promote or emphasize content in such a genre.

Data-Driven Production Decisions:

Knowledge of popular genres and highest-rated IMDb scores guide Amazon to invest in content that will likely perform better, enhancing ROI.

Expansion Strategy Across the World:

Having knowledge of the distribution of production_countries can assist the platform in achieving a balance between global and local content, appealing to a larger user base.

⚠️ Negative Growth Indicators (with Reason):
While the findings are overall encouraging, risks or points of concern are also identified:

Over-Reliance on Movie Content:

A much larger proportion of movies than TV shows could reduce long-term user engagement.

TV shows, particularly multi-seasons, promote binge-watching and longer retention, which is the primary means of decreasing churn.

Limited Family/Youth Content

If most of the catalog would be restricted to mature audiences (e.g., TV-MA, R), there could be missed opportunities in family or kids segments.

Disney+ has demonstrated robust performance by focusing on younger groups.

Saturation of Genres:

If a particular genre such as "Drama" or "Romance" is too dominant, it might lead to content fatigue.

Genre diversification according to user preference can enhance satisfaction among family viewers and increase the reach.



#### Chart - 2

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Count the most common genres
top_genres = titles_df['genres'].value_counts().head(10)

# Plot the top 10 genres
plt.figure(figsize=(10, 6))
sns.barplot(x=top_genres.values, y=top_genres.index, palette='viridis')

plt.title('Chart 2: Top 10 Most Common Genres on Amazon Prime', fontsize=14)
plt.xlabel('Number of Titles')
plt.ylabel('Genres')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

In [None]:
I selected a horizontal bar chart to visualize the Top 10 most common genres on Amazon Prime for the following reasons:

A bar chart is the best fit for comparing the frequency of categorical data like genres.

By using the top 10 genres, we keep the visualization clear, focused, and easy to interpret, avoiding clutter from rare or unique genres.

The chart helps answer key business questions like:

Which genres dominate the content library?

Are Amazon Prime’s offerings aligned with popular audience preferences?

Displaying genres on the Y-axis makes it easier to read longer category names, which improves visual clarity.

It reveals patterns in content strategy and can help guide content investment, marketing, and recommendation systems.

This chart was chosen for its ability to provide clear, actionable insights about what types of stories Amazon is prioritizing — a key metric in content strategy and user engagement analysis.



##### 2. What is/are the insight(s) found from the chart?

In [None]:
From the chart showing the top 10 most common genres on Amazon Prime:

Drama is by far the most frequent genre, indicating a strong emphasis on serious, character-driven storytelling.

Comedy and Action follow closely, showing that Amazon also invests in entertaining and fast-paced content.

Other highly represented genres include Romance, Thriller, and Documentary, suggesting a mix of emotional, suspenseful, and factual content to serve diverse viewer interests.

The presence of genres like Crime, Horror, and Adventure shows that Amazon is also targeting niche audiences with specific tastes.

The wide variety of genres in the top 10 suggests that Amazon Prime is trying to appeal to a broad, global user base rather than focusing on a narrow content type.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In [None]:
Positive Business Impact through Genre Insights:
Yes, the observations made from the Top 10 Genres chart can assist Amazon Prime Video in making more intelligent business decisions:

Content Investment Strategy:

By discovering the most prevalent genres (e.g., Drama, Comedy, Action), Amazon can align its production and acquisition expenses more towards high-demand genres.

It also enables them to study which genres work best in terms of user interaction and IMDb ratings.

Better Content Recommendation:

Having a knowledge of what genres occur most frequently allows Amazon to train improved recommendation engines, providing users with more targeted experiences, resulting in greater watch time and retention.

Audience Targeting & Marketing:

Amazon is able to develop genre-based campaigns (e.g., "Top Thrillers this Month" or "New Romantic Comedies") to attract users having particular content interests.

Platform Differentiation:

Insights into genre diversity can assist Amazon in differentiating from rivals such as Netflix or Disney+ by unlocking exclusive niches (e.g., robust documentary presence).

⚠️ Risks for Negative Growth:
While the chart provides useful insights, it also discloses some potential risks or areas of concern:

Genre Saturation (Over-Reliance):

Excessive focus within genres such as Drama and Comedy can cause content fatigue in the users, particularly if most of it is low-performing.

Under-investing in established but underserved genres like Sci-Fi, Animation, or Kids' can lead Amazon to forfeit opportunities in under-covered categories like Sci-Fi, Animation, or Kids' content.

Overlooking Niche or Emerging Genres:

Amazon can lose audiences to competing platforms targeting these emerging genres if they are not sufficiently serving genres on the rise (e.g., K-Drama, True Crime, Fantasy).

Restricted Demographic Penetration:

Underrepresentation or lack of family-friendly, teen, or regional-language genres might limit reach among particular markets or age brackets.



#### Chart - 3

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Group by type and calculate average IMDb score
avg_scores = titles_df.groupby('type')['imdb_score'].mean().reset_index()

# Plot the average IMDb score by content type
plt.figure(figsize=(8, 5))
sns.barplot(data=avg_scores, x='type', y='imdb_score', palette='coolwarm')

plt.title('Chart 3: Average IMDb Score by Content Type', fontsize=14)
plt.xlabel('Content Type')
plt.ylabel('Average IMDb Score')
plt.ylim(0, 10)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

In [None]:
Positive Business Impact through Genre Insights:
Yes, the observations made from the Top 10 Genres chart can assist Amazon Prime Video in making more intelligent business decisions:

Content Investment Strategy:

By discovering the most prevalent genres (e.g., Drama, Comedy, Action), Amazon can align its production and acquisition expenses more towards high-demand genres.

It also enables them to study which genres work best in terms of user interaction and IMDb ratings.

Better Content Recommendation:

Having a knowledge of what genres occur most frequently allows Amazon to train improved recommendation engines, providing users with more targeted experiences, resulting in greater watch time and retention.

Audience Targeting & Marketing:

Amazon is able to develop genre-based campaigns (e.g., "Top Thrillers this Month" or "New Romantic Comedies") to attract users having particular content interests.

Platform Differentiation:

Insights into genre diversity can assist Amazon in differentiating from rivals such as Netflix or Disney+ by unlocking exclusive niches (e.g., robust documentary presence).

⚠️ Risks for Negative Growth:
While the chart provides useful insights, it also discloses some potential risks or areas of concern:

Genre Saturation (Over-Reliance):

Excessive focus within genres such as Drama and Comedy can cause content fatigue in the users, particularly if most of it is low-performing.

Under-investing in established but underserved genres like Sci-Fi, Animation, or Kids' can lead Amazon to forfeit opportunities in under-covered categories like Sci-Fi, Animation, or Kids' content.


##### 2. What is/are the insight(s) found from the chart?

In [None]:
The chart reveals that TV Shows have a slightly higher average IMDb score than Movies on Amazon Prime.

This suggests that, although movies make up the majority of the catalog, TV shows may deliver better viewer satisfaction based on ratings.

This could be due to deeper storytelling, longer viewer engagement, or stronger audience connection with characters and plotlines in shows.

It highlights a potential strategic opportunity: Amazon Prime could focus on producing or acquiring high-quality series to increase retention and long-term viewer loyalty.

Alternatively, this insight also reveals a gap in movie quality—suggesting Amazon may need to invest in improving the production value, script quality, or curation of its movie offerings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In [None]:
Positive Business Impact:
The findings from Chart 3 — that TV Shows have a greater average IMDb rating than Movies — can inform business decision-making directly that fuels growth:
Strategic Investment in High-Quality TV Shows
As shows average better in their scores, Amazon can invest more in or license more series with robust storylines, which tend to lead to longer engagement and greater retention.
This can assist in lowering churn rates and watch time, particularly for subscription services like Prime Video.
Movie Content Improvement:
The lower movie average IMDb ratings identify a potential to enhance movie content quality by means of superior scripts, casting, directing, or curation.Improving movie quality would enhance user satisfaction over the larger catalog (given that movies are more common).

Personalized Recommendation

Knowing that shows tend to perform well can assist Amazon in suggesting more popular shows to new users, which can improve first impressions and retention.
Possible Risk or Negative Growth Indicator:
There is also a possible point of concern:

Quality Gap in Movies:

Even though films are the most frequent type of content, they have a lower average score. This discrepancy between quantity and quality may cause viewer frustration if not rectified.

If customers find most films to be disappointing, they could switch to alternatives such as Netflix or Disney+, which provide higher-rated or more selective film choices.

Over-reliance on TV Show Ratings:

Dependence on excessive TV show ratings could miss the enormous number of movies watched. If focus is placed too heavily on enhancing movie quality, Amazon could forego interest from occasional viewers, who tend to watch more films than shows.



#### Chart - 4

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Exclude missing or unknown genres
genre_scores = titles_df[titles_df['genres'].notnull() & (titles_df['genres'] != 'Unknown')]

# Group by genre and calculate average IMDb score
avg_genre_score = genre_scores.groupby('genres')['imdb_score'].mean().sort_values(ascending=False).head(10)

# Plot top 10 genres by average IMDb score
plt.figure(figsize=(10, 6))
sns.barplot(x=avg_genre_score.values, y=avg_genre_score.index, palette='crest')

plt.title('Chart 4: Top 10 Genres by Average IMDb Score', fontsize=14)
plt.xlabel('Average IMDb Score')
plt.ylabel('Genres')
plt.xlim(0, 10)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose this chart to visualize the average IMDb score by genre because:

It reveals how well each genre is received by audiences, based on their IMDb ratings.

While previous charts showed which genres are most common, this one adds a quality dimension by showing how viewers actually feel about those genres.

A horizontal bar chart is ideal here because:

Genre names can be long, and displaying them on the Y-axis keeps the chart readable.

It's easy to compare the average scores side by side.

The chart helps identify high-performing genres (like Documentary, Crime, or Thriller) and low-performing ones, offering actionable insights for:

Content production priorities

Licensing and acquisition decisions

Platform personalization and user targeting

This chart was chosen because it connects audience sentiment (IMDb scores) with genre categories, which is crucial for improving content strategy and user engagement.

##### 2. What is/are the insight(s) found from the chart?

In [None]:
The chart reveals that genres such as Documentary, Crime, Thriller, and History have some of the highest average IMDb scores on Amazon Prime.

These genres are likely better received by audiences because they offer more depth, factual storytelling, or intense narratives that resonate with viewers.

Genres that are more niche or focused, like Documentaries or Biographies, tend to receive higher critical appreciation compared to broader genres like Comedy or Romance.

This suggests that high-quality content doesn’t always align with the most frequent genres (e.g., Drama and Comedy were more common but not always top-rated).

The insight supports the idea that Amazon could invest more in well-performing genres — even if they’re not the most common — to elevate platform quality and user satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In [None]:
 Positive Business Impact:
The insights from Chart 4 — showing the top-rated genres by average IMDb score — can significantly help Amazon Prime drive strategic content decisions:

Invest in High-Rated Genres:

Genres like Documentary, Crime, and Thriller received the highest average IMDb scores.

This indicates strong audience approval, suggesting Amazon can benefit from producing or acquiring more content in these genres to improve platform reputation and satisfaction.

Niche Audience Engagement:

Although not the most frequent, top-rated genres often attract loyal niche audiences, which helps build community and long-term retention.

Promoting or curating content by high-rated genres can lead to better user experience and higher engagement.

Data-Driven Marketing & Curation:

Insights can inform genre-based content promotions (“Top-rated Documentaries” or “Best Thrillers”), improving discovery and user satisfaction.
Potential Indicators of Negative Growth:
Mismatch Between Quantity and Quality:

More common genres like Drama and Comedy, though frequently produced, may not appear in the top-rated list.

This suggests a gap between what is produced most and what audiences actually value, potentially leading to viewer fatigue or dissatisfaction over time.

Underutilization of High-Quality Genres:

If high-rated genres like Documentary or History are underrepresented, Amazon may be missing opportunities to deliver more of what audiences prefer.

This could result in audience migration to competitors that better serve those preferences.



#### Chart - 5

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Drop rows with missing or invalid release_year
titles_years = titles_df[titles_df['release_year'].notnull()]

# Convert to integer if needed
titles_years['release_year'] = titles_years['release_year'].astype(int)

# Count of titles per release year
yearly_counts = titles_years['release_year'].value_counts().sort_index()

# Plot the trend
plt.figure(figsize=(12, 6))
sns.lineplot(x=yearly_counts.index, y=yearly_counts.values, marker='o', color='teal')

plt.title('Chart 5: Number of Titles Released Per Year', fontsize=14)
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

In [None]:
I chose this line chart to visualize the number of titles released per year because:

It helps uncover trends over time — specifically, how Amazon Prime’s content library has grown, stabilized, or declined in different years.

Time series data like release_year is best visualized with a line chart, as it clearly shows patterns, spikes, and drops across a continuous timeline.

The chart answers a critical business question:
"When did Amazon invest more heavily in new content?"

It can help Amazon assess the impact of strategic decisions, such as ramping up original productions or licensing content during certain periods (e.g., post-2015 growth).

This chart supports long-term planning by identifying content output trends, seasonal patterns, or gaps that may need to be addressed.

##### 2. What is/are the insight(s) found from the chart?

In [None]:
The chart shows a clear upward trend in content production over the years, particularly from 2015 onward, indicating Amazon Prime's aggressive expansion strategy.

There is a noticeable spike in the number of titles released between 2017 and 2020, suggesting increased investment in original and licensed content during that period.

This surge aligns with Amazon’s strategy to compete with platforms like Netflix and Disney+, by offering a diverse and growing content library.

The data may also reflect industry-wide growth, as streaming demand increased globally — particularly around the COVID-19 pandemic, when streaming consumption surged.

Any recent drop in titles (if visible in later years) could point to shifting strategies, production delays, or quality-over-quantity approaches.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In [None]:
Positive Business Impact:
Yes, the insights from Chart 5 — which shows the number of titles released per year — can positively impact Amazon Prime's business in several ways:

Evidence of Strategic Growth:

The steady rise in content production, especially after 2015, reflects Amazon’s deliberate expansion strategy, aimed at competing with other major streaming platforms.

It confirms the company’s commitment to building a large, diverse content library, which helps attract and retain users.

Investment Timing Validation:

The spike in content releases around 2017–2020 suggests strategic investments that likely contributed to user acquisition and brand growth.

Understanding when major expansions occurred allows Amazon to correlate release volume with subscription growth, engagement, and churn.

Forecasting & Planning:

This historical view allows Amazon to forecast content demand, plan release calendars, and manage content budgets more efficiently.

Potential Risk / Indicator of Negative Growth:
While the overall trend is positive, the chart may also reveal potential risks:

Overproduction Without Quality:

If the number of titles increases but average IMDb scores do not improve, Amazon risks creating a quantity-over-quality perception, which can erode trust.

A high content volume can overwhelm users, making it harder to find high-quality content (content discovery fatigue).

Possible Decline or Plateau:

If the chart shows a drop in recent years (e.g., post-2021), it could indicate:

Slowed production due to global events (e.g., COVID, strikes).

A shift in strategy (e.g., focusing on fewer but higher-budget productions).

Or potential budget cuts, which may impact long-term content variety or platform competitiveness.



#### Chart - 6

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Replace missing age certifications with 'Unknown' if not already handled
titles_df['age_certification'] = titles_df['age_certification'].fillna('Unknown')

# Countplot for age certification
plt.figure(figsize=(10, 6))
sns.countplot(data=titles_df, x='age_certification', order=titles_df['age_certification'].value_counts().index, palette='Set3')

plt.title('Chart 6: Distribution of Content by Age Certification', fontsize=14)
plt.xlabel('Age Certification')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

In [None]:
I selected this bar chart to visualize the distribution of content by age certification because:

It directly answers a key question in content strategy:
"What audience age groups is Amazon Prime targeting the most?"

Age certifications (like PG, TV-14, R, TV-MA) reflect the suitability of content for different viewer segments, such as children, teens, and adults.

A countplot (bar chart) is ideal for comparing the frequency of discrete categories like age ratings.

This chart helps identify:

Whether the platform is family-friendly or leans heavily toward adult-oriented content.

If there's an opportunity to diversify and better serve underrepresented age groups.

##### 2. What is/are the insight(s) found from the chart?

In [None]:
The chart reveals that a significant portion of Amazon Prime's content is rated for mature audiences, such as TV-MA and R.

This indicates that Amazon is currently targeting adult viewers more heavily than children or family audiences.

There is comparatively less content rated for younger audiences, such as TV-G, G, or PG, suggesting a gap in family or kids’ programming.

The presence of a large number of titles with the label “Unknown” also suggests that some content may lack clear classification, which could affect parental controls and content filtering.

These insights highlight both strengths and opportunities:

Strength: A strong library for mature audiences.

Opportunity: Expand into youth and family segments, potentially increasing the platform’s appeal to a broader household demographic.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In [None]:
The insights from Chart 6 — showing the distribution of content by age certification — can directly help Amazon Prime:

Refine Target Audience Strategy:

With a strong focus on adult-oriented content (TV-MA, R), Amazon is clearly appealing to mature audiences who prefer more complex or intense content.

This focus can help differentiate the platform from competitors like Disney+, which primarily targets younger viewers.

Improve User Experience and Personalization:

Understanding content ratings allows Amazon to offer better parental control features and age-based recommendations, increasing user trust and family adoption.

Expand Market Reach:

By identifying the lack of family-friendly and children’s content, Amazon can intentionally invest in underserved segments — helping to attract families, younger users, and schools.

⚠️ Potential Risks or Indicators of Negative Growth:
Limited Content for Younger Audiences:

The chart shows relatively low representation of PG, G, and TV-G content, which could limit Amazon’s growth in the family and kids’ entertainment market.

Competitors like Netflix and Disney+ are strong in this area, meaning Amazon might lose family viewership or miss out on subscription bundles targeted at parents.

Unknown Age Ratings:

A notable number of titles have an “Unknown” rating. This inconsistency can lead to poor user trust, especially for families who want to filter content for children.

It also affects recommendation accuracy and undermines Amazon’s ability to clearly classify and promote its content.



#### Chart - 15 - Pair Plot

In [None]:
# import seaborn as sns
import matplotlib.pyplot as plt

# Select relevant numerical columns for pair plot
pairplot_df = titles_df[['release_year', 'runtime', 'imdb_score']].dropna()

# Create the pair plot
sns.pairplot(pairplot_df, diag_kind='kde', palette='husl')

plt.suptitle('Pair Plot: Relationships Between Release Year, Runtime & IMDb Score', y=1.02, fontsize=14)
plt.show()


##### 1. Why did you pick the specific chart?

In [None]:
I selected the pair plot to explore relationships between multiple numerical variables in the dataset — specifically release_year, runtime, and imdb_score — because:

It provides a compact, multi-dimensional view of how numerical features interact with each other.

It helps identify:

Correlations (e.g., Does longer runtime result in higher IMDb scores?).

Trends over time (e.g., Have IMDb scores changed across release years?).

Outliers or clusters that may not be obvious in individual plots.

The diagonal histograms (or KDEs) give insights into individual variable distributions, while the scatter plots between variables help spot potential patterns or anomalies.

It's especially useful in early data exploration to inform feature selection for machine learning models.

In summary, the pair plot was chosen for its ability to reveal multi-variable interactions in a single visual, making it a powerful tool for uncovering patterns that drive deeper insights or predictive modeling.

##### 2. What is/are the insight(s) found from the chart?

In [None]:
The pair plot reveals a few key relationships between release_year, runtime, and imdb_score:

Runtime vs IMDb Score:

There is a slight positive correlation between runtime and imdb_score, suggesting that longer titles may be rated slightly higher on average.

However, the relationship is not strong or consistent, indicating other factors also play major roles in audience ratings.

Release Year vs IMDb Score:

IMDb scores are relatively evenly distributed across years, meaning that newer content doesn't necessarily perform better than older titles in terms of ratings.

This could imply that Amazon’s recent increase in content does not always equate to higher viewer satisfaction.

Distributions:

Runtime is heavily right-skewed, with most content under 120 minutes.

IMDb scores tend to cluster between 5 and 7, indicating moderate viewer satisfaction overall.

Release years show a sharp increase in recent years, confirming Amazon’s growing content library.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

In [None]:
1. Balance Content Volume with Quality
Although the platform boasts a high number of movies, TV series have better IMDb ratings.

Recommendation: Spend more on high-quality TV series, particularly those that fall under high-rated genres (e.g., Crime, Thriller, Documentary), to raise user engagement and retention.

✅ 2. Diversify Genre Strategy
Drama and Comedy are overrepresented in the catalog, yet genres such as Documentary and History always get higher scores.

Recommendation: Showcase and offer high-performance underrepresented genres to maximize viewer satisfaction and differentiate from rivals.

✅ 3. Increase Family and Kids Content
Age certification data indicates a high emphasis on mature content with few titles for families and children.

Recommendation: Produce or license additional family content (PG, G, TV-G) to target household viewing and challenge Disney+ and other competitors.

✅ 4. Enhance Metadata Completeness
Several titles have missing or "Unknown" values for critical fields such as age certification and genre.

Recommendation: Implement strict metadata standards to enhance user filtering, parental controls, and recommendation correctness.

✅ 5. Utilize Viewer Ratings for Curation
Utilize IMDb ratings to highlight the best-rated content per category or genre.



# **Conclusion**

In [None]:
This project offered a thorough exploratory breakdown of Amazon Prime TV shows and movies dataset. Through visualization, statistical summary, and trend analysis, some of the major patterns and opportunities uncovered were:

Most content is movies, yet TV shows have more robust average IMDb scores, which suggests greater audience engagement with series.

Most popular genres are Drama, Comedy, and Action, but Documentaries, Crime, and Thriller genres have better ratings, which indicates an imbalance in the volume of content versus audience demand.

There is a strong emphasis on mature audiences content (TV-MA, R) with very few child- and family-friendly offerings, offering a clear potential for expansion.

A sharp growth in content production between 2015 and 2020 was noted, mirroring Amazon strategic drive into original and exclusive programming.

The report also pointed to a couple of risks that could impact user satisfaction and reduce audience reach if not addressed, including genre saturation, lack of metadata, and underrepresentation of family-friendly content.

Overall, Amazon Prime can shore up its competitive advantage by:

Elevating high-quality TV series and highest-rated genres,

Exploring family and youth content,

Enhancing metadata completeness, and

Applying audience rating trends to curation and content planning.

These insights based on data have the potential to inform wiser content investment, improved user targeting, and ultimately, more robust business performance in the extremely competitive streaming landscape.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***