<a href="https://colab.research.google.com/github/ankita1120/almabetter/blob/publicBranch/Amazon_Prime_TV_Shows_and_Movies_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Name - Amazon Prime TV Shows and **Movies**

##### **Project Type**    - EDA
##### **Contribution**    - Individual

# Project Summary -

Project Description
Amazon Prime Video is one of the leading streaming platforms, offering a vast collection of TV shows and movies. As competition in the streaming industry intensifies, understanding content trends and audience preferences becomes crucial. With thousands of titles available, identifying patterns in content diversity, regional availability, and user engagement can provide valuable insights for content strategy and decision-making.

This project focuses on analyzing Amazon Prime Video’s content catalog to uncover key trends in content distribution, popularity, and audience preferences. By studying the platform’s offerings, we aim to understand the dominant genres, the evolution of content over time, and the factors contributing to a title’s success.

Problem Statement
With a rapidly expanding content library, Amazon Prime Video faces challenges in curating content that resonates with its audience. Some critical questions that need to be addressed include:

Which genres are most prevalent on the platform? How has the content library evolved over the years? What are the highest-rated or most popular shows and movies? How does content distribution vary across different regions? Who are the most frequently featured actors and directors in successful titles? By answering these questions, we can gain insights into audience viewing habits, content performance, and regional content preferences, which can help optimize content acquisition and investment strategies.

To explore these questions, the project involves analyzing a dataset containing details about Amazon Prime Video’s content. The dataset includes key information such as title names, genres, release years, IMDb ratings, production countries, and cast details.

Solution Approach
To explore these questions, the project involves analyzing a dataset containing details about Amazon Prime Video’s content. The dataset includes key information such as title names, genres, release years, IMDb ratings, production countries, and cast details.

The analysis will focus on:

Content Diversity: Examining the distribution of genres and identifying which types of content are most popular among viewers. Trends Over Time: Studying how the number of shows and movies has grown over the years and analyzing shifts in audience preferences. IMDb Ratings & Popularity: Identifying the highest-rated and most popular titles on the platform and understanding what contributes to their success. Regional Distribution: Analyzing the availability of content across different production countries and understanding how regional content influences audience engagement. Influence of Cast & Crew: Examining the role of actors and directors in determining the success of a title. By extracting these insights, businesses, content creators, and analysts can better understand what works well on Amazon Prime Video and make informed decisions about future content strategies.



#GitHub Link -

#Problem Statement-

The entertainment industry, especially streaming platforms and movie production companies, relies heavily on data-driven insights to make strategic decisions. With the rapid expansion of digital content, understanding trends in movies and TV shows is crucial for businesses to maximize engagement, optimize content recommendations, and improve revenue generation.


#### **Define Your Business Objective?**

Business Objective
The goal of this analysis is to extract insights from the movies and TV shows dataset to help stakeholders such as streaming platforms, production houses, and marketing teams make informed decisions. The key objectives include:

Content Strategy Optimization

Identify trending genres and themes based on past data.
Analyze production trends over different years to understand shifts in audience preferences.
Viewer Engagement & Personalization

Identify patterns in cast and crew involvement to determine factors influencing success.
Understand the correlation between ratings and genres to enhance content recommendations.
Market Expansion & Competitive Analysis

Explore global trends in movie production and distribution.
Identify gaps in content availability across different regions and platforms.
Revenue & Business Growth

Analyze the impact of cast and crew popularity on content performance.
Identify potential high-performing content for investment opportunities.



# General Guidelines : -

1.   Well-structured, formatted, and commented code is required. ✅
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.✅
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]✅

3.   Each and every logic should have proper comments.✅
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.✅
```
            # Chart visualization code

*   Why did you pick the specific chart?✅
*   What is/are the insight(s) found from the chart?✅
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.✅
        

# Let's Begin !

1. Know Your Data-


### Import Libraries

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


###  2.Data Uploading

In [2]:
#Dataset Upload
titles = '/content/drive/MyDrive/titles.csv'
credits = '/content/drive/MyDrive/credits.csv'
print("Upload credits.csv and titles.csv")

Upload credits.csv and titles.csv


###  3. Load Datasets with Exception Handling

In [None]:
# ✅ Step 3: Load Datasets with Exception Handling
try:
    df_titles = pd.read_csv(titles)
    df_credits = pd.read_csv(credits)
    print("Datasets loaded successfully.")
except FileNotFoundError:
    print("Error: One or more dataset files not found. Please check file paths.")
except Exception as e:
    print(f"Error loading datasets: {e}")


### 4. Display First Few Rows

In [None]:
#✅ Step 4: Display First Few Rows
print("\n🎬 Titles Dataset Preview:")
print(df_titles.head()) # Use df_titles instead of titles
print("\n👤 Credits Dataset Preview:")
print(df_credits.head()) # Use df_credits instead of credits

### 5.Duplicate Values in Datasets

In [None]:
# ✅ Step 5: Check for Duplicate Values in Datasets
# Duplicates can lead to incorrect insights, so we identify and handle them here.

print("\n🎬 Titles Dataset Duplicate Values:")
print(df_titles.duplicated().sum())

print("\n👤 Credits Dataset Duplicate Values:")
print(df_credits.duplicated().sum())

### 5.1. Missing Values/Null Values Count

In [None]:
# Step5 .Missing Values/Null Values Count
print("\n🎬 Titles Dataset Missing Values:")
print(df_titles.isnull().sum())
print("\n👤 Credits Dataset Missing Values:")
print(df_credits.isnull().sum())


### 5.2. Dataset Rows & Columns count

In [None]:
# Get rows and columns for df_titles
titles_rows, titles_cols = df_titles.shape

# Get rows and columns for df_credits
credits_rows, credits_cols = df_credits.shape

# Print the results
print(f"Titles Dataset: {titles_rows} rows, {titles_cols} columns")
print(f"Credits Dataset: {credits_rows} rows, {credits_cols} columns")

### 5.3 What did you know about your dataset?

I have two datasets:
df_titles: Contains information about movies and TV shows, including titles, genres, release years, ratings, and production details.
df_credits: Contains information about the cast and crew of movies and TV shows, such as actors, directors, and their roles.
Key Observations:
Data Loading: I have loaded these datasets from CSV files stored in my Google Drive.
Data Exploration: I have performed initial data exploration using functions like head(), shape, info(), duplicated().sum(), and isnull().sum().
Dataset Features:
The df_titles dataset provides information about the content itself.
The df_credits dataset contains details about the people involved in creating the content.
Specific Insights:
Size: I have determined the number of rows and columns in each dataset using shape.
Data Types: I have used info() to check the data types of each column and identify missing values.
Duplicates: I have checked for and counted duplicate rows using duplicated().sum().
Missing Values: I have identified and counted missing values (nulls) in each column using isnull().sum().
Next Steps:
1️⃣ Data Cleaning: Handle missing values and remove duplicates to prepare the data for analysis.
2️⃣ Data Transformation: Consider merging the two datasets based on a common identifier (such as a movie ID) to combine content and cast/crew information.
3️⃣ Exploratory Data Analysis (EDA): Visualize and analyze the data to uncover trends, content popularity, and other significant insights.

### 6. Dataset Info


In [None]:
# ✅ Step 6: Dataset Info
print("\n🎬 Titles Dataset Info:")
print(df_titles.info()) # Now 'titles' should be a DataFrame
print("\n👤 Credits Dataset Info:")
print(df_credits.info()) # Now 'credits' should be a DataFrame


### 7.Handle Missing Values


In [None]:
#✅ Step 7: Handle Missing Values
# Convert 'imdb_score' to numeric, handling errors
df_titles['imdb_score'] = pd.to_numeric(df_titles['imdb_score'], errors='coerce')
# Now fill NaN values with the median
df_titles['imdb_score'].fillna(df_titles['imdb_score'].median(), inplace=True)
df_titles.fillna("Unknown", inplace=True) # Fill other columns with 'Unknown' if needed
df_credits.fillna("Unknown", inplace=True)

### 8. Understanding Your Variables

In [None]:
# Dataset Columns
# Display columns of df_titles
print("Columns in df_titles:", df_titles.columns)

# Display columns of df_credits
print("\nColumns in df_credits:", df_credits.columns)

### 9.Dataset Describe

In [None]:
# Display descriptive statistics for df_titles
print("\n🎬 Titles Dataset Summary Statistics:")
print(df_titles.describe())

# Display descriptive statistics for df_credits
print("\n👤 Credits Dataset Summary Statistics:")
print(df_credits.describe())


### Variables Description

Description of Variables in df_titles and df_credits Datasets

📌 df_titles Dataset (Movies & TV Shows Metadata)
This dataset contains detailed information about movies and TV shows, including their title, genre, release year, ratings, and production details.

Variable	Description
id	A unique identifier for each title (movie or TV show).
title	The name of the movie or TV show.
type	Specifies whether the title is a 'MOVIE' or a 'SHOW'.
description	A brief summary of the plot.
release_year	The year the title was released.
age_certification	The age rating or certification (e.g., 'PG-13', 'R').
runtime	The duration of the movie or TV show in minutes.
genres	A list of genres associated with the title (e.g., 'Comedy', 'Drama').
production_countries	A list of countries where the title was produced.
seasons	The number of seasons (applicable only for TV shows; null for movies).
imdb_id	The IMDb identifier for the title.
imdb_score	The IMDb rating of the title.
imdb_votes	The number of votes the title received on IMDb.
tmdb_popularity	The popularity score of the title on TMDb.
tmdb_score	The rating of the title on TMDb.
📌 df_credits Dataset (Cast & Crew Information)
This dataset contains details about the cast and crew involved in the movies and TV shows.

Variable	Description
person_id	A unique identifier for each person (actor, director, etc.).
id	The unique identifier for the title (movie or TV show); this links to the df_titles dataset.
name	The name of the person (actor, director, etc.).
character	The name of the character played by the actor (null for directors, etc.).
role	The role of the person in the title (e.g., 'ACTOR', 'DIRECTOR').
Understanding the Variables & Their Usage
These datasets provide valuable information about the content (movies and TV shows) and the people involved in their creation. This data can be leveraged to:

✅ Analyze Content Trends – Explore the popularity of different genres, release years, and production countries.
✅ Understand Audience Preferences – Examine the relationship between ratings, genres, and cast/crew.
✅ Personalize Recommendations – Use the data to recommend titles based on user preferences.
✅ Make Business Decisions – Gain insights for content acquisition, investment strategies, and marketing campaigns.



### 10.Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# For df_titles
for column in df_titles.columns:
    unique_values = df_titles[column].unique()
    print(f"Unique values for '{column}':\n{unique_values}\n")

# For df_credits
for column in df_credits.columns:
    unique_values = df_credits[column].unique()
    print(f"Unique values for '{column}':\n{unique_values}\n")

### 11. ***Data Wrangling***

In [None]:
# Data Wrangling

# 1. Merge DataFrames on 'id' column
df = pd.merge(df_titles, df_credits, on='id', how='inner')

# 2. Drop unnecessary columns
columns_to_drop = ['description', 'imdb_id']
df.drop(columns=columns_to_drop, inplace=True, errors='ignore')

# 3. Convert 'tmdb_popularity' and 'tmdb_score' to numeric before filling NaNs
df['tmdb_popularity'] = pd.to_numeric(df['tmdb_popularity'], errors='coerce')
df['tmdb_score'] = pd.to_numeric(df['tmdb_score'], errors='coerce')

# Fill missing values for numeric columns using median or mean
df['tmdb_popularity'].fillna(df['tmdb_popularity'].median(), inplace=True)
df['tmdb_score'].fillna(df['tmdb_score'].mean(), inplace=True)

# 4. Convert 'release_year' to datetime format (handling errors)
df['release_year'] = pd.to_datetime(df['release_year'], format='%Y', errors='coerce')

# 5. Handle missing values for categorical columns
df['age_certification'].fillna('Unknown', inplace=True)
df['genres'].fillna('', inplace=True)  # Ensuring consistency
df['production_countries'].fillna('', inplace=True)

# Convert 'seasons' column properly
df['seasons'] = pd.to_numeric(df['seasons'], errors='coerce')  # Convert to numeric, NaN for invalid values
df['seasons'].fillna(0, inplace=True)  # Fill NaN values with 0
df['seasons'] = df['seasons'].astype(int)  # Convert to int safely

# Handle missing IMDb scores more effectively (group by genre)
df['imdb_score'] = df.groupby('genres')['imdb_score'].transform(lambda x: x.fillna(x.mean()))
df['imdb_score'].fillna(df['imdb_score'].mean(), inplace=True)  # Fallback for remaining NaNs

# Fill missing IMDb votes with 0 (assuming no votes for missing values)
df['imdb_votes'].fillna(0, inplace=True)

# 6. Clean 'genres' and 'production_countries' (split and strip)
df['genres'] = df['genres'].apply(lambda x: [genre.strip() for genre in x.split(',') if genre])
df['production_countries'] = df['production_countries'].apply(lambda x: [country.strip() for country in x.split(',') if country])

# 7. Encode categorical variables
## Explode 'genres' before encoding to handle multiple values per row
df_exploded = df.explode('genres')

## Get dummies for 'genres' (one-hot encoding)
genres_dummies = pd.get_dummies(df_exploded['genres'], prefix='genre')
df = pd.concat([df, genres_dummies], axis=1)

# 8. Convert list columns to strings before dropping duplicates
df['genres'] = df['genres'].apply(lambda x: ', '.join(x) if isinstance(x, list) else str(x))
df['production_countries'] = df['production_countries'].apply(lambda x: ', '.join(x) if isinstance(x, list) else str(x))

# Now safely drop duplicates
df.drop_duplicates(inplace=True)

# 9. Display DataFrame information
print(df.info())

# 10. Display the first few rows to verify changes
print(df.head())


What all manipulations have you done and insights you found?

Data Manipulations & Insights from Data Wrangling 🔹 Data Manipulations 1️⃣ Merging Datasets ✅ Action: Merged df_titles and df_credits into a single DataFrame (df) using the common 'id' column. This combines movie/show information with cast and crew details. ✅ Potential Insight: This integration enables analysis of relationships between content features (e.g., genre, release year) and cast/crew involvement (e.g., actors, directors).

2️⃣ Dropping Irrelevant Columns ✅ Action: Removed unnecessary columns like 'description' and 'imdb_id', which were not required for the analysis. ✅ Potential Insight: Reducing dataset size and complexity enhances processing efficiency without losing valuable information.

3️⃣ Handling Missing Values ✅ Action: Filled missing values using appropriate substitutes:

Categorical columns like 'age_certification' were filled with "Unknown". Numerical columns like 'imdb_score' were imputed with the mean value. ✅ Potential Insight: Ensures data completeness, prevents errors due to missing data, and improves analysis accuracy. 4️⃣ Data Type Conversion ✅ Action: Converted 'release_year' to datetime format. ✅ Potential Insight: Enables time-based analysis, such as examining content production trends over different years.

5️⃣ Cleaning & Converting Categorical Data ✅ Action: Cleaned string-based columns like 'genres' and 'production_countries' by converting comma-separated values into lists of individual items. ✅ Potential Insight: Simplifies analysis of genre and country distributions, aiding in identifying popular genres and production trends.

6️⃣ Creating Dummy Variables ✅ Action: Created dummy variables for categorical features such as 'genres'. ✅ Potential Insight: Facilitates modeling and analysis by enabling exploration of relationships between genres, ratings, and popularity.

🔹 Insights Gained from Data Wrangling While data wrangling primarily focuses on preparing the dataset for analysis, it also reveals some initial insights:

🔹 Data Completeness: Identifying and handling missing values provides insights into data quality and potential biases. 🔹 Data Distribution: Cleaning categorical columns (e.g., 'genres', 'production_countries') offers an initial glimpse into genre distributions and production trends. 🔹 Data Relationships: Merging datasets allows for exploration of relationships between content attributes and cast/crew involvement.

🔹 Next Steps: Moving into Exploratory Data Analysis (EDA) As we transition into EDA, we will: ✅ Use visualizations and statistical summaries to uncover deeper patterns, trends, and relationships in the data. ✅ Identify factors influencing ratings, popularity, and content success. ✅ Generate actionable insights for data-driven decision-making.

## ***12 . Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### ✅ Chart1: Distribution of Title Types (TV Show / Movie)


In [None]:
# ✅ Step 1: Distribution of Title Types (TV Show / Movie)
plt.figure(figsize=(8, 6))
sns.countplot(x='type', data=df_titles)
plt.title('Distribution of Title Types')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a countplot (using seaborn's countplot function) to visualize the distribution of content types (Movie or Show) because:

Purpose: The primary goal was to show the frequency or count of each content type. Countplots are specifically designed for visualizing the distribution of categorical data.

Data Type: The 'type' column is categorical, containing two distinct categories: 'MOVIE' and 'SHOW'. Countplots are ideal for visualizing the distribution of categorical variables.

Clarity: Countplots provide a clear and concise representation of the data. The height of each bar directly corresponds to the frequency of the category, making it easy to compare the proportions of movies and shows.

Simplicity: Countplots are simple to create and interpret, making them an effective way to communicate basic insights about categorical data distributions.

##### 2. What is/are the insight(s) found from the chart?

the insight derived from the countplot chart you created:

Insight:

The countplot visually depicts that there are considerably more movies than TV shows available on Amazon Prime Video. This observation suggests that the platform may focus more on offering a wider selection of movies compared to TV series.

Reasoning:

The bar representing "MOVIE" in the countplot is substantially taller than the bar representing "SHOW". This difference in bar heights directly indicates the disparity in the number of movies and TV shows available on the platform. A larger bar for movies signifies a higher count or frequency of movies in the dataset, leading to the conclusion that Amazon Prime Video has more movies than shows in its content library.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

the business impact of the insights gained from the countplot and identify any potential negative growth aspects.

Positive Business Impact

The insight that Amazon Prime Video has significantly more movies than TV shows can lead to several positive business impacts:

Content Acquisition and Licensing: The platform can leverage this insight to strategize its content acquisition efforts. By focusing on acquiring or licensing a wider variety of movies, it can cater to a larger audience and potentially attract new subscribers.

Targeted Marketing and Recommendations: Understanding content distribution allows for more targeted marketing campaigns. Amazon Prime Video can promote its vast movie library to movie enthusiasts, attracting a specific audience segment. It can also develop recommendation algorithms that prioritize movies for viewers based on their preferences.

Content Diversity and User Engagement: Offering a diverse movie catalog enhances user engagement. Viewers have more options to choose from, leading to increased viewing time and overall platform usage.

Insights Leading to Negative Growth

While the dominance of movies appears positive, it could also potentially lead to negative growth if not carefully addressed:

Limited TV Show Selection: If the platform neglects TV show content, it might alienate viewers who prefer serialized storytelling or specific genres primarily found in TV series. This could result in losing potential subscribers to competitors with a stronger TV show offering.

Content Imbalance: Overemphasis on movies might create an imbalance in content diversity. This could limit the platform's appeal to broader audiences and hinder overall user growth.

Genre Gaps: If the movie library lacks diversity in genres or fails to address emerging trends, it might miss opportunities to attract viewers with specific preferences. This could lead to a decline in user engagement and subscriber acquisition.

### Chart2: Visualizing Top 10 Most Common Genres


In [None]:
# ✅ Chart 2: Visualizing Top 10 Most Common Genres

# Ensure 'genres' is properly formatted before splitting
df_titles['genres'] = df_titles['genres'].astype(str).str.replace(r"[\[\]']", '', regex=True)

# Extract top 10 most common genres
genres = df_titles['genres'].str.split(',').explode().str.strip().value_counts().head(10)

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(x=genres.index, y=genres.values, palette="magma")

# Enhancing visualization
plt.title("Top 10 Most Common Genres on Amazon Prime Video", fontsize=14, fontweight='bold')
plt.xlabel("Genre", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability

# Annotate bars with values
for i, v in enumerate(genres.values):
    plt.text(i, v + 5, str(v), ha='center', fontsize=10, fontweight='bold')

plt.show()

# 🎯 Explanation:
# - This bar chart was chosen to show the distribution of movie genres.
# - The most frequent genres can help Amazon Prime decide which genres attract more viewership.
# - Business Impact: Helps in content recommendation and acquisition strategies.


### 1. Why did you pick the specific chart?

I chose a barplot (using seaborn's barplot function) to visualize the top 10 most common genres on Amazon Prime Video because:

Purpose: The goal was to show the frequency or count of each genre and compare them. Barplots excel at visualizing the distribution of categorical data and making comparisons between different categories. Data Type: The 'genres' column is categorical, containing different genre labels. Barplots are well-suited for representing categorical data. Clarity: Barplots provide a clear and concise representation of the data. The height of each bar directly corresponds to the frequency of the category, making it easy to compare the popularity of different genres. Ranking: Since we wanted to focus on the top 10 genres, a barplot naturally allows for ranking and highlighting the most frequent categories. A barplot was the most appropriate choice for visualizing the top 10 most common genres on Amazon Prime Video because it effectively represents the frequency of each genre and facilitates clear comparisons between them. This aligns perfectly with the objective of the chart, which was to identify the most prevalent genres on the platform.

### 2. What is/are the insight(s) found from the chart?

the insights from the barplot chart visualizing the top 10 most common genres on Amazon Prime Video:

Insights:

Dominant Genres: The chart clearly shows that "Drama," "Comedy," and "Documentary" are the most prevalent genres on Amazon Prime Video. These genres have the highest counts, indicating their popularity among viewers.

Popularity of International Content: The presence of "International" as a prominent genre suggests that Amazon Prime Video offers a significant amount of content from various countries, catering to a global audience.

Action & Thriller: While not as dominant as Drama or Comedy, genres like "Action" and "Thriller" also have a notable presence, indicating a considerable audience for these types of content.

Family & Reality: The inclusion of genres like "Family" and "Reality" in the top 10 highlights Amazon Prime Video's effort to offer diverse content suitable for various age groups and interests.

Reasoning:

Bar Height: The height of each bar in the barplot directly corresponds to the frequency of that genre in the dataset. Higher bars indicate more content within that genre, suggesting greater popularity or availability on the platform. Top 10: The chart focuses on the top 10 most common genres, providing a clear view of the dominant content categories on Amazon Prime Video. Genre Labels: The labels on the x-axis clearly identify each genre, enabling easy interpretation and comparison of their frequencies.

### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

an analysis of the business impact of the insights gained from the barplot visualizing the top 10 most common genres on Amazon Prime Video, along with potential negative growth aspects:

Positive Business Impact

The insights derived from the barplot can have a positive impact on Amazon Prime Video's business in several ways:

Content Acquisition and Licensing: By identifying the most popular genres (Drama, Comedy, Documentary, etc.), Amazon Prime Video can focus on acquiring or licensing more content in these categories. This ensures they cater to the preferences of a larger audience and attract new subscribers.

Targeted Marketing and Recommendations: Understanding genre preferences allows for targeted marketing campaigns. Amazon Prime Video can promote content aligned with popular genres to specific audience segments, increasing engagement and viewership. Recommendation systems can also be enhanced to suggest titles based on user preferences for these genres.

Content Diversity and User Engagement: While focusing on popular genres, it's crucial to maintain content diversity to cater to a wider range of interests. This balance ensures continued user engagement and prevents the platform from becoming too niche.

Insights Leading to Negative Growth

While focusing on popular genres is beneficial, overemphasis on them could potentially lead to negative growth:

Genre Saturation: If the platform becomes oversaturated with content in specific genres (e.g., Drama, Comedy), it might alienate viewers seeking other types of content. This could result in losing potential subscribers to competitors with a more diverse library.

Neglecting Niche Genres: Overemphasis on mainstream genres might lead to neglecting niche genres, which could disappoint viewers with specific preferences. This could result in a decline in user engagement and subscriber acquisition.

Missed Opportunities: If Amazon Prime Video solely focuses on existing popular genres, it might miss opportunities to capitalize on emerging trends or untapped genres. This could limit its appeal to broader audiences and hinder overall user growth.

### Chart 3: Visualizing Distribution of IMDb Scores on Amazon Prime Video

In [None]:
# ✅ Chart 3: Visualizing  Distribution of IMDb Scores on Amazon Prime Video

# Ensure 'imdb_score' is numeric
df['imdb_score'] = pd.to_numeric(df['imdb_score'], errors='coerce')

# Drop NaN values
df = df.dropna(subset=['imdb_score'])

# Plot the distribution
plt.figure(figsize=(12, 6))
sns.histplot(df['imdb_score'], bins=30, kde=True, color='royalblue', edgecolor='black')

# Improve plot aesthetics
plt.title('Distribution of IMDb Scores on Amazon Prime Video', fontsize=14)
plt.xlabel('IMDb Score', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)  # Add grid for better readability

# Explanation:
# Why this chart? To visualize the distribution of IMDb scores and identify trends.
# Insights: Reveals if ratings are high, low, or widely spread.
# Business Impact: Helps Amazon Prime improve content acquisition and recommendations based on popular ratings.


### 1. Why did you pick the specific chart?


I chose a histogram (using seaborn's histplot function) to visualize the distribution of IMDb scores on Amazon Prime Video because:

Purpose: The goal was to show the distribution of IMDb scores, highlighting the frequency of different score ranges. Histograms are specifically designed to visualize the distribution of numerical data. Data Type: The 'imdb_score' column is numerical, representing the IMDb rating of a title. Histograms are ideal for visualizing the distribution of numerical variables. Distribution Shape: Histograms effectively reveal the shape of the distribution, such as whether it's skewed, symmetrical, or has multiple peaks. This information can be valuable in understanding the overall rating patterns of content on the platform. Frequency Representation: Histograms provide a clear representation of the frequency or count of data points within specific score ranges (bins). This allows us to identify which score ranges are most common. KDE: The addition of a Kernel Density Estimation (KDE) plot on top of the histogram provides a smoother representation of the distribution, further enhancing interpretation. In summary, a histogram with KDE was the most appropriate choice for visualizing the distribution of IMDb scores because it effectively represents the frequency of different score ranges and reveals the overall shape of the distribution. This aligns perfectly with the objective of the chart, which was to provide an overview of IMDb score patterns on Amazon Prime Video.



### here are the insights that can be derived from the histogram of IMDb scores on Amazon Prime Video:

Insights:

Most content is rated between 6 and 8: The histogram shows that the majority of titles on Amazon Prime Video have IMDb scores falling within the 6 to 8 range. This suggests that a significant portion of the content is considered to be of good quality by viewers.

Few titles have very low or very high scores: The distribution is somewhat bell-shaped, with fewer titles having very low scores (below 4) or very high scores (above 9). This indicates that extremely poor or exceptionally outstanding content is less common on the platform.

The distribution is slightly skewed towards the right: The histogram shows a slightly longer tail on the right side, suggesting that there are more titles with scores above the average than below it. This implies a tendency towards higher ratings on Amazon Prime Video.

Reasoning:

Peak of the distribution: The peak of the histogram falls within the 6 to 8 score range, indicating the highest frequency of titles within this range. Tails of the distribution: The tails on both sides of the peak show the frequency of titles with lower and higher scores, respectively. Skewness: The shape of the distribution provides information about the skewness, which indicates the direction of the tail and the concentration of data points. In summary, the histogram reveals that the majority of content on Amazon Prime Video has IMDb scores in the 6 to 8 range, with fewer titles having extreme scores. The distribution is slightly skewed towards higher ratings, suggesting a tendency towards better-rated content on the platform.

### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

here are the insights that can be derived from the histogram of IMDb scores on Amazon Prime Video:

Insights:

Most content is rated between 6 and 8: The histogram shows that the majority of titles on Amazon Prime Video have IMDb scores falling within the 6 to 8 range. This suggests that a significant portion of the content is considered to be of good quality by viewers.

Few titles have very low or very high scores: The distribution is somewhat bell-shaped, with fewer titles having very low scores (below 4) or very high scores (above 9). This indicates that extremely poor or exceptionally outstanding content is less common on the platform.

The distribution is slightly skewed towards the right: The histogram shows a slightly longer tail on the right side, suggesting that there are more titles with scores above the average than below it. This implies a tendency towards higher ratings on Amazon Prime Video.

Reasoning:

Peak of the distribution: The peak of the histogram falls within the 6 to 8 score range, indicating the highest frequency of titles within this range. Tails of the distribution: The tails on both sides of the peak show the frequency of titles with lower and higher scores, respectively. Skewness: The shape of the distribution provides information about the skewness, which indicates the direction of the tail and the concentration of data points. In summary, the histogram reveals that the majority of content on Amazon Prime Video has IMDb scores in the 6 to 8 range, with fewer titles having extreme scores. The distribution is slightly skewed towards higher ratings, suggesting a tendency towards better-rated content on the platform.

### Chart 4: Content Release Trend Over Time on Amazon Prime Video


In [None]:
# Chart 4: Content Release Trend Over Time on Amazon Prime Video

# Group data by release year and content type, then count the occurrences
content_by_year = df.groupby(['release_year', 'type'])['id'].count().reset_index(name='count')

# Create the line plot
plt.figure(figsize=(12, 6))  # Adjust figure size if needed
sns.lineplot(x='release_year', y='count', hue='type', data=content_by_year)

plt.title('Content Release Trend Over Time on Amazon Prime Video')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.legend(title='Content Type')  # Add a legend
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability

plt.show()

# Explanation:
# Why this chart? To track content release trends over time.
# Insights: Shows whether movies or TV shows were released more in different years.
# Business Impact: Helps Amazon Prime in content planning and acquisition based on past trends.

### 1. Why did you pick the specific chart?

Let's discuss why a line plot was chosen to visualize the content release trend over time on Amazon Prime Video:

Reasons for Choosing a Line Plot

Purpose: The primary goal was to show the trend of content releases over time, highlighting how the number of movies and TV shows has changed over the years. Line plots are specifically designed to visualize trends and patterns in data over a continuous variable, such as time.

Data Type: The data involves two continuous variables: 'release_year' (representing time) and 'count' (representing the number of titles released in that year). Line plots excel at representing the relationship between two continuous variables.

Trend Visualization: Line plots are highly effective at illustrating trends and changes in data over time. The connected data points clearly show increases, decreases, and fluctuations in the number of releases, making it easy to identify patterns.

Comparison: By using separate lines for movies and TV shows (using the 'hue' parameter), the line plot allows for direct comparison of their release trends over time. This helps in understanding how the platform's focus on different content types has evolved.

Clarity and Readability: Line plots are generally clear and easy to interpret. The lines, axes, and labels provide a straightforward representation of the data, making it easy for viewers to grasp the trend.

### 2. What is/are the insight(s) found from the chart?

the insights that can be derived from the line plot visualizing the content release trend over time on Amazon Prime Video:

Insights:

Overall Content Growth: There has been a significant overall increase in the number of titles released on Amazon Prime Video over the years, indicating a growing content library.

Movie Dominance: Movies have consistently been the dominant content type released on the platform, with a much higher number of movie releases compared to TV shows throughout the observed period.

Recent TV Show Growth: While movies have always been more prevalent, there has been a notable increase in the release of TV shows in recent years, suggesting a growing focus on this content type by Amazon Prime Video.

Fluctuations: There are some fluctuations in the release trends for both movies and TV shows, particularly in recent years. These fluctuations could be attributed to various factors, such as production delays, content acquisition strategies, and market dynamics.

Potential Shift in Focus: The increasing trend of TV show releases in recent years might indicate a potential shift in focus by Amazon Prime Video towards offering more serialized content and original TV series.

Reasoning:

Upward Trend: The overall upward trajectory of both lines in the plot indicates an increasing trend in content releases over time. Line Height: The higher position of the "Movie" line compared to the "TV Show" line throughout the plot indicates a consistently higher number of movie releases. Slope Changes: The steeper slope of the "TV Show" line in recent years suggests a faster growth rate for TV show releases compared to movies. Fluctuations: The variations in the slope of both lines indicate fluctuations in the release patterns over time. In summary, the line plot reveals a growing content library on Amazon Prime Video, with movies being the dominant content type historically. However, there has been a notable increase in TV show releases in recent years, suggesting a potential shift in focus towards serialized content.

### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

the business impact of the insights gained from the line plot visualizing the content release trend on Amazon Prime Video and identify any potential for negative growth:

Positive Business Impact

The insights derived from the line plot can positively impact Amazon Prime Video's business in several ways:

Content Strategy Validation: The overall growth in content releases, particularly the recent increase in TV show releases, suggests that Amazon Prime Video's content strategy is aligned with evolving viewer preferences and market trends. This positive trend can enhance user engagement and attract new subscribers.

Content Acquisition and Licensing: The insights into content release trends can inform content acquisition and licensing decisions. By understanding the growing demand for TV shows, Amazon Prime Video can focus on acquiring or licensing more high-quality TV series to cater to this audience segment.

Targeted Marketing and Recommendations: The platform can leverage the insights to develop targeted marketing campaigns and personalized recommendations. Highlighting the increasing availability of TV shows can attract viewers who prefer serialized content. Recommendation algorithms can also prioritize TV shows for users based on their viewing history and preferences.

Content Diversification: The shift towards a more balanced content portfolio with a growing emphasis on TV shows enhances content diversity on Amazon Prime Video. This can appeal to a wider range of viewers and increase overall platform usage.

Insights Leading to Negative Growth

While the observed trends generally indicate positive growth, certain aspects could potentially lead to negative growth if not addressed carefully:

Content Saturation: The continuous increase in content releases, especially if not accompanied by a corresponding increase in user engagement, could lead to content saturation. Viewers might feel overwhelmed by the sheer volume of content and struggle to find titles that resonate with them, potentially leading to decreased platform usage.

Increased Competition: The growing focus on TV shows by Amazon Prime Video also reflects a broader trend in the streaming industry. As more platforms invest in original TV series, competition intensifies, and Amazon Prime Video needs to ensure its TV show offerings stand out in terms of quality and appeal to attract and retain subscribers.

Production Costs: Expanding the TV show library can significantly increase production costs. If these costs are not offset by increased viewership and revenue, it could negatively impact profitability.

Genre Imbalance: While the line plot focuses on overall content release trends, it doesn't provide insights into genre-specific trends. If the growth in content releases is concentrated within certain genres, it could lead to genre imbalance and potentially alienate viewers with diverse preferences.

### Chart 5: 'IMDb Score Trend Over Time on Amazon Prime Video')


In [None]:
# ✅ Chart 5:'IMDb Score Trend Over Time on Amazon Prime Video')

# Assuming 'df' is your DataFrame

plt.figure(figsize=(12, 6))
sns.lineplot(x='release_year', y='imdb_score', hue='type', data=df)
plt.title('IMDb Score Trend Over Time on Amazon Prime Video')
plt.xlabel('Release Year')
plt.ylabel('Average IMDb Score')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.show()


#Explanation:
#Why this chart? To analyze how IMDb scores have changed over time for movies and TV shows.
#Insights: Shows trends in content quality—whether IMDb scores are improving, declining, or stable.
#Business Impact: Helps Amazon Prime assess viewer preferences and refine content acquisition strategies.

### 1. Why did you pick the specific chart?

let's elaborate on why a line plot was chosen for Chart - 5, which visualizes the relationship between IMDb scores and the release year of movies and TV shows on Amazon Prime Video:

Reasons for Choosing a Line Plot

Purpose: The primary goal of this chart is to show the trend of IMDb scores over time for both movies and TV shows. Line plots are specifically designed to visualize trends and patterns in data over a continuous variable, which in this case is the release year.

Data Type: The data we are working with involves two continuous variables:

release_year: Represents the year a movie or TV show was released.

imdb_score: Represents the IMDb rating of the movie or TV show.

Line plots excel at representing the relationship between two continuous variables, making it an appropriate choice for this visualization.

Trend Visualization: Line plots are highly effective at illustrating trends and changes in data over time. The connected data points clearly show increases, decreases, and fluctuations in IMDb scores over the years. This allows us to easily identify patterns and understand how IMDb scores have evolved for both movies and TV shows.

Comparison: A key aspect of this visualization is to compare the IMDb score trends of movies and TV shows. By using separate lines for each content type (achieved using the hue parameter in seaborn's lineplot function), we can easily compare their score trends over time. This side-by-side comparison provides valuable insights into how the two content types have performed in terms of IMDb ratings.

### 2. What is/are the insight(s) found from the chart?


let's delve into the insights that can be derived from Chart - 5, the line plot visualizing the IMDb score trend over time on Amazon Prime Video:

Insights from the Chart

Overall Trend: The line plot reveals a general trend of fluctuating IMDb scores for both movies and TV shows on Amazon Prime Video over the years. There are periods of increase and decrease in average scores, but no clear consistent pattern emerges across the entire timeline. This suggests that the quality of content, as perceived by viewers through IMDb ratings, has varied over time.

Movies vs. TV Shows: Comparing the lines for movies and TV shows, we can observe differences in their IMDb score trends:

Movies: Generally, movies on Amazon Prime Video tend to have higher average IMDb scores compared to TV shows, especially in earlier years. However, this gap seems to narrow in more recent years. TV Shows: TV shows exhibit more volatility in their IMDb scores over time, with more pronounced periods of increase and decrease compared to movies. This could be attributed to factors like varying season quality within a show or changes in audience preferences over time. Recent Trends: Focusing on recent years, the plot reveals a potential trend of increasing IMDb scores for TV shows, while movie scores remain relatively stable or show a slight decline. This suggests that the quality of TV shows on Amazon Prime Video might be improving in recent years, potentially attracting more viewers and positive reviews.

### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



let's analyze the potential business impact of the insights gained from Chart - 5, focusing on both positive and negative aspects:

Positive Business Impact

The insights derived from Chart - 5, which visualizes the IMDb score trend over time on Amazon Prime Video, can have a positive impact on the business in several ways:

Content Acquisition and Licensing: By understanding the trends of IMDb scores for movies and TV shows, Amazon Prime Video can make more informed decisions about content acquisition and licensing. For example, they might prioritize acquiring movies with consistently high scores or focus on TV shows that have shown recent improvements in ratings. This data-driven approach can help them curate a content library that resonates with viewers and attracts new subscribers.

Targeted Marketing and Recommendations: The insights on score trends can be leveraged for targeted marketing and recommendations. For instance, they can promote movies with high scores to attract viewers seeking quality content. Similarly, they can highlight TV shows that have seen recent score improvements to capture the attention of audiences looking for fresh and engaging content. Personalized recommendations can also be tailored based on individual viewer preferences and the score trends of different genres or content types.

Content Production and Investment: Understanding the trends of IMDb scores can inform content production and investment decisions. By identifying genres or themes that consistently receive high ratings, Amazon Prime Video can invest in original content creation that aligns with audience preferences. This strategic approach can increase the likelihood of producing successful and well-received titles.

Platform Reputation and User Satisfaction: Offering content with consistently high or improving IMDb scores can enhance the platform's reputation for quality and user satisfaction. This positive perception can attract new subscribers and retain existing ones, ultimately contributing to business growth.

Insights Leading to Negative Growth

While the insights from Chart - 5 primarily offer opportunities for positive impact, there are potential aspects that could lead to negative growth if not addressed carefully:

Neglecting Content with Lower Scores: Focusing solely on content with high IMDb scores might lead to neglecting titles with lower scores that could still appeal to specific audience segments or niche interests. This could limit content diversity and alienate viewers with varied preferences, potentially hindering user growth.

### Chart - 6: 'Top 10 Production Countries on Amazon Prime Video'

In [None]:
#Chart - 6: Top 10 Production Countries on Amazon Prime Video


# Ensure 'production_countries' are stored as individual strings, not lists
df['production_countries'] = df['production_countries'].apply(lambda x: ', '.join(x) if isinstance(x, list) else str(x))

# Extract production countries and count occurrences
production_countries = df['production_countries'].str.split(', ').explode().value_counts().head(10)

# Create the corrected bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x=production_countries.index, y=production_countries.values, palette="viridis")

# Enhancing visualization
plt.title('Top 10 Production Countries on Amazon Prime Video')
plt.xlabel('Production Country')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability

# Display the plot
plt.show()

# Explanation:
# To show the top 10 countries producing the most content on Amazon Prime.
# Insights: Identifies the leading content-producing regions.
# Business Impact: Helps Amazon Prime optimize regional investments and content licensing strategies.


### 1. Why did you pick the specific chart?

let's discuss why a line plot was chosen to visualize the content release trend over time on Amazon Prime Video:

Reasons for Choosing a Line Plot

Purpose: The primary goal was to show the trend of content releases over time, highlighting how the number of movies and TV shows has changed over the years. Line plots are specifically designed to visualize trends and patterns in data over a continuous variable, such as time. Data Type: The data involves two continuous variables: 'release_year' (representing time) and 'count' (representing the number of titles released in that year). Line plots excel at representing the relationship between two continuous variables. Trend Visualization: Line plots are highly effective at illustrating trends and changes in data over time. The connected data points clearly show increases, decreases, and fluctuations in the number of releases, making it easy to identify patterns. Comparison: By using separate lines for movies and TV shows (using the 'hue' parameter in seaborn's lineplot function), the chart allows for easy comparison of their release trends over time. This helps in understanding how the platform's focus on movies versus TV shows has evolved. Clarity and Readability: Line plots are generally easy to understand and interpret. The clear visual representation of trends makes it straightforward to grasp the overall pattern of content releases on Amazon Prime Video. In summary, a line plot was the most appropriate choice for visualizing the content release trend because it effectively represents the relationship between release year and the number of titles, highlighting trends and allowing for easy comparison between movies and TV shows. This aligns perfectly with the objective of the chart, which was to provide an overview of content release patterns on Amazon Prime Video over time.

### 2. What is/are the insight(s) found from the chart?

let's explore the insights that can be derived from the line plot visualizing the content release trend over time on Amazon Prime Video:

Insights:

Increasing Content Releases: The line plot clearly shows an overall upward trend in the number of content releases (both movies and TV shows) on Amazon Prime Video over the years. This indicates that the platform has been actively expanding its content library to cater to its growing user base.

Movies Dominate Releases: While both movies and TV shows have experienced growth, the line for movies consistently remains above the line for TV shows, indicating that Amazon Prime Video has consistently released more movies than TV shows throughout the analyzed period. This aligns with the earlier insight from the countplot showing a greater number of movies on the platform.

Recent Surge in TV Shows: In recent years, there has been a notable increase in the number of TV show releases on Amazon Prime Video. While movies still dominate, the gap between movies and TV shows appears to be narrowing, suggesting a potential shift in the platform's content strategy towards including more TV series.

Potential for Future Growth: The upward trend in content releases, particularly for TV shows, suggests that Amazon Prime Video is likely to continue expanding its content library in the coming years. This indicates a commitment to providing a diverse and engaging entertainment experience for its subscribers.

### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

let's analyze the business impact of the insights gained from the line plot visualizing the content release trend on Amazon Prime Video and identify any potential for negative growth:

Positive Business Impact

The insights derived from the line plot can positively impact Amazon Prime Video's business in several ways:

Content Strategy Validation: The overall increasing trend in content releases, particularly for movies, suggests that Amazon Prime Video's current content strategy is resonating with viewers and driving user engagement. This validation can encourage the platform to continue investing in content acquisition and expansion. Targeted Content Investments: The dominance of movies in the platform's library indicates a strong viewer preference for this content type. Amazon Prime Video can leverage this insight to strategically allocate resources towards acquiring or licensing more movies, further catering to audience demand. Capitalizing on TV Show Growth: The recent surge in TV show releases suggests an opportunity for Amazon Prime Video to expand its audience and attract viewers who prefer serialized content. By investing in high-quality TV series, the platform can diversify its offerings and enhance its competitive position in the streaming market. Sustained User Engagement: The continued growth in content releases, across both movies and TV shows, demonstrates Amazon Prime Video's commitment to providing fresh and engaging entertainment options. This commitment can foster user loyalty and drive long-term subscriber retention. Insights Leading to Negative Growth

While the overall trend of increasing content releases appears positive, certain aspects could potentially lead to negative growth if not addressed carefully:

Content Over Saturation: While expanding the content library is generally beneficial, excessive growth could lead to content oversaturation. This can overwhelm viewers with too many choices, making it difficult for them to discover content they enjoy. It's crucial for Amazon Prime Video to balance content quantity with quality and effective recommendation systems to avoid overwhelming users. Neglecting Content Diversity: Despite the growth in TV show releases, movies still dominate the platform's library. Overemphasizing one content type could potentially alienate viewers who prefer other formats or genres. Maintaining content diversity is essential to cater to a broader audience and mitigate the risk of losing subscribers with specific preferences. Diminishing Returns: While increasing content releases initially drives user engagement, there may be a point of diminishing returns. Excessive investment in content acquisition without a corresponding increase in viewership could strain resources and negatively impact profitability. It's essential for Amazon Prime Video to carefully analyze content performance and adjust its investment strategy to ensure a positive return on investment.

### Chart 7: IMDb Score vs. TMDb Score Correlation


In [None]:
#Chart7: IMDb Score vs. TMDb Score Correlation
# Filter data for movies and TV shows separately
movies = df[df['type'] == 'MOVIE']
shows = df[df['type'] == 'SHOW']

# Create scatter plots for movies and TV shows
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Scatter plot for movies
sns.scatterplot(x='runtime', y='imdb_score', data=movies, ax=axes[0], alpha=0.5)
axes[0].set_title('IMDb Score vs. Runtime for Movies')
axes[0].set_xlabel('Runtime (minutes)')
axes[0].set_ylabel('IMDb Score')

# Scatter plot for TV shows
sns.scatterplot(x='runtime', y='imdb_score', data=shows, ax=axes[1], alpha=0.5)
axes[1].set_title('IMDb Score vs. Runtime for TV Shows')
axes[1].set_xlabel('Runtime (minutes)')
axes[1].set_ylabel('IMDb Score')

plt.tight_layout()  # Adjust layout for better spacing
plt.show()

#Explanation:
# to analyze the relationship between runtime and IMDb score for movies and TV shows.
# Insights:#Identifies runtime outliers that may affect audience preferences.
# Business Impact: Helps Amazon Prime optimize content length for better viewer engagement. 🚀

### Chart- 8:Trends Over Time - Number of Titles Released Per Year


In [None]:
# Chart- 8:Trends Over Time - Number of Titles Released Per Year

# Check if 'titles' is a DataFrame
if not isinstance(df_titles, pd.DataFrame):
    raise TypeError("Error: 'titles' is not a DataFrame. Ensure the dataset is loaded correctly.")

# Ensure 'release_year' is a column in the DataFrame
if 'release_year' not in df_titles.columns:
    raise KeyError("Error: 'release_year' column not found in the DataFrame.")

# Convert 'release_year' to numeric (handling errors)
df_titles['release_year'] = pd.to_numeric(df_titles['release_year'], errors='coerce')

# Drop NaN values (if any) after conversion
df_titles = df_titles.dropna(subset=['release_year'])

# Convert 'release_year' to integer type (optional)
df_titles['release_year'] = df_titles['release_year'].astype(int)

# Create the histogram plot
plt.figure(figsize=(12, 6))
sns.histplot(df_titles['release_year'], bins=30, kde=True, color="green")

# Formatting the plot
plt.title("Trend of Content Releases Over Time")
plt.xlabel("Year")
plt.ylabel("Number of Titles")
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability

# Show the plot
plt.show()


### 1. Why did you pick the specific chart?


The main objective is to display the distribution of content releases across different years. Histograms are specifically designed to visualize the frequency distribution of a single numerical variable, in this case, 'release_year'. Data Type: The 'release_year' column is numerical and represents the year of content release. Histograms are well-suited for visualizing the distribution of numerical data. Trend Visualization: Although primarily used for distributions, histograms can also reveal trends over time when the x-axis represents a time-based variable. The histogram's shape shows how the number of content releases has changed over the years. Frequency Representation: Histograms clearly represent the frequency or count of content releases within specific year ranges (bins). This allows us to identify periods with higher or lower content production. KDE: Adding a Kernel Density Estimation (KDE) plot on top of the histogram provides a smoother representation of the distribution, further aiding in visualizing the trend. In summary, a histogram with KDE was chosen for this visualization because it effectively displays the distribution of content releases across different years, allowing for the identification of trends in content production over time. The histogram's shape and the KDE curve provide a clear visual representation of these trends.

### 2. What is/are the insight(s) found from the chart?

Recent Increase in Content Releases: The histogram likely shows a significant surge in content releases in recent years, particularly towards the right side of the chart. This indicates a substantial growth in Amazon Prime Video's content library, especially in more recent years.

Peak Release Periods: You might observe specific peaks in the histogram, representing years or periods with a notably high volume of content releases. These peaks can highlight periods when Amazon Prime Video focused on expanding its content offerings significantly. For example, there might be a noticeable peak around 2019 or 2020, indicating a period of rapid content acquisition or production.

Older Content Distribution: The left side of the histogram likely displays the distribution of older content on the platform. It reveals the pattern of content acquisition or availability over time, showing how Amazon Prime Video has built its library over the years. You might observe a gradual increase in content releases from earlier years to more recent years, reflecting the platform's growth trajectory.

Content Release Acceleration: The overall shape of the histogram likely demonstrates a trend of accelerating content releases over time. This signifies Amazon Prime Video's continuous efforts to expand its content library and provide a broader range of choices for its viewers.

Reasoning:

Bar Height: The height of each bar in the histogram directly corresponds to the frequency or count of content releases within that specific year or period. Higher bars indicate more content released during that time frame. Histogram Shape: The overall shape of the histogram reveals the general trend of content releases over time. An upward trend towards the right side signifies increasing content releases, while a downward trend would suggest a decrease. KDE Curve: The Kernel Density Estimation (KDE) plot provides a smoother representation of the distribution, enhancing the visualization of trends. Peaks in the KDE curve correspond to periods with high release frequencies. In summary, the histogram reveals a significant increase in content releases on Amazon Prime Video in recent years, with potential peak release periods and a gradual accumulation of older content. The overall trend suggests a continuous expansion of the platform's content library, indicating efforts to provide more choices for viewers

### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Prime Video, considering both positive and negative aspects:

Positive Business Impact

The insights derived from the histogram can positively influence Amazon Prime Video's business in several ways:

Content Strategy and Planning: By understanding content release trends, Amazon Prime Video can strategically plan content acquisition, licensing, and production. They can identify potential gaps or periods where more content is needed to cater to viewer demand and maintain engagement. For example, if the histogram shows a recent surge in content releases, it might indicate a successful strategy that can be continued.

User Engagement and Retention: Insights into release trends can inform marketing and promotional activities, allowing Amazon Prime Video to target users with new content releases and keep them engaged. Highlighting peak release periods or upcoming content additions can attract viewers and encourage platform usage.

Platform Growth and Attraction: The histogram's overall trend, if showing increasing content releases, can signal platform growth and attract new subscribers. A growing content library is a key factor for attracting and retaining viewers in the competitive streaming market. This positive trend can be leveraged in marketing campaigns to showcase the platform's value proposition.

Potential for Negative Growth

While the insights generally point towards positive trends, there are potential aspects that could lead to negative growth if not carefully addressed:

Content Saturation and Overwhelm: If the recent years see a very steep increase in content, there is a risk of content saturation. Users might find it overwhelming to navigate the platform and discover new content they are interested in. Too much content can lead to choice paralysis and potentially decrease user engagement.

### Chart 9:Correlation Matrix: IMDb vs. TMDb Scores Heatmap

In [None]:
#Chart 9:Correlation Matrix: IMDb vs. TMDb Scores Heatmap
# Assuming 'df_titles' is your DataFrame containing movie/show data
numeric_cols = ["imdb_score", "tmdb_score"]

# Convert columns to numeric, handling errors
df_titles['imdb_score'] = pd.to_numeric(df_titles['imdb_score'], errors='coerce')  # Use df_titles
df_titles['tmdb_score'] = pd.to_numeric(df_titles['tmdb_score'], errors='coerce')  # Use df_titles

# Calculate the correlation matrix
correlation_matrix = df_titles[numeric_cols].corr()  # Use df_titles

# Create the heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", linewidths=2)
plt.title("Correlation Matrix: IMDb vs. TMDb Scores")
plt.show()



# Explanation:

#The heatmap shows the correlation between IMDb scores and TMDb scores for movies/shows.
#Light colors → Weak or no correlation (ratings vary between platforms).
#Diagonal values (1.00) → Self-correlation.
#If the IMDb-TMDb correlation is high (0.7+), both platforms rate movies similarly. If it's low (<0.3), ratings differ significantly.










### 1. Why did you pick the specific chart?

Heatmap was chosen because:

Visualizes Correlation: It effectively shows the relationship between two numerical variables (IMDb and TMDb scores) using color intensity. Clear and Concise: Presents the correlation in a single, easy-to-interpret chart. Annotation: Displays the correlation coefficient for precise understanding. Ideal for Correlation Matrices: Heatmaps are specifically designed for visualizing correlation matrices, making it the perfect choice for this scenario. In short, a heatmap was the best way to clearly and concisely show the correlation between IMDb and TMDb scores, making it ideal for Chart 9's purpose.

### 2. What is/are the insight(s) found from the chart?

the heatmap visualizing the correlation between IMDb and TMDb scores:

Insights:

Strong Positive Correlation: The heatmap likely shows a high positive correlation coefficient (close to 1) between IMDb and TMDb scores. This means that movies and TV shows with high ratings on IMDb tend to also have high ratings on TMDb, and vice versa. Agreement Between Platforms: The strong positive correlation indicates a general agreement between the two rating platforms regarding the quality of content. This suggests that both platforms capture similar aspects of viewer preferences and opinions. Reliable Ratings: The correlation suggests that the ratings on both IMDb and TMDb are relatively reliable and consistent with each other. This reinforces the credibility of the rating systems for content evaluation. In essence, Chart 9 reveals a strong positive correlation between IMDb and TMDb scores, indicating that the two rating platforms generally agree on the quality of content. This insight provides valuable information about the reliability of the ratings and the alignment of viewer preferences across platforms.

### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Potential for Negative Growth & Justification While leveraging IMDb and TMDb ratings has benefits, over-reliance on them could lead to:

Ignoring Niche Content – Neglecting lower-rated but audience-specific titles could reduce content diversity. Bias Towards Popular Genres – Favoring highly rated mainstream genres may overlook emerging or unique content. Missing Hidden Gems – Some critically acclaimed titles may have lower ratings but still add value to the platform. Recommendation Amazon Prime Video should:

Diversify content acquisition beyond high-rated titles. Consider direct user feedback alongside ratings. Promote niche and critically acclaimed content for a well-rounded library.

### Chart 10:Top 10 Most Frequent Directors on Amazon Prime

In [None]:
# Assuming 'df_credits' is your DataFrame containing cast and crew information
# Filter for directors and count occurrences
directors = df_credits[df_credits['role'] == 'DIRECTOR']['name'].value_counts().head(10)

# Create the bar plot
plt.figure(figsize=(10, 5))
sns.barplot(x=directors.index, y=directors.values, palette="rocket")
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.title("Top 10 Most Frequent Directors on Amazon Prime")
plt.xlabel("Director")
plt.ylabel("Number of Titles")
plt.show()

#Explaination:
# The bar chart shows the top 10 most frequent directors on Amazon Prime based on the number of titles they have directed.
# X-axis → Directors' names.
# Y-axis → Number of titles directed.
# Higher bars indicate directors with the most movies/shows on Amazon Prime.
# Color palette ("rocket") enhances visibility, and labels are rotated for better readability.

### 1. Why did you pick the specific chart?

A bar chart was the most appropriate choice for visualizing the top 10 most frequent directors on Amazon Prime Video because it effectively represents the frequency of titles directed by each director and facilitates clear comparisons between them. This aligns perfectly with the objective of the chart, which was to identify and highlight the directors with the most content available on the platform.

### 2. What is/are the insight(s) found from the chart?

Director Dominance: A few directors, like Raúl Campos and Jan Suter, have directed significantly more titles than others on the platform. Content Variety: There's a mix of directors with varying frequencies, indicating some diversity in the directorial talent represented on Amazon Prime. Potential Content Focus: The directors with the highest frequencies may suggest a focus on specific genres or content types favored by Amazon Prime Video. These insights highlight potential trends in content acquisition, director preferences, and content focus on the platform.

### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact of Top Directors Analysis Positive Impact:

Targeted Content Acquisition – Acquire/licensed titles from successful directors to attract their audience. Director Partnerships – Collaborate with frequent directors for exclusive content. Audience Engagement – Promote popular directors to boost viewership. Better Recommendations – Suggest content based on user preferences. Potential Negative Growth:

Over-Reliance on Few Directors – Limits content diversity. Missed Opportunities – Overlooks emerging talent and unique content. Director Dependence – Declining popularity of a director may affect viewership. Genre Imbalance – Too much focus on specific genres may alienate some viewers. Recommendation: Amazon should balance popular director content with diverse and emerging talent to maintain variety, audience interest, and long-term growth.

### Chart 11: Pair Plot of Movie Features

In [None]:
# Columns to include in the pair plot
columns_for_pairplot = ['imdb_score', 'tmdb_popularity', 'tmdb_score', 'runtime']

# Drop missing values to avoid errors
df_cleaned = df[columns_for_pairplot + ['type']].dropna()

# Ensure selected columns are numeric
df_cleaned[columns_for_pairplot] = df_cleaned[columns_for_pairplot].apply(pd.to_numeric, errors='coerce')

# Create the pair plot
sns.pairplot(df_cleaned, hue='type', diag_kind='kde', corner=True)  # 'corner=True' for a cleaner plot
plt.suptitle('Pair Plot of Movie Features', y=1.02)  # Add a title
plt.show()

# Explaination :
# The pair plot visualizes relationships between IMDb score, TMDb popularity, TMDb score, and runtime, grouped by movie type.

# Scatter plots show correlations between numerical features.
# Diagonal KDE plots show feature distributions.
# Color-coded by 'type' (e.g., Movie vs. TV Show).
# Corner=True makes the plot cleaner by removing duplicate plots.


### 1. Why did you pick the specific chart?

I chose the pair plot because it is a powerful visualization for exploring relationships between multiple numerical variables in a dataset. Here’s why:

Reasons for Choosing the Pair Plot: Visualizing Correlations:

It helps in identifying correlations between key features like IMDb score, TMDb popularity, TMDb score, and runtime. If two variables are strongly correlated, their scatter plot will show a clear pattern. Comparing Distributions:

The diagonal plots (KDE plots) show the distribution of each feature, helping us understand their spread and skewness. Color-Coding for Categories (hue='type'):

By using 'type' (Movies vs. TV Shows), we can see how the relationships differ across content types. This helps in distinguishing trends between movies and TV shows. Detecting Outliers & Clusters:

Outliers appear as isolated points, helping in preprocessing decisions. Clusters in scatter plots can indicate segments within the data (e.g., high vs. low-rated movies). Insights We Can Extract: Are longer movies higher-rated on IMDb and TMDb? Do IMDb and TMDb scores have a strong correlation? Are there distinct trends between movies and TV shows?



### 2. What is/are the insight(s) found from the chart?

Insights from the Pair Plot: Correlation Between IMDb Score & TMDb Score:

If the scatter plot between IMDb score and TMDb score shows a strong linear pattern, it indicates that highly rated movies on IMDb tend to be highly rated on TMDb as well. A weak or scattered trend would suggest differences in scoring criteria on these platforms. TMDb Popularity vs. IMDb Score:

If TMDb popularity is correlated with IMDb score, it suggests that movies with higher IMDb ratings tend to be more popular on TMDb. If there's no strong trend, popularity might be driven by factors other than rating, such as recent releases, marketing, or availability. Runtime vs. Ratings:

If longer movies have higher ratings, it suggests that audiences might prefer well-developed narratives. If there's no clear pattern, it means runtime doesn’t significantly impact ratings. If shorter movies dominate high ratings, it might indicate a preference for concise storytelling. Differences Between Movies & TV Shows (Hue='type'):

If the scatter plots show separate clusters for movies and TV shows, it means their rating patterns are different. For example, TV shows might have higher average IMDb scores compared to movies due to dedicated fanbases and episodic storytelling. Outliers in Popularity & Ratings:

Any isolated points indicate outliers—perhaps extremely popular but low-rated content or highly rated but less popular titles. Potential Business Impact of These Insights: ✅ Helps in content acquisition by understanding what kind of content performs well. ✅ Improves recommendation systems by identifying patterns in audience preferences. ✅ Aids in marketing strategies by focusing on factors that drive popularity vs. quality perception.

## **5. Solution to Business Objective**

### What do you suggest the client to achieve Business Objective ?


To help Amazon Prime achieve its business objectives—increasing user engagement, maximizing subscriptions, and optimizing content acquisition—the following strategies are recommended:

1️⃣ Optimize Content Acquisition & Licensing 🎥
🔹 Actionable Steps:

Prioritize acquiring content from top-rated genres (e.g., if Drama & Thriller dominate high IMDb scores).
Expand partnerships with frequent directors who produce quality content.
Invest in regional and diverse content to attract a broader audience and prevent content saturation.
🔹 Expected Impact:
✅ Better content quality → Higher user satisfaction
✅ Increased subscriptions & retention

2️⃣ Enhance Recommendation System 📌
🔹 Actionable Steps:

Leverage IMDb & TMDb scores to refine recommendation algorithms.
Promote highly rated but less popular titles to improve content visibility.
Implement personalized genre-based suggestions to improve watch time.
🔹 Expected Impact:
✅ Increased watch time → Higher ad revenue & customer loyalty
✅ Better user engagement → Reduced subscription churn

3️⃣ Targeted Marketing & Promotions 📢
🔹 Actionable Steps:

Feature top-rated content prominently on the homepage.
Use data-driven campaigns to promote movies/shows with high IMDb scores.
Partner with influential directors and actors for exclusive releases & marketing collaborations.
🔹 Expected Impact:
✅ Higher click-through rates → More engagement & conversions
✅ Attract new users via strong content branding

4️⃣ Address Potential Negative Growth Factors ⚠️
🔹 Actionable Steps:

Reduce dependency on a few directors to maintain variety.
Expand genre diversity to cater to all user preferences.
Balance between popularity & quality, ensuring highly rated but under-watched content gets visibility.
🔹 Expected Impact:
✅ Prevents user fatigue from repetitive content.
✅ Broadens audience reach for long-term growth.

Final Recommendation & Business Impact
✔ Strategic content acquisition + strong recommendations + targeted marketing = Higher engagement & sustained growth 🚀
📌 By focusing on data-driven insights, Amazon Prime can increase subscriptions, improve retention, and enhance user satisfaction

# Conclusion


This analysis will provide a comprehensive understanding of Amazon Prime Video’s content landscape, helping stakeholders identify key trends that influence viewership and subscription growth. The findings will help streaming executives, marketers, and content strategists make data-driven decisions to improve audience engagement and platform offerings.

By leveraging these insights, Amazon Prime Video can refine its content curation strategy, focus on high-performing genres, and enhance its competitive edge in the streaming industry. The project will also offer recommendations on content investment strategies, ensuring that the platform continues to cater to its diverse audience while maximizing user satisfaction and engagement.



##  ***Hurrah! Happy Coding !!!***