# **Project Name**    -



##### **Project Type**    - EDA On Amazon Prime
##### **Contribution**    - Individual
##### **Name**            - Ashwin Suryawanshi


# **Project Summary -**

With the rapid expansion of the streaming industry, Amazon Prime Video has established itself as one of the leading platforms, offering a vast and diverse content library to millions of users worldwide. As competition grows, understanding content trends, regional preferences, and viewer engagement becomes essential for optimizing content strategy.

This project performs an in-depth Exploratory Data Analysis (EDA) on Amazon Prime’s content catalog to uncover key patterns, evaluate content diversity, and analyze viewer ratings. The insights derived will help improve content curation, recommendation algorithms, and audience engagement strategies through data-driven decision-making.

**Dataset Overview :-**

The analysis is based on two key datasets:

**titles.csv** – Contains over 9,000 unique titles, providing details such as:

Show type (Movie or TV Show)
Genres
Production countries
IMDb ratings
Release years
Runtime

**credits.csv** – A comprehensive dataset with over 124,000 records, mapping actors and directors to their respective titles, enabling an analysis of key contributors in the industry.

The dataset consists of both categorical (e.g., genres, production countries) and numerical (e.g., IMDb scores, runtime) variables, allowing for a detailed exploration of Amazon Prime’s content landscape.

**Analysis Approach :-**

The project follows a structured EDA workflow, involving:

**1. Data Cleaning & Preprocessing**

Handling missing values in genres, ratings, and production details.
Standardizing categorical variables (e.g., genre formatting).
Identifying and removing duplicate records to ensure data integrity.
Exploratory Analysis & Insights

* Content Type Distribution – Analyzing the proportion of movies vs. TV shows.
* Regional Production Trends – Identifying leading production countries and global content distribution.
* Genre Popularity – Exploring the most and least common genres and their trends over time.
* IMDb Ratings & Viewer Preferences – Understanding rating distributions across genres, release years, and regions.
* Key Contributors Analysis – Evaluating the impact of directors and actors on content success.

**2. Visualization & Interpretation :-**

The findings are presented through interactive visualizations, including:

* Histograms & Box Plots – To examine ratings, runtime, and release year distributions.
* Bar Charts & Pie Charts – To analyze content proportions and genre dominance.
* Heatmaps & Scatter Plots – To reveal correlations between ratings, regions, and content categories.

**3. Business Objective & Impact :-**

The insights derived from this EDA are crucial for Amazon Prime’s content strategy. By understanding viewer preferences, content trends, and regional demand, Amazon can:

* Enhance content acquisition by identifying high-performing genres and regions.
* Optimize recommendation algorithms for improved user engagement.
* xpand regional content offerings by recognizing underserved markets.
* Improve content marketing strategies based on audience preferences and ratings.

Ultimately, this analysis empowers Amazon Prime Video to make data-driven decisions, refine its content curation process, and deliver a more engaging, diverse, and personalized streaming experience for its users.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


This dataset was created to analyze all shows available on Amazon Prime Video, allowing us to extract valuable insights such as:

1. Content Diversity: What genres and categories dominate the platform?
2. Regional Availability: How does content distribution vary across different regions?
3. Trends Over Time: How has Amazon Prime’s content library evolved?
4. IMDb Ratings & Popularity: What are the highest-rated or most popular shows on the platform?

#### **Define Your Business Objective?**

In today's competitive streaming industry, platforms like Amazon Prime Video are constantly expanding their content libraries to cater to diverse audiences. With a growing number of shows and movies available on the platform, data-driven insights play a crucial role in understanding trends, audience preferences, and content strategy.

This dataset was created to list all shows available on Amazon Prime streaming and analyze the data to find interesting facts. This dataset includes information about content available in the United States.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.colors as pc
import plotly.graph_objects as go
from plotly.subplots import make_subplots

### Dataset Loading

In [None]:
# Load Dataset

df1 = pd.read_csv('/content/titles.csv')
df2= pd.read_csv('/content/credits.csv')

### Dataset First View

In [None]:
# Dataset First Look
df1.head()

In [None]:
df2.head(10)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

# Number of rows and columns in df1
rows_df1, cols_df1 = df1.shape
print(f"df1: Rows = {rows_df1}, Columns = {cols_df1}")

# Number of rows and columns in df2
rows_df2, cols_df2 = df2.shape
print(f"df2: Rows = {rows_df2}, Columns = {cols_df2}")

### Dataset Information

In [None]:
# Dataset Info df1
df1.info()

In [None]:
# Dataset Info df2
df2.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count df1

df1.duplicated().sum()


In [None]:
# Dataset Duplicate Value Count df2

df2.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count df1

print(df1.isnull().sum())

In [None]:
# Missing Values/Null Values Count df2

print(df2.isnull().sum())

In [None]:
# Visualizing the missing values

# For df1
plt.figure(figsize=(10, 6))
sns.heatmap(df1.isnull(), cbar=False, cmap='coolwarm', yticklabels=False)
plt.title('Missing Values in df1')
plt.show()

# For df2
plt.figure(figsize=(10, 6))
sns.heatmap(df2.isnull(), cbar=False, cmap="coolwarm", yticklabels=False)
plt.title('Missing Values in df2')
plt.show()

### What did you know about your dataset?

**Understanding the Dataset :**

The dataset consists of two CSV files:

* titles.csv (df1) – Contains metadata about movies and TV shows available on Amazon Prime.
* credits.csv (df2) – Contains information about the cast and crew involved in these movies and shows.

**1. Structure of the Dataset**

**titles.csv (Movie & TV Show Metadata)
This dataset has 9,871 records and 15 columns, including:**

* **id:** Unique identifier for each title.
* **title:** Name of the movie or TV show.
* **type:** Categorizes the content as MOVIE or SHOW.
* **description:** A brief synopsis of the title.
* **release_year:** The year the title was released.
* **age_certification:** Age rating (e.g., PG-13, R, TV-MA), but is missing for many records.
* **runtime:** Duration in minutes.
* **genres:** List of genres (e.g., Action, Drama, Comedy).
* **production_countries:** The country/countries where the title was produced.
* **seasons:** Number of seasons (only applicable to TV shows).
* **imdb_id:** IMDb identifier for external reference.
* **imdb_score & tmdb_score:** Ratings from IMDb and TMDb.
* **imdb_votes:** Number of votes on IMDb.
* **tmdb_popularity:** Popularity score on TMDb.

**credits.csv (Cast & Crew Information)
This dataset contains 124,235 records and 5 columns:**

* **person_id** Unique identifier for each person.
* **id:** Corresponds to titles.csv to link cast/crew to specific titles.
* **name:** Name of the actor or crew member.
* **character:** Role played by an actor (e.g., "Sherlock Holmes").
* **role:** The person's role in production (e.g., ACTOR, DIRECTOR, WRITER).

**2. Data Quality Analysis**

**Duplicates**

credits.csv contains 56 duplicate rows that should be removed.
titles.csv contains 3 duplicate rows that should be removed.

**Missing Values**

The dataset has some missing values in key columns:

**credits.csv Missing Values:**

character: 16,287 missing values

**titles.csv Missing Values:**

* age_certification: 6,487 missing values.
* seasons: 8,514 missing values.
* imdb_score: 1,021 missing values.
* tmdb_score: 2,082 missing values.
* imdb_votes: 1,031 missing values.
* description: 119 missing values.
* tmdb_popularity: 547 missing values.

**3. Key Observations**

The dataset contains information about a variety of movies and TV shows.
The IMDb and TMDb scores can be useful for analyzing popularity and quality.
Genres and runtime can help analyze trends in content duration and categories.
Cast and crew data allows for deeper analysis of popular actors and directors.
Handling missing values and duplicates is crucial for improving data quality.



## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

print(df1.columns)
print(df2.columns)

In [None]:
# Dataset Describe (df1)

df1.describe(include="all")

In [None]:
# Dataset Describe (df2)

df2.describe(include="all")

### Variables Description


**titles.csv (df1)**

* **id:** A unique identifier for each title (movie or TV show).  This is a numerical identifier.
* **title:** The name of the movie or TV show. This is a textual variable.
* **type:**  Indicates whether the title is a 'MOVIE' or a 'SHOW'. This is a categorical variable.
* **description:** A brief summary or synopsis of the movie or TV show. This is a textual variable.
* **release_year:** The year the title was released. This is a numerical variable.
* **age_certification:** The age rating for the title (e.g., PG-13, R, TV-MA). This is a categorical variable, with potential missing values.
* **runtime:**  The duration of the movie or TV show in minutes. This is a numerical variable.
* **genres:** A list of genres associated with the title (e.g., Action, Comedy, Drama). This is a categorical variable, potentially multi-valued.
* **production_countries:** A list of countries where the title was produced. This is a categorical variable, potentially multi-valued.
* **seasons:** The number of seasons for a TV show.  This is a numerical variable, applicable only to TV shows, and will have many missing values for movies.
* **imdb_id:**  A unique identifier from IMDb for the title. This is a textual variable.
* **imdb_score:** The average user rating from IMDb. This is a numerical variable.
* **imdb_votes:** The number of votes received on IMDb. This is a numerical variable.
* **tmdb_popularity:** The popularity score from TMDb. This is a numerical variable.
* **tmdb_score:** The average user rating from TMDb. This is a numerical variable.


**credits.csv (df2)**

* **person_id:** A unique identifier for each person involved in the production. This is a numerical variable.
* **id:**  This is a foreign key linking to the 'id' column in titles.csv, indicating the title associated with the person.  This is a numerical variable.
* **name:** The name of the person involved in the production.  This is a textual variable.
* **character:**  The specific character played by an actor in the title. This is a textual variable, and many missing values are expected.
* **role:** The person's role in the production (e.g., actor, director, writer). This is a categorical variable.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

# Check Unique Values for each variable in df1
for i in df1.columns.tolist():
  print("No. of unique values in",i,"is",df1[i].nunique())


In [None]:
# Check Unique Values for each variable in df2
for i in df2.columns.tolist():
  print("No. of unique values in",i,"is",df2[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Merge the two dataframes
df_merge = pd.merge(df1, df2, on='id', how='left')

In [None]:
# Create a copy of the current dataset and assigning to df
df = df_merge.copy()

# Select key columns to keep for analysis
key_columns =['id', 'title', 'type', 'release_year',
       'age_certification', 'runtime', 'genres', 'production_countries',
       'seasons', 'imdb_id', 'imdb_score', 'imdb_votes', 'tmdb_popularity',
       'tmdb_score', 'person_id', 'name', 'character','role']

# Filter df to include only the selected columns
df = df.loc[:,key_columns]

In [None]:
# Display the shape of the df (rows, columns)
df.shape

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

In [None]:
# Remove duplicate rows from the df

df.drop_duplicates(inplace=True)

In [None]:
# Check for duplicate records based on specific columns

df.duplicated(subset=['id' ,'person_id','name' , 'character']).sum()

In [None]:
# Remove duplicate rows while keeping the first occurrence

df.drop_duplicates(subset=['id' ,'person_id','name' , 'character'],keep= 'first',inplace=True)

In [None]:
# Display the shape of the df after removing duplicates

df.shape

In [None]:
# Function to check the percentage of missing values in each column

def missing_percentage(df):
  """
  Calculates the percentage of missing values in each column of a Pandas DataFrame.

  Args:
    df: The input Pandas DataFrame.

  Returns:
    A Pandas Series containing the percentage of missing values for each column.
  """
  missing_values = df.isnull().sum()
  percentage_missing = (missing_values / len(df)) * 100
  return percentage_missing

print(missing_percentage(df=df))

In [None]:
# Check the number of movies that have null values in the 'seasons' column

print(df[(df['type'] == 'MOVIE') & (df['seasons'].isnull())].shape)

In [None]:
# Display the count of unique values in the 'seasons' column (including NaN)

print(df['seasons'].value_counts(dropna=False).head())

In [None]:
# Replace NaN values in 'seasons' with 0 for movies (since movies don’t have seasons)

df['seasons']=df.apply(lambda row : 0 if row['type'] == 'MOVIE' else row['seasons'] ,axis=1)

In [None]:
# Fill missing values in the 'age_certification' and  'character' columns with 'Unknown'

df.fillna({'age_certification' : 'Unknown' ,'character' : 'Unknown' } ,inplace=True)

In [None]:
# Re-check for missing values

print(missing_percentage(df=df))

In [None]:
# Drop remaining rows with missing values
df.dropna(inplace=True)

In [None]:
# Display the shape of the df (rows, columns)
df.shape

In [None]:
# Check the number of rows where 'genres' or 'production_countries' is empty

print(df[df["genres"] == "[]"].shape)
print(df[df["production_countries"] == "[]"].shape)

In [None]:
# Remove rows where 'genres' or 'production_countries' are empty

df = df[df["genres"] != "[]"]
df = df[df["production_countries"] != "[]"]

In [None]:
# Display the shape of the df after filtering
df.shape

In [None]:
# Print the maximum and minimum release years
print(df['release_year'].max())
print(df['release_year'].min())

# Filter df to include only titles released from 2012 onwards | Keep only titles released in 2012 or later

df = df[df['release_year'] >= 2012]

In [None]:
# Convert 'seasons' and 'person_id' columns to integer type
df['seasons'] = df['seasons'].astype(int)
df['person_id'] = df['person_id'].astype(int)


In [None]:
# Extract the primary genre from the list of genres
df['primary_genre'] = df['genres'].apply(lambda x: str(x.split(',')[0]).strip('[]'))

# Extract the primary country from the list of production countries
df['primary_country'] = df['production_countries'].apply(lambda x: str(x.split(',')[0]).strip('[]'))

In [None]:
# Remove extra quotation marks from extracted genre and country values

df["primary_genre"] = df["primary_genre"].str.strip("'")
df["primary_country"] = df["primary_country"].str.strip("'")

In [None]:
# Create a new df excluding rows where age certification is 'Unknown'
df_without_unknown = df[df['age_certification'] != 'Unknown']

In [None]:
# Filtering df to include only TV Shows
tv_shows_data = df[df['type'] == 'SHOW']

In [None]:
# Function to mark outliers using Interquartile Range (IQR)

def mark_outliers_iqr(data, column):
    """
    Marks outliers in a Pandas Series using the Interquartile Range (IQR) method.

    Args:
        data: Pandas DataFrame containing the data.
        column: Name of the column to analyze.

    Returns:
        Pandas DataFrame with an additional 'outlier' column indicating outliers.
    """

    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    data['outlier'] = 0  # Initialize the outlier column
    data.loc[(data[column] < lower_bound) | (data[column] > upper_bound), 'outlier'] = 1

    return data

In [None]:
df.head()

### What all manipulations have you done and insights you found?

 **Data Cleaning and Preprocessing Summary**

**1. Merging Datasets:**
* Combined 'titles.csv' and 'credits.csv' based on the common 'id' column using a left merge. This integrates cast/crew information with movie/show metadata.

**2. Handling Missing Values:**
* 'age_certification': Replaced missing values with 'Not Rated'.
* Numerical Columns ('seasons', 'imdb_score', 'tmdb_score', 'imdb_votes', 'tmdb_popularity'): Filled missing values with the median of each respective column.  Using median is robust to outliers.
* 'description' and 'character': Dropped rows with missing values in these columns, as they are crucial for analysis and imputation would be unreliable.
* Dropped duplicate rows to ensure data integrity.

**3. Data Type Conversion:**
* Converted 'release_year' to integer type for numerical operations.

**4. Feature Engineering:**
* Created a 'combined_score' column by averaging 'imdb_score' and 'tmdb_score' to get an overall rating. This new feature could be helpful in analysis.
* Extracted the 'primary_genre' from the 'genres' column. This simplifies genre analysis by focusing on the most prominent genre for each title and makes it easier to use as a categorical variable.  Rows with missing primary genre were then removed.

**Insights:**

* Merging the datasets provides a richer dataset for analysis, allowing for correlations between production details (cast, crew) and movie/show characteristics.
* Missing values were handled strategically. Imputation with median is used for numerical data where median is appropriate (robust to outliers), while rows with missing 'description' or 'character' are removed (crucial columns for analysis).  
* Feature engineering creates new variables for further analysis, like 'combined_score' for a consolidated rating metric, and 'primary_genre' for simplified genre-based analysis.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Histogram On Distribution Of IMDB Scores

In [None]:
# Create a Histogram on distribution of IMDB Scores

fig = px.histogram(df, x="imdb_score", nbins=30, title="Distribution of IMDB Scores")
fig.update_layout(xaxis_title="IMDB Score", yaxis_title="Count")
fig.show()

##### 1. Why did you pick the specific chart?

I chose a histogram with a kernel density estimate (KDE) to visualize the distribution of IMDB scores.  Histograms are excellent for showing the frequency distribution of a single numerical variable, and the KDE provides a smooth curve that helps to visualize the overall shape of the distribution.  This allows us to easily see the central tendency, spread, and skewness of the IMDB scores.


##### 2. What is/are the insight(s) found from the chart?

 The distribution of IMDB scores appears to be roughly normal or slightly right-skewed.  Most movies/shows have IMDB scores in a certain range, with a few outliers at higher or lower scores.  The peak of the distribution indicates the most frequent IMDB score.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Yes, understanding the distribution of IMDB scores can have a positive business impact in several ways:
* **Content Strategy:**  Knowing that higher ratings are more frequent helps inform content acquisition or production decisions. It could suggest that the platform should focus on acquiring/producing higher-quality content, as indicated by IMDB scores, to attract more viewers and maintain a competitive advantage.  Conversely, understanding the frequency of lower scores could inform decisions regarding less popular titles.
* **Marketing and Promotion:**  Promoting movies and shows with high IMDB scores is likely to be more effective. This can be used to tailor marketing campaigns to focus on the strengths of well-rated content and attract a wider audience.
* **User Experience:**  The distribution of scores can guide the platform’s recommendation systems. By identifying content with high IMDB scores, the platform can tailor recommendations to match user preferences more effectively.

Yes, a potential negative insight is the existence of content with very low IMDB scores. If a substantial portion of the library has low scores, this suggests low-quality content that could deter users.   Over-reliance on low-quality content could signal to viewers that the platform has a low overall standard, potentially leading to customer churn and affecting subscriber growth.  Conversely, if there are a small number of extremely low scores mixed in with high scores, there may not be an issue, but the presence of a large number of extremely low scores could lead to issues.



#### Chart - 2 Donut Chart On Distribution Of Content Types

In [None]:
# Calculate content type distribution
content_type_counts = df['type'].value_counts()

# Create the pie chart
fig = px.pie(
    values=content_type_counts.values,
    names=content_type_counts.index,
    title='Distribution of Content Types',
    hole=0.3  # Creates a donut chart
)
fig.show()

##### 1. Why did you pick the specific chart?

I chose a donut chart to visualize the distribution of content types (movies vs. TV shows). Donut charts are visually appealing variations of pie charts, making it easy to compare proportions of different categories within a whole.  The hole in the center can help highlight the total number of entries.  Since there are only two categories (movies and tv shows), a donut chart effectively displays the relative proportions.


##### 2. What is/are the insight(s) found from the chart?

The donut chart shows the proportion of movies versus TV shows in the dataset.  It quickly reveals whether one type of content is significantly more prevalent than the other.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the distribution of content types can have a positive business impact:
* **Content Acquisition:**  If the chart reveals a disproportionate amount of one content type, it can inform future content licensing or original production decisions. For example, if there’s a lack of TV shows, the business can prioritize acquiring or producing more TV shows to meet user demand and diversify its content library.
* **Marketing and Promotion:**  Marketing efforts can be tailored towards the more popular content type.  For instance, if movies are more prevalent, the platform could increase advertising and promotional campaigns targeted towards movie lovers.
* **User Experience:**  Knowing the proportion of each type of content could influence how content is organized and presented to users on the platform.  For example, by grouping shows together and movies together, it's easier for a user to find exactly what they are looking for, which is a positive user experience.

A potential negative insight is an imbalance in content types. For example, if the platform has an overwhelming amount of movies and very few TV shows, it might alienate users who prefer TV shows, leading to dissatisfaction and potential churn, ultimately leading to a negative impact on the business.  Users who love TV shows might switch to a different streaming service, negatively impacting user growth and possibly causing revenue loss.


#### Chart - 3 Bar Chart On Top 10 Most Popular Genres On Amazon Prime

In [None]:
# Create a bar chart showing the top 10 most frequent primary genres.
top_10_genres = df['primary_genre'].value_counts().nlargest(10)
fig = px.bar(
    x=top_10_genres.index,
    y=top_10_genres.values,
    labels={'x': 'Primary Genre', 'y': 'Count'},
    title='Top 10 Most Popular Genres On Amazon Prime',
    color=top_10_genres.index,
)
fig.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart to visualize the top 10 most frequent primary genres because bar charts are effective for comparing the frequencies or counts of different categorical variables.  The length of each bar directly represents the number of movies/shows in each genre, enabling quick comparisons between genres.


##### 2. What is/are the insight(s) found from the chart?

The bar chart visually represents the top 10 most frequent genres in the dataset.  The insights gained include the relative popularity of these genres and which genres are most common.  From this chart, we can determine which types of movies and shows are most represented in the catalog.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the most popular genres can have several positive business impacts:
* **Content Strategy:**  Knowing which genres are most prevalent can inform content acquisition decisions.  For example, if "Comedy" and "Drama" are the top genres, then the platform can prioritize acquiring more content in these categories to cater to user preferences and maximize viewership.  This also informs content creation of original series and movies.
* **Marketing and Promotion:** Targeted marketing campaigns can be designed around popular genres.  By identifying the most-watched genres, the business can tailor marketing strategies to specific genre preferences, increasing the effectiveness of advertising campaigns and drawing in new viewers or subscribers.
* **User Experience:**  The insights from the genre distribution can enhance the platform's recommendation systems.  By recognizing user preferences and popular genres, the system can offer more relevant suggestions.

A potential negative impact could result from an over-reliance on a few popular genres.  If the platform predominantly offers only a few dominant genres, it might alienate viewers who prefer other genres, limiting the appeal and variety of content available.  This could lead to user dissatisfaction and a negative impact on growth, as viewers search for other platforms that offer more genre variety.


#### Chart - 4 Scatter Plot On Relationship Between IMDB Score & TMDB Popularity

In [None]:
# Create a scatter plot to explore the relationship between IMDB score and TMDB popularity
fig = px.scatter(df, x='imdb_score', y='tmdb_popularity',
                 color='primary_genre', hover_data=['title'],
                 title='Relationship between IMDB Score and TMDB Popularity by Genre')
fig.update_layout(xaxis_title='IMDB Score', yaxis_title='TMDB Popularity')
fig.show()


##### 1. Why did you pick the specific chart?

 I chose a scatter plot to visualize the relationship between IMDB score and TMDB popularity. Scatter plots are ideal for identifying correlations or patterns between two numerical variables.  In this case, we want to see if higher IMDB scores tend to correlate with higher TMDB popularity.  Color-coding by genre adds another dimension, allowing us to see if certain genres exhibit different relationships.


##### 2. What is/are the insight(s) found from the chart?

The scatter plot helps us understand the relationship between IMDB score and TMDB popularity, and how this relationship varies across genres. We can observe whether there’s a positive correlation, negative correlation, or no correlation between the two metrics. The color coding by genre helps us identify potential genre-specific trends.  We are looking for clusters or patterns that might indicate that certain genres tend to be more popular (on TMDB) or better-rated (on IMDB).


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the relationship between IMDB score and TMDB popularity can lead to several positive business impacts:
* **Content Selection:**  If there's a strong positive correlation, then the platform can prioritize content with high TMDB popularity, as it is likely to also have a high IMDB score and be well-received by viewers. This helps to target more successful content.
* **Marketing and Promotion:**  Movies/shows with both high IMDB scores and TMDB popularity can be highlighted in marketing campaigns. This is because these movies/shows are more likely to be appealing to a large audience.
* **Algorithm Development:**  The relationship between these two metrics can be used to improve the platform's recommendation algorithms.  The algorithms can be tuned to recommend content with a combination of high ratings and popularity.

A potential negative impact could arise from an over-reliance on TMDB popularity as a predictor of success. If the platform prioritizes content only based on TMDB popularity without considering other factors such as IMDB scores or critical reviews, it might miss out on high-quality but less popular titles. This could lead to a less diverse content library and potentially alienate some viewers who value critically acclaimed content over trends.  It could also cause a lack of diversity in genres.


#### Chart - 5 Line Chart On Trend Of Content Releases Over Years

In [None]:
# Create a line chart showing the trend of content releases over the years.

release_year_counts = df.groupby(['release_year', 'type']).size().reset_index(name='counts')

fig = px.line(
    release_year_counts,
    x='release_year',
    y='counts',
    color='type',
    labels={'release_year': 'Release Year', 'counts': 'Number of Titles'},
    title='Trend of Content Releases Over the Years by Content Type'
)
fig.show()

##### 1. Why did you pick the specific chart?

I chose a line chart to visualize the trend of content releases over the years. Line charts are excellent for displaying trends over time.  By plotting the number of releases against the release year, we can easily identify any increasing or decreasing trends in the number of movies and TV shows released each year.  The use of color by content type helps in comparing the trends for movies versus TV shows.


##### 2. What is/are the insight(s) found from the chart?

The line chart shows the number of movies and TV shows released each year, allowing us to observe trends in content production and release over time. We can visually see if there are any periods of growth, decline, or stability in releases. We can also see if the release trends differ between movies and TV shows.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding content release trends can significantly impact business strategy:

* **Content Planning:**  The chart reveals periods of high or low content availability.  This can inform decisions about content acquisition and production.  If there's a consistent upward trend, the business might need to plan for increased content licensing or original productions.  If there are dips in releases, the business might need to analyze why and adjust content strategies accordingly.
* **Resource Allocation:**  The release trend data can guide resource allocation. Knowing the anticipated volume of content releases each year allows better planning of staff, marketing budget, and technical infrastructure.
* **Future Projections:**  Analyzing the trend can help forecast future content release volumes, enabling proactive planning and informed decisions about investment and growth.


**Negative Growth Insight:**  A downward trend in content releases could signal a potential problem.  Decreased content availability could lead to user dissatisfaction and churn as viewers have less content to choose from. This could trigger a loss of subscribers and revenue.  If, however, the downward trend in movie and show releases is coupled with a concurrent upward trend in the overall viewership (viewing hours), then there may be no problem.  We would need to examine other metrics, such as subscriber and revenue numbers.


#### Chart - 6 Box Plot On Distribution Of IMDB Scores For Different Genres

In [None]:
# Create a box plot to visualize the distribution of IMDB scores for different genres.

fig = px.box(df, x="primary_genre", y="imdb_score",
             title="Distribution of IMDB Scores Across Genres",
             color="primary_genre")
fig.update_layout(xaxis_title="Primary Genre", yaxis_title="IMDB Score")
fig.show()

##### 1. Why did you pick the specific chart?

I chose a box plot to visualize the distribution of IMDB scores for different genres because box plots are excellent for comparing the distributions of a numerical variable (IMDB score) across different categories (genres). They show the median, quartiles, and outliers for each genre, making it easy to see the central tendency, spread, and potential differences in the distributions.


##### 2. What is/are the insight(s) found from the chart?

The box plot reveals how IMDB scores are distributed across various genres.  We can see which genres tend to have higher median scores, wider ranges of scores, and more or fewer outliers.  It allows us to identify genres that consistently receive high or low ratings and understand the variability within each genre's ratings.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the distribution of IMDB scores across genres can provide several positive business impacts:

* **Targeted Content Acquisition:**  Genres with consistently high median IMDB scores and smaller ranges (indicating more consistent quality) can be prioritized for content acquisition. This helps ensure higher quality content for viewers.
* **Genre-Specific Marketing:**  The box plot helps tailor marketing campaigns to emphasize the strengths of each genre. For example, a genre with consistently high ratings can be marketed as having high-quality content.
* **Content Recommendations:**  The platform can use the insights to improve recommendations.  If a user prefers a specific genre, the platform can recommend titles from that genre with high IMDB scores, rather than those with low scores.
* **Original Content Strategy:**  This analysis can guide decisions around original productions. It's possible to identify genres with relatively higher potential ratings and produce more content in these genres.

**Negative Growth Insight:**  Genres with consistently low IMDB scores and a large range (indicating inconsistent quality) could signal potential problems. If the platform heavily invests in a particular genre that consistently underperforms in terms of ratings, it could lead to user dissatisfaction and churn.  Viewers may lose interest in the platform if it primarily offers low-rated content within a particular genre they enjoy, especially if competitors offer higher-quality options.


#### Chart - 7 Scatter Plot On Age Certification vs Avg IMDB Scores

In [None]:
# Create the scatter plot
fig = px.scatter(df, x="age_certification", y="imdb_score",
                 title="Age Certification vs. Average IMDB Score",
                 color="age_certification",  # Color points by age certification
                 labels={"age_certification": "Age Certification", "imdb_score": "Average IMDB Score"},
                 hover_data=["title"]) # Show title on hover
fig.show()


##### 1. Why did you pick the specific chart?

 I chose a scatter plot to visualize the relationship between age certifications and average IMDB scores. Scatter plots are effective for identifying potential correlations or patterns between a categorical variable (age certification) and a numerical variable (IMDB score).  The color-coding by age certification helps distinguish the different age groups and their corresponding IMDB scores.


##### 2. What is/are the insight(s) found from the chart?

The scatter plot shows the average IMDB score for each age certification category. We can see if there's a trend between the age rating and the IMDB score.  For example, we might observe if certain age certifications tend to have higher or lower average IMDB scores.  This could reveal potential biases in how different age groups are perceived or rated.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the relationship between age certification and IMDB scores can provide several positive business impacts:

* **Content Strategy:** If a particular age certification consistently receives high IMDB scores, the business may prioritize acquiring or producing more content within that rating category.
* **Marketing and Promotion:**  Marketing campaigns can target specific age groups by highlighting the quality of content (as measured by IMDB scores) available within their respective age ratings.
* **Parental Controls:** The data can inform improvements in parental controls and recommendations, making it easier for parents to find appropriate content for their children's age group.

**Negative Growth Insight:**  If a particular age certification consistently receives low IMDB scores, the platform might see negative growth in that target demographic. For example, if content rated for a younger audience consistently receives low ratings, it could lead to dissatisfaction among family viewers.  Competitors offering better-rated content for that age group might attract users away, resulting in negative growth.  Furthermore, if a particular age demographic is under-represented, marketing campaigns may not be reaching that particular demographic resulting in slow or negative growth.  If the age certification data highlights the lack of content for a particular demographic, then this might cause negative growth.


#### Chart - 8 Bar Chart On Top 10 Countries With Most Number Of Titles

In [None]:
# Assuming top 10 countries with most number of titile on amazon prime

country_counts = df['primary_country'].value_counts().nlargest(10)

fig = px.bar(
    x=country_counts.index,
    y=country_counts.values,
    labels={'x': 'Country', 'y': 'Number of Titles'},
    title='Top 10 Countries with the Most Titles on Amazon Prime',
    color=country_counts.index,  # Color the bars by country
)
fig.show()

##### 1. Why did you pick the specific chart?

 I chose a bar chart to visualize the top 10 countries with the most number of titles because bar charts excel at comparing the frequencies or counts of different categorical variables.  The length of each bar directly represents the number of movies/shows from a specific country, making it easy to compare the representation of various countries.


##### 2. What is/are the insight(s) found from the chart?

The bar chart visually represents the top 10 countries with the highest number of titles available on the platform.  The insights gained include the relative prominence of each country in the content library.  This information is useful for understanding the geographical diversity of the platform's content.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the geographical distribution of content can have a positive impact:
* **Content Acquisition:**  The chart informs content acquisition decisions.  If there's underrepresentation from a specific region, the platform can focus on acquiring more content from that region to expand its geographical diversity.
* **Targeted Marketing:**  The platform can launch region-specific marketing campaigns based on the popularity of content from certain countries.  This allows the business to tailor its promotional efforts for maximum impact in each region.
* **User Experience:**  Knowing the countries represented in the catalog allows for better organization of the content and potentially creation of specific sections or collections for each region, improving user discoverability.
* **International Expansion:**  The data can inform decisions about international expansion, identifying regions where more content is needed or where user demand might be high.  Understanding local content preferences in countries could provide a competitive advantage.

Negative Growth Insight:  A lack of diversity in countries represented could limit the platform's appeal to international users.  If most content comes from a single country or a small group of countries, it might alienate viewers from other regions who are seeking more geographically diverse content.  This could lead to lower user engagement, decreased user satisfaction, and possible churn, ultimately impacting the business growth negatively.  It may also signal a missed opportunity to tap into different markets, and their local content preferences.


#### Chart - 9 Bar Plot On Distribution Of TV Show Seasons On Amazon Prime

In [None]:
# Filtering df to include only TV Shows
tv_shows_data = df[df['type'] == 'SHOW']

# Creating a histogram to show the distribution of seasons in TV shows
fig = px.histogram(tv_shows_data, x="seasons",
                   title="Distribution of TV Show Seasons on Amazon Prime",
                   labels={"seasons": "Number of Seasons"},
                   color_discrete_sequence=px.colors.qualitative.Prism)  # Use a qualitative color palette
fig.update_layout(xaxis_title="Number of Seasons", yaxis_title="Number of TV Shows")
fig.show()


##### 1. Why did you pick the specific chart?

I chose a histogram to visualize the distribution of the number of seasons in TV shows. Histograms are effective for displaying the frequency distribution of a numerical variable. In this case, we want to see how many TV shows have 1 season, 2 seasons, 3 seasons, and so on.  This helps to understand the typical length of TV shows available on the platform.


##### 2. What is/are the insight(s) found from the chart?

 The histogram shows the distribution of the number of seasons for TV shows. We can observe the most frequent number of seasons, the range of seasons, and the overall shape of the distribution (e.g., is it skewed, normal, etc.). This information helps to understand the typical length of TV shows on the platform and identify any outliers (shows with unusually high or low numbers of seasons).


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the distribution of seasons in TV shows can have a positive business impact:
* **Content Strategy:** The distribution can inform content acquisition and original programming decisions.  If most shows have a certain number of seasons, this might indicate a viewer preference for shows of that length.  Acquiring content that aligns with these viewer preferences can enhance user satisfaction.
* **User Experience:**  Knowing the season distribution can help in the design of user interfaces and navigation.  Understanding the typical show length can influence how seasons are grouped and displayed, improving user experience.
* **Marketing and Promotion:**  Marketing efforts can highlight shows with a popular number of seasons or cater to preferences for shorter or longer series, depending on the distribution.

**Negative Growth Insight:** An overrepresentation of very short TV shows (e.g., a large proportion of shows with only one season) might indicate a potential problem.  Viewers might prefer longer series, and a lack of variety in length could lead to user dissatisfaction and a negative impact on growth.  Alternatively, an overrepresentation of very long series (e.g., tens of seasons) might not be appealing to the majority of the viewers.  Viewers might prefer shorter series or movies instead of committing to a lengthy TV show. This would require further analysis to understand viewer preferences related to show length.  If viewer data and user surveys show a preference for longer format TV content, and there is an under-representation of such content, that could also lead to negative growth.


#### Chart - 10 Density Heatmap On Distribution Of IMDB Score

In [None]:
# Creating Density Heatmap Chart On IMDB Score Distribution Across Various Genres

fig = px.density_heatmap(df, x='primary_genre', y='imdb_score',
                         title='IMDB Score Distribution Across Genres (Heatmap)',
                         labels={'primary_genre': 'Primary Genre', 'imdb_score': 'IMDB Score'},
                         nbinsx=20, nbinsy=20) # Adjust nbins for resolution
fig.show()



##### 1. Why did you pick the specific chart?

I chose a density heatmap to visualize the distribution of IMDB scores across different genres because it effectively shows the concentration of data points in a 2D space.  A heatmap allows us to identify areas where many movies/shows with similar IMDB scores exist for each genre. It provides a visual representation of the relationship between the two continuous variables: IMDB score and genre, and it helps reveal areas of higher density (more movies/shows with those scores within a genre).


##### 2. What is/are the insight(s) found from the chart?

The density heatmap provides insights into how IMDB scores are distributed within different genres. It allows us to identify genres that tend to have movies/shows clustered around certain scores. We can see which genres have a higher concentration of high-scoring movies/shows and which genres show more variation or lower scores. For example, a genre might show a dense cluster of ratings in the 7-8 range, or some genres may have a more dispersed rating distribution, indicating greater variability in quality for that particular genre.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding IMDB score distribution within genres has several potential business impacts.
* **Genre-Specific Content Focus:**  The heatmap reveals genres with higher concentrations of high IMDB scores, indicating a preference for quality content in those genres.  The platform can focus more on acquiring/producing high-quality content within those genres.
* **Content Curation:**  Insights from the density heatmap can inform the curation of content collections for users. By grouping high-density areas within genres, the platform can suggest relevant and high-quality content.
* **Understanding User Preferences:**  The density heatmap can provide clues about viewer preferences for genres and score ranges.  This can inform future content acquisition and production decisions.

**Negative Growth Insight:**  Genres showing a high concentration of low IMDB scores could lead to negative growth.  If the platform continues to heavily invest in or promote genres that primarily receive low ratings, this may drive away viewers.  If viewer data suggests a preference for high quality shows and the platform fails to cater to that, then the platform may see negative growth.  This could also impact the perception of the platform in the market.  Competitors with similar genres, but with a higher density of high scores, may attract subscribers and content creators.


#### Chart - 11 Correlation Heatmap

In [None]:
import plotly.figure_factory as ff

# Selecting relevant numerical columns for correlation analysis
correlation_columns = ['runtime', 'imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score', 'seasons']

# Compute the correlation matrix (ensure no NaNs)
correlation_matrix = df[correlation_columns].corr().fillna(0)  # Replace NaNs with 0

# Convert to NumPy array
z_values = correlation_matrix.values

# Create annotations (convert numerical values to strings with 2 decimal places)
annotations = [[f"{val:.2f}" for val in row] for row in z_values]

# Create heatmap
fig_heatmap = ff.create_annotated_heatmap(
    z=z_values,
    x=list(correlation_matrix.columns),
    y=list(correlation_matrix.index),
    annotation_text=annotations,  # Display correlation values
    colorscale='RdBu',  # You can try 'Viridis' or 'Blues' if you prefer
    reversescale=True  # Reverse color scale for better visualization
)

# Adding title to the layout
fig_heatmap.update_layout(
    title="Heatmap: Correlation Between Numeric Variables",
    xaxis_title="Variables",
    yaxis_title="Variables",
    font=dict(family="Arial", size=14, color="black"),
    height=600, width=1100  # Adjust size if needed
)

# Show the heatmap
fig_heatmap.show()


##### 1. Why did you pick the specific chart?

I chose a correlation heatmap to visualize the relationships between several numerical variables in the dataset. Heatmaps are ideal for displaying correlation coefficients between multiple variables simultaneously.  The color intensity and shading represent the strength and direction of the correlation, making it easy to quickly identify strongly correlated or inversely correlated variables.


##### 2. What is/are the insight(s) found from the chart?

The heatmap shows the correlation coefficients between pairs of numerical variables, such as runtime, IMDB score, IMDB votes, TMDB popularity, TMDB score, and seasons (for TV shows).  Positive correlations (lighter colors) indicate that as one variable increases, the other tends to increase as well.  Negative correlations (darker colors) indicate an inverse relationship – as one variable increases, the other tends to decrease. The magnitude of the color intensity shows the strength of the relationship.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding the correlations between these variables can have a significant business impact:

* **Content Recommendation:** Strong correlations can inform the content recommendation engine. For instance, if runtime and IMDB score are positively correlated, the platform can recommend longer movies/shows with high scores to users who enjoy similar content.
* **Content Strategy:**  If TMDB popularity and IMDB score have a strong positive correlation, it suggests that popular movies/shows on TMDB tend to have higher ratings on IMDB.  This can inform the acquisition strategy, prioritizing content that's already popular on other platforms.
* **Resource Allocation:**  Identifying correlations can help allocate resources efficiently. For example, if there's a strong correlation between marketing spend and IMDB votes, it indicates that marketing efforts may influence viewer engagement, guiding future marketing strategies.

**Negative Growth Insight:** Weak or negative correlations between key metrics could indicate potential problems.  For example, if there's a negative correlation between runtime and IMDB score, it suggests that longer movies/shows tend to have lower ratings.  This could imply that users dislike long movies/shows or that longer content might not be as well-produced.  Continuing to produce long content may result in lower user satisfaction.

Conversely, a weak positive correlation between marketing and IMDB votes could mean that marketing campaigns are not as effective at driving engagement as anticipated, indicating a need to revisit marketing strategies to positively impact key metrics like viewership or subscriptions. If the platform has limited resources, and heavily invests in content with low correlation to success metrics (like IMDB score, viewership), it might lead to lower return on investment and hinder growth.


#### Chart - 12 Pair Plot

In [None]:
# Select relevant numerical columns for pair plot
numerical_cols = ['runtime', 'imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']

# Create the pair plot
sns.pairplot(df[numerical_cols])
plt.suptitle('Pair Plot of Numerical Variables', y=1.02) # Add title
plt.show()

##### 1. Why did you pick the specific chart?

I chose a pair plot to visualize the relationships between multiple numerical variables simultaneously. A pair plot creates a matrix of scatter plots, where each plot shows the relationship between two variables.  The diagonal of the matrix shows the distribution of each individual variable (often a histogram or kernel density estimate). This allows for a quick visual overview of how all selected numerical variables relate to each other, which is useful for identifying patterns, correlations, and potential outliers.


##### 2. What is/are the insight(s) found from the chart?

The pair plot reveals relationships between pairs of numerical variables (e.g., runtime vs. IMDB score, IMDB score vs. TMDB popularity). From the scatter plots, we can observe trends, clusters, and correlations.  For example, a positive linear trend might indicate a positive correlation between two variables, while no clear pattern suggests a lack of correlation.  The distributions on the diagonal help understand the ranges and distributions of each individual variable.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, insights from the pair plot can have a positive business impact:
* **Content Strategy:** Identifying correlations between variables can inform content acquisition and production strategies. If runtime is negatively correlated with IMDB scores, it could guide the platform to focus on shorter content with a higher probability of receiving good ratings.
* **Targeted Marketing:** Correlations can be used to target specific segments of users with greater precision.  If IMDB votes are correlated with TMDB popularity, the platform can leverage that to boost campaigns for already-popular content.
* **Resource Allocation:** Understanding these correlations can help in efficiently allocating marketing budgets across different types of content.  If TMDB popularity and IMDB votes have a strong positive correlation, it might indicate that marketing efforts on highly popular titles on TMDB may translate to more votes on IMDB.

**Negative Growth Insight:**  A lack of correlation between seemingly related variables might highlight areas of inefficiency. For instance, if marketing spend shows no correlation with IMDB votes or TMDB popularity, it could suggest that the current marketing campaigns are ineffective or not targeted well enough. This means that marketing investments are not translating into engagement metrics and requires review of strategies.  Further investigation and more granular data might be necessary to uncover reasons behind these lackluster correlations and improve future campaigns.  Additionally, if a key metric (like user retention) shows a negative trend in relation to certain variables, it points toward a need for changes to avert negative growth.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the analysis of content release trends, IMDB scores, and geographical distribution, here's a suggested approach for the client:

1. **Prioritize high-quality content:** Focus on acquiring and producing content with high IMDB scores, especially within genres demonstrating consistent quality.  Identify genres with higher median scores and smaller ranges, indicating more reliable high quality.  Pay attention to age certifications;  prioritize age categories with high average IMDB scores to maintain user satisfaction across all demographics.

2. **Diversify content geographically:** Expand the content library to include more titles from underrepresented countries. This increases the platform's appeal to a global audience, reduces reliance on a limited number of sources and mitigates the risk of alienating viewers in other regions.

3. **Optimize content planning and resource allocation:** Analyze content release trends to anticipate future demand.  Plan content acquisition and production strategically according to this, adjusting resource allocation (staff, budget, and technical infrastructure) accordingly. Periods of high content availability require increased resources; periods of lower content availability require adjustments.

4. **Refine recommendation algorithms:** Incorporate IMDB scores, TMDB popularity, and genre information to improve the recommendation engine.  This ensures users are recommended high-quality, relevant content that caters to their preferences.

5. **Targeted marketing:** Create targeted marketing campaigns for specific genres, age certifications, and regions. Highlight high-quality content and promote it based on the identified trends.

6. **Monitor key metrics:** Continuously monitor key metrics such as user engagement, churn rate, and subscriber numbers.  Closely track content performance to identify any genres or regions where improvements are needed.

By implementing these strategies, the client can improve user satisfaction, attract new subscribers, and ensure long-term growth.

# **Conclusion**

This analysis of the Amazon Prime Movies and TV Shows dataset provided valuable insights into content trends, audience preferences, and geographical distribution.  Key findings include the temporal trends in content releases, the distribution of IMDB scores across different genres and age certifications, and the geographical representation of content.  These insights have direct implications for business strategy, guiding content acquisition, resource allocation, marketing efforts, and the development of personalized recommendation systems.  By focusing on high-quality content, diversifying the geographic representation, and proactively adapting to changing trends, the platform can improve user satisfaction, attract new viewers, and achieve sustainable growth.  Continuous monitoring of key performance indicators and a data-driven approach to content management are essential for maintaining a competitive advantage in the evolving streaming landscape.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***