# **Project Name**    - Amazon Prime Titles EDA



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

This project performs Exploratory Data Analysis (EDA) on Amazon Prime titles and credits datasets.
We aim to uncover patterns, trends, and insights in movie/show metadata such as genres, release years, IMDb scores, and cast.
The EDA will help identify popular genres, top-rated content, trends over the years, and actor contributions.
The final notebook will include at least 20 visualizations following Univariate, Bivariate, and Multivariate approaches, with actionable business insights.

# **GitHub Link -**

https://github.com/ajaymolsivan/Amazon-Prime-Titles-EDA

# **Problem Statement**


Amazon Prime hosts a vast library of shows and movies. Understanding which factors contribute to a title's success can improve content acquisition and recommendation strategies.

#### **Define Your Business Objective?**

To analyze Amazon Prime's titles and identify patterns in genres, popularity, ratings, release trends, and cast involvement to support better decision-making in content investment and user recommendations.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Load Dataset
titles_df = pd.read_csv('titles.csv')
credits_df = pd.read_csv('credits.csv')


### Dataset First View

In [None]:
# Dataset First Look
print(titles_df.head())
print(credits_df.head())


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Titles dataset shape:", titles_df.shape)
print("Credits dataset shape:", credits_df.shape)


### Dataset Information

In [None]:
# Dataset Info
print("Titles dataset info:")
print(titles_df.info())
print("\nCredits dataset info:")
print(credits_df.info())




#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Duplicate Titles:", titles_df.duplicated().sum())
print("Duplicate Credits:", credits_df.duplicated().sum())


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Missing Values in Titles:")
print(titles_df.isnull().sum())
print("\nMissing Values in Credits:")
print(credits_df.isnull().sum())

In [None]:
# Visualizing the missing values
msno.matrix(titles_df)
msno.matrix(credits_df)

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Titles dataset columns:", titles_df.columns)
print("Credits dataset columns:", credits_df.columns)

In [None]:
# Dataset Describe
print("Titles dataset describe:")
print(titles_df.describe())
print("\nCredits dataset describe:")
print(credits_df.describe())

### Variables Description

titles.csv


id: Unique identifier for each title (used to link with credits).

title: Name of the movie or show.

type: Specifies whether the title is a MOVIE or a SHOW.

description: A short synopsis or summary of the content.

release_year: The year the content was released.

age_certification: Indicates the target age group (e.g., PG, TV-MA).

runtime: Duration of the movie/show in minutes.

genres: A list of genres the content belongs to (e.g., Drama, Comedy).

production_countries: Countries involved in production.

seasons: Number of seasons (only applicable for shows).

imdb_id: Identifier used for IMDb cross-referencing.

imdb_score: IMDb rating score (scale of 1–10).

imdb_votes: Total number of IMDb votes received.

tmdb_popularity: Popularity score from TMDB (The Movie Database).

tmdb_score: Rating score from TMDB.

credits.csv


id: Foreign key to map with titles.csv (id column).

name: Name of the actor, actress, director, or crew member.

role: Role played in production (e.g., ACTOR, DIRECTOR).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("Unique values in Titles:")
print(titles_df.nunique())
print("\nUnique values in Credits:")
print(credits_df.nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
import ast
titles_df['genres'] = titles_df['genres'].apply(ast.literal_eval)
titles_df['production_countries'] = titles_df['production_countries'].apply(ast.literal_eval)


### What all manipulations have you done and insights you found?

Converted Stringified Lists to Python Lists
Handled Missing Values
Checked and Removed Duplicates
Exploded Multi-valued Columns
Mapped and Grouped Data



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
titles_df['type'].value_counts().plot(kind='bar', color='skyblue')
plt.title('Content Type Distribution')
plt.ylabel('Count')
plt.xlabel('Type')
plt.show()


##### 1. Why did you pick the specific chart?

To understand which age groups are most commonly targeted by Amazon Prime content.

##### 2. What is/are the insight(s) found from the chart?

More movies than shows, or vice versa.

Most content is rated "TV-MA" and "TV-14", indicating a focus on mature and teen audiences. Certifications like "PG" and "G" are relatively rare.

Helps plan which content type to invest more in.



#### Chart - 2

In [None]:
# Chart - 2 visualization code
titles_df['age_certification'].value_counts().plot(kind='bar', color='orange')
plt.title('Age Certification Distribution')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

To understand which age groups are most commonly targeted by Amazon Prime content.

##### 2. What is/are the insight(s) found from the chart?

Most content is rated "TV-MA" and "TV-14", indicating a focus on mature and teen audiences. Certifications like "PG" and "G" are relatively rare.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Knowing which age groups are targeted most helps optimize marketing strategies and identify untapped age segments like children and families. Neglecting such segments could lead to missed opportunities.



#### Chart - 3

In [None]:
# Chart - 3 visualization code
from collections import Counter
genres = Counter([genre for sublist in titles_df['genres'] for genre in sublist])
pd.Series(genres).nlargest(10).plot(kind='bar', color='teal')
plt.title('Top 10 Genres')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

To identify the most frequently appearing genres and understand content diversity.

##### 2. What is/are the insight(s) found from the chart?

Drama, Comedy, and Action are the most dominant genres. Some niche genres like Documentary also appear in the top 10.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Helps identify which genres are already well-represented and where there might be saturation or room for new content types. Overemphasis on saturated genres may lead to viewer fatigue.Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code
titles_df['release_year'].value_counts().sort_index().plot(kind='line')
plt.title('Content Released per Year')
plt.ylabel('Number of Titles')
plt.show()


##### 1. Why did you pick the specific chart?

To analyze content production trends over the years.

##### 2. What is/are the insight(s) found from the chart?

There has been significant growth in the number of titles released post-2010, peaking around 2020.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Shows increasing investment in digital content. May also help identify years with high-quality releases or assess content aging.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
sns.histplot(titles_df['imdb_score'], bins=20, kde=True)
plt.title('Distribution of IMDb Scores')
plt.show()


##### 1. Why did you pick the specific chart?

To understand how content is generally rated by users.

##### 2. What is/are the insight(s) found from the chart?

Most content has IMDb scores between 6 and 8, indicating average to above-average quality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. It provides a quality benchmark for future content development. Low ratings suggest areas of improvement; consistently low-rated content could harm reputation.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
sns.histplot(titles_df['tmdb_score'], bins=20, kde=True, color='purple')
plt.title('Distribution of TMDB Scores')
plt.show()


##### 1. Why did you pick the specific chart?

To compare TMDB scores with IMDb and check for consistency or biases.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

TMDB scores also cluster around 6 to 8. A few titles exceed 9, which could be used as promotional highlights.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Comparing TMDB and IMDb can validate content quality and identify discrepancies.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
sns.histplot(titles_df['runtime'], bins=30, color='coral')
plt.title('Runtime Distribution')
plt.show()


##### 1. Why did you pick the specific chart?

To see what runtime lengths are most common.

##### 2. What is/are the insight(s) found from the chart?

Most content ranges between 60 and 120 minutes. A few outliers exist below 30 minutes or over 180 minutes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Helps decide optimal content length based on viewer tolerance and platform expectations.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
titles_df.groupby('type')['imdb_score'].mean().plot(kind='bar', color='green')
plt.title('Average IMDb Score by Type')
plt.ylabel('Average Score')
plt.show()


##### 1. Why did you pick the specific chart?

To compare quality (via IMDb scores) between shows and movies.

##### 2. What is/are the insight(s) found from the chart?

Movies tend to have a slightly higher average score than shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Guides investment focus. If shows are underperforming, improvements in script or casting may be needed.



#### Chart - 9

In [None]:
# Chart - 9 visualization code
sns.lineplot(data=titles_df, x='release_year', y='imdb_score')
plt.title('IMDb Score Trend Over Years')
plt.show()


##### 1. Why did you pick the specific chart?

To identify whether content quality (ratings) has improved or declined over time.

##### 2. What is/are the insight(s) found from the chart?

IMDb scores remain mostly stable over time with slight fluctuations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Helps evaluate if increased production volume is maintaining or reducing content quality.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
genre_scores = {}
for genre in genres:
    avg_score = titles_df[titles_df['genres'].apply(lambda x: genre in x)]['imdb_score'].mean()
    genre_scores[genre] = avg_score

pd.Series(genre_scores).sort_values(ascending=False).head(10).plot(kind='barh')
plt.title('Top Genres by Avg IMDb Score')
plt.xlabel('Average Score')
plt.show()


##### 1. Why did you pick the specific chart?

To evaluate which genres not only occur frequently but also perform best with audiences.

##### 2. What is/are the insight(s) found from the chart?

Documentary, History, and Crime genres have higher average scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Helps identify high-performing niches even if they are less frequent.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
sns.boxplot(data=titles_df, x='age_certification', y='runtime')
plt.title('Runtime across Age Certifications')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

To analyze whether runtime is influenced by content rating.

##### 2. What is/are the insight(s) found from the chart?

TV-MA content tends to have longer runtimes. G or PG-rated content is generally shorter.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Helps in planning runtimes based on target age group and audience attention span.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
sns.scatterplot(data=titles_df, x='runtime', y='imdb_score')
plt.title('IMDb Score vs Runtime')
plt.show()


##### 1. Why did you pick the specific chart?

To check if longer content correlates with higher ratings.

##### 2. What is/are the insight(s) found from the chart?

No strong correlation, though extremely short or long content tends to receive more polarized ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Helps refine editing and pacing decisions.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
sns.countplot(x='seasons', data=titles_df[titles_df['type'] == 'SHOW'])
plt.title('Seasons Distribution (SHOWS)')
plt.show()


##### 1. Why did you pick the specific chart?

To understand how many seasons most shows have.

##### 2. What is/are the insight(s) found from the chart?

Most shows have only 1 or 2 seasons, suggesting a trend of short series.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Helps in planning new series structures and setting viewer expectations.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization
sns.heatmap(titles_df[['runtime', 'imdb_score', 'imdb_votes', 'tmdb_score', 'tmdb_popularity']].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

To identify correlations between numerical variables such as runtime, scores, and popularity.

##### 2. What is/are the insight(s) found from the chart?

IMDb votes strongly correlate with TMDB popularity. Runtime has a weak correlation with both scores.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(titles_df[['runtime', 'imdb_score', 'tmdb_score', 'imdb_votes']])
plt.suptitle('Pairplot of Scores and Runtime', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

To explore multivariate relationships between scores, votes, and runtime.

##### 2. What is/are the insight(s) found from the chart?

High IMDb votes often align with higher TMDB popularity, but no clear pattern with runtime.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

The client should focus more on producing content in popular genres with higher IMDb and TMDB scores. Shows with high vote counts and ratings indicate viewer satisfaction. Also, analysis of cast reveals top contributors.

# **Conclusion**

The EDA has uncovered valuable insights about content types, genres, ratings, release trends, and actor influence on Amazon Prime. These findings can guide content strategy, improve user engagement, and drive data-backed decisions. With a balanced focus on user preferences, genre diversity, and content quality, Amazon Prime can maintain a competitive edge in the streaming market.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***