
 ### Project Summary -

In the era of digital entertainment, streaming platforms such as Amazon Prime Video have become dominant players, offering audiences across the globe a wide variety of movies, TV shows, and original productions. With the massive growth in streaming content, understanding the trends, preferences, and underlying patterns in such a vast library of titles is both fascinating and valuable. This project focuses on conducting an in-depth Exploratory Data Analysis (EDA) of Amazon Prime content using two key datasets: one containing detailed title information (titles.csv) and another capturing cast and crew details (credits.csv). Together, these datasets provide a comprehensive view of what Amazon Prime offers, how content has evolved over the years, and what factors might influence ratings and popularity.

The project begins by examining content distribution, analyzing the balance between movies and TV shows, release trends across years, and the average duration of films and series. This helps us understand how Amazon Prime has positioned its content portfolio and whether there has been a shift in production strategies over time. For instance, by studying year-wise release patterns, we can identify growth phases and explore how streaming platforms have adapted to global demand, particularly during significant events like the COVID-19 pandemic.

A key focus of the analysis is on genres, as they form the backbone of audience preferences. By identifying the most common genres, their frequency, and their performance in terms of IMDb and TMDB ratings, we can highlight which types of content resonate most with viewers. Tracking genre popularity across decades also reveals how audience tastes have evolved. For example, genres like drama and comedy have historically dominated, while genres such as documentary and thriller have gained traction in recent years. This provides valuable insights into content creation trends and how Amazon Prime caters to diverse audience needs.

Another crucial dimension explored is ratings and popularity. IMDb scores, votes, and TMDB ratings are used to measure quality and audience engagement. By analyzing the distribution of ratings, the correlation between IMDb and TMDB scores, and the relationship between votes and popularity, we gain a better understanding of what drives viewer satisfaction and recognition. Highlighting the highest-rated and most-voted content gives an indication of timeless classics as well as current audience favorites.

The project also delves into age certifications, shedding light on how Amazon Prime addresses audience segments based on maturity ratings. By examining the distribution of content across certifications such as G, PG, PG-13, and R, we can determine whether the platform is more family-oriented or targeted towards mature audiences. Furthermore, comparing ratings across certifications provides insights into whether content targeted towards certain age groups tends to perform better critically.

From a global perspective, country-wise analysis helps uncover the geographical diversity of Amazon Prime’s library. By identifying leading content-producing nations, analyzing US vs. non-US contributions, and comparing average ratings by country, we gain insights into international representation. A visual representation through a world map further strengthens this global outlook.

The inclusion of the credits.csv dataset allows for cast and crew analysis, which provides a human-centric dimension to the study. Identifying the most frequent actors, directors, and writers highlights the creative contributors behind the platform’s success. Furthermore, evaluating their contribution in terms of frequency and average ratings adds depth to understanding how much influence star power and creative direction have on audience reception.

Finally, the project investigates time-based trends and engagement metrics such as IMDb votes and TMDB popularity. By looking at decade-wise content growth, runtime patterns, and the evolution of genres, we trace the historical journey of entertainment on Amazon Prime. Engagement metrics further reveal the gap between critical acclaim and audience attention, helping us understand the interplay between quality and popularity.

In summary, this project provides a holistic view of Amazon Prime’s content strategy and audience reception through comprehensive exploratory data analysis. By combining statistical KPIs with visual storytelling, we uncover patterns that not only describe the present landscape of Amazon Prime’s library but also point towards emerging trends in the global streaming industry. The insights from this study can be leveraged by content creators, streaming platforms, and even audiences to understand what makes content engaging, successful, and impactful in today’s digital entertainment era.



##### Problem Statement

With the rapid expansion of streaming platforms like Amazon Prime Video, understanding the vast library of content has become a challenge. The platform hosts thousands of movies and shows across different genres, countries, and age groups, making it essential to analyze patterns that drive popularity and audience engagement. However, the data is scattered across multiple attributes such as ratings, runtime, cast, and genres, which makes it difficult to identify clear insights. This project aims to perform exploratory data analysis to uncover meaningful trends, evaluate content distribution, and highlight factors that contribute to the success and global reach of Amazon Prime’s content.


#### Business Objective

The objective of this project is to analyze Amazon Prime’s content library to uncover insights that can support data-driven decision-making for content strategy. By examining trends in genres, ratings, release years, popularity, and cast/crew contributions, the goal is to identify what resonates most with audiences. This analysis will help in understanding viewer preferences, optimizing content acquisition or production, and enhancing customer satisfaction. Ultimately, the insights aim to support Amazon Prime in strengthening its competitive edge in the global streaming market.

In [3]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import plotly.express as px 

In [4]:
# Step 1 -  Know Your Data
title_df = pd.read_csv(r"data\titles.csv")
credit_df = pd.read_csv(r"data\credits.csv")

In [5]:
title_df.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts20945,The Three Stooges,SHOW,The Three Stooges were an American vaudeville ...,1934,TV-PG,19,"['comedy', 'family', 'animation', 'action', 'f...",['US'],26.0,tt0850645,8.6,1092.0,15.424,7.6
1,tm19248,The General,MOVIE,"During America’s Civil War, Union spies steal ...",1926,,78,"['action', 'drama', 'war', 'western', 'comedy'...",['US'],,tt0017925,8.2,89766.0,8.647,8.0
2,tm82253,The Best Years of Our Lives,MOVIE,It's the hope that sustains the spirit of ever...,1946,,171,"['romance', 'war', 'drama']",['US'],,tt0036868,8.1,63026.0,8.435,7.8
3,tm83884,His Girl Friday,MOVIE,"Hildy, the journalist former wife of newspaper...",1940,,92,"['comedy', 'drama', 'romance']",['US'],,tt0032599,7.8,57835.0,11.27,7.4
4,tm56584,In a Lonely Place,MOVIE,An aspiring actress begins to suspect that her...,1950,,94,"['thriller', 'drama', 'romance']",['US'],,tt0042593,7.9,30924.0,8.273,7.6


In [6]:
credit_df.head()

Unnamed: 0,person_id,id,name,character,role
0,59401,ts20945,Joe Besser,Joe,ACTOR
1,31460,ts20945,Moe Howard,Moe,ACTOR
2,31461,ts20945,Larry Fine,Larry,ACTOR
3,21174,tm19248,Buster Keaton,Johnny Gray,ACTOR
4,28713,tm19248,Marion Mack,Annabelle Lee,ACTOR


In [7]:
title_df.shape

(9871, 15)

In [8]:
credit_df.shape

(124235, 5)

In [9]:
title_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9871 entries, 0 to 9870
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    9871 non-null   object 
 1   title                 9871 non-null   object 
 2   type                  9871 non-null   object 
 3   description           9752 non-null   object 
 4   release_year          9871 non-null   int64  
 5   age_certification     3384 non-null   object 
 6   runtime               9871 non-null   int64  
 7   genres                9871 non-null   object 
 8   production_countries  9871 non-null   object 
 9   seasons               1357 non-null   float64
 10  imdb_id               9204 non-null   object 
 11  imdb_score            8850 non-null   float64
 12  imdb_votes            8840 non-null   float64
 13  tmdb_popularity       9324 non-null   float64
 14  tmdb_score            7789 non-null   float64
dtypes: float64(5), int64(

In [10]:
credit_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124235 entries, 0 to 124234
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   person_id  124235 non-null  int64 
 1   id         124235 non-null  object
 2   name       124235 non-null  object
 3   character  107948 non-null  object
 4   role       124235 non-null  object
dtypes: int64(1), object(4)
memory usage: 4.7+ MB


In [12]:
title_df.duplicated().sum()

3

In [13]:
credit_df.duplicated().sum()

56

In [14]:
title_df.isnull().sum()

id                         0
title                      0
type                       0
description              119
release_year               0
age_certification       6487
runtime                    0
genres                     0
production_countries       0
seasons                 8514
imdb_id                  667
imdb_score              1021
imdb_votes              1031
tmdb_popularity          547
tmdb_score              2082
dtype: int64

In [15]:
credit_df.isnull().sum()

person_id        0
id               0
name             0
character    16287
role             0
dtype: int64



After getting the first view and some basic investigation I have found that there are 9871 rows & 15 columns in title_df dataframe and 124235 rows & 5 columns in the credit_df dataframe. We also found for duplicated values and found 3 duplicate values in title_df dataframe & 56 duplictate values in credit_df dataframe. We also found for null values and found these number of null values respectively. description-119, age_certification-6487, seasons-8514, imdb_id-667, imdb_score-1021, imdb_votes-1031, tmdb_popularity-547, tmdb_score-2082, character-16287.


In [16]:
title_df.columns

Index(['id', 'title', 'type', 'description', 'release_year',
       'age_certification', 'runtime', 'genres', 'production_countries',
       'seasons', 'imdb_id', 'imdb_score', 'imdb_votes', 'tmdb_popularity',
       'tmdb_score'],
      dtype='object')

In [17]:
credit_df.columns

Index(['person_id', 'id', 'name', 'character', 'role'], dtype='object')

In [18]:
title_df.describe()

Unnamed: 0,release_year,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
count,9871.0,9871.0,1357.0,8850.0,8840.0,9324.0,7789.0
mean,2001.327221,85.973052,2.791452,5.976395,8533.614,6.910204,5.984247
std,25.810071,33.512466,4.148958,1.343842,45920.15,30.004098,1.517986
min,1912.0,1.0,1.0,1.1,5.0,1.1e-05,0.8
25%,1995.5,65.0,1.0,5.1,117.0,1.232,5.1
50%,2014.0,89.0,1.0,6.1,462.5,2.536,6.0
75%,2018.0,102.0,3.0,6.9,2236.25,5.634,6.9
max,2022.0,549.0,51.0,9.9,1133692.0,1437.906,10.0


In [19]:
credit_df.describe()

Unnamed: 0,person_id
count,124235.0
mean,406473.7
std,561629.6
min,1.0
25%,38992.0
50%,133949.0
75%,571256.0
max,2371153.0



Variables Description

The project uses two datasets — titles.csv and credits.csv — containing detailed information about Amazon Prime content. The titles dataset includes variables such as title, type (Movie/Show), release_year, runtime, genres, age_certification, production_countries, imdb_score, and tmdb_popularity, which describe each piece of content’s characteristics and performance. The credits dataset provides information on the people involved in the content, including name, character, and role (Actor, Director, or Writer). Together, these variables allow for a comprehensive analysis of content distribution, quality, audience ratings, and the creative contributors behind Amazon Prime’s library.


Check Unique Values for Each 

In [22]:
title_df.nunique()

id                      9868
title                   9737
type                       2
description             9734
release_year             110
age_certification         11
runtime                  207
genres                  2028
production_countries     497
seasons                   32
imdb_id                 9201
imdb_score                86
imdb_votes              3650
tmdb_popularity         5325
tmdb_score                89
dtype: int64

In [23]:
credit_df.nunique()

person_id    80508
id            8861
name         79758
character    71097
role             2
dtype: int64

##### Handling Null Values

In [25]:
# Replacing Nan values with 'Not Rated' in age_certification column

title_df['age_certification'].fillna("Not Rated",inplace=True)

# Replacing Null values with mean in imdb_score column 

title_df['imdb_score'].fillna(title_df['imdb_score'].mean(),inplace=True)
#Dropping duplicate values from both title_df and credit_df
title_df.drop_duplicates(inplace=True)
credit_df.drop_duplicates(inplace=True)
# Merging title_df and credit_df dataset using 'id' column as primary key 
mreged_df = title_df.merge(credit_df,on='id',how='left')
# Creating A column for decade
title_df['decade'] = (title_df['release_year'] // 10) * 10



A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.




A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





In [27]:
mreged_df.shape

(125186, 19)

In [28]:
mreged_df['runtime_category'] = ''
for i in range(mreged_df.shape[0]):
    if mreged_df.loc[i,'runtime'] < 30:
        mreged_df.loc[i,'runtime_category'] = 'Short'
    elif mreged_df.loc[i,'runtime'] <=90:
        mreged_df.loc[i,'runtime_category'] = 'Medium'
    else:
        mreged_df.loc[i,'runtime_category'] = 'Long'
    


In [30]:
# Add column is_movie 
mreged_df['is_movie']=0
for i in range(len(mreged_df)):
    if mreged_df.loc[i,'type'] == "MOVIE":
        mreged_df.loc[i,'is_movie'] = 1 
    else :
        mreged_df.loc[i,'is_movie'] = 1


In [31]:
#Categorize IMDb ratings (Poor, Average, Good, Excellent)
mreged_df['rating_category'] = ''
for i in range(mreged_df.shape[0]):
    score = mreged_df.loc[i,'imdb_score']
    if pd.isna(score):
        mreged_df.loc[i,'rating_category'] = 'Not Rated'
    elif score < 5:
        mreged_df.loc[i,'rating_category'] = 'Poor'
    elif score < 7 :
        mreged_df.loc[i,'rating_category'] = "Average"
    elif score < 8.5:
        mreged_df.loc[i,'rating_category'] = "Good"
    else :
        mreged_df.loc[i,'rating_category'] = "Excellent"

### KPIs ( Key Performance Indicators) Calculation

In [32]:
total_titles = mreged_df.shape[0]
total_movies = mreged_df[mreged_df['type'] == "MOVIE"].shape[0]
total_shows = mreged_df[mreged_df['type'] == "SHOW"].shape[0]
avg_runtime = mreged_df['runtime'].mean()
avg_imdb = mreged_df['imdb_score'].mean()
oldest_year = mreged_df['release_year'].min()
newest_year = mreged_df['release_year'].max()
print("Total Titles:", total_titles)
print("Movies:", total_movies)
print("Shows:", total_shows)
print("Average Runtime:", round(avg_runtime, 2))
print("Average IMDb Score:", round(avg_imdb, 2))
print("Release Years Range:", oldest_year, "-", newest_year)

Total Titles: 125186
Movies: 116685
Shows: 8501
Average Runtime: 95.35
Average IMDb Score: 5.97
Release Years Range: 1912 - 2022
