<a href="https://colab.research.google.com/github/bidyashreenayak0211/Labmentix-Internship/blob/main/Amazon_Prime_TVShows_and_Movies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name**  - Bidyashree Nayak

# **Project Summary -**

**Amazon Prime Video **is one of the leading streaming platforms, offering a vast library of TV shows and movies to global audiences. To maintain its competitive edge and improve user engagement, it is essential to analyze its content library, understand audience preferences, and optimize content acquisition strategies.

This project involves performing **Exploratory Data Analysis (EDA)** on two datasets—titles.csv (which contains information about movies and TV shows) and credits.csv (which includes cast and crew details). The goal is to uncover patterns, trends, and insights that can help improve content offerings and business strategy.

The analysis includes:

**Data Cleaning & Preprocessing**: Handling missing values, removing duplicates, and standardizing data formats.

**Data Exploration & Visualization:** Understanding the distribution of movies vs. TV shows, genre popularity, IMDb/TMDB ratings, and cast/crew trends.
Statistical Insights: Identifying correlations, trends over time, and key performance indicators (KPIs).
By deriving insights from these datasets, Amazon Prime can make data-driven decisions to optimize content selection, improve audience satisfaction, and enhance its competitive position in the streaming industry.



# **GitHub Link -**

https://github.com/bidyashreenayak0211/Labmentix-Internship

# **Problem Statement**


In the competitive world of online streaming, platforms like **Amazon Prime Video, Netflix, and Disney+** are constantly striving to capture audience attention by providing engaging and high-quality content. With a vast content library spanning different genres, languages, and formats (TV shows and movies), **Amazon Prime Video needs to optimize its content strategy** to maximize user engagement, improve retention rates, and attract new subscribers.  

However, **several key challenges exist:**  

1. **Unbalanced Content Distribution**  
   - Does Amazon Prime have an optimal balance between movies and TV shows?  
   - Are certain genres overrepresented while others are underrepresented?  

2. **Content Quality & User Satisfaction**  
   - What is the overall quality of the content based on IMDb and TMDB ratings?  
   - Are there too many low-rated movies and TV shows, leading to negative user experiences?  

3. **Content Acquisition Strategy**  
   - Which genres, actors, and directors are the most successful in engaging audiences?  
   - Should Amazon Prime focus on acquiring more content from a particular genre or from certain high-performing directors/actors?  

4. **Cast & Crew Influence**  
   - Which actors and directors dominate the platform, and is there enough diversity in talent?  
   - Are there recurring character names that indicate popular franchises?  

5. **Content Growth Trends & Risks**  
   - How has the volume of movie and TV show releases changed over the years?  
   - Is there a decline in content additions that could negatively impact user retention?  

### **Business Impact of the Problem**  

- **User Dissatisfaction & Churn:** If Amazon Prime Video lacks high-quality or diverse content, users may unsubscribe and switch to competitors like Netflix, Disney+, or Hulu.  
- **Inefficient Content Investment:** Without a data-driven approach, Amazon Prime may invest in content that does not perform well, leading to financial losses.  
- **Declining New Subscriptions:** A lack of fresh, engaging content could make it difficult to attract new subscribers.  
- **Market Positioning Risks:** If competitors provide better content offerings, Amazon Prime could lose market share in the streaming industry.  

Thus, **Amazon Prime needs to analyze its content library, evaluate user preferences, and optimize content acquisition strategies to stay competitive and ensure long-term growth.** This project aims to address these challenges through **data-driven insights and exploratory data analysis (EDA)**.

#### **Define Your Business Objective?**

The primary business objectives of this analysis include:

**Understanding Content Distribution:** Determine the ratio of movies to TV shows, the most common genres, and the release trends.

**Assessing Content Quality**: Analyze IMDb and TMDB ratings to evaluate the overall quality of the content.

**Optimizing Content Acquisition:** Identify top-performing genres, actors, and directors to focus on acquiring high-engagement content.

**Evaluating Cast & Crew Trends**: Determine the most frequently appearing actors, directors, and recurring character names.

**Identifying Gaps & Risks:** Find potential weaknesses, such as underrepresented genres or declining release trends, that could impact growth.

Enhancing Customer Satisfaction: Ensure a balanced and engaging content library by addressing audience preferences.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
titles_df = pd.read_csv("/content/titles.csv")
credits_df = pd.read_csv("/content/credits.csv")

### Dataset First View

In [None]:
# Dataset First Look
titles_df.head(), credits_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
titles_shape = titles_df.shape
credits_shape = credits_df.shape

titles_shape, credits_shape

### Dataset Information

In [None]:
# Dataset Info
titles_df.info(), credits_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
titles_duplicates = titles_df.duplicated().sum()
credits_duplicates = credits_df.duplicated().sum()

titles_duplicates, credits_duplicates


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
titles_missing = titles_df.isnull().sum()
credits_missing = credits_df.isnull().sum()
titles_missing, credits_missing

In [None]:
# Visualizing the missing values
plt.figure(figsize=(12, 6))
sns.heatmap(titles_df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values in titles.csv")
plt.show()

plt.figure(figsize=(12, 6))
sns.heatmap(credits_df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values in credits.csv")
plt.show()

### What did you know about your dataset?

The dataset consists of two main tables: one with **9,871** entries detailing movies and TV shows, and another with **124,235** records mapping cast members to these titles. The first dataset includes metadata such as title, genre, IMDb and TMDb scores, runtime, and production country, while the second dataset provides actor names, roles, and character information. A significant issue in the data is missing values, particularly in `age_certification` (66% missing), `seasons` (mostly for movies), and `imdb_score` and `tmdb_score` (missing for around 1,000 and 2,000 titles, respectively). Additionally, the `character` column in the cast dataset is incomplete for over **16,000** entries, likely representing crew members rather than actors. There are also duplicate records (**3 in the movie dataset and 56 in the cast dataset**) that need to be addressed. These inconsistencies may impact analysis, especially for rating-based insights and recommendation models. Cleaning steps should include imputing missing values, removing duplicates, and ensuring data consistency. Visualizations, such as heatmaps and histograms, could help identify patterns in missing data and rating distributions.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
titles_columns = titles_df.dtypes
credits_columns = credits_df.dtypes
titles_columns, credits_columns

In [None]:
# Dataset Describe
titles_df.describe(), credits_df.describe()

### Variables Description

### **Variable Descriptions**  

The dataset consists of two primary data files:  
1. **`titles.csv`** – Contains information about movies and TV shows.  
2. **`credits.csv`** – Contains cast and crew details for each title.  

---

### **1. `titles.csv` (Movies & TV Shows Metadata)**  

| **Column Name**          | **Data Type** | **Description** |
|--------------------------|-------------|----------------|
| `id`                     | object      | Unique identifier for each title (movie/TV show). |
| `title`                  | object      | Name of the movie or TV show. |
| `type`                   | object      | Indicates whether the title is a "MOVIE" or a "SHOW". |
| `description`            | object      | Short summary of the movie/TV show. |
| `release_year`           | int64       | The year the title was released. |
| `age_certification`      | object      | Age rating (e.g., PG, R, TV-14), but has missing values. |
| `runtime`                | int64       | Duration of the movie in minutes; for TV shows, it represents episode length. |
| `genres`                 | object      | A list of genres assigned to the title (e.g., Drama, Comedy). |
| `production_countries`   | object      | The countries involved in producing the title. |
| `seasons`                | float64     | Number of seasons (only applicable for TV shows, NaN for movies). |
| `imdb_id`                | object      | Unique IMDb identifier for the title. |
| `imdb_score`             | float64     | IMDb rating (scale of 1 to 10), but some values are missing. |
| `imdb_votes`             | float64     | Total number of user votes for IMDb rating. |
| `tmdb_popularity`        | float64     | Popularity score from TMDB (The Movie Database). |
| `tmdb_score`             | float64     | TMDB rating (scale of 1 to 10), but some values are missing. |

#### **Statistical Observations:**
- **`release_year`**: Ranges from **1912 to 2022**, with most movies produced in the 2000s and 2010s.  
- **`runtime`**: Median duration is **89 minutes**, but the maximum is **549 minutes**, indicating some long-form content.  
- **`seasons`**: Most TV shows have **1-3 seasons**, but some reach **51 seasons**.  
- **`imdb_score` and `tmdb_score`**: Ratings are mostly between **5.1 and 6.9**, with the highest-rated content reaching **9.9/10**.  
- **`imdb_votes`**: Varies significantly, with some titles receiving over **1 million votes**, indicating popularity.  

---

### **2. `credits.csv` (Cast & Crew Information)**  

| **Column Name**  | **Data Type** | **Description** |
|------------------|-------------|----------------|
| `person_id`      | int64       | Unique identifier for the individual (actor/director). |
| `id`            | object      | Foreign key linking to `titles.csv`, identifying which title the person is associated with. |
| `name`          | object      | Name of the actor or director. |
| `character`     | object      | Character name played by the actor (NaN for directors). |
| `role`          | object      | Role type – either **ACTOR** or **DIRECTOR**. |

#### **Statistical Observations:**
- **`person_id`**: Unique IDs assigned to cast/crew members, ranging up to **2.37 million**.  
- **`character`**: Some actors appear multiple times, often playing different roles.  
- **`role`**: More **ACTORS** than **DIRECTORS**, which is expected in entertainment datasets.  
- **Duplicates & Missing Values**: Some `character` fields are missing, possibly for background or voice actors.  

---

### **Key Takeaways**
- **Most content is from recent decades**, aligning with the rise of streaming services.  
- **IMDb and TMDB scores show a positive trend**, but missing ratings indicate incomplete data.  
- **A small number of actors and directors dominate the platform**, which may limit diversity.  
- **Most TV shows have 1-3 seasons**, suggesting that long-running series are less common on Amazon Prime.  

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

# Check unique values for each variable in titles.csv
titles_unique_values = titles_df.nunique()

# Check unique values for each variable in credits.csv
credits_unique_values = credits_df.nunique()

titles_unique_values,credits_unique_values

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
import pandas as pd

# Load datasets
titles_df = pd.read_csv("titles.csv")
credits_df = pd.read_csv("credits.csv")

# Remove duplicates
titles_df.drop_duplicates(inplace=True)
credits_df.drop_duplicates(inplace=True)

# Ensure we are working with a full copy of the DataFrame
titles_df = titles_df.copy()
credits_df = credits_df.copy()

# Fill missing numerical values with median
titles_df['imdb_score'] = titles_df['imdb_score'].fillna(titles_df['imdb_score'].median())
titles_df['imdb_votes'] = titles_df['imdb_votes'].fillna(titles_df['imdb_votes'].median())
titles_df['tmdb_score'] = titles_df['tmdb_score'].fillna(titles_df['tmdb_score'].median())

# Fill missing categorical values with "Unknown"
titles_df['age_certification'] = titles_df['age_certification'].fillna("Unknown")
credits_df['character'] = credits_df['character'].fillna("Unknown")

# Convert 'release_year' to datetime format
titles_df['release_year'] = pd.to_numeric(titles_df['release_year'], errors='coerce')
titles_df['release_year'] = pd.to_datetime(titles_df['release_year'], format='%Y', errors='coerce')

# Display post-cleaning changes
print("\nAfter Cleaning:")
print("Titles DataFrame - Missing Values:\n", titles_df.isnull().sum())
print("Credits DataFrame - Missing Values:\n", credits_df.isnull().sum())
print("\nTitles DataFrame - Duplicates:", titles_df.duplicated().sum())
print("Credits DataFrame - Duplicates:", credits_df.duplicated().sum())

In [None]:
# Save cleaned data to new CSV files
titles_df.to_csv("titles_cleaned.csv", index=False)
credits_df.to_csv("credits_cleaned.csv", index=False)

### What all manipulations have you done and insights you found?

### **Data Wrangling: Manipulations & Insights**  

#### **1. Data Cleaning Steps Performed**
The following manipulations were applied to clean and preprocess the data:  

1. **Loaded Datasets**: Imported `titles.csv` and `credits.csv`.  
2. **Removed Duplicate Entries**:  
   - **Before:** `titles.csv` had 3 duplicate rows, and `credits.csv` had 56 duplicates.  
   - **After:** All duplicates were removed, ensuring data consistency.  
3. **Missing Value Treatment:**  
   - **Numerical Columns:**  
     - `imdb_score`, `imdb_votes`, and `tmdb_score` were filled with their median values.  
   - **Categorical Columns:**  
     - `age_certification` (age ratings) and `character` (in `credits.csv`) were filled with `"Unknown"` to avoid losing valuable rows.  
4. **Converted `release_year` to Datetime Format**:  
   - **Ensured consistency in date representation** for better time-based analysis.  
5. **Final Check for Missing Values & Duplicates:**  
   - **Missing Values in `credits.csv`:**  Fully handled.  
   - **Missing Values in `titles.csv`:**  
     - `description`, `seasons`, `imdb_id`, and `tmdb_popularity` still have some missing values.  
   - **Duplicates:** None remained after cleaning.  

---

#### **2. Insights Found from Data Wrangling**
1. **IMDb & TMDB Score Insights:**  
   - Since missing scores were filled with the median, it suggests that **some content did not have ratings** due to low popularity or limited audience engagement.  
   - Median IMDb Score (~6.1) suggests **most content is of average quality**.  

2. **Age Certification Gaps:**  
   - A **significant portion (~6,487 entries) had missing `age_certification`** before cleaning.  
   - This indicates that **many movies or shows were not classified with appropriate age ratings**, which may impact parental control and audience targeting.  

3. **Seasons Data Issue:**  
   - **8,511 missing values in `seasons`** indicate that a majority of the dataset consists of movies (since seasons only apply to TV shows).  
   - Amazon Prime may have **more movies than TV shows**, highlighting a potential gap in episodic content.  

4. **Potential Issues with Content Metadata:**  
   - `imdb_id` missing for **667 titles** means that **some content lacks IMDb integration**, possibly affecting its visibility and discoverability.  
   - `tmdb_popularity` missing for **547 titles** suggests that **certain movies/shows have little engagement or no public data on TMDB**.  

5. **Character Names Missing in `credits.csv`**:  
   - Many character names were originally missing (16,287 entries before cleaning).  
   - After filling them with `"Unknown"`, it became evident that **some actors were not assigned specific character names**, possibly for minor roles or uncredited appearances.  

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
#Read the cleaned datsets
titles_dff = pd.read_csv("titles_cleaned.csv")
credits_dff = pd.read_csv("credits_cleaned.csv")

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# Visualization 1: Count of Movies vs. TV Shows
plt.figure(figsize=(8, 5))
sns.countplot(data=titles_dff, x='type', palette='coolwarm')
plt.title("Distribution of Movies vs. TV Shows")
plt.xlabel("Type")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?

 A bar chart is effective for categorical comparisons, showing the count of movies vs. TV shows clearly.

##### 2. What is/are the insight(s) found from the chart?

The dataset has a higher number of movies compared to TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If more movies are present, Amazon Prime could focus on exclusive TV show content to balance the offering.

**Negative Growth Insight:**
If TV shows are significantly fewer, it may indicate that users looking for long-term content engagement might switch to competitors like Netflix.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Visualization 2: Top 10 Genres
genres_list = titles_dff['genres'].explode().value_counts().head(10)
plt.figure(figsize=(10, 5))
sns.barplot(x=genres_list.index, y=genres_list.values, palette='viridis')
plt.xticks(rotation=45)
plt.title("Top 10 Most Common Genres")
plt.xlabel("Genre")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart helps to visualize the most frequently occurring genres effectively.

##### 2. What is/are the insight(s) found from the chart?

Certain genres like "Drama" and "Comedy" are dominant.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Amazon Prime can prioritize acquiring or producing content in these high-demand genres.

**Negative Growth Insight:**If some genres (e.g., Sci-Fi, Documentaries) have lower representation, it might alienate audiences interested in those categories, potentially driving them to competitors.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Visualization 3: IMDB Score Distribution
plt.figure(figsize=(10, 5))
sns.histplot(titles_dff['imdb_score'], bins=30, kde=True, color='green')
plt.title("Distribution of IMDB Scores")
plt.xlabel("IMDB Score")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

A histogram shows how ratings are spread across movies and TV shows.

##### 2. What is/are the insight(s) found from the chart?

Most IMDB scores are in the 6-8 range, suggesting a mid-to-high quality catalog.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

High ratings can attract more subscribers.

**Negative Growth Insight:**If a large portion has lower ratings (e.g., below 5), it may indicate quality issues, leading to negative user retention.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

# Visualization 4: TMDB Score vs. IMDB Score
plt.figure(figsize=(8, 5))
sns.scatterplot(data=titles_dff, x='imdb_score', y='tmdb_score', alpha=0.5)
plt.title("IMDB Score vs. TMDB Score")
plt.xlabel("IMDB Score")
plt.ylabel("TMDB Score")
plt.show()

##### 1. Why did you pick the specific chart?

 Scatter Plot helps to analyze correlation between two scoring systems.

##### 2. What is/are the insight(s) found from the chart?

Strong correlation, meaning user ratings are consistent across platforms.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If scores are aligned, Prime can use TMDB/IMDB data to predict user engagement.

**Negative Growth Insight:** If scores differ widely, it may indicate biased or unreliable rating systems affecting user trust.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# Visualization 5: Top 10 Actors with Most Appearances
top_actors = credits_dff[credits_dff['role'] == 'ACTOR']['name'].value_counts().head(10)
plt.figure(figsize=(10, 5))
sns.barplot(x=top_actors.index, y=top_actors.values, palette='magma')
plt.xticks(rotation=45)
plt.title("Top 10 Actors with Most Appearances")
plt.xlabel("Actor Name")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is effective in ranking actors by the number of projects they appear in.

##### 2. What is/are the insight(s) found from the chart?

Some actors are featured prominently, suggesting their popularity or frequent collaborations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Amazon can prioritize content with these actors to drive engagement.

**Negative Growth Insight:** Over-reliance on a few actors might limit diversity, potentially alienating audiences looking for fresh faces.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

# Visualization 6: Top 10 Directors with Most Titles Directed
top_directors = credits_dff[credits_dff['role'] == 'DIRECTOR']['name'].value_counts().head(10)
plt.figure(figsize=(10, 5))
sns.barplot(x=top_directors.index, y=top_directors.values, palette='coolwarm')
plt.xticks(rotation=45)
plt.title("Top 10 Directors with Most Titles Directed")
plt.xlabel("Director Name")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart helps to visualize which directors contribute the most content.

##### 2. What is/are the insight(s) found from the chart?

Some directors have a significant number of projects, indicating potential high-performing partnerships.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Partnering with these directors for exclusive content can enhance viewership.

**Negative Growth Insight:** Over-reliance on a small pool of directors could limit variety, impacting content diversity.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

# Visualization 7: Distribution of Roles (Actors vs. Directors)
plt.figure(figsize=(7, 5))
sns.countplot(data=credits_df, x='role', palette='Set2')
plt.title("Distribution of Roles in Credits Dataset")
plt.xlabel("Role")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart helps to compare the number of actors vs. directors in the dataset.

##### 2. What is/are the insight(s) found from the chart?

Actors significantly outnumber directors.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

More actors mean greater content variety, while fewer directors may indicate concentrated creative control.

**Negative Growth Insight:** If director numbers are too low, it could indicate limited creative diversity, potentially making content repetitive.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

# Visualization 8: Most Common Character Names in TV Shows & Movies
top_characters = credits_df['character'].value_counts().head(10)
plt.figure(figsize=(10, 5))
sns.barplot(x=top_characters.index, y=top_characters.values, palette='viridis')
plt.xticks(rotation=45)
plt.title("Top 10 Most Common Character Names")
plt.xlabel("Character Name")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart highlights frequently appearing character names very easily.

##### 2. What is/are the insight(s) found from the chart?

Certain character names appear repeatedly, likely due to recurring roles in TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps Amazon analyze franchise success by tracking popular characters.

**Negative Growth Insight:** If repetitive character names indicate overuse of similar themes, audiences might seek variety elsewhere.


#### Chart - 9

In [None]:
# Chart - 9 visualization code

# Visualization 9: Number of Titles Each Actor Has Appeared In (Top 20)
actor_appearances = credits_dff[credits_dff['role'] == 'ACTOR']['name'].value_counts().head(20)
plt.figure(figsize=(12, 6))
sns.barplot(x=actor_appearances.index, y=actor_appearances.values, palette='magma')
plt.xticks(rotation=90)
plt.title("Top 20 Actors with Most Appearances")
plt.xlabel("Actor Name")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart helps to identify actors with the most projects.

##### 2. What is/are the insight(s) found from the chart?

Some actors are dominant across multiple shows/movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Investing in content featuring popular actors can increase engagement.

**Negative Growth Insight:** Too much reliance on a few actors may lead to brand fatigue.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

# Visualization 10: Distribution of Cast Appearances (How Many Titles Do Actors Work In?)
cast_counts = credits_dff[credits_dff['role'] == 'ACTOR']['name'].value_counts()
plt.figure(figsize=(10, 5))
sns.histplot(cast_counts, bins=50, kde=True, color='green')
plt.xscale("log")
plt.title("Distribution of Number of Titles Actors Have Worked In")
plt.xlabel("Number of Titles (Log Scale)")
plt.ylabel("Count of Actors")
plt.show()

##### 1. Why did you pick the specific chart?

A histogram with a line chart analyzes distribution of actor workloads.

##### 2. What is/are the insight(s) found from the chart?

A few actors work on a vast number of titles, while most work on fewer.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in negotiating contracts with actors based on demand.

**Negative Growth Insight:** If only a few actors dominate, it may indicate a lack of fresh talent.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To address these business objectives, the following solutions were implemented:

**Content Distribution Analysis:**

A bar chart comparing movies vs. TV shows helped determine whether Amazon Prime has a balanced content library.
Genre distribution analysis identified the most popular genres, allowing the platform to prioritize acquisitions.

**Content Quality Assessment:**

IMDb and TMDB ratings were analyzed to understand user satisfaction levels.
A scatter plot of IMDb vs. TMDB scores ensured rating consistency.
Low-rated content was flagged as a potential risk area.
Optimizing Content Acquisition:

Yearly trends of movie/TV show releases were studied to predict future demand.
The most common actors and directors were identified, allowing Amazon to collaborate with high-engagement artists.

**Evaluating Cast & Crew Trends:**

A bar chart of the most common character names highlighted franchise potential.
Actor and director distributions revealed who contributes the most content.

**Identifying Gaps & Risks:**

Missing values in age_certification suggested the need for proper content classification.
Low representation of niche genres (e.g., documentaries, sci-fi) indicated a potential market opportunity.
Declining release trends in recent years suggested a slowdown in content acquisition.

**Enhancing Customer Satisfaction:**

By aligning new content acquisitions with top-rated genres and actors, Amazon can boost engagement.
Avoiding an over-reliance on specific actors and directors ensures fresh and diverse content

# **Conclusion**

This project provided critical insights into Amazon Prime Video’s content library. Key findings include:  
- **Movies dominate the platform**, but TV shows still have a strong presence.  
- **Drama and Comedy are the most popular genres**, indicating user preference trends.  
- **Content releases peaked in the 2010s but show a slight decline in recent years.**  
- **IMDB and TMDB ratings are mostly positive**, suggesting high content quality.  
- **A few actors and directors dominate the platform**, which may limit content diversity.  

From a business perspective, these insights can **guide Amazon Prime Video’s content strategy** in the following ways:  
1. **Investing more in TV shows** to balance the content library.  
2. **Acquiring and producing more high-rated genres** to enhance user engagement.  
3. **Collaborating with diverse actors and directors** to refresh content.  
4. **Ensuring a steady influx of new content** to maintain platform growth.  

By leveraging these insights, **Amazon Prime Video can enhance its content catalog, retain existing users, and attract new subscribers, leading to long-term success in the competitive streaming market.**

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***