<a href="https://colab.research.google.com/github/aakash23082005/gfg_projects_21days/blob/main/Project2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 2: In-Depth Exploratory Data Analysis (EDA)
## Netflix Content Analysis ðŸŽ¬

**Project Objective:** To perform an in-depth exploratory data analysis of the Netflix dataset. We will explore trends in content production, identify popular genres, analyze content ratings, and understand the distribution of movies and TV shows on the platform. This project builds on foundational EDA by introducing time-series analysis and more complex data cleaning and transformation techniques.

### Core Concepts We'll Cover:
1.  **Data Cleaning & Transformation:** Handling missing values and converting data types (especially dates).
2.  **Time-Series Analysis:** Analyzing how content has been added to Netflix over the years.
3.  **Text Data Manipulation:** Parsing and analyzing columns with multiple values, like `listed_in` (genres) and `cast`.
4.  **Geographical & Rating Analysis:** Understanding where content comes from and its maturity level.
5.  **Feature Engineering:** Creating new, insightful features like 'content age'.
6.  **Advanced Visualization:** Creating insightful plots to understand distributions and relationships in the data.

### Step 1: Setup - Importing Libraries

As always, we begin by importing our essential data science toolset, including a new library for word clouds.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

# Set a consistent style for our plots
sns.set_style('darkgrid')

### Step 2: Data Loading and Initial Inspection

We'll load the `netflix_titles.csv` dataset and perform a high-level overview.

In [None]:
!git clone "https://github.com/HarshvardhanSingh-13/Datasets"

In [None]:
netflix_df = pd.read_csv('/content/Datasets/Netflix_Titles Dataset/netflix_titles.csv')
netflix_df.head()

In [None]:
# Get a concise summary of the dataframe
netflix_df.info()

**Interpretation of `.info()`:**
- We have 7787 entries (titles).
- **Key Problem:** The `date_added` column is of type `object` (a string), not a `datetime` object. We cannot perform time-based analysis until this is corrected.
- **Missing Values:** `director`, `cast`, `country`, `date_added`, and `rating` all have missing values. `director` has the most significant number of nulls.

### Step 3: Data Cleaning and Transformation

This step is critical for ensuring our analysis is accurate. We will handle missing values and correct data types.

#### **Theoretical Concept: Data Type Conversion & Handling Nulls**
Data often comes in non-ideal formats. Storing dates as strings, for example, prevents us from extracting components like the year or month, or from plotting data over time. Converting columns to their proper data types (`pd.to_datetime`, `.astype()`) is a fundamental preprocessing step.

For null values, we have several strategies:
1.  **Drop:** If only a very small percentage of rows have missing data, dropping them might be acceptable (`.dropna()`).
2.  **Fill/Impute:** Replace missing values with a placeholder (like "Unknown") or a statistical measure (like the mode for categorical data). This is useful when you don't want to lose the other information in those rows.

In [None]:
# 1. Handle missing values in 'director' and 'cast'
# Since these are text fields and many are missing, we'll fill them with 'Unknown'.
netflix_df['director'] = netflix_df['director'].fillna('Unknown')
netflix_df['cast'] = netflix_df['cast'].fillna('Unknown')

In [None]:
# 2. Handle missing 'country'
# We'll fill with the mode, which is the most common country.
mode_country = netflix_df['country'].mode()[0]
netflix_df['country'] = netflix_df['country'].fillna(mode_country)

In [None]:
# 3. Drop the few rows with missing 'date_added' and 'rating'
# Since the number is small (less than 0.2% of data), dropping them is a safe option.
netflix_df.dropna(subset=['date_added', 'rating'], inplace=True)

In [None]:
# 4. Convert 'date_added' to datetime objects
# Use format='mixed' to handle potential variations in date formats
netflix_df['date_added'] = pd.to_datetime(netflix_df['date_added'], format='mixed', dayfirst=False)

* **format='mixed':** This argument tells pandas to infer the date format automatically. This is helpful when the date strings in the column have different formats.

* **dayfirst=False:** This argument specifies that when the date format is ambiguous (e.g., 01/02/2023), it should be interpreted as month first (January 2nd) rather than day first (February 1st).

In [None]:
# 5. Create new features for year and month added
netflix_df['year_added'] = netflix_df['date_added'].dt.year
netflix_df['month_added'] = netflix_df['date_added'].dt.month

In [None]:
# Verify our cleaning and transformation
print("Missing values after cleaning:")
print(netflix_df.isnull().sum())
print("\nData types after transformation:")
print(netflix_df.dtypes)

### Step 4: Exploratory Data Analysis & Visualization

#### 4.1 What is the distribution of content type?

In [None]:
plt.figure(figsize=(8, 6))
type_counts = netflix_df['type'].value_counts()
plt.pie(type_counts, labels=type_counts.index, autopct='%1.1f%%', startangle=140, colors=['#e60023', '#221f1f'])
plt.title('Proportion of Movies vs. TV Shows')
plt.ylabel('')
plt.show()

**Insight:** The Netflix library is dominated by Movies, which make up roughly 70% of the content in this dataset.

#### 4.2 How has content been added over time?

In [None]:
# Group data by year and content type
content_over_time = netflix_df.groupby(['year_added', 'type']).size().unstack().fillna(0)

plt.figure(figsize=(14, 8))
content_over_time.plot(kind='line', marker='o', figsize=(14, 8))
plt.title('Content Added to Netflix Over the Years (by Type)')
plt.xlabel('Year Added')
plt.ylabel('Number of Titles Added')
plt.legend(title='Content Type')
plt.grid(True)
plt.show()

**Insight:** By separating movies and TV shows, we can see that while both grew significantly, the addition of movies accelerated much more dramatically, peaking in 2019. The growth in TV shows has been more steady. There appears to be a slight slowdown in content additions in 2020 and 2021, which could be due to the COVID-19 pandemic affecting productions or the dataset being incomplete for the latest year.

In [None]:
netflix_df.head(2)

#### 4.3 What are the most popular genres?

#### **Theoretical Concept: Handling Multi-Value Text Columns**
The `listed_in` column contains strings with multiple genres separated by commas (e.g., "Dramas, International Movies"). To analyze each genre individually, we need to transform the data. A common technique is to:
1.  **Split** the string in each row into a list of genres.
2.  **Explode** the DataFrame so that each genre in the list gets its own row, duplicating the other information for that title.
This allows us to perform a `value_counts()` on the genres.

In [None]:
# Split the 'listed_in' column and explode it
genres = netflix_df.assign(genre=netflix_df['listed_in'].str.split(', ')).explode('genre')

In [None]:
# Get the top 15 genres and their counts
top_genres_counts = genres['genre'].value_counts().reset_index()
top_genres_counts.columns = ['genre', 'count'] # Rename columns for clarity

# Select only the top 15 for plotting
top_genres_counts_plot = top_genres_counts.head(15)

plt.figure(figsize=(12, 8))
sns.barplot(y='genre', x='count', data=top_genres_counts_plot, palette='mako', hue='genre', legend=False)
plt.title('Top 15 Genres on Netflix')
plt.xlabel('Count')
plt.ylabel('Genre')
plt.show()

**Insight:** "International Movies" is the most common genre tag, highlighting Netflix's global content strategy. This is followed by Dramas, Comedies, and Action & Adventure.

#### 4.4 What is the distribution of content duration?

In [None]:
# Separate movies and TV shows
movies_df = netflix_df[netflix_df['type'] == 'Movie'].copy()
tv_shows_df = netflix_df[netflix_df['type'] == 'TV Show'].copy()

In [None]:
# Clean and convert duration for movies
movies_df['duration_min'] = movies_df['duration'].str.replace(' min', '').astype(int)

# Clean and convert duration for TV shows
tv_shows_df['seasons'] = tv_shows_df['duration'].str.replace(' Seasons', '').str.replace(' Season', '').astype(int)

In [None]:
# Plot the distributions
fig, axes = plt.subplots(1, 2, figsize=(18, 7))

# Movie Duration Distribution
sns.histplot(ax=axes[0], data=movies_df, x='duration_min', bins=50, kde=True, color='skyblue').set_title('Movie Duration Distribution (minutes)')

# TV Show Season Distribution
sns.countplot(ax=axes[1], x='seasons', data=tv_shows_df, palette='rocket', order=tv_shows_df['seasons'].value_counts().index, hue='seasons', legend=False).set_title('TV Show Season Distribution')

plt.show()

**Insight:**
- The majority of movies on Netflix are between 80 and 120 minutes long, which is standard for feature films.
- The vast majority of TV shows on Netflix are short-lived, with most having only 1 season. This could reflect a strategy of producing many pilots and only renewing the most successful ones, or a focus on limited series.

#### 4.5 Where does the content come from? (Geographical Analysis)

In [None]:
# Handle the multi-country listings similar to genres
countries = netflix_df.assign(country=netflix_df['country'].str.split(', ')).explode('country')

In [None]:
# Get the top 15 countries and their counts
top_countries_counts = countries['country'].value_counts().reset_index()
top_countries_counts.columns = ['country', 'count'] # Rename columns for clarity

In [None]:
# Select only the top 15 for plotting
top_countries_counts_plot = top_countries_counts.head(15)

plt.figure(figsize=(12, 10))
sns.barplot(y='country', x='count', data=top_countries_counts_plot, palette='viridis', hue='country', legend=False)
plt.title('Top 15 Content Producing Countries on Netflix')
plt.xlabel('Number of Titles')
plt.ylabel('Country')
plt.show()

**Insight:** The United States is by far the largest producer of content available on Netflix. However, India is a very strong second, which explains why so many of the top actors were from India. The UK, Japan, and South Korea also represent major content markets for the platform, emphasizing its global nature.

In [None]:
netflix_df.head(2)

#### 4.6 What are the maturity ratings of the content?

In [None]:
plt.figure(figsize=(12, 8))
sns.countplot(x='rating', data=netflix_df, order=netflix_df['rating'].value_counts().index, palette='crest', hue='rating', legend=False)
plt.title('Distribution of Content Ratings on Netflix')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

**Insight:** A large portion of Netflix's content is aimed at mature audiences, with `TV-MA` (Mature Audience) and `TV-14` (Parents Strongly Cautioned) being the two most common ratings. This suggests a focus on adult viewers over content for children (`TV-G`, `TV-Y`).

### Step 5: Feature Engineering - Content Freshness
Let's create a new feature to analyze how old content is when it gets added to Netflix. This can tell us about their acquisition strategy (buying old classics vs. releasing new originals).

In [None]:
# Create the 'age_on_netflix' feature
netflix_df['age_on_netflix'] = netflix_df['year_added'] - netflix_df['release_year']

# Filter out any potential errors where added_year is before release_year
content_age = netflix_df[netflix_df['age_on_netflix'] >= 0]

plt.figure(figsize=(14, 7))
sns.histplot(data=content_age, x='age_on_netflix', bins=50, kde=True)
plt.title('Distribution of Content Age When Added to Netflix')
plt.xlabel('Content Age (Years)')
plt.ylabel('Number of Titles')
plt.show()

**Insight:** The large spike at `0` indicates that a significant amount of content is added in the same year it's released, which is characteristic of "Netflix Originals." However, there is a very long tail, showing that Netflix also heavily invests in acquiring licensed content that can be decades old, building a deep library of classic films and shows.

### Step 6: Deeper Multivariate Analysis

In [None]:
# Analyze movie duration across different top genres
top_genres = genres['genre'].value_counts().index[:5]
genres_movies = genres[(genres['type'] == 'Movie') & (genres['genre'].isin(top_genres))].copy()
genres_movies['duration_min'] = genres_movies['duration'].str.replace(' min', '').astype(int)

plt.figure(figsize=(15, 8))
sns.boxplot(data=genres_movies, x='genre', y='duration_min', palette='pastel', hue='genre', legend=False)
plt.title('Movie Duration by Top Genres')
plt.xlabel('Genre')
plt.ylabel('Duration (minutes)')
plt.xticks(rotation=45)
plt.show()

**Insight:** While the median duration for most top genres is similar (around 90-100 minutes), we can see some interesting variations. For example, Dramas tend to have a wider range of durations, with many longer films. International Movies also show a broad distribution, reflecting diverse filmmaking styles from around the world.

### Step 7: Word Cloud from Content Descriptions
As a final visual analysis, let's generate a word cloud from the `description` column to see what themes and words are most common in Netflix content.

In [None]:
# Combine all descriptions into a single string
text = ' '.join(netflix_df['description'])

# Create and generate a word cloud image
wordcloud = WordCloud(width=800, height=400, background_color='black').generate(text)

# Display the generated image
plt.figure(figsize=(15, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Most Common Words in Netflix Content Descriptions', fontsize=20)
plt.show()

**Insight:** The word cloud highlights common themes and subjects. Words like "life," "family," "love," "young," "friends," and "world" are prominent, suggesting that much of the content revolves around human relationships and personal journeys. Action-oriented words like "find," "secret," and "new" also appear frequently.

### Step 8: Final Conclusion and Summary of Insights

This in-depth EDA of the Netflix dataset has revealed several key characteristics and strategies of the platform's content library.

**Key Findings:**
1.  **Content Strategy:** Netflix's library is movie-heavy (~70%), and the platform aggressively added content between 2016-2019. Their strategy involves a mix of brand new originals (added the same year they are released) and a deep library of licensed older content.
2.  **Global Dominance:** While the US is the top content producer, the platform is heavily international, with India being a massive contributor. This is reflected in the top genres ("International Movies") and most frequent actors.
3.  **Target Audience:** The content library is skewed towards mature audiences, with `TV-MA` and `TV-14` being the most common ratings.
4.  **Content Format & Genre:** Dramas and Comedies are universally popular genres. Most movies stick to a standard 90-120 minute runtime, while the vast majority of TV shows only last for a single season, suggesting a high-risk, high-reward approach to series production.
5.  **Common Themes:** Descriptions of content frequently revolve around universal themes of life, family, love, and discovery.

**Limitations:** This dataset is a snapshot in time and lacks viewership data. Therefore, our analysis is of the *supply* of content, not its *demand* or popularity. Nonetheless, this EDA provides a strong, multi-faceted understanding of the composition and evolution of the Netflix library.

# Submission Q's

* How has the distribution of content ratings changed over time?
* Is there a relationship between content age and its type (Movie vs. TV Show)?
* Can we identify any trends in content production based on the release year vs. the year added to Netflix?
* What are the most common word pairs or phrases in content descriptions?
* Who are the top directors on Netflix?

In [None]:
content_rating = netflix_df.groupby(['year_added', 'rating']).size().unstack().fillna(0)

plt.figure(figsize=(14, 8))
content_rating.plot(kind='line', marker='o', figsize=(14, 8))
plt.title('Content Rating Added to Netflix Over the Years (by Type)')
plt.xlabel('Year Added')
plt.ylabel('Number of Content ratings Added')
plt.legend(title='Content rating')
plt.grid(True)
plt.show()

Summary and Insight:

Dominance of Mature Ratings: The TV-MA and TV-14 ratings consistently represent the largest categories of content added each year, especially from 2016 onwards. This reinforces the earlier insight that Netflix primarily targets adult audiences.
Significant Growth Across All Ratings (2016-2019): There was a substantial increase in content additions across almost all rating categories, peaking around 2019. This mirrors the overall content addition trend observed earlier.
Steady Increase in TV-MA and TV-14: The lines for TV-MA and TV-14 show a steep upward trajectory, indicating a continuous and growing focus on mature and teen-friendly content.
Limited Children's Content: Ratings like TV-Y (all children) and TV-G (general audiences) remain relatively low compared to other categories, with only a modest increase over the years. This suggests that while Netflix does offer content for younger viewers, it is not their primary focus in terms of volume.
TV-PG and PG-13 Growth: Ratings like TV-PG (parental guidance suggested) and PG-13 (parents strongly cautioned, some material may be inappropriate for pre-teenagers) also saw growth, indicating a broad appeal to families and younger teens.
R and NR Categories: The R (restricted) rating shows a noticeable increase, particularly in later years, suggesting an expansion into more adult-oriented films. NR (not rated) content also fluctuates but generally follows the overall trend.
Post-2019 Slowdown: Similar to the overall content addition trend, there's a visible slowdown or leveling off in content additions across most rating categories in 2020 and 2021, possibly due to external factors like the pandemic.
In essence, the distribution of content ratings has shifted towards a higher volume across all categories, with a pronounced and sustained focus on mature and teen audiences (TV-MA, TV-14). Netflix has clearly prioritized building a robust library for these demographics, while still providing a diverse range of content for other age groups

In [None]:
movies_df_age = netflix_df[netflix_df['type'] == 'Movie'].copy()
tv_shows_df_age = netflix_df[netflix_df['type'] == 'TV Show'].copy()

In [None]:
combined_df_age = pd.concat([movies_df_age, tv_shows_df_age])

plt.figure(figsize=(10, 7))
sns.violinplot(x='type', y='age_on_netflix', data=combined_df_age, palette='pastel', hue='type', legend=False)
plt.title('Distribution of Content Age When Added to Netflix by Content Type')
plt.xlabel('Content Type')
plt.ylabel('Content Age (Years)')
plt.show()


When added to Netflix, movies tend to have a wider age distribution, encompassing both new releases (around 0 years old) and a substantial proportion of very old content (decades old). In contrast, TV shows generally exhibit a younger age distribution, with a stronger concentration on recent productions (0-5 years old).

### Data Analysis Key Findings
*   **Movies**:
    *   A significant portion of movies are added to Netflix in their release year (around 0 years old).
    *   Movies also show a long and substantial tail extending to much older ages (decades old), indicating a strategy of acquiring a vast catalog of older films.
    *   The overall age distribution for movies when added to Netflix is notably wider.
*   **TV Shows**:
    *   Many TV shows are also added in their release year (around 0 years old).
    *   However, the distribution for older TV shows is shorter and less dense compared to movies, suggesting less acquisition of very old TV series.
    *   The age distribution for TV shows is more concentrated towards newer content, primarily within the 0-5 year range.


In [None]:
plt.figure(figsize=(15, 10))
sns.histplot(x='release_year', y='year_added', data=netflix_df, bins=30, cmap='viridis')
plt.title('Density of Content Added to Netflix by Release Year and Year Added')
plt.xlabel('Release Year')
plt.ylabel('Year Added')
plt.show()

### Summary of Trends in Content Production (Release Year vs. Year Added)

From the 2D histogram visualizing `release_year` against `year_added`, coupled with the insights from `age_on_netflix`, several key trends in Netflix's content acquisition strategy become apparent:

1.  **Diagonal Line of Netflix Originals/New Releases:** A prominent diagonal band along `release_year = year_added` indicates a substantial volume of content being added to Netflix in the same year it was released. This represents Netflix's focus on original productions and acquiring fresh, newly released content.

2.  **Concentration of Recent Content:** There's a high density of content released in the more recent years (e.g., from 2010s onwards) and added to Netflix in the subsequent years, forming a dense cluster towards the top-right of the plot. This suggests an active acquisition strategy for contemporary content.

3.  **Horizontal Spread for Older Content Acquisition:** For content released in earlier decades (e.g., pre-2000s), the points are much more spread out horizontally along the `year_added` axis. This shows that Netflix has been consistently adding older content to its library over many years, indicating a strategy to build a comprehensive catalog that includes classics and licensed content with significant age gaps between their release and addition to the platform.

4.  **Content Age Distribution:** The `age_on_netflix` feature further clarifies this. While a large peak at `age_on_netflix = 0` (same-year release and addition) is observed, there's also a long tail in the distribution, confirming that Netflix acquires content that can be several decades old.

5.  **Shifting Acquisition Focus Over Time:** The heatmap also subtly suggests a shift in Netflix's acquisition over its operational years. In earlier years of Netflix's content addition (e.g., before 2015), there might have been a higher proportion of older content being added. As the platform matured, the focus shifted more towards contemporary and original content, as evidenced by the increasing density along the `release_year = year_added` diagonal in later `year_added` ranges.

In conclusion, Netflix employs a dual strategy: rapidly adding newly released and original content to stay current and competitive, while simultaneously enriching its library with a diverse collection of older, licensed titles to cater to a broader audience and provide a deep catalog.

In [None]:
import nltk
nltk.download('stopwords')
import string
from nltk.corpus import stopwords

In [None]:
import nltk
import string
from nltk.corpus import stopwords

# Get English stopwords
STOPWORDS = set(stopwords.words('english'))

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = ''.join([char for char in text if char not in string.punctuation])
    # Split into words and filter stopwords
    words = [word for word in text.split() if word not in STOPWORDS]
    # Join words back into a string
    return ' '.join(words)

# Apply the preprocessing function to the 'description' column
netflix_df['cleaned_description'] = netflix_df['description'].apply(preprocess_text)

print("First 5 cleaned descriptions:")
print(netflix_df[['description', 'cleaned_description']].head())

In [None]:
from nltk.util import ngrams
from collections import Counter

# Create an empty list to store formatted bigrams
bigrams_list = []

# Iterate through each cleaned_description
for description in netflix_df['cleaned_description']:
    # Split the description into words
    words = description.split()
    # Generate bigrams from the words
    bigrams = list(ngrams(words, 2))
    # Format each bigram and append to the list
    for bigram in bigrams:
        bigrams_list.append('_'.join(bigram))

# Use Counter to count the occurrences of each formatted bigram
bigram_counts = Counter(bigrams_list)

# Print the 20 most common bigrams and their counts
print("20 Most Common Bigrams:")
for bigram, count in bigram_counts.most_common(20):
    print(f"{bigram}: {count}")

**Reasoning**:
Now that the bigram counts have been generated, the next step is to create a word cloud to visually represent these common phrases. This fulfills the visualization aspect of the task.



In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Generate word cloud from bigram counts
wordcloud = WordCloud(width=800, height=400, background_color='white', collocations=False).generate_from_frequencies(bigram_counts)

# Display the generated image:
plt.figure(figsize=(15, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Most Common Word Pairs in Netflix Content Descriptions', fontsize=20)
plt.show()

### Summary of Common Word Pairs and Their Insights

Reviewing the word cloud and the list of the top 20 most common bigrams reveals clear patterns in the narratives and themes prevalent across Netflix content descriptions.

**Key Observations from Top Bigrams:**

*   **Relational and Personal Narratives:** Phrases like `young_man`, `young_woman`, `best_friends`, `best_friend`, `falls_love`, and `two_young` strongly indicate a focus on character-driven stories centered around relationships, coming-of-age, romance, and personal development. These narratives often explore the journeys and experiences of individuals, particularly youth.
*   **Settings and Locations:** `high_school` and `new_york` (and `york_city`) frequently appear, suggesting common settings for these character-driven stories. High schools often serve as backdrops for youthful dramas and comedies, while New York City represents a diverse, bustling urban environment for various plots.
*   **Genre Indicators:** `standup_special`, `true_story`, `based_true`, `world_war` (and `war_ii`), `documentary_follows`, `documentary_series`, and `serial_killer` point towards specific genres or narrative styles. There's a notable presence of comedy (stand-up), historical dramas/documentaries (World War II, true stories), and crime/mystery (serial killer, documentary following events).
*   **Adventure and Exploration:** `around_world` and `road_trip` hint at themes of travel, discovery, and adventure, suggesting content that involves characters embarking on journeys or exploring new places.

**What these phrases reveal about Netflix content:**

1.  **Target Audience:** The prominence of `high_school`, `young_man`, and `young_woman` reinforces a strong appeal to young adult and adolescent demographics. However, the inclusion of `true_story`, `world_war`, and `serial_killer` also indicates content aimed at more mature audiences interested in factual, historical, or crime-related narratives.
2.  **Narrative Focus:** Netflix content frequently revolves around the lives and experiences of individuals, often focusing on their relationships (`best_friends`, `falls_love`), challenges, and personal growth. There's a blend of fictional dramas, comedies, and real-life inspired stories.
3.  **Content Diversity:** While there's a strong emphasis on character-centric dramas and comedies, Netflix also clearly invests in documentaries and true-crime stories, as well as period pieces and action-adventure content, showcasing a diverse portfolio to cater to varied tastes.
4.  **Global & Local Themes:** Although `new_york` points to a specific locale, `international movies` (from earlier analysis) and phrases like `around_world` suggest that Netflix's content, even when focused on personal narratives, often has a global perspective or involves characters traversing different cultures and places.

In essence, the common word pairs painted by the word cloud and detailed in the bigram list underscore Netflix's strategy to deliver a mix of engaging, character-driven narratives (especially for younger audiences) alongside a significant offering of real-life inspired, historical, and true-crime content, set in both familiar and exotic locations.


In [None]:
top_directors = netflix_df['director'].value_counts().reset_index()
top_directors_plot = top_directors.head(15)


plt.figure(figsize=(12, 8))
sns.barplot(y='director', x='count', data=top_directors_plot, palette='mako', hue='director', legend=False)
plt.title('Top 15 directors on Netflix')
plt.xlabel('Count')
plt.ylabel('Directors')
plt.show()