Skip to content

cephasM/netflix

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

📊 Netflix Data Analysis Project Author: Pierre Lukozi LinkedIn: https://www.linkedin.com/in/pierre-musili-65195a112/)

🎯 Objective This project aims to perform an exploratory data analysis (EDA) on the Netflix dataset. The goal is to uncover key insights regarding:

Types of content (Movies vs. TV Shows)

Most frequent directors

Top producing countries

Distribution of age ratings

Trends in content release and addition over time

📁 Project Structure This notebook includes the following steps:

Data loading and cleaning

Exploratory Data Analysis (EDA)

Visualization of main insights

📈 Visual Results & Interpretations

  1. Distribution of Age Ratings python Copier Modifier

Age ratings for shows in the dataset

for p in ax.patches: height = p.get_height() ax.annotate(f'{height}', (p.get_x() + p.get_width() / 2, height), ha='center', va='bottom', fontsize=10) plt.xticks(rotation=45) plt.tight_layout() plt.show() ✅ Insight: The most common age ratings are TV-MA, TV-14, and TV-PG, indicating a preference for mature and teen-oriented content.

Age ratings for shows in the dataset 2. Top 10 Most Prolific Directors # Top 10 directors in the dataset

plt.figure(figsize=(10, 5))

Données des 10 meilleurs réalisateurs

top_directors = df1[~(df1["director"] == "Unknown")].director.value_counts().nlargest(10)

Tracer le graphique

ax = top_directors.plot(kind="barh", color="skyblue")

Ajouter les étiquettes sur les barres

for i, v in enumerate(top_directors): ax.text(v + 1, i, str(v), color='black', va='center', fontsize=10)

plt.title("Top 10 directors by number of shows directed") plt.xlabel("Number of shows") plt.tight_layout() plt.show()

... ✅ Insight: Directors like Raúl Campos and Jan Suter appear most frequently. They are often associated with local productions and documentaries.

2 Top 10 directors in the dataset 3. Age Rating Distribution for Top 10 Directors python Copier Modifier # Distribution of Rating the 10 best Directors # Filter directors "known" df_known = df1[df1["director"] != "Unknown"]

Maintains the 10 bests directors

top_directors = df_known["director"].value_counts().nlargest(10).index df_top_directors = df_known[df_known["director"].isin(top_directors)]

use plot

plt.figure(figsize=(12,6)) ax = sns.countplot(data=df_top_directors, x="director", hue="rating", order=top_directors) plt.title("Distribution of Rating the 10 best Directors") plt.xlabel("Directors") plt.ylabel("Count number of rating") plt.xticks(rotation=45) plt.legend(title="Rating") plt.tight_layout()

Add patches

for p in ax.patches: height = p.get_height() if height > 0: ax.annotate(f'{height}', (p.get_x() + p.get_width() / 2., height), ha='center', va='bottom', fontsize=8)

plt.show() ... ✅ Insight: Each director works with a variety of ratings. Some focus on mature content, while others on more family-friendly genres.

3 top 10 directors 4. Rating Distribution by Country (Top 10) # Distribution of rating per country (Top 10 country) # Top 10 country top_countries = df1['country'].value_counts().nlargest(10).index df_top_countries = df1[df1['country'].isin(top_countries)]

use plot

plt.figure(figsize=(12,6)) ax = sns.countplot(data=df_top_countries, x='country', hue='rating', order=top_countries) plt.title("Distribution of rating per country (Top 10 country)") plt.xlabel("Country") plt.ylabel("Count number of rating") plt.xticks(rotation=45) plt.legend(title="Rating") plt.tight_layout()

Add patches

for p in ax.patches: height = p.get_height() if height > 0: ax.annotate(f'{height}', (p.get_x() + p.get_width() / 2., height), ha='center', va='bottom', fontsize=8)

plt.show() ... ✅ Insight: The United States dominates in volume, followed by India and the United Kingdom, reflecting their strong media production industries.

5 visualisation the year in 5. Release Year Distribution

sns.histplot(df1["release_year"]) ✅ Insight: Most content was released between 2010 and 2020, peaking in 2018–2019, likely due to Netflix’s global expansion.

5 visualisation the year in 6. Number of Shows Released Each Year Since 2008

Number of shows released each year since 2008

order = range(2008,2022) plt.figure(figsize=(10,5))

Personnalize colours

palette_colors = {"Movie": "skyblue"} p = sns.countplot(x="release_year",data=df1, hue="type", order = order, palette=palette_colors) plt.title("Number of shows released each year since 2008 that are on Netflix") plt.xlabel("") for i in p.patches: p.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = "center", va = "center", xytext = (0, 10), textcoords = "offset points") ... ✅ Insight: Netflix's content production significantly increased after 2015, with a strong upward trend until 2020.

9 number of shows relead ech year 🧰 Technologies Used Python

Pandas

Matplotlib

Seaborn

Jupyter Notebook

📌 Conclusion This exploratory data analysis (EDA) revealed key trends in Netflix's catalog: the dominance of certain countries and directors, a strong push in recent years for more content, and a concentration in mature audience ratings.

These findings form a solid foundation for future work such as:

Building a recommendation engine

Genre and duration analysis

NLP on descriptions and reviews

🖊️ Author Pierre Lukozi 📎 LinkedIn Profile

About

This project aims to explore a Netflix dataset using Exploratory Data Analysis (EDA). The objective is to uncover insights on: Types of content (movies/TV shows) The most featured directors The top content-producing countries The distribution of age ratings Trends in release dates over time

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors