📊 Netflix Data Analysis Project Author: Pierre Lukozi LinkedIn: https://www.linkedin.com/in/pierre-musili-65195a112/)
🎯 Objective This project aims to perform an exploratory data analysis (EDA) on the Netflix dataset. The goal is to uncover key insights regarding:
Types of content (Movies vs. TV Shows)
Most frequent directors
Top producing countries
Distribution of age ratings
Trends in content release and addition over time
📁 Project Structure This notebook includes the following steps:
Data loading and cleaning
Exploratory Data Analysis (EDA)
Visualization of main insights
📈 Visual Results & Interpretations
- Distribution of Age Ratings python Copier Modifier
for p in ax.patches: height = p.get_height() ax.annotate(f'{height}', (p.get_x() + p.get_width() / 2, height), ha='center', va='bottom', fontsize=10) plt.xticks(rotation=45) plt.tight_layout() plt.show() ✅ Insight: The most common age ratings are TV-MA, TV-14, and TV-PG, indicating a preference for mature and teen-oriented content.
2. Top 10 Most Prolific Directors
# Top 10 directors in the dataset
plt.figure(figsize=(10, 5))
top_directors = df1[~(df1["director"] == "Unknown")].director.value_counts().nlargest(10)
ax = top_directors.plot(kind="barh", color="skyblue")
for i, v in enumerate(top_directors): ax.text(v + 1, i, str(v), color='black', va='center', fontsize=10)
plt.title("Top 10 directors by number of shows directed") plt.xlabel("Number of shows") plt.tight_layout() plt.show()
... ✅ Insight: Directors like Raúl Campos and Jan Suter appear most frequently. They are often associated with local productions and documentaries.
3. Age Rating Distribution for Top 10 Directors
python
Copier
Modifier
# Distribution of Rating the 10 best Directors
# Filter directors "known"
df_known = df1[df1["director"] != "Unknown"]
top_directors = df_known["director"].value_counts().nlargest(10).index df_top_directors = df_known[df_known["director"].isin(top_directors)]
plt.figure(figsize=(12,6)) ax = sns.countplot(data=df_top_directors, x="director", hue="rating", order=top_directors) plt.title("Distribution of Rating the 10 best Directors") plt.xlabel("Directors") plt.ylabel("Count number of rating") plt.xticks(rotation=45) plt.legend(title="Rating") plt.tight_layout()
for p in ax.patches: height = p.get_height() if height > 0: ax.annotate(f'{height}', (p.get_x() + p.get_width() / 2., height), ha='center', va='bottom', fontsize=8)
plt.show() ... ✅ Insight: Each director works with a variety of ratings. Some focus on mature content, while others on more family-friendly genres.
4. Rating Distribution by Country (Top 10)
# Distribution of rating per country (Top 10 country)
# Top 10 country
top_countries = df1['country'].value_counts().nlargest(10).index
df_top_countries = df1[df1['country'].isin(top_countries)]
plt.figure(figsize=(12,6)) ax = sns.countplot(data=df_top_countries, x='country', hue='rating', order=top_countries) plt.title("Distribution of rating per country (Top 10 country)") plt.xlabel("Country") plt.ylabel("Count number of rating") plt.xticks(rotation=45) plt.legend(title="Rating") plt.tight_layout()
for p in ax.patches: height = p.get_height() if height > 0: ax.annotate(f'{height}', (p.get_x() + p.get_width() / 2., height), ha='center', va='bottom', fontsize=8)
plt.show() ... ✅ Insight: The United States dominates in volume, followed by India and the United Kingdom, reflecting their strong media production industries.
5. Release Year Distribution
sns.histplot(df1["release_year"]) ✅ Insight: Most content was released between 2010 and 2020, peaking in 2018–2019, likely due to Netflix’s global expansion.
6. Number of Shows Released Each Year Since 2008
order = range(2008,2022) plt.figure(figsize=(10,5))
palette_colors = {"Movie": "skyblue"} p = sns.countplot(x="release_year",data=df1, hue="type", order = order, palette=palette_colors) plt.title("Number of shows released each year since 2008 that are on Netflix") plt.xlabel("") for i in p.patches: p.annotate(format(i.get_height(), '.0f'), (i.get_x() + i.get_width() / 2., i.get_height()), ha = "center", va = "center", xytext = (0, 10), textcoords = "offset points") ... ✅ Insight: Netflix's content production significantly increased after 2015, with a strong upward trend until 2020.
🧰 Technologies Used
Python
Pandas
Matplotlib
Seaborn
Jupyter Notebook
📌 Conclusion This exploratory data analysis (EDA) revealed key trends in Netflix's catalog: the dominance of certain countries and directors, a strong push in recent years for more content, and a concentration in mature audience ratings.
These findings form a solid foundation for future work such as:
Building a recommendation engine
Genre and duration analysis
NLP on descriptions and reviews
🖊️ Author Pierre Lukozi 📎 LinkedIn Profile