<a href="https://colab.research.google.com/github/Usman-12478/Netflix-Data-Analysis/blob/main/Netflix_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load the dataset
df = pd.read_csv('netflix_titles_2021.csv')

In [None]:
df.head()

**What types of shows or movies are uploaded on Netflix?**


In [None]:
df.info()

In [None]:
df['type'].value_counts()

In [None]:
df['rating'].value_counts()

In [None]:
df.shape

In [None]:
df.size

In [None]:
df.columns.tolist()

In [None]:
df.dtypes

In [None]:
df.describe()

In [None]:
df.describe(include='all')

In [None]:
df.nunique()

In [None]:
df.isnull().sum()

**What is the correlation between features?**

In [None]:
numerical_df = df.select_dtypes(include=['float64', 'int64'])
correlation_matrix = numerical_df.corr()
sns.heatmap(correlation_matrix, annot=True, color = 'red')
plt.xlabel('Features')
plt.ylabel('Features')
plt.title('Correlation Between Features')
plt.show()


**Which shows are most watched on Netflix?**

In [None]:

df = df.dropna(subset=['title'])

df['title'] = df['title'].astype(str)

most_watched = df['title'].value_counts().head(10)
print(most_watched)


**- What is the distribution of ratings?**


In [None]:
df['rating'].value_counts().plot(kind='bar', title='Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()


- **Which has the highest rating: TV shows or movies?**


In [None]:
df['type'].value_counts().plot(kind = 'bar')
plt.show()

- **What is the best month for releasing content?**


In [None]:
date_format = '%B %d, %Y'

df['date_added'] = pd.to_datetime(df['date_added'], format=date_format, errors='coerce')

missing_dates = df['date_added'].isnull().sum()
print(f'Number of missing dates after conversion: {missing_dates}')

if missing_dates > 0:
    print("\nRows with missing dates:")
    print(df[df['date_added'].isnull()])

df['month_added'] = df['date_added'].dt.month

best_month = df['month_added'].value_counts().idxmax()
print(f'The best month for releasing content is: {best_month}')


- **Which genres are most watched on Netflix?**


In [None]:
genres = df['listed_in'].str.split(', ').explode().value_counts().head(10)
genres.plot(kind='bar', title='Most Watched Genres on Netflix')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.show()



- **How many movies have been released over the years?**


In [None]:
df['release_year'].value_counts().sort_index().plot(kind='line', title='Movies Released Over the Years')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()


- **How many movies were made per year?**


In [None]:
movies_per_year = df[df['type'] == 'Movie']['release_year'].value_counts().sort_index()
movies_per_year.plot(kind='bar', title='Movies Made Per Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()


- **What is the show ID and director for 'House of Cards'?**


In [None]:
house_of_cards = df[df['title'] == 'House of Cards']
show_id = house_of_cards['show_id'].values[0]
director = house_of_cards['director'].values[0]
print(f'Show ID: {show_id}, Director: {director}')


- **List all movies released in 2000.**


In [None]:
movies_2000 = df[(df['type'] == 'Movie') & (df['release_year'] == 2000)]
print(movies_2000)


- **Show only the titles of TV shows released in India.**


In [None]:
tv_shows_india = df[(df['type'] == 'TV Show') & (df['country'] == 'India')]
print(tv_shows_india['title'])


**Show only the titles of TV shows released in United States**

In [None]:
tv_shows_US = df[(df['type'] == 'TV Show') & (df['country'] == 'United States')]
print(tv_shows_US['title'])


**- Identify the top 10 directors who have contributed the most TV shows and movies to Netflix.**


In [None]:
top_directors = df['director'].value_counts().head(10)
print(top_directors)


-**How many movies/TV shows has Tom Cruise been cast in?**


In [None]:
tom_cruise = df[df['cast'].str.contains('Tom Cruise', na=False)]
print(tom_cruise)


- **How many movies have a "TV-14" rating in Canada?**


In [None]:
tv14_canada = df[(df['rating'] == 'TV-14') & (df['country'] == 'Canada') & (df['type'] == 'Movie')]
print(tv14_canada)