#                                   Netflix Culture 
## Identifying Trends and Patterns in TV Shows and Movies


In [None]:
from IPython.display import Image
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

img = mpimg.imread('netflix1.png')
plt.figure(figsize=(14, 10))
plt.imshow(img)
plt.show()
img = mpimg.imread('netflix2.png')
plt.figure(figsize=(14, 10))
plt.imshow(img)
plt.show()
img = mpimg.imread('netflix3.png')
plt.figure(figsize=(14, 10))
plt.imshow(img)
plt.show()

# Introduction


Netflix is one of the most popular streaming services in the world. It offers a wide range of movies and Tv shows to its subscribers. Netflix is popular with its foreign-langauge, genre specific and binge-worthy content. It provides the audience with quality and original products which is why it records a huge success in the market. As each company Netflix, also relies on a huge amount of data (Big Data). As a big streaming company it collects data from its subscribers about their actions like what they watch the most, when they watch and how long they watch. This data is working also for their recommendation system. By analyzing subscribers viewing history and behavior Netflix offers content that the subscriber is most likely to be interested in.Hence, the audience stays engaged with the platform and benefits the company itself.


# The aim of our project

As having subscribers of Netflix in our group members we got interested in analyzing some patterns that are used in its Tv shows and movies. We decided to choose two datasets containing different types of information about Netflix like the names of the movies, the names of the TV shows, the producing year, the producers, etc. Having this amount of data gave us the opportunity to analyze some patterns in the content and provide some visualizations demonstrating them in a more clear way. The purpose of this paper is to clean, analyze and visualize the data we have explaining our steps in detail. The language we used for all the processes is Python.


Datasets 
As mentioned above we have two datasets netflix_titles .csv and imdb_top_1000 .csv. Both datasets are obtained from Kaggle.com which is one of the largest data science communities providing reliable and useful resources. 
netflix_titles.csv contains unlabelled text data of around 9000 Netflix Shows and Movies along with full details like Cast, Release Year, Rating, Description, etc.
imdb_top_1000 .csv is an IMDB Dataset of top 1000 movies and tv shows.
In addition to the datasets we have used a json file in our project called countries.geojson


### Data Cleaning


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from wordcloud import WordCloud, STOPWORDS
import itertools
from textblob import TextBlob

Below are all the libraries we inserted. 
1)Numpy is a library for numerical computing in Python. It provides tools for working with arrays and matrices. 
2)Pandas is a library for data manipulation and analysis. It provides tools for data analysis. 
3)Seaborn is a library for statistical data visualization. It provides a high-level interface for creating statistical graphics. 
4)Matplotlib is a library for creating static, animated, and interactive visualizations in Python. It provides various tools for creating various types of plots. 
5)Wordcloud is a library for creating word clouds in Python. A word cloud is a visual representation of text data, where the size of each word is proportional to its frequency in the text. 
6)TextBlob is a Python library used for processing textual data. It is built on top of the Natural Language Toolkit (NLTK) library and provides a simple API for common natural language processing (NLP) tasks such as sentiment analysis, part-of-speech tagging, noun phrase extraction, and more.
7)And the last, itertools is a library for working with iterators, which are objects that can be looped over. It provides tools for creating, combining, and manipulating iterators.


In [None]:
df = pd.read_csv("./netflix_titles.csv")
df2 = pd.read_csv("./imdb_top_1000.csv")
pd.set_option('display.max_columns', None)

This is the first cell of our python code. The first two rows of the cell are creating two pandas dataframes one df and the second df2. Both lines are reading the csv files and allow us to work with the data inside of the datasets. The third line sets the requirement for pandas to display all the columns of the data frames because without setting the option to “None” the Pandas would limit the number of columns when displaying by default. Obviously, we needed to insert some libraries to make analysis and data visualization. Below are all the libraries we inserted.


### Getting basic information about dataset

In [None]:
df.isnull().sum()

isnull().sum()is a pandas method chain used to find the number of missing values in each column of a DataFrame df. Missing values can be zeros, NaNs, etc.

In [None]:
df["country"].fillna("MISSING", inplace=True)
df["duration"].fillna("0 min", inplace=True)
df["director"].fillna("Unknown", inplace=True)
df["cast"].fillna("Unknown", inplace=True)
df["date_added"].fillna("Unknown", inplace=True)
df["rating"].fillna("Unknown", inplace=True)
df["duration"].fillna("Unknown", inplace=True)
df.head(n=3)

df["country"].fillna("MISSING", inplace=True) line fills the missing values in the "country" column of the dataset with the "MISSING", which is a string. 

df["duration"].fillna("0 min", inplace=True) line fills the missing values in the "duration" column with "0 min".

df["director"].fillna("Unknown", inplace=True) line fills the missing values in the "director" column with "Unknown".

df["cast"].fillna("Unknown", inplace=True) line fills the missing values in the "cast" column with "Unknown".

df["date_added"].fillna("Unknown", inplace=True) line fills the missing values in the "date_added" column with "Unknown".

df["rating"].fillna("Unknown", inplace=True) line fills the missing values in the "rating" column with "Unknown".

df["duration"].fillna("Unknown", inplace=True) line fills the missing values in the "duration" column with "Unknown".

In all the above cases the last words are strings.
In all the cases we write inplace= True to indicate that we want to amke changes on the original dataset instead of creating a copy.
At the end we get a dataframe with no missing and NaN values.

In [None]:
df.describe()

describe() is a method which returns a summary of the central tendency, dispersion, and shape of the distribution of the columns of our DataFrame.
In the output we can see the words "count", "mean","std","min","25%","50%","75%","max". These are the information about the columns of our dataframe already cleaned from missing vales. Here count: The number of non-null values in each column of the DataFrame
mean is the average of each column,std is the standars deviation, min is the minimum value, 25% is the 25% of each column, similarly are 50% and 75% and the max is the maximum value of the coumns.

In [None]:
df.shape

df.shape returns a tuple with the number of rows and the number of columns in the dataframe. In our case the number of the rows is 8807 and the number of columns is 12.

In [None]:
df.columns

In pandas the method df.columns returns the names of the columns. We can see the names of the columns of our dataset above and we can also see the type of it which is "object".

In [None]:
df.count()

count() method returns the number of non-null values in each column of a DataFrame. This method can be used to quickly identify missing values in a Dataframe.

In [None]:
df.nunique()

nunique() method returns the number of unique values in each column of a Dataframe.

In [None]:
print(f" dtype - show_id: {df.show_id.dtype}")
print(f" dtype - type: {df.type.dtype}")
print(f" dtype - title: {df.title.dtype}")
print(f" dtype - director: {df.director.dtype}")
print(f" dtype - cast: {df.cast.dtype}")
print(f" dtype - country: {df.country.dtype}")
print(f" dtype - date_added: {df.date_added.dtype}")
print(f" dtype - release_year: {df.release_year.dtype}")
print(f" dtype - rating: {df.rating.dtype}")
print(f" dtype - duration: {df.duration.dtype}")
print(f" dtype - listed_in: {df.listed_in.dtype}")
print(f" dtype - description: {df.description.dtype}")

Instead of writing for each row of the code the first row will be explained as the others are the same just for different columns.

print(f" dtype - show_id: {df.show_id.dtype}")
We are printing the data type of the column. (f" dtype - show_id) this is for having a string with the name of the column and the meanoning of our code in the output.

From the output we can see that we have 11 columns with type object and 1 column with type int64

In [None]:
df.dropna(axis="index", how="all")

dropna() method is used to remove missing values from a DataFrame. The axis parameter specifies whether to remove rows or columns that contain missing values.how parameter is the condition for removing a row or column.

### Working with duration column and splitting it into 2 columns

In [None]:
# df["duration"].unique()

unique() method is used to get an array of unique values in a Dataframe column.In our case unique() will return an array of unique values in the "duration" column.

In [None]:
l_num_dur=list() #creates a new list
l_seas_min=list() #creates another new list
for i in df["duration"]: #loop for duration column
    num_dur=int(i.split()[0]) #to get the integer value
    w_dur=i.split()[1] #to get the word Seasons/min
    l_num_dur.append(num_dur) #will add the numerical value of duration
    l_seas_min.append(w_dur) #will add either Seasons or min
df["Number duration"]=l_num_dur #puts the values in the list
df["Season/min"]=l_seas_min #puts the values in the list
df.head(n=3) #prints the dataframe

The above code is to split the duration column into two parts integers and Seasons or min. Each line is explained in the code cell.

In [None]:
df["rating"].unique()

In this case unique() will return an array of unique values in the "rating" column.

Here, as we can see, we have a problem, since three values of from "duration" column have been placed in the "rating" column. What we will do, is simply getting their indices and then we will assign them their true vallues. Those values in "rating" column will be marked as "unknown", while in the duration column, they will get their true values.

In [None]:
print(df.loc[df["rating"]=="74 min", "duration"])
print(df.loc[df["rating"]=="84 min", "duration"])
print(df.loc[df["rating"]=="66 min", "duration"])
#There were 3 Nan values in the duration column, and the original values were put in ratings column

In [None]:
df.loc[df["rating"]=="74 min", "duration"]="74 min"
df.loc[df["rating"]=="84 min", "duration"]="84 min"
df.loc[df["rating"]=="66 min", "duration"]="66 min"
df.loc[df["rating"]=="74 min", "rating"]="Unknown"
df.loc[df["rating"]=="84 min", "rating"]="Unknown"
df.loc[df["rating"]=="66 min", "rating"]="Unknown"

### Number of TV Shows and Movies

In [None]:
print(df["type"].unique()) #here we see that there are only two types: Movie and TV show
df_movies = df[df["type"] == "Movie"]
df_series = df[df["type"] == "TV Show"]

Two new DataFrames, one for movies and one for TV series created from original df enable us to compare the numerical variety of TV series and movies independently.

In [None]:
df_movies.value_counts()
n_of_movies_and_tv_series = df["title"].nunique()
n_of_movies =  df_movies["title"].count()
n_of_shows = df_series["title"].count()
print(f"There are {n_of_movies} movies and {n_of_shows} TV shows in our dataset")
#that means that netflix produced more movies than TV shows
n_of_movies_and_tv_series

Calculating the number of movies and TV shows in the dataset, some basic analysis is performed above.

In [None]:
percentage_movies = (n_of_movies / n_of_movies_and_tv_series) * 100
percentage_tvshows =  (n_of_shows / n_of_movies_and_tv_series) * 100

labels = ['Movies', 'TV Shows']
values = [percentage_movies, percentage_tvshows]

colors_pie = ['#E50914', '#8C8C8C', '#221F1F']

plt.figure(figsize=(8,8))
plt.pie(values, colors = colors_pie, autopct='%1.1f%%', textprops={'fontsize': 14})
plt.title('Movies and TV Shows')
plt.legend(loc='upper right', labels=labels)
plt.show() 

By dividing the number of movies by the entire number of movies and TV programs, and the number of TV shows by the total number of movies and TV shows, the percentage of movies and TV shows is calculated. The pie chart is then shown with the labels "Movies" and "TV Shows" and matching colors using the matplotlib tool. The pie chart makes it easier to see the distribution of movies and TV series included in the dataset.

### Years when the highest/lowest number of movies/TV shows available on Netflix were produced.

In [None]:
count_dict = {} #creates an empty dictionary

for year in df["release_year"]: #iterates over the dictionary
    if year in count_dict: #checks whether that year already exists in dictionary as a key
        count_dict[year] += 1 #if yes, adds 1 to its value
    else:
        count_dict[year] = 1 #if no, adds it to the dictionary as a key

max_n = max(count_dict.values()) #finds the maximum number of movies  
min_n = min(count_dict.values()) #finds the minimum number of movies
list_of_min = []
list_of_max = []
for i in count_dict.keys():
    if count_dict[i] == max_n:
        list_of_max.append(i) #iterating over dictionary keys, finds what keys have the value = to maximum number of movies, and add to the list
        
    if count_dict[i] == min_n:
        list_of_min.append(i) #the same with minimum number of movies
        
print(f"{list_of_max} is/are the year(s) when the highest number of film/series available on Netflix were produced.")
print(f"{list_of_min} is/are the year(s) when the lowest number of film/series available on Netflix were produced.")

This algorithm examines the dataset to identify the years where the streaming service produced the most and the least TV shows and movies.

In [None]:
keys = list(count_dict.keys()) #creates a list from dictionary keys
keys.sort() #sorts the list to avoid the mess in the graph

value_counts_2 = {} #creates an empty dictionary
for i in keys:
    value_counts_2[i] = count_dict[i] 

keys_2 = list(value_counts_2.keys())
values_2 = list(value_counts_2.values())

plt.figure(figsize=(12,10))
plt.plot(keys_2, values_2, color="red")
plt.xlabel("Years")
plt.xlim(1997,2023) #since the netflix was founded only in 1997
plt.ylabel("Number of movies")
plt.title("Number of produced movies over years")
plt.show()

This algorithm examines the dataset to identify the years where the streaming service produced the most and the least TV shows and movies.

### Rating System

In [None]:
rating_list = list(df["rating"].unique()) # creates a list of all unique movie ratings in the dataset

rating_list.remove("Unknown") # removes the "Unknown" rating from the list

rating_count_dict = {} # creates an empty dictionary to store the counts of each rating

for rating in df["rating"]: # Iterates over each movie in the dataset
    # if the rating is already in the dictionary, increment its count
    if rating in rating_count_dict:
        rating_count_dict[rating] += 1
    # if the rating is not "Unknown" and not already in the dictionary, add it with a count of 1
    elif rating != "Unknown":
        rating_count_dict[rating] = 1

keys_3 = rating_count_dict.keys() #gets the keys(ratings)
values_3 = rating_count_dict.values() #gets the values(counts)

plt.figure(figsize=(12,8))# creates a figure with a size of 12 x 8 inches
plt.bar(keys_3, values_3, color="darkred", ec="black")
#creates a bar chart with the ratings on the x-axis, counts on the y-axis, dark red bars, and black edges
plt.xticks(rotation=45) #rotates the x-axis labels by 45 degrees for readability
plt.ylabel("Number of movies") #sets y-axis label
plt.xlabel("Rating") #sets x-axis label
plt.title("Motion Picture Association film rating system") #sets the title
plt.show() #displays the chart

The graph makes it easier to see how Netflix movies and TV series are rated. It demonstrates that most of the Netflix selection of movies and TV series are rated for mature audiences, with TV-MA and TV-14 following closely behind. The graph also includes a limited number of G-, PG-, and TV-G-rated films and television programs that are appropriate for younger viewers. Overall, this code offers useful information about the kinds of Netflix material that are offered and the intended viewers for each rating group.

## Wordcloud of movie titles

Let's create something interesting. What about a wordcloud with the words in it, that were used to create the title for each movie. What will this give us? We will find out which words were used the most for more than 8000 movies when creating their titles. 

In [None]:
title_text = ' '.join(df['title'].dropna())
wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', colormap="Reds", width=800, height=400).generate(title_text)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()


Based on the titles of the films and TV series in the dataset, this code creates a word cloud. The WordCloud class from the wordcloud library is used for this purpose. Those words that appear most frequently in the names of the films and TV series in the dataset are represented visually in the resulting word cloud.

### Getting Top  directors (w/ >= 10 movies)

In [None]:
director_counts = df[df["director"]!="Unknown"]['director'].value_counts()
popular_directors=director_counts[director_counts>10]

colormap=cm.ScalarMappable(cmap=cm.Reds) # create a colormap
colormap.set_clim(popular_directors.min(), popular_directors.max()) # set the limits for the colormap
#color=colormap.to_rgba(popular_directors)

plt.figure(figsize=(12,8)) # creates a figure with a size of 12 x 8 inches
popular_directors.plot(kind='bar', color=colormap.to_rgba(popular_directors), ec="black" ) # plot a bar chart with the colormap
plt.title('Number of Movies Directed by Each Director (top 10 directors)') # set the title
plt.xlabel('Director') # set the x-axis label
plt.ylabel('Number of Movies') # set the y-axis label
plt.xticks(rotation=45, fontsize = 12) # rotate the x-axis labels
plt.show() # display the plot

#kodum xndir ka, 2 hat director irar heta berum

By only taking into account filmmakers who have directed more than 10 films, this code creates a bar plot showing the number of films each director (or the group of directors (if they worked on the same project together)) in the dataset has directed. The popular_directors variable is used to filter out just the filmmakers who have directed more than 10 films, and the value_counts() function is used to count the number of films each director has directed. Using a colormap, the bars' colors are determined by the amount of films that each filmmaker has directed. The plot is then constructed using the popular_directors.plot() method with appropriate labels, a clear title, and rotation. The plot makes it easier to see which directors have produced the most Netflix films.

### Duration distribution

In [None]:
duration_of_movies=df[df["Season/min"]=="min"]["Number duration"]
plt.figure(figsize=(12,8))
plt.hist(duration_of_movies, bins=25, color="red", ec="black")
plt.title("Duration Distribution")
plt.xlabel("Movie Duration")
plt.ylabel("Number of movies")
plt.show()

We display the dataset's distribution of movie runtimes here. To extract the duration values, we first filter the dataframe to only include the rows representing movies (not TV programmes). Then, using the plt.hist() function and 25 bins, we plot a histogram of the durations. The figure demonstrates that the majority of the dataset's films have running times between 70 and 120 minutes, peaking around 90 minutes. There are a very small number of movies that are longer than 200 minutes, but they do exist. This story can offer us a general notion of how long Netflix movies tend to be, which can be helpful for content producers who wish to make movies that appeal to the platform's customers.

In [None]:
duration_of_tvshows=df[df["Season/min"]=="Seasons"]["Number duration"]
plt.figure(figsize=(12,8))
plt.hist(duration_of_tvshows, bins=10, color="black", ec="red")
plt.title("Duration Distribution")
plt.xlabel("TV Show Duration")
plt.ylabel("Number of TV Shows")
plt.show()

Just like the previous one, we display the durations of TV shows. This histogram indicates that most Netflix TV programs have between 1-4 seasons, with 1-2 seasons having the highest frequency. A significant percentage of TV series have run times of 5–10 seasons, while a much smaller proportion have run times of more than 10.

### Average movie duration over time

In [None]:
df[df["Season/min"]=="min"].groupby(df["release_year"])["Number duration"].agg(["mean", "max", "min"])
mean_durations=df[df["Season/min"]=="min"].groupby(df["release_year"])["Number duration"].mean()

plt.figure(figsize=(12,8))
plt.bar(mean_durations.index, mean_durations.values, color="white", edgecolor="red")
plt.title('Average movie duration for each year')
plt.xlabel('year')
plt.ylabel('average movie duration')
plt.show()

According to the plot, the average movie runtime has varied throughout the years, generally growing since the early 2000s. It should be emphasized that this research solely takes into account films and ignores TV shows. As a result, the story only offers a fragmented view of the overall rise in Netflix content duration.

## Sentiment Analysis

Sentiment analysis is used in order to understand the emotion or the attitude of something. In our case, by applying this analysis for the "Description" column, we can determine whether the content on Netflix mainly is positive or not. 

For this, we import Textblob library. Next, we use the sentiment function, whicih has two properties polarity and subjectivity. We will focus on the polarity part. It returns number from -1 to 1, if it is >0 than the content is positive, if <0, then negative, if it is equal to zero, then the content is neutral. 

In [None]:
for index,row in df.iterrows():
    
    d=row['description']
    d_blob=TextBlob(d)
    d_polarity=d_blob.sentiment.polarity
    
    if d_polarity<0:
        d_sentiment='Negative'
    elif d_polarity>0:
        d_sentiment='Positive'
    else:
        d_sentiment='Neutral'
    df.loc[[index,2],'Description_Sentiment']=d_sentiment
    

In [None]:
d_sentiment_counts=df.groupby(["release_year", "Description_Sentiment"]).size().unstack()

d_sentiment_counts = d_sentiment_counts.loc[2000:]

colors = {'Negative': 'red', 'Neutral': 'black', 'Positive': 'grey'}

plt.figure(figsize=(30,30))
d_sentiment_counts.plot(kind='bar', stacked=True, color=[colors[c] for c in d_sentiment_counts.columns])

plt.xlabel('Year')
plt.ylabel('Number of movies')
plt.title('Sentiment Content of Netflix')
plt.show()

As we can see, over years, the movies on netflix have become more positive, since the number of movies with positive content has increased signinficantly.

### Movies distribution among countries

In [None]:
countries_list = [] #creates a new list for countries
for i in df["country"]: #iterates over the values in column "Countries"
    if "," in i: #finds the rows which have several values for "Countries", separated by commas
        a=str(i).split(sep = ",") #creates a new variable a and assigns to it splitted values in form of list
#        list_a = list(a)  will delete this part
        for b in a: #iterates over list a with separated countries
            countries_list.append(b) #appends the countries to the initial list
    else:
        countries_list.append(i) #if there are no commas, it means that only one country is mentioned, so we append that country to the list
    
new_countries_list =[x for x in countries_list if x != "MISSING" and x != ""] #deletes all the "Missing" values from countries_list and assigns to a new list
new_countries_list_2 = [x.strip() for x in new_countries_list] #since some of them come with unnecessary spaces, strip deletes those spaces

from collections import Counter

country_counts = Counter(new_countries_list_2) #creates a Counter object and assigns it to country_counts
country_counts_dict = dict(country_counts) #converts the Counter object country_counts to a regular dictionary country_counts_dict.
countries = country_counts_dict.keys() #creates a new variable called countries, which contains a list of all the unique country names in new_countries_list_2.
n_of_movies_countries = country_counts_dict.values() #creates a new variable called n_of_movies_countries, which contains a list of the number of movies produced in each country.

 The Counter object counts the number of occurrences of each country in the list, and returns a dictionary-like object where the keys are the countries and the values are the counts.
 
The keys() method is used to extract the keys (i.e., the country names) from the dictionary country_counts_dict.

The values() method is used to extract the values (i.e., the counts) from the dictionary country_counts_dict.

The code above creates a dictionary, where the keys represent the countries and the values show how many movies were produced in each country

In [None]:
import json
import folium

geo_json_data = json.load(open('countries.geojson')) #loads the countries.geojson file, which contains geographic data for all the countries in the world
df_2 = pd.DataFrame({'Country': list(countries), 'Movie Count': list(n_of_movies_countries)}) #creates a new pandas DataFrame called df_2
#Country column contains a list of all the unique country names in new_countries_list_2
#Movie Count column contains a list of the number of movies produced in each country.
map = folium.Map(location=[37.0902, -95.7129], zoom_start=2) #creates a new folium Map object, which will be used to display the choropleth map.
#"Location" sets the center of map to be United States
#"Zoom start" sets the zoom level of the map 
folium.Choropleth(
    geo_data= geo_json_data,
    name='choropleth',
    data=df_2,
    columns=['Country', 'Movie Count'],
    key_on='properties.ADMIN', #specifies the key in the geo_json_data that matches the country names in df_2
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Number of Movies', #sets the label
    highlight=True
    
).add_to(map)

map

# New Dataset - IMDB TOP 1000 

Our second dataset includes top 1000 IMDB movies. We do not use it seperately, instead we are going to use it for our origingal "netflix movies" dataset. We will first merge these two datasets, to find out the list of those Netflix movies whicih were classified as top movies in IMDB and then, we will work with this new dataset. But first, we will get basic information about the second - "IMDB toP 1000" (df2) dataset, by using the same functions that were used for the first dataset.

## Basic information about the Dataset

In [None]:
df2.isnull().sum()

In [None]:
df2.describe()

In [None]:
df2.shape

In [None]:
df2.columns

In [None]:
df2.count()

In [None]:
df2.nunique()

In order to be able to merge this datasets by the "title" of the movie, we modified the column name for df2, to match it with the column name of Titles of df1.

In [None]:
df2.columns=['Poster_Link', 'title', 'Released_Year', 'Certificate',
       'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director',
       'Star1', 'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross']

For now, we will change the dtype of the column "Released_year" of df2 since it is of dtype("O"). We will convert it to Int64. A little bit later, we will see why we did this.

In [None]:
df2.loc[:, 'Released_Year'] = pd.to_numeric(df2['Released_Year'], errors='coerce').astype('Int64')

In [None]:
merged_df = pd.merge(df, df2, on='title')

We merged the datasets, and now it seems that everything should be fine. However, there were cases, when two DIFFERENT movies produced by different directors and in different years, had the SAME name. Therefore, we should try to identify these "wrong" movies. We will do this by comparing the columns "Released_Year" of df2 and "release_year" of df. Since the dtype of these two columns were different, there was a possibility, that when comparing these two, there would be mistakes. That's why we modified the type of the column "Released_Year" a few lines above.

In [None]:
sxal_director=merged_df[merged_df["director"]!=merged_df["Director"]]
sxaltaridirectorov=sxal_director[sxal_director["release_year"]!=sxal_director["Released_Year"]]
sxaltaridirectorov[["director", "Director", "release_year", "Released_Year"]]
netflix_top_imdb=merged_df.drop(sxaltaridirectorov.index)

After doing a couple of easy steps, now we have the perfectly right dataset. The dataset of Netflix movies which were included in IMDB Top 1000 movies.

### Finding correlation between MetaScore and IMDB Rating

HMMM, both Metascore and IMDB rating are used to indicate the quality of the movie. Then what is the difference between them? And why should we find a link between them? Firstly, while both are used to know the rating of the movie, Metascore is a weighted average of critic reviews from a variety of publications, including newspapers, magazines, and online review sites. The scores are assigned on a scale of 0-100, and the weighted average is used to calculate the Metascore. On the other hand, IMDb (Internet Movie Database) rating is a rating given by registered users of the IMDb website. Users can rate movies and TV shows on a scale of 1 to 10, and the IMDb rating is calculated based on the average of all user ratings. This means, that Metascore is a more objective estimation of a movie, then IMDB Rating. 

Now, let's understand, if the ratings of critics and users are similar or not? For this, we will plot a graph which shows the relation between them.

In [None]:
meta_score=netflix_top_imdb["Meta_score"]
imdb_rating=netflix_top_imdb["IMDB_Rating"]

corr_coeff = round(meta_score.corr(imdb_rating), 2)
print(f"Correlation coefficient: {corr_coeff}")
print("There is a weak positive correlation between user and critic ratings.")

plt.figure(figsize=(12,8))
sns.regplot(x=meta_score, y=imdb_rating, scatter=True, color='black', line_kws={"color": "red"})
plt.xlabel("Meta Score")
plt.ylabel("IMDB Rating")
plt.title("Correlation between user ratings and critic reviews")
plt.show()

For the highest-rated Netflix movies, a correlation between user and critic reviews is calculated and shown. The correlation between the Meta_score and IMDB_Rating columns is first determined using Pandas' corr() function. The coefficient and a description of the association are then printed, with the coefficient rounded to two decimal places. The plot shows that there is a weak positive correlation between the Meta Score and the IMDB Rating. Thus, indicating that we cannot do certain conclusions about the other one, when considering one of them.

### Correlation between Number of Votes and IMDB Rating

In [None]:
no_of_votes=netflix_top_imdb["No_of_Votes"]
imdb_rating=netflix_top_imdb["IMDB_Rating"]

corr_coeff = round(no_of_votes.corr(imdb_rating), 2)
print(f"Correlation coefficient: {corr_coeff}")
print("There is a weak positive correlation between user ratings and number of their votes.")

plt.figure(figsize=(12,8))
sns.regplot(x=no_of_votes, y=imdb_rating, scatter=True, color='red', line_kws={"color": "black"})
plt.xlabel("Number of votes")
plt.ylabel("IMDB Rating")
plt.title("Correlation between user ratings and number of votes")
plt.show()

The relationship between the number of votes and the IMDB rating of the highest rated Netflix movies and TV series is examined here. There is a slight positive link between user ratings and the total number of votes, according to our calculation of the correlation coefficient between the two variables, which comes out to 0.59. The link between the two variables is then depicted using a scatter plot of the two data points and a regression line. We can detect an upward trend, which supports our theory that the IMDB rating rises as the number of votes does.

### Genre distribution (in netflix movies that are in IMDB  top 1000 movies list)

In [None]:
popular_genres = []
for index, row in netflix_top_imdb.iterrows():
    popular_genres.extend(row['Genre'].split(', '))


genres_count = pd.Series(popular_genres).value_counts()

plt.figure(figsize=(12,8))
plt.bar(genres_count.index, genres_count.values, color="red", ec="black", linewidth=2)
plt.xticks(rotation=45)
plt.xlabel('genre')
plt.ylabel('Number of movies')
plt.title('Top genres in movies with top 1000 imdb ratings(poxela petq anuny)')
plt.show()

We use a pandas series to count the instances of each genre in the list, and a horizontal bar chart is used to visualize the results. The resulting graph illustrates the most common genres in the highest-rated films along with the quantity of each genre's films. It offers information about the tastes of the viewers and raters of these films. In this instance, we can observe that Drama, Action, Comedy, Adventure, and Crime are the most popular genres.

### Top 10 Actors (in netflix movies that are in IMDB  top 1000 movies list)

In [None]:
stars=netflix_top_imdb[["Star1", "Star2", "Star3", "Star4"]]
actors = []
for index, row in stars.iterrows():
    actors.append(row["Star1"])
    actors.append(row["Star2"])
    actors.append(row["Star3"])
    actors.append(row["Star4"])


actors_count = pd.Series(actors).value_counts()
top_actor_count = actors_count[:10]

plt.figure(figsize=(12,8))
plt.barh(top_actor_count.index, top_actor_count.values, color="black", ec="red", linewidth=2)
plt.xlabel("Number of movies")
plt.ylabel("Actors")
plt.yticks(fontsize=12)
plt.title('Top 10 Actors whose movies are in top 1000 imdb ratings(poxela petq anuny)')
plt.show()

We pick the top 10 actors from the 'actors_count' Series who have appeared in the most top 1000 IMDb-rated movies, and we plot the horizontal bar graph with their counts. The plot, which was made using the matplotlib tool, shows the top 10 actors on the y-axis and the number of films they have appeared in on the x-axis. The horizontal bar graph indicates that Amir Khan, RObert De Niro and Mark Ruffalo are the top 3 actors whose films are most commonly ranked in the top 1000 IMDb ratings.

### WordCloud of Actors' names

AGAIN WORDCLOUDS! They are beautiful, aren't they? Now let's creat a wordcloud with the actors name in it.
The actors whose films are in the top 1000 imdb ratings are grouped into a word cloud here. The actors' names that appear most frequently in the dataset are displayed in the word cloud that results, with bigger font denoting more occurrences in the list. It provides a visual representation of the most well-known actors in the top 1000 imdb-rated Netflix films.

In [None]:
actors_array=np.array(actors)
actors_array=np.char.replace(actors_array, " ", "")


plt.figure(figsize=(12,8))
wordcloud = WordCloud(background_color='black', colormap="Reds", width=800, height=400).generate(' '.join(actors_array))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

### Collaborations between Directors and actors.

What if we want to know which famous directors work with which actors? What if we want to see get som insighs about their colloborations? After all, the success of project comes from successful collaboration between the actor and the director. Then what we definitely need to do is to create a colloboration graph for Directors and actors. However, since our data is huge, and there are a lot of directors and actors, the picture will not be clear if we do it for the whole dataset. THus, we will create the graph for the collaboration between Top 10 directors and the actors they worked with. 

The collaboration between the top 10 directors and the actors they have worked with in films with the top 1000 IMDB ratings is displayed using a network visualization created using the NetworkX library. Then, using G = nx, a fresh NetworkX graph is produced.Graph(). Each director and actor is added as a node with a "director" or "actor" property, and nodes are added to the graph using G.add_nodes_from(). Using G.add_edge(), edges are added to the graph, each of which represents a cooperation between a director and an actor. Director nodes are colored differently based on their attribute (directors are red, actors are pink), and lines are created between the nodes to signify cooperation. The network is then displayed using nx.draw_networkx(). A network of partnerships between the top 10 directors and actors in films with the top 1000 IMDB ratings is the graphic that is the outcome.

In [None]:
#Colloboration graph FOR TOP 10 DIRECTORS
import networkx as nx

#finding top 10 directors
top_directors2=netflix_top_imdb.groupby("Director").size()
top_directors2=top_directors2.sort_values(ascending=False) #to make it descending
top_directors2=top_directors2[0:10].index #top10
top_directors2=top_directors2.tolist() 

collab=netflix_top_imdb[netflix_top_imdb["Director"].isin(top_directors2)]
collab=collab[["Director", "Star1", "Star2", "Star3", "Star4"]]
top_actors2 = collab[['Star1', 'Star2', 'Star3', 'Star4']].values.flatten().tolist()


G = nx.Graph()
G.add_nodes_from(top_directors2, type='director')
G.add_nodes_from(top_actors2, type='actor')

for index, row in collab.iterrows():
    director = row['Director']
    for actor in row[['Star1', 'Star2', 'Star3', 'Star4']]:
        G.add_edge(director, actor)

plt.figure(figsize=(30,30))

pos = nx.spring_layout(G)
nx.draw_networkx_nodes(G, pos, nodelist=top_directors2, node_size=1000, node_color='red', alpha=0.8)
nx.draw_networkx_nodes(G, pos, nodelist=top_actors2, node_size=500, node_color='pink', alpha=0.8)
nx.draw_networkx_edges(G, pos, width=1, alpha=0.3)
nx.draw_networkx_labels(G, pos, font_size=10)
plt.axis('off')
plt.show()

As we can see from the graph, the red dots represent the 10 directors, while the pink smaller dots are the actors they collaborated with. The edge show that those particular director-actor pair worked together.

### Grouping the movies based on their IMDB Rating

We arrange the films in netflix_top_imdb according to their IMDB rating and print the titles of the films in each group. According to their IMDB ratings, the movies in each group are arranged in descending order. This offers a quick method to view which movies fall into each rating category and how their IMDB ratings stack up against one another.

In [None]:
groups=(netflix_top_imdb.sort_values('IMDB_Rating', ascending=False)).groupby(["IMDB_Rating"])
for rating, group in groups:
    print(f"Movies with IMDB rating {rating}:")
    for i, row in group.iterrows():
        print(f"{row['title']}) ")
    print("\n")

### Movie Search System

In [None]:
print("If you don't have particular preferences for any of those questions, please type 'skip'")
user_type = input("The type: Movie or TV Show: ")
user_dir = input("Preferred director's name: ")
user_country = input("Country: ")
user_year = input("The interval of years(e.g. 2010-2022): ")
user_duration = input("The interval of duration in minutes/seasons: ")

The user is asked to input the answers to some questions for a more effective search. If they dont have specific preferences they could type "skip"

In [None]:
user_pref_list = [user_type, user_dir, user_country, user_year, user_duration] #creates a list of all the user preferences
user_pref_list_valid = [i for i in user_pref_list if i != "skip"] #filters the list so that no "skip" value remains
length  = len(user_pref_list_valid) #calculates how many variables are stored in the list
percentage_for_1 = 100/length #calculates how much is the contribution of each inputted preference
df["percentage"] = 0 #creates a new column in our dataframe, 
#which will indicate the percentage of similarity of the exact movie to the user's search
if user_type != "skip": 
    type_matches = df["type"] == user_type #sets a condition for our later searching process
    if any(type_matches): #if the user's prefered type is found, the percentage is incremented by percentage_for_1
        df.loc[type_matches, "percentage"] += percentage_for_1
        
if user_dir != "skip":
    type_matches = df["director"] == user_dir
    if any(type_matches):
        df.loc[type_matches, "percentage"] += percentage_for_1
    
if user_country != "skip":
    type_matches = df["country"] == user_country
    if any(type_matches):
        df.loc[type_matches, "percentage"] += percentage_for_1
        
#The same thing is done as in the first if else block    

if user_year != "skip":
    #the variable user_year contains an interval where the minimum and maximum years are separated by '-'
    user_year = user_year.split("-") #splits the variable user_year to obtain a list of 2 elements
    min_year = int(user_year[0]) #first element is the starting point of the interval(minimum)
    max_year = int(user_year[1]) #second element is the ending point of the interval(maximum)

    year_matches = (df["release_year"] >= min_year) & (df["release_year"] <= max_year)
    if any(year_matches):
        df.loc[year_matches, "percentage"] += percentage_for_1
        
if user_duration != "skip":
    user_duration = user_duration.split("-")
    min_dur = int(user_duration[0])
    max_dur = int(user_duration[1])
    type_matches = (df["Number duration"] >= min_dur) & (df["Number duration"] <= max_dur)
    if any(type_matches):
        df.loc[type_matches, "percentage"] += percentage_for_1
#The same is done with the duration which is also given by the interval
df_search = df.sort_values(by='percentage', ascending = False) #creates a new Data Frame, sorted with respect to Percentage


In [None]:
df_subset = df_search[0:10] #subsets the first 10 rows of the sorted dataframe
for i in df_subset["title"]:
    print("The movie", i, "corresponds to your search by", df_subset["percentage"][df_subset["title"] == i].values[0], "%")
#Prints out the titles of 10 Movies/TV Shows that are more likely to correspond to the user's search

This code is part of a movie search system that matches user preferences with movies in a dataset to generate a list of recommended films. The process takes user preferences for the type of movie, director, country, release year, and duration as input and creates a list of all the user preferences. It then filters the list so that no "skip" value remains, calculates how many variables are stored in the list, and calculates how much is the contribution of each inputted preference. The system then creates a new column in a data frame, indicating the exact movie's similarity percentage to the user's search. It sets a condition for each user preference and increments the percentage column for each matched movie in the data frame. Finally, the system creates a new data frame, sorted with respect to the percentage column containing the recommended movies.

### Recommendation Modeling

In [None]:
df1 = pd.read_csv("./netflix_titles.csv")

In [None]:
#!pip install rake_nltk
import pandas as pd
import numpy as np
from rake_nltk import Rake
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from numpy import savetxt
import nltk
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
data = df2.drop(['Poster_Link', 'Released_Year', 'Certificate', 'Runtime','IMDB_Rating', 'Meta_score', 'No_of_Votes', 'Gross'],axis=1)

In [None]:
data.rename(columns = {'title':'Title'}, inplace = True)

In [None]:
data.rename(columns = {'Overview':'Plot'}, inplace = True)

In [None]:
data['Actors'] = data['Star1']+','+data['Star2']+data['Star3']+data['Star4']

In [None]:
data = data.drop(['Star1','Star2','Star3','Star4'],axis=1)

In [None]:
data.head()

The rake_nltk library is installed, the required libraries are imported, and stopwords and punkt corpora are downloaded from the NLTK library. The df2 dataset is then loaded after some extraneous columns have been removed. Two columns are given new names—Title and Plot—and the four columns devoted to actors are combined into one column called Actors. The four columns of the original actors are then removed.

In [None]:
df1_dropped = df1.drop(['show_id', 'type', 'country', 'date_added', 'release_year', 'rating', 'duration' ],axis=1)
data_netflix = df1_dropped.dropna()
data_netflix
data_netflix.columns =['Title','Director', 'Actors', 'Genre', 'Plot']
data_netflix = data_netflix[['Title', 'Genre', 'Plot', 'Director', 'Actors']]
data_full = pd.concat([data_netflix, data])
data_full.drop_duplicates(subset='Title', inplace = True)

We clean and reorganize the IMDB dataset (data), concatenate it with the Netflix dataset (df1), and then delete specific columns from the Netflix dataset (df1) and only pick rows without missing values. On the basis of the Title field, we eliminate duplicate rows and store the outcome to data_full. The resulting dataframe (data_full) contains details on both datasets' titles, genres, plots, directors, and actors.

In [None]:
# to extract key words from Plot to a list
data_full['Key_words'] = ''   # initializing a new column
r = Rake()   # using Rake to remove stop words

for index, row in data_full.iterrows():
    r.extract_keywords_from_text(row['Plot'])   # to extract key words 
    key_words_dict_scores = r.get_word_degrees()    # to get dictionary with key words and their similarity scores
    row['Key_words'] = list(key_words_dict_scores.keys())   # to assign it to new column


We use the Rake algorithm to extract key words from the 'Plot' column of each row and add them to a new column called 'Key_words' in the 'data_full' DataFrame. The 'Key_words' column in the DataFrame is first set empty. The key words are then extracted from the "Plot" column for each row in the DataFrame using the Rake technique. The important terms that were retrieved and the scores assigned to them are kept in a dictionary. The list of key words that were extracted is then added to the 'Key_words' column of the associated row by the code. A list of key words that are most important to the 'Plot' of each TV show or movie has been added to the 'Key_words' column.

In [None]:
data_full['Key_words']

In [None]:
data_full.head()

In [None]:
data_full['Genre'] = data_full['Genre'].map(lambda x: x.split(','))
data_full['Actors'] = data_full['Actors'].map(lambda x: x.split(','))
data_full['Director'] = data_full['Director'].map(lambda x: x.split(','))

# create unique names by merging firstname & surname into one word, & convert to lowercase 
for index, row in data_full.iterrows():
    row['Genre'] = [x.lower().replace(' ','').replace("tv","").replace('shows', '') for x in row['Genre']]
    row['Actors'] = [x.lower().replace(' ','') for x in row['Actors']]
    row['Director'] = [x.lower().replace(' ','') for x in row['Director']]

The Genre, Actors, and Director columns of the data_full dataframe are divided into lists of strings. All strings are then changed to lowercase and any extra characters are removed. The term "TV" and "shows" are removed from the strings in Genre, and all spaces are removed from the strings in Actors and Director.Also breaks the strings in Genre by commas.

In [None]:
data_full

In [None]:
data_full['Bag_of_words'] = ''
columns = ['Genre', 'Director', 'Actors', 'Key_words']

for index, row in data_full.iterrows():
    words = ''
    for col in columns:
        words += ' '.join(row[col]) + ' '
    row['Bag_of_words'] = words
    
# strip white spaces infront and behind, replace multiple whitespaces (if any)
data_full['Bag_of_words'] = data_full['Bag_of_words'].str.strip().str.replace('   ', ' ').str.replace('  ', ' ')

data_full_final = data_full[['Title','Bag_of_words']].reset_index(drop=True)

using the split() function to extract the words from the columns "Genre," "Director," "Actors," and "Key_words," which are then converted to lowercase, have spaces removed, and have some characters removed using the replace() and lower() methods. Concatenating all the words from the "Genre," "Director," "Actors," and "Keywords" columns into a single string for each row, creating a new column named "Bag_of_words." producing a new dataframe named "data_full_final" that only contains the "Title" and "Bag_of_words" columns and removes any excess whitespace from the "Bag_of_words" column.

In [None]:
data_full

In [None]:
data_full_final['Bag_of_words'][:10].tolist()

In [None]:
data_full_final

In [None]:
count = CountVectorizer()
count_matrix = count.fit_transform(data_full_final['Bag_of_words'])

cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [None]:
count_matrix

In [None]:
indices = pd.Series(data_full_final['Title'])

In [None]:
indices.tolist()

This creates a Pandas series object named indices which includes the names of every TV show and movie in the dataset. This series object is then converted into a list by calling the.tolist() function.

In [None]:
data_full_final

In [None]:
def recommend(title, cosine_sim = cosine_sim):
    recommended_movies = []
    idx = indices[indices == title].index[0]   # to get the index of the movie title matching the input movie
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)   # similarity scores in descending order
    top_5_indices = list(score_series.iloc[1:6].index)   # to get the indices of top 6 most similar movies
    # [1:4] to exclude 0 (index 0 is the input movie itself)
    
    for i in top_5_indices:   # to append the titles of top 10 similar movies to the recommended_movies list
        recommended_movies.append(list(data_full_final['Title'])[i])
        
    return recommended_movies

We write a function named recommend, which accepts as inputs a movie title and a cosine similarity matrix (cosine_sim), and outputs a list of 5 more movies that are suggested depending on how similar they are to the original film. The method begins by getting the index of the input movie from the indices list, a pandas Series that contains the names of every movie in the dataset. The cosine similarity between the input video and each of the other movies in the dataset is then calculated using the cosine_sim matrix to build a similarity score series. The similarity scores are sorted descendingly using the sort_values() technique. The next step is to choose the first through sixth items (indices 1 to 5) in the sorted similarity score series using the iloc technique to get the indices of the top five most similar movies (apart from the input movie itself). The recommended_movies list is then supplemented with the names of the top 5 closest-related films using a for loop. The list of suggested movies is then returned.

In [None]:
recommend('Captain America: Civil War')

In [None]:
recommend('Once Upon a Time in America')

# Conclusion

1.	The Netflix dataset consisted of 8807 rows and 12 columns..
2.	The Netflix has 6131 movies and 2676 TV shows.
3.	The Movies make up 69.6% of the whole and TV Shows 30.4%
4.	In 2018 Netflix produced its maximum number of movies
5.	Minimum number of movies available on Netflix are from 1961, 1959, 1925, 1966, 1947
6.	Starting from 2011, The number of movies published yearly drastically increased. However it started to decline after 2018 until the lates available year of this dataset: 2021
7.	Most of the Netflix selection of movies and TV series are rated for mature audiences, with TV-MA and TV-14 following closely behind. Very small number of Movies/TV Shows have ratings NC-17 , UR,  TV-Y7-FV.
8.	The word most frequently used in the titles of Netflix movies/TV shows are Love, Life, Girl, Christmas,  World
9.	The director with the greatest number of movies is Rajiv Chikala
10.	 The majority of the dataset's films have running times between 70 and 120 minutes, peaking around 90 minutes. There are a very small number of movies that are longer than 200 minutes, but they do exist.
11.	 Most Netflix TV programs have between 1-4 seasons, with 1-2 seasons having the highest frequency. A significant percentage of TV series have run times of 5–10 seasons
12.	 The duration of the movies available on Netflix are the greatest during the years 1960-1967. Even though Netflix did not exist at that time, its platform contains movies and shows from that time.
13.	 During the years, the number of movies with positive content increased in comparison with the movies with neutral and negative content.
14.	 The countries where most netflix movies/TV shows were produced are United States (3690), India(1046), United Kindgom (806)
15.	 There are 144 Netflix movies that are included in Top 1000 IMDB Rating Movies.
16.	 There is a weak positive correlation between  IMDB Ratings and MetaScores. That means that with given the MetaScore we cannot guess IMDB Rating.
17.	 The actors with the greatest number of movies are Aamir Khan, Leonardo DiCaprio, Mark Ruffalo. (This information is demonstrated using both the word cloud and barplot).
18.  Collaboration between actors and top directors.
19.  We created a mechanism of filter movies based on users' preferances (based on the type, duration, director, country, ect)
20.  We created recommandation system which will people to find movies that are similar to their favourite movies.
