# Analysis of the Netflix catalogue
## Visualization, data mining and prediction of data
This notebook contains the results by using the aformentioned techniques in order to get an understanding of how the netflix catalogue has evolved over the years as well as representing the data in an understandable manner.

Most of the python code that does the heavy lifting i.e data sanitizing, crunching of numbers and magic resides in seperate .py files in order to keep the notbook clean, only featuring code that is relevant for plot and other visualization.



# Python setup
We recommend opening this notebook using JupyterLab in order to be able to view all the interactive plots.

NB! Please only run this one time as the change director command will keep moving the notebook directory up the directory tree, this causes the filepath defined in the code to not agree with the current directory of the notebook. Restart kernel if IOError occurs. 



In [None]:
# Only run this cell one time. Restart kernel and run again to fix IOError.
import os
import yrs_months

# Changes the notebook working directory on level up.
%cd ..

# Running main python script.
%run -i "src/main.py"

# Data set from main.py
data_set

# Chapter 1. Genre analysis

## Most Popular movie and series genres
### Discarding the 'movie' and 'tv show'  entry in the genre list and 

In [None]:
# Import packages needed for visualization
import datetime as dt
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt 
import seaborn as sns
from wordcloud import WordCloud
import chord # Need to install - pip install chord

In [None]:
# Split genres into list on comma and put each item on separate line
genres = data_set['listed_in'].dropna().str.split(', ').explode().copy()

Use word cloud to visualize the most frequent genres in the Netflix library

In [None]:
# Make word cloud using frequency of genres.
plt.subplots(figsize=(10,10))
wordcloud = WordCloud(
                          background_color='Black',
                          width=1920,
                          height=1080
                         ).generate_from_frequencies(genres.value_counts())
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
netflix_data = data_set
order =  sorted(netflix_data.release_year.unique())[-15:-1]
plt.figure(figsize=(15,7))
g = sns.countplot(netflix_data.release_year, hue=netflix_data.type, order=order, palette="pastel");
plt.title("Movies vs TV-Shows released on Netflix")
plt.xlabel("Production year")
plt.ylabel("Total Count")
plt.show()

In [None]:
plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(x="rating", data=netflix_data, palette="Set2", order=netflix_data['rating'].value_counts().index[0:15])

In [None]:
rating_order =  ['G', 'TV-Y', 'TV-G', 'PG', 'TV-Y7', 'TV-Y7-FV', 'TV-PG', 'PG-13', 'TV-14', 'R', 'NC-17', 'TV-MA']

movie_rating = netflix_data['rating'].value_counts()
#tv_rating = tv_show['rating'].value_counts()[rating_order].fillna(0)

In [None]:
temp3 = temp2.explode('listed_in')
temp3 = temp3.rename(columns={'listed_in': 'Genre', 'rating': 'PG-Rating'})
df = temp3.groupby(['Genre','PG-Rating']).size().unstack(fill_value=0)
df = df[df > 0].fillna(0)

df[1:]

In [None]:
sns.heatmap(df[1:]).set_title("Genre vs PG Rating")

In [None]:
replacements = {
        "& Talk": "",
        "Classic & Cult": "Classic, Cult",
        "Features": "",
        "Series": "",
        "Comedy": "Comedies",
        "British": "International",
        "Spanish-Language": "International",
        "Children & Family": "Kids'",
        "TV Shows": "",
        "Movies": "",
        "Docuseries": "Documentaries",
        "& Talk Shows": "",
        "Stand-Up": "",
        "TV": "",
        "Shows": "",
        " ": "",
    }

temp2 = netflix_data.copy()
temp2.listed_in = temp2.listed_in.replace(replacements, regex=True).str.split(',').apply(lambda x: [i for i in x if i != 'International'])

Want to plot the genres and total count of each genre. 
Separate into movies and tv shows.
Decided to remove categories International Movies and International TV shows as these were overrepresented in the data set as they are given to all movies not from the US. This category is always coupled with another gerne and is therefore not seen as one of the main genres.

In [None]:
# Extract movie genres
genres_movies = data_set[data_set["type"]=="Movie"]['listed_in'].dropna().str.split(', ').explode().copy()

In [None]:
# Plot bar plot of all movie genres
plt.figure(figsize=(12,6))
sns.countplot(y=genres_movies, order=genres_movies.value_counts(ascending=True).index[:-1]) #removed international movies
plt.title("Movies by Genre")
plt.ylabel("Genre")
plt.xlabel("Total Count")

In [None]:
#Extract TV genres
genres_tv = data_set[data_set["type"]=="TV Show"]['listed_in'].dropna().str.split(', ').explode().copy()

In [None]:
# Plot TV genre count
plt.figure(figsize=(12,6))
sns.countplot(y=genres_tv, order=genres_tv.value_counts(ascending=True).index[:-1]) # Removed international TV shows
plt.title("TV Shows by Genre")
plt.ylabel("Genre")
plt.xlabel("Total Count")

In [None]:
plt.figure(figsize = (15,15))
plt.pie(
    [genre_value for genre_value in pop_movie_genre.values()],
    labels=[genre_keys for genre_keys in pop_movie_genre.keys()],
    autopct=None
)
plt.show()

In [None]:
plt.figure(figsize = (15,15))
plt.pie(
    [genre_value for genre_value in pop_series_genre.values()],
    labels=[genre_keys for genre_keys in pop_series_genre.keys()],
    autopct=None
)
plt.show()


Look at the 5 most frequent movie and TV genres and plot with gear added to see if there are any patterns.
International TV shows and Movies have again been removed. 

In [None]:
genre_time = data_set[['date_added','listed_in']].copy()
genre_time = genre_time[genre_time['date_added'] != 'Unknown date_added']
genre_time['month_added'] = genre_time['date_added'].str.replace(',', '').str.lstrip().apply(lambda x: dt.datetime.strptime(x,'%B %d %Y')).dt.month_name()
genre_time['year_added'] = genre_time['date_added'].str.replace(',', '').str.lstrip().apply(lambda x: dt.datetime.strptime(x,'%B %d %Y')).dt.year
#year_released = genre_time['date_added']
genre_time['listed_in'] = genre_time['listed_in'].str.split(', ')
genre_time = genre_time.explode('listed_in')
#print(genre_time)

filter_list_m = ['Dramas', 'Comedies', 'Documentaries', 'Action & Adventure', 'Independent Movies']
filter_list_tv = ["TV Dramas", "TV Comedies", "Crime TV Shows", "Kids' TV", "Docuseries"]
top_m_genres = genre_time[genre_time.listed_in.isin(filter_list_m)]
top_tv_genres = genre_time[genre_time.listed_in.isin(filter_list_tv)]


fig, axes = plt.subplots(1, 2, figsize=(15, 8))
#fig.suptitle("Movies TV Shows added to Netflix by Year for top 5 genres")
fig.tight_layout()

sns.countplot(ax=axes[0], x="year_added", hue="listed_in" ,data=top_m_genres, palette="pastel")
axes[0].set_title("Movies")
axes[0].set_xlabel("Year added")
axes[0].set_ylabel("Total count")
axes[0].legend(loc=2)

sns.countplot(ax=axes[1], x="year_added", hue="listed_in" ,data=top_tv_genres, palette="pastel")
axes[1].set_title("TV Shows")
axes[1].set_xlabel("Year added")
axes[1].set_ylabel("")
axes[1].legend(loc=2)


Plot heatplot of genre and year added to Netflix to see if there are any patterns in what genres have been popular over time.

In [None]:
# Create a list of from dataset with month added and genre
year_genre = genre_time[["year_added", "listed_in"]]

# Group month added and genre and make table with value counts
group_y = year_genre.groupby("listed_in")
group_y = group_y['year_added'].value_counts() #count values in month
group_y = group_y.unstack() 
group_y = group_y.fillna(0) #fill nans with 0

# Check table
group_y

In [None]:
# Plot heatmap
plt.figure(figsize=(12,10))
sns.heatmap(group_y, cmap="Greens")
plt.title("Heatmap of 15 most popular genres and year added to Netflix")
plt.ylabel("Genre")
plt.xlabel("Month added to Netflix")

Heat plot of month added to netflix for 15 most frequent genres. 
Want to check if there is a pattern in when in the year the genres are added to Netflix.

In [None]:
# Create a list of from dataset with year added and genre
month_genre = genre_time[["month_added", "listed_in"]]

# Get most popular genres by value_counts() and only select these from data set
popular_genres = data_set.listed_in.str.split(', ').explode().value_counts().index[:17]
month_genre = month_genre[month_genre.listed_in.isin(popular_genres)]

# Remove International TV shows and International Movies as these are categories that does not give the genre of the movie, only that is was not made in the US. There are overrepresented in the dataset and not that interesting.
month_genre = month_genre[month_genre.listed_in != 'International TV Shows']
month_genre = month_genre[month_genre.listed_in != 'International Movies']

# Group month added and genre and make table with value counts
group = month_genre.groupby("listed_in")
group = group['month_added'].value_counts() #count values in month
group = group.unstack() 
group = group.fillna(0) #fill nans with 0

# Reindex to sort months by calendar and not alphabetically
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
group = group.reindex(columns=months) #sort months according to calendar

# Check table
group

In [None]:
# Plot heatmap
plt.figure(figsize=(12,8))
sns.heatmap(group, cmap="Greens")
plt.title("Heatmap of 15 most popular genres and month added to Netflix")
plt.ylabel("Genre")
plt.xlabel("Month added to Netflix")

Check if there has been a change in added international movies and tv shows.

In [None]:
filter_list_int = ["International TV Shows", "International Movies"]
top_int = genre_time[genre_time.listed_in.isin(filter_list_int)]

plt.figure(figsize=(12,6))
sns.countplot(x="year_added", hue="listed_in" ,data=top_int, palette="pastel")
plt.legend(loc=2)
plt.xlabel("Year")

Check most frequent cast in the most frequent movie genres. 

In [None]:
genre_cast = data_set[['cast','listed_in']].copy()
genre_cast = genre_cast[genre_cast['cast'] != 'Unknown cast']
genre_cast['cast'] = genre_cast['cast'].str.split(', ')
genre_cast = genre_cast.explode('cast')

m_per_cast = genre_cast['cast'].value_counts()

genre_cast['listed_in'] = genre_cast['listed_in'].str.split(', ')
genre_cast = genre_cast.explode('listed_in')

top_comedy_cast = genre_cast[genre_cast['listed_in'] == 'Comedies']['cast']

top_action_cast = genre_cast[genre_cast['listed_in'] == 'Action & Adventure']['cast']

top_thriller_cast = genre_cast[genre_cast['listed_in'] == 'Thrillers']['cast']
top_drama_cast = genre_cast[genre_cast['listed_in'] == 'Dramas']['cast']

fig, axes = plt.subplots(2, 2, figsize=(20, 8))
#fig.suptitle("Movies TV Shows added to Netflix by Year for top 5 genres")
#fig.tight_layout()

sns.countplot(ax=axes[0,0], y=top_comedy_cast, palette="pastel", order=top_comedy_cast.value_counts(ascending=False).index[:10])
axes[0,0].set_title("Comedies")
axes[0,0].set_xlabel("")

sns.countplot(ax=axes[0,1], y=top_action_cast, palette="pastel", order=top_action_cast.value_counts(ascending=False).index[:10])
axes[0,1].set_title("Action & Adventure")
axes[0,1].set_xlabel("")
axes[0,1].set_ylabel("")

sns.countplot(ax=axes[1,0], y=top_drama_cast, palette="pastel", order=top_drama_cast.value_counts(ascending=False).index[:10])
axes[1,0].set_title("Dramas")
axes[1,0].set_xlabel("No. of Movies")

sns.countplot(ax=axes[1,1], y=top_thriller_cast, palette="pastel", order=top_thriller_cast.value_counts(ascending=False).index[:10])
axes[1,1].set_title("Thrillers")
axes[1,1].set_xlabel("No. of Movies")
axes[1,1].set_ylabel("")


Check most frequent directors in most frequent movie genres

In [None]:
genre_director = data_set[['director','listed_in']].copy()
genre_director = genre_director[genre_director['director'] != 'Unknown director']
genre_director['director'] = genre_director['director'].str.split(',')
genre_director = genre_director.explode('director')
genre_director['director'] = genre_director['director'].str.strip()

m_per_dir = genre_director['director'].value_counts()

genre_director['listed_in'] = genre_director['listed_in'].str.split(',')
genre_director = genre_director.explode('listed_in')
genre_director['listed_in'] = genre_director['listed_in'].str.lstrip()


top_comedy_dir = genre_director[genre_director['listed_in'] == 'Stand-Up Comedy']['director']

top_action_dir = genre_director[genre_director['listed_in'] == 'Action & Adventure']['director']

top_thriller_dir = genre_director[genre_director['listed_in'] == 'Thrillers']['director']
top_doc_dir = genre_director[genre_director['listed_in'] == 'Documentaries']['director']
top_drama_dir = genre_director[genre_director['listed_in'] == 'Dramas']['director']

fig, axes = plt.subplots(2, 2, figsize=(20, 8))
#fig.suptitle("Movies TV Shows added to Netflix by Year for top 5 genres")
#fig.tight_layout()

sns.countplot(ax=axes[0,0], y=top_comedy_dir, palette="pastel", order=top_comedy_dir.value_counts(ascending=False).index[:10])
axes[0,0].set_title("Stand-Up Comedy")
axes[0,0].set_xlabel("")

sns.countplot(ax=axes[0,1], y=top_action_dir, palette="pastel", order=top_action_dir.value_counts(ascending=False).index[:10])
axes[0,1].set_title("Action & Adventure")
axes[0,1].set_xlabel("")
axes[0,1].set_ylabel("")

sns.countplot(ax=axes[1,0], y=top_drama_dir, palette="pastel", order=top_drama_dir.value_counts(ascending=False).index[:10])
axes[1,0].set_title("Dramas")
axes[1,0].set_xlabel("No. of Movies")

sns.countplot(ax=axes[1,1], y=top_doc_dir, palette="pastel", order=top_doc_dir.value_counts(ascending=False).index[:10])
axes[1,1].set_title("Documentaries")
axes[1,1].set_xlabel("No. of Movies")
axes[1,1].set_ylabel("")

# Classifying directors  from heatmap.

### Creating director - genre matrix

In [None]:
populated = director_classification.populate_director_genre_dataframe()

In [None]:
import copy
import seaborn as sns

# Copying dataframe in order to not directly mutating the populated list
# as it takes some time to create the populatedDirector list
copy = copy.copy(populated)

for d in copy:
    # Dropping directors columns that has less that 8 registered movies in total in addition to unknown director.
    if (copy[d].sum() < 8.0 or d == 'Unknown director'):
        copy.drop([d],  axis=1, inplace = True)

sns.heatmap(copy) # Creating heatmap
plt.show()

From this heatmap containing the most active directors in the Netflix catalogue we can "classify" which genre a certain director is. By analysing the heatmap we can clearly state that "Stand-Up Comedy" is the genre that director is most active in and that Jan Suter is a "Stand-Up Comedy" director. McG can be with some certanty be classified as a "Action & Adventure" director.

# Genres added per year (Sander)
## The following plots shows the frequency of when genres were added

In [None]:
import pandas as pd
import numpy as np
import correlation_between_genres
import matplotlib.pyplot as plt
import chord

df = pd.DataFrame(pd.read_csv('netflix_titles.csv'))
dfMovies = df[df['type'] == 'Movie']
dfSeries = df[df['type'] == 'TV Show']

# Handling null values
### Number of movies and series which have missing date for when they were added.

In [None]:
missingMovieDates = len(dfMovies[dfMovies['date_added'].isnull()])
missingSeriesDates = len(dfSeries[dfSeries['date_added'].isnull()])
print('Total number of movies having a unknown date they were added: ', missingMovieDates)
print('Total number of series having a unknown date they were added: ', missingSeriesDates)

In [None]:
def genresAddedPerYear(df: pd.DataFrame) -> pd.DataFrame:
    # Gets the dates from the dateFrame and converts the format to datetime
    dates = pd.to_datetime(df['date_added'])
    # Removes day and month, as we are only interested in the year
    dates = dates.dt.year


    # Splits the listed_in column into individual genre columns
    genres = correlation_between_genres.genresOfMoviesSeries(df)

    # Puts dates and genres into one table
    genreAdded = correlation_between_genres.genresOfMoviesSeries(df)
    genreAdded.insert(0, 'date_added', dates)

    # Change cells with no value to None
    genreAdded = genreAdded.where(genreAdded.notnull(), None)

    # Name of every genre
    uniqueGenres = correlation_between_genres.totalOccurenceOfGenres(genres).keys().tolist()
    uniqueYears = genreAdded['date_added'].unique()
    # Flip the list to get the columns in the next step in ascending order from left to right
    uniqueYears = np.flip(uniqueYears)
    # Removing nan value
    uniqueYears = uniqueYears[1:]
    uniqueYears = np.sort(uniqueYears)

    genresAddedPerYear = pd.DataFrame(0, index = uniqueGenres, columns=uniqueYears)


    for i, movie in genreAdded.iterrows():
        yearAdded = movie['date_added']
        genresOfMovie = movie[1:4]

        for genre in genresOfMovie:
            if (yearAdded == None or genre == None):
                continue
            genresAddedPerYear[yearAdded][genre] += 1
    
    return genresAddedPerYear

### DataFrame showing how many movies/series with a given genre was added per year. Mind that a movie/series may have multiple genres and the sum of each column isn't the same as at number of movies/series added per year.

In [None]:
genresAddedPerYearSeries = genresAddedPerYear(dfSeries)
genresAddedPerYearMovies = genresAddedPerYear(dfMovies)

### Removing the data in the year 2020 as the year is not over yet and therefore is not representable

In [None]:
genresAddedPerYearSeries = genresAddedPerYearSeries.drop(2020, axis = 'columns')
genresAddedPerYearMovies = genresAddedPerYearMovies.drop(2020, axis = 'columns')

### Displaying DataFrames

In [None]:
display(genresAddedPerYearSeries)
display(genresAddedPerYearMovies)

# Visualizing the data

In [None]:
movieGenres = genresAddedPerYearMovies.index
seriesGenres = genresAddedPerYearSeries.index

fig, (ax1, ax2) = plt.subplots(1, 2, )
fig.subplots_adjust(right = 2, top = 1)

ax1.set_xlabel('Years')
ax1.set_ylabel('Added to genre')
ax1.set_title('Movies added to genre per year', fontweight = 'bold')

ax2.set_xlabel('Years')
ax2.set_ylabel('Added to genre')
ax2.set_title('Series added to genre per year', fontweight = 'bold')



for genre in movieGenres:
    ax1.plot(genresAddedPerYearMovies.loc[genre], marker = '.')

for genre in seriesGenres:
    ax2.plot(genresAddedPerYearSeries.loc[genre], marker = '.')

ax1.legend(movieGenres, loc = 2, fontsize = 5)
ax2.legend(seriesGenres, loc = 2, fontsize = 5)
plt.show()

We can tell from these two plots that the amount of new content added has been increasing every year, and the growth started to spike around 2015. We can also tell that there hasn't been added any new series in the years between 2008 and 2012. There are in total 11 movies/series missing a added date where 10 of them are series. Some of these, or all, might have been added in the years between 2008 and 2012, but we don't really know.

## Combinations of genres

In [None]:
def genreCombos(df: pd.DataFrame) -> pd.DataFrame:
        # Genre of every movie/series
    genres = correlation_between_genres.genresOfMoviesSeries(df)

    # How many occurences a genre needs to have to be included in the data
    TOTAL_OCCURENCE_THRESHOLD = 100

    # How many times a genre occurs
    genreOccurence = correlation_between_genres.totalOccurenceOfGenres(genres)

    # Only keeping genres that has an occurence higher than the threshold
    genreOccurence = genreOccurence[genreOccurence > TOTAL_OCCURENCE_THRESHOLD]

    # Cross-section between genres
    corrMatrix = pd.DataFrame(index = genreOccurence.keys(), columns = genreOccurence.keys())

    # Filling corrMatrix with values
    for genre1 in corrMatrix.keys():
        for genre2 in corrMatrix.keys():
            mainGenre = correlation_between_genres.moviesSeriesWithGenre(genres, genre1)
            genreCombination = correlation_between_genres.moviesSeriesWithGenre(mainGenre, genre2)

            if(genre1 == genre2):
                corrMatrix[genre1][genre2] = 0
            else:
                corrMatrix[genre1][genre2] = len(genreCombination)
                
    return corrMatrix

### Combination of genres in movies
NB! To show this plot open in JupyterLab.

In [None]:
moviesGenreCombos = genreCombos(dfMovies)

# Converts the cross-section matrix to a list as it is needed it the next step.
genreCombinationValues = moviesGenreCombos.values.tolist()
genreNames = moviesGenreCombos.index.tolist()

# NOTE: Requires to be run in jupyter lab as the plot won't show in notebook.
chord.Chord(genreCombinationValues, genreNames, margin=80, font_size_large='10px').show()

### Combination of genres in series
NB! To show this plot open in JupyterLab.

In [None]:
moviesGenreCombos = genreCombos(dfSeries)

# Converts the cross-section matrix to a list as it is needed it the next step.
genreCombinationValues = moviesGenreCombos.values.tolist()
genreNames = moviesGenreCombos.index.tolist()

# NOTE: Requires to be run in jupyter lab as the plot won't show in notebook.
chord.Chord(genreCombinationValues, genreNames, margin=80, font_size_large='10px').show()

## Chapter 1.2 - Patterns in genres (Aleksander)

In [None]:
#Work with desired data
df = data_set[["type","date_added"]].copy()
df = yrs_months.valid_dates(df)
df

In [None]:
#Get month and year columns
df = yrs_months.create_month_column(df)
df = yrs_months.create_year_column(df)
df

In [None]:
months = yrs_months.all_months()
years = yrs_months.all_years()

In [None]:
df = yrs_months.create_table(df)

In [None]:
df

In [None]:
yrs_months.heatmap(df,title="Content update per month and year",xlab="Month",ylab="Year")

In [None]:
#Christmas
df_xmas = data_set[["date_added","description"]].copy()
df_xmas = df_xmas[df_xmas["description"].str.contains("Christmas")]
df_xmas 

In [None]:
df_xmas = yrs_months.valid_dates(df_xmas) #Remova all dates with "Uknown date_added"
df_xmas = yrs_months.create_month_column(df_xmas) #Create month column
df_xmas = yrs_months.create_year_column(df_xmas) #Create year column
df_xmas

In [None]:
df_xmas_tab = yrs_months.create_table(df_xmas)

In [None]:
df_xmas_tab

In [None]:
yrs_months.heatmap(df_xmas_tab,title="Overview of when Christmas content is added",xlab="Months",ylab="Years")

In [None]:
#Find all Horror related categories and see if more horror stuff comes before Halloween (should come in september/october)
df_horror = data_set[["date_added","listed_in"]].copy()
df_horror = df_horror[df_horror["listed_in"].str.contains("Horror")]

In [None]:
df_horror = yrs_months.valid_dates(df_horror)
df_horror = yrs_months.create_month_column(df_horror)
df_horror = yrs_months.create_year_column(df_horror)

In [None]:
df_horror

In [None]:
df_horror_tab = yrs_months.create_table(df_horror)

In [None]:
df_horror_tab

In [None]:
yrs_months.heatmap(df_horror_tab,title="Overview of when Horror content is added",xlab="Months",ylab="Years")

In [None]:
#Love movies around valentines? should be added in jan/feb

#Find all Romantic related categories and see if more horror stuff comes before Halloween (should come in september/october)
df_romantic = data_set[["date_added","listed_in"]].copy()
df_romantic = df_romantic[df_romantic["listed_in"].str.contains("Romantic")]
df_romantic = yrs_months.valid_dates(df_romantic)
df_romantic = yrs_months.create_month_column(df_romantic)
df_romantic = yrs_months.create_year_column(df_romantic)
df_romantic_tab = yrs_months.create_table(df_romantic)

In [None]:
yrs_months.heatmap(df_romantic_tab,title="sometitle",xlab="fksd",ylab="fonsd")

# Summary genre

# Content evolution of Netflix.

In [None]:
# Import the library used for plot
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize = (35,6))
sns.countplot(x='release_year', data = data_set, hue='type') # Plot the release year

In [None]:
# See when series and movies where added to netflix catalouge
df = data_set.copy()
df['year_added'] = df['date_added'].str[-4:]
df['year_added'].replace({"dded": "unknown"}, inplace = True)
df.sort_values(by=['year_added'], inplace=True)

plt.figure(figsize = (35,10))
sns.countplot(x=df['year_added'], data = df, hue='type')


In [None]:
# Check when they added new and old content to the library

data_set['release_year'].value_counts()

In [None]:
import numpy as np

# Ceck when netflix started to add older content to the catalouge

df['decade'] = ""   # Make new column in the dataset

# Make conditions for the plot
conditions = [
    (df['release_year'] >= 1925) & (df['release_year'] <= 1945),
    (df['release_year'] >= 1946) & (df['release_year'] <= 1965),
    (df['release_year'] >= 1966) & (df['release_year'] <= 1985),
    (df['release_year'] >= 1986) & (df['release_year'] <= 2007),
    (df['release_year'] >= 2008) & (df['release_year'] <= 2020)
]

values = ['1925-1945', '1946-1965', '1966-1985', '1986-2007', '2008-2020'] # Which value to add to the condition interval

df['year'] = np.select(conditions, values)   # Add values to decade

df.drop(df.loc[df['year']=='2008-2020'].index, inplace=True)  # Drop content made after netflixc was released


plt.figure(figsize = (35,20))
sns.countplot(x=df['year_added'], data = df, hue='year')



# Movie recommendations

The aim of this part is to create content recommondations based on a given title. 
In order to create recommondations, one needs to identify similarities between the given title and others titles in the dataset. 

The similarity between titles will be found by comparing and finding similarities in the description, genre and cast columns. 

To create recommondations, it is crucial to quantify the similarities. For this porpuse, the cosine similarity will be utlizied. The cosine similarity measures similarity between two vectors by comparing the angle between two vectors and determining if they are pointing in the same direction \[kilde: Data Mining: Concepts and Techniques, chap 2.5.7]

kilde link https://www.sciencedirect.com/science/article/pii/B9780123814791000022

To fully understand this concept before the developement of the recommendation system, a simple example is explained:

Consider two texts:

text1 = Hello Hello Goodbye

text2 = Goodbye Hello Goodbye

By identifying the words and their frequency in the two texts, the table below is acheived: 

|  | Hello | Goodbye |
| --- | --- | --- |
| textA | 2 | 1 |
| textB | 1 | 2 |

This can now be visualized as vectors: 


In [None]:
import matplotlib.pyplot as plt
plt.style.use("ggplot")
text1_occurences = [2,1]
text2_occurences = [1,2]
vectors = np.array([text1_occurences,text2_occurences])
origin = np.array([[0, 0],[0,0]])
plt.quiver(*origin,vectors[:,0],vectors[:,1],color=["r","g"],angles="xy",scale_units="xy",scale=1)
plt.xlim(0,2.5)
plt.ylim(0,2.5)
plt.xlabel("Goodbye")
plt.ylabel("Hello")
plt.text(0.4,0.5,"\u03B8",fontsize=17)
plt.text(1.02,2.02,"textA",fontsize=12,color="g")
plt.text(2.02,1.02,"textB",fontsize=12,color="r")
plt.title("Vectorized representation of frequency of words")


By letting $\theta$ be the angle between the two vectors, the cosine similarity  is calculated by: 
\begin{equation}
\cos \theta = \frac{A \cdot B}{ |A| \cdot |B|},
\end{equation}
where $A$ and $B$ are vectors representing the occurences of words in textA and textB, and $|A|$ and $|B|$ are the length of the vectors.
Recalling from calculus the relation between a cosine value for $\theta$ and the angle $\theta$ itself:

|  Degrees | cos $\theta$ |
|  --- | --- |
|  0 | 1 |
|  30 | 0.866 |
| 60 |  0.5 |
| 90 |  0 |

For this angle-identifying reason, the cosine similarity is used to identify similar content. The smaller the angle $\theta$ is between two vectors, the more similar the conent is. 

As mentioned above, the description, genre and cast column will be used to identify similarities. We start by extracting the dataframe with the respective columns:

In [None]:
df = data_set[["title","description","listed_in","cast"]].copy()

In order to identify cosine similarities, we need to convert the information in the columns to strings. Therefore, the columns are modified and added to a seperate column. We start by removing stopwords from the description column and cleaning it up: 

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = stopwords.words('english')

In [None]:
df

In [None]:
#Clean the description column
df["description"] = df["description"].str.lower() #converting all words to lower 
df["description"] = df["description"].str.split() #creating each description to a list
df["description"] = df["description"].apply(lambda x: pd.Series(''.join([word + ' ' for word in x if word not in stop_words]))) #need to get aslak to explain
df["description"] = df["description"].apply(lambda x: x.replace(',','')) #Removing commas
df["description"] = df["description"].apply(lambda x: x.replace('.','')) #Removing dots

In [None]:
#Clean the listed in column
df["listed_in"] = df["listed_in"].str.lower()
df["listed_in"] = df["listed_in"].str.split()
df["listed_in"] = df["listed_in"].apply(lambda x: pd.Series(''.join([word + ' ' for word in x if word not in stop_words])))
df["listed_in"] = df["listed_in"].apply(lambda x: x.replace(',',''))
df["listed_in"] = df["listed_in"].apply(lambda x: x.replace('.',''))
df["listed_in"] = df["listed_in"].apply(lambda x: x.replace('&',''))

In [None]:
#Clean the cast column
df["cast"] = df["cast"].str.lower()
df["cast"] = df["cast"].apply(lambda x: x.replace(' ','')) #removing whitespace between first and last name so cosine similraity checks entire actor name, and not only first name and last name
df["cast"] = df["cast"].apply(lambda x: x.replace(',',' ')) 



In [None]:
#Combine description, listed_in and cast as new column 
df["all_info"] = df["description"] + df["listed_in"] + df["cast"]
#df.iloc[0].all_info run this line to see an example

In [None]:
df

Now that we for each title have a single column containing information about the description, the genres and the cast, we can count instances of different words and create a matrix containing instances count of unique words. This can be acheived by using sklearn.feature_extraction.text.CountVectorizer, which is described to "convert a collection of text documents to a matrix of token counts \[source: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html ]. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer() #Initialize a CountVectorizer object
count_matrix = cv.fit_transform(df["all_info"]) # Identify and count instances in the "all_info" column. 
# print(count_matrix.toarray()) print this line to get a feeling of what the count_matrix looks like

Once the count_matrix is acheived, the next step is to find the cosine similarity. For this purpose, sklearn.metrics.pairwise.cosine_similarity is used \[source: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html].

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cos_sim = cosine_similarity(count_matrix) #Now the cosine similarities have been quantified
#print(cos_sim) print this line to get a feeling of what cosine sims looks like

#Defining two helper functions
def get_title_from_idx(df,idx):
    """
    Returns the title of a entry from a given index
    """
    return df[df.index == idx].title.values[0]
#df[df.index == index]["title"].values[0]

def get_idx_from_title(df,title):
    """
    Returns the index of a movie from a given title
    """
    return df[df["title"]==title].index.values.astype(int)[0]

In [None]:
title_user_likes = "Transformers Prime" #Test code for transformers prime
title_idx = get_idx_from_title(df,title_user_likes) #identify the index of the title that the user likes

With the index of the movie, we can find the list of the cosine similarities for the specific title. This is done by:

In [None]:
specific_cos_sim = cos_sim[title_idx]

We need to keep track of the indexes elements in the specific_cos_sim scores. By enumerating, the array of cosine similarity scores are converted into tuples containing the index and the scores. Finally, by converting these tuples into a list, we get a list of tuples containing the index and the scores:

In [None]:
similar_content = list(enumerate(specific_cos_sim))

Now we wish to sort the list of tuples based on similarity scores. 

This is acheived by using the built-in sorted() function in python. The sorted() function buils a new list. In the official documentation for the sorted()-function, it says that the key should specify a function of one argument that is used to extract a comparison key for each element in the iterable. We wish to sort on second value for each tuple (wich is the cosine similirity score). I therefore define a function to acheive this:

source:
https://docs.python.org/3/library/functions.html#sorted

In [None]:
def key_func(val):
    return val[1]

sorted_similar = sorted(similar_content,key=key_func,reverse=True) #using reverse because we wish to sort by descending order
#note, could also have used key = lambda x: x[1]
sorted_similar_suggestions = sorted_similar[1:] #Skip the first because this will suggest itself


#Print top 5 similar content:
i = 0
print("Top 5 titles similar to "+ title_user_likes + " are:\n")
for e in sorted_similar_suggestions:
    print(get_title_from_idx(df,e[0])) #As above, we have a list of tuples, where index 0 is key and inde 1 is cosine sim score. Therefore give index (e[0]) to function.
    i+=1
    if i >= 5:
        break


As seen from the above list, when entering Transformers Prime as liked content, the code returns content similar to this. This is specially noticeable as the top similar content is another Transformers movie, and we can therefore conclude that this recommondation system works fine. 

# Neural Network

Ex 4. Try to predict the genre of a movie/show based on the director, actors, etc. using
machine/deep learning techniques.

Here we take a dive into a Neural Network model. It was decided to use the model [MLPClassifier][MLPClassifier] from the sklearn library. We will start this chapter with some displamers. 

Only the finalized model will be displayed here. The data has been polished the python file _cleaning_data_for_NN.py_ to improve the final product. The changes has been mainly to combine similar genres and to remove names of actor/actresses that were only displayed once in the dataset. We believed this would improve our model. 

As stated, the model has been ran several times, however, only the final model has been displayed below.Furthermore, the previous model has been saved and their report is located below in the [result](#Results) section.


## Programming Logic

The core of the programming logic for the final model is simple. In the _cleaning_data_for_NN.py_ file we start by cleaning the provided [netflix data][netflixData] set. Removing nan values and values that were incoherent. Movies/TV Shows labled _just_ Movies/TV Shows were dropped from the dataset, and genres labeled with "TV Show" or "Movie", as for example Romantic Movies ( or Romantic TV Shows) were renamed to just Romantic. The main reason were try to improve the amount of datapoints with the same genre and reducing the amount of classes in the output data. 

Afterwards the data were checked for duplicates. We had issues with some lables because they were mentioned twice in a row. The biggest issue were about "International Movies/TV Shows". Our decision to combine all International subject to a larger genre called "International" made it often appeared more than once in several row it was a subject of. We decided to remove duplicates and the Pandas Dataframe were saved to another CSV file, called _cleaned_Netflix_for_NN.csv_.

[MLPClassifier]: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
[netflixData]: https://www.kaggle.com/shivamb/netflix-shows

### Reading of CSV file, creating and transforming X & y data.

The file _final_NN_script.py_ contains the logic of the final Neural Network algorithm. It will be presented in parts below.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
import pickle

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction import text 
from sklearn.metrics import classification_report, multilabel_confusion_matrix

In [None]:
netflix_data=pd.read_csv("cleaned_Netflix_for_NN.csv")

smaller_data = netflix_data.copy()

y = smaller_data.listed_in
X = [','.join((d, c, r, t)) for d,c,r,t in zip(
                                                smaller_data.director, 
                                                smaller_data.cast, 
                                                smaller_data.rating, 
                                                smaller_data.title
                                            )]

# Custom stop words for the CountVectorizer to ignore while transforming.
customStopWords=['no cast', 'no director', 'movies', 'tv shows',
                'lgbtq movies', 'teen tv shows', 'cult'] 

# Find all actors that only appears once in the dataset
customCastStopWords = smaller_data.cast.str.split(', ').explode().value_counts()[
    smaller_data.cast.str.split(', ').explode().value_counts() < 2].keys()
customCastStopWords = [x.lower() for x in customCastStopWords] # Make all values lowercase

# Add stopwords for vectorizer into single frozenset
stop_words = text.ENGLISH_STOP_WORDS.union(customStopWords)
stop_words = stop_words.union(list(customCastStopWords))

# Split data into train and test data, training data = 80% of original data & test = 20%.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=1000) 

# Define the vectorizer algorithm
matrix = CountVectorizer(
    tokenizer=lambda row: [x.strip() for x in row.split(',') if x != ''], 
    stop_words=stop_words)

# Transform X data
x_train_fit = matrix.fit_transform(X_train)
x_test_fit = matrix.transform(X_test)

# Transform y data
y_train_fit = matrix.fit_transform(y_train)
y_test_fit = matrix.transform(y_test)

# Printing out all genres in the y data
print("Genres:")
for i in matrix.get_feature_names():
    print(i)

#### X & y data

A few lines needs to be explained in the code section above. The fist is the making of the Y and X data (out and in data) used by the Neural Network algorithm. 
```python
y = smaller_data.listed_in
X = [','.join((d, c, r, t)) for d,c,r,t in zip(
                                                smaller_data.director, 
                                                smaller_data.cast, 
                                                smaller_data.rating, 
                                                smaller_data.title
                                            )]
```

y data is the output prediction, and in this model we are trying to predict genres from numerous columns of the data set. Alittle selfexplanatory that the y (out) data points needs to be the genres, which, in the dataset is called _listed_in_

The X data has a simple ( and potentionally improved ) python _join_ method implementation. We make use of the _CountVectorizer_ method deployed by the sklearn library. This will be further explained later, now we just need to know the input of this can be either a string or byte. 

[From the countVectorizer documentation][countVectorizer]: 
> input : string {‘filename’, ‘file’, ‘content’}, default=’content’ <br>
> _Otherwise the input is expected to be a sequence of items that can be of type string or byte._

To get around this issue (easily) since we want several columns the join method were implemented. Here, for each row, we join all directors, cast members, the rating and the tile into single list of string. This is discussed in the [Improvment Chapter](#Improvments-and-further-research)

[countVectorizer]: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

#### Stopwords

The next section explained is the stopword logic. Because the netflix data set has empty cells from nearly each column, we made use of the imputation method explained earlier and replaces the empty words with filler words as _No Director_ or _No Cast_. Because these words is selfmade and has no meaning, they needs to not be counted by the CountVectorizer method. Here the stopwords come into play. By writing _No Director_ or _No Cast_ as stopwords (among others), we can rely on the method to not take these words into account when making the fitted data.


```python
# Custom stop words for the CountVectorizer to ignore while transforming.
customStopWords=['no cast', 'no director', 'movies', 'tv shows',
                'lgbtq movies', 'teen tv shows', 'cult'] 

customCastStopWords = smaller_data.cast.str.split(', ').explode().value_counts()[
    smaller_data.cast.str.split(', ').explode().value_counts() < 2].keys()
                                                                                 
customCastStopWords = [x.lower() for x in customCastStopWords] # Make all values lowercase

# Add stopwords for vectorizer into single frozenset
stop_words = text.ENGLISH_STOP_WORDS.union(customStopWords)
stop_words = stop_words.union(list(customCastStopWords))
```

This section of code creates the stopwords for the final model. From previous test the ``` 'lgbtq movies', 'teen tv shows', 'cult' ``` values were added to the stopwords due to the results from the [Classification Report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) from a previous model. 
Further it was decided to remove all actors that only appeared in one movie. The variable _customCastStopWords_ stores these values. 

#### CountVectorizer

After splitting the data into both test and train datasets with a ratio of 20 to 80 percent of the original dataset, the CountVectorizer is used. The documentation explains the class from sklearn as this: 
> Convert a collection of text documents to a matrix of token counts

The method is implemented with a lamdba function: ```  tokenizer=lambda row: [x.strip() for x in row.split(',') if x != ''] ``` The tokenizer paramter is explained as such: 
> tokenizer : callable, default=None <br>
Override the string tokenization step while preserving the preprocessing and n-grams generation steps. 

This is the parameter that is used to split the words from each row. The lamdba function has been implemented to make use of the join method used to join all the different columns. Now each row gets split and stripped for whitespaces to make sure that every item is the same.  

### MLPClassifier method

In [None]:
hidden_layer = 1000 #round(x_train_fit.shape[1]*(2/3) + y_train_fit.shape[1])

# Small datetime to check when ML started
datetime_object = datetime.datetime.now()
print("Begin ML: ", datetime_object)

# Neural Network algorithm
clf = MLPClassifier(hidden_layer_sizes=(hidden_layer,hidden_layer ),
                    solver='adam', verbose=True, 
                    random_state=1, max_iter=50) 

clf.fit(x_train_fit, y_train_fit)

# Small datetime to check when ML stopped
datetime_object = datetime.datetime.now()
print("End ML: ", datetime_object)

## Results

As this defined as a unbalanced multilabeled classification, we can look at the micro average results to best see the results of the NN model.
By having the _drama_ genre as an example: 

Precision is calculated: <br>
drama correctly indentified divided by drama correctly identified plus other genres identified as drama

Recall is calculated:
Drama correctly identified by drama correctly identified plus drama identified as other genres. 
##### First model

First we present the Classification Report of the First model



In [None]:
from first_NN_model import first_NN_model

x_train_fit, x_test_fit, y_train_fit, y_test_fit, matrix = first_NN_model()

fileName = 'first_unflitered_NN_model.sav'

loaded_model = pickle.load(open(fileName, 'rb'))
y_pred = loaded_model.predict(x_test_fit)


print("Classification Report: \n", classification_report(y_test_fit,y_pred, target_names=list(matrix.get_feature_names())))


Before cleaning up the data we can still see a good precision score, however the recall score is lower. 

#### Improving genre data

The first attempt to improve the results were to redefine genres. Previous report says several genres had a precision and recall score of 0, thus we made the decision to combine the genres and/or removed them from the transformed dataset. The resulting report is seen below.

In [None]:
from cleaned_genres_script import cleaned_genres_NN_model

x_train_fit, x_test_fit, y_train_fit, y_test_fit, matrix = cleaned_genres_NN_model()

fileName = 'cleaned_genres_NN_model.sav'

loaded_model = pickle.load(open(fileName, 'rb'))
y_pred = loaded_model.predict(x_test_fit)


print("Classification Report: \n", classification_report(y_test_fit,y_pred, target_names=list(matrix.get_feature_names())))


The result shows a good improvment from 0.19 to 0.38 when we cleaned the genres up. We halved the amount of genres from 42  to 22. Our hypothesis to get more refined genres had a positiv result. However, we can still try to get better results. 

#### Final result

The final model removed all actors that only appeared in one movie, (approx 20k names). The logic is seen in the [first subchapter of Programming Logic](#Reading-of-CSV-file,-creating-and-transforming-X-&-y-data.) chapter. We assumed that these names would only confuse the model and not be helpful. The results are printed below in the already saved model.

In [None]:
# Import values needed for classification report
from final_NN_script import final_NN_model

x_train_fit, x_test_fit, y_train_fit, y_test_fit, matrix = final_NN_model()
# Compute results
fileName = 'Final_NN_model.sav'

loaded_model = pickle.load(open(fileName, 'rb'))
y_pred = loaded_model.predict(x_test_fit)


print("Classification Report: \n", classification_report(y_test_fit,y_pred, target_names=list(matrix.get_feature_names())))


Here we actually see an further improvment on the recall and a small reduction in the precision. Can further small adjustments improve the model? 

### Confusion Matrix Visualization

In [None]:
# Confusion Matrix
cm = multilabel_confusion_matrix(y_test_fit, y_pred)

# Make labels out of genres
labels = matrix.get_feature_names()
 
def mutlilabel_cm_plot(confusion_matrix, axes, class_label, class_names, fontsize=14):

    df_cm = pd.DataFrame(
        confusion_matrix, index=class_names, columns=class_names,
    )

    try:
        heatmap = sns.heatmap(df_cm, annot=True, fmt="d", cbar=False, ax=axes)
    except ValueError:
        raise ValueError("Confusion matrix values must be integers.")
    heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
    heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=fontsize)
    axes.set_xlabel('True label')
    axes.set_ylabel('Predicted label')
    axes.set_title(class_label)
    
# Plot Confusion Matrix
fig, ax = plt.subplots(6, 4, figsize=(7, 12))
    
for axes, cfs_matrix, label in zip(ax.flatten(), cm, labels):
    mutlilabel_cm_plot(cfs_matrix, axes, label, ["Y", "N"])

fig.tight_layout()
plt.show()

At the end we show the confusion matrixes of all the genres from the final model. 

### Improvments and further research

Here arqe some features that could have been improved given more knowledge and time

#### Questions
* Could a word embedding model be used to further generalize the names of actor/actresses? 
* Could the title be used in a better extend? Maybe split and remove more of the same word and not use whole title string as an node?
* Could word embedding be used to generelize the training/test data better than the manuall cleaning?

#### Improvments
* The model could be more refined to accept functions and/or make it more resuable.
* The cleaning could be refined to accept different datasets, the genre that were changed is now hardcoded in a dictonary. This is not optimal. 

# Preparing data for machine learning

In [None]:
import pandas as pd
import numpy as np
import data_sanitizer
import matplotlib.pyplot as plt

dataSet = pd.read_csv('../netflix_titles.csv')

In [None]:
dataSet = pd.read_csv('../netflix_titles.csv')
# Copy listed_in column into it's own dataframe, split each genre into it's own columns and rename the columns
genres = dataSet['listed_in'].str.split(', ', expand=True)
genres.rename(columns = {0: 'genre1', 1: 'genre2', 2: 'genre3'}, inplace = True)
genres

In [None]:
# Do the same as above for cast and director columns
cast = dataSet['cast'].str.split(', ', expand = True)
director = dataSet['director'].str.split(', ', expand = True)

for i in cast.keys():
    cast.rename(columns={i: 'actor{}'.format(i + 1)}, inplace = True)

for i in director.keys():
    director.rename(columns={i: 'director{}'.format(i + 1)}, inplace = True)

cast

In [None]:
# Remove International Movies and International TV Shows from genres as it is overrepresented and doesn't really tell us much

genres = genres.replace({'International Movies': None})
genres = genres.replace({'International TV Shows': None})

In [None]:
# Set any cell that doesn't have a value to nan
director = director[director.notnull()]
cast = cast[cast.notnull()]
genres = genres[genres.notnull()]
genres

In [None]:
# Fetch how many times each director occurs in movies and series
directorOccurence = pd.Series()

for col in director:
    count = director[col].value_counts()
    directorOccurence = directorOccurence.add(count, fill_value = 0)

directorOccurence = directorOccurence.sort_values(ascending=False)

# Which directors occur in more than 7 movies and series
directorOccurence[directorOccurence > 7]

In [None]:
# Same as above but for actors instead
actorOccurence = pd.Series()

for col in cast:
    count = cast[col].value_counts()
    actorOccurence = actorOccurence.add(count, fill_value = 0)

actorOccurence = actorOccurence.sort_values(ascending=False)

# Actors that occur in more than 2 movies and series
actorOccurence
actorOccurence[actorOccurence > 2]

In [None]:
cast.count()
# We can tell from this that there are very few data entries in the columns further out and we choose to drop
# these as there won't be much information lost compared to the potential gain in memory saved.

In [None]:
# Only want to keep columns with more than 100 data points

# Potential problem with this approach is that we could lose informtion on the same actor in the case where an actor always occurs as one of the last everytime
cast = cast[cast.count().keys()[cast.count().values > 100]]
cast

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.subplots_adjust(right = 3, top = 2)

# Plot comparing the directors who occurs the most.
directorsTop = directorOccurence.iloc[0:50].sort_values()

ax1.set_title('Top 50 directors by occurance', fontweight = 'bold')
ax1.set_xlabel('Occurance')
ax1.barh(width = directorsTop.values, y = directorsTop.keys())

# Plot comparing the actors who occurs the most.
actorsTop = actorOccurence.iloc[0:50].sort_values()

ax2.set_title('Top 50 actors by occurance', fontweight = 'bold')
ax2.set_xlabel('Occurance')
ax2.barh(width = actorsTop.values, y = actorsTop.keys())

fig.show()

In [None]:
# Check if there is any common genres the directors occur in.

# Set to any value between 0 and 50 where 50 is the director with the highest occurance
DIRECTOR_INDEX = 25

director_with_highest_occurance = directorsTop.index[DIRECTOR_INDEX]

movies_series_director_occurs_indexes = []

for col in director.columns:
    movies_series_director_occurs_indexes += director.loc[director[col] == director_with_highest_occurance].index.values.tolist()


genres.iloc[movies_series_director_occurs_indexes]

# We can tell from this that there actually seems to be a really big correlation between director and genres (at least for the top 25) as a director
# keeps making movies/series in the same genres.

In [None]:
# Check if there is any common genres the actors occur in.

# Set to any value between 0 and 50 where 50 is the actor with the highest occurance
ACTOR_INDEX = 25

actor_with_highest_occurance = actorsTop.index[ACTOR_INDEX]

movies_series_actor_occurs_indexes = []



for col in cast.columns:
    movies_series_actor_occurs_indexes += cast.loc[cast[col] == actor_with_highest_occurance].index.values.tolist()


genres.iloc[movies_series_actor_occurs_indexes]

# There seems to be some correlation between an actor and genres as well, but not as strong as with the directors. As
# many of the actors do occur in the same genres multiple times, but they also deviate from their "main genre" at times.

This relationship between directors and genres, and actors and genres, can be used to predict the genre of a movie based on director and actor, but we got an issue that the number of movies/series an actor/director has been a part of drops of quickly as we can tell from the occurance plots. This means that if we would want to use this information to create a model using machine learning, it could become hard to get a model with high accuracy if what we input to the model doesn't contain any of the directors or actors that has a high occurance, and there isn't particulary many of them.