# Netflix next banger

## Background

Netflix is a member-based internet entertainment service founded in 1997. The service allows members to watch tv-series, movies and documentaries through a streaming software on internet-connected devices against a monthly fee. Netflix’s content relies on a complex structure of algorithms that watches customer’s online use and are supposed to optimize satisfying decisions. Thus the company is in great need of user data. (Clares-Gavilán, Fernández-Manzano och Neira 2016, 568-576)

The company’s operational environment is characterized by hands-off approach to management. The workforce is heterogeneous, about 3,500 people who 
leaders trust to have the company’s best interests at heart and act correspondingly. (Gulati 2018, 4-13).  Furthermore a majority of Netflix’s content is from outside sources and the company’s operational flexibility is limited by outside partners (Ciejka 2018, 3-38).


## Problem description

As an effect of high competition and slow growth in the number of new subscribers Netflix share price has declined in fall 2019, hence Netflix wants to maximize the success of their next creation. The problem lies in not knowing what will be best received by the public. (Ny Teknik, 2019)

Additionally, by having some of its content come from outside sources, part of the problem becomes choosing the right content. If Netflix have to choose between several newly released movies or series, how will they know which one has the best chance of success?

In our project we will gather data from movies and their scripts. We will look for commonly used words, actors, genres and directors to predict which attributes Netflix should be the most interested in when creating their next movie and also which outside-produced movies they should add to their assortment. Finally upcoming movies will also be discussed with our analysed data in mind.


## Previous Studies
There are many different researches that has been carried out in the field using different approaches. Most of the studies found focuses on extracting data from social media to map how users online discussions, like reviews etc can predict a movies success. There are fewer studies that focuses on the attributes of a movie to explore how this is related to a movies success. 

In Ahmad et. al’s study Movie Success Prediction Using Data Mining (2017) the authors presents a study which aims to predict a movies success based on the weighted attributes budget, actors, director, producer, set locations, story writer, movie release day, competing movie releases at the same time, music, release location and target audience. The weighing is done using a mathematical model. It’s stated that “success cannot be predicted based on a particular attribute” and therefore weighing of multiple attributes will be integrated in this study as well. Furthermore the study gives a basic framework in how to value different attributes and what method to consider for the data mining. Lastly, the article shows how classification and clustering can be used in the context of prediction. It also gives a reference point to compare the results of this study to. 

Another similar study is Hammad Afzals article Prediction of Movies popularity Using Machine Learning Techniques (2016) which describes how machine learning approaches can predict an movies success based on datasets of IMDB scores and attributes. The most important finding of this article regarding the work of this report is that the attributes which contributed with most information was metascore and number of votes for each movie, Oscar awards won by the movies and the number of screens the movie is going to be screened which is considered in the choosing of attributes for this report. 

Furthermore Meenakshi et al also try to determine a movies success using several of the same attributes in their 2018 study. In the paper A Data mining Technique for Analyzing and Predicting the success of Movie they employ a decision tree together with clusters to categorize movies as either “Flop”, “Average” or “Hit”. They found that budget was no indication of how well a movie would be rated, and that genre was the most important node in their decision tree. (Meenakshi et al. 2018, 1-6)

The paper Predicting movie success and academy awards through sentiment and social network analysis talks about how much influence online communities have on the actual success of movies. The paper focuses on a model that that predicts academy award nominees by researching the opinions of online communities. Afterwards it uses the same approach to see if there is a correlation between online communities and movie success at the box office. By doing this, the study were able to predict different real events, such as nine academy award winners, by considering and using mentioned models on social networks in movie communities online.(Jonas Sebastian Krauss et al. 2008, 9-10). 


## Data collection

importing data from 4 separate tables: 
 - movies_metadata, containing several diffrerent attributes
 - title_principals, containing a connection between a movie and its cast members
 - movie_names, containing information about cast members
 - box_office // TODO 

In [144]:
import numpy as np
import csv

movie_data = []
title_principals = []
movie_names = []

with open('movies_metadata.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        movie_data.append(row) 

In [2]:
with open('title_principals.tsv', encoding="utf8") as tsvfile4:
    reader_titles = csv.reader(tsvfile4, delimiter='\t')
    for row in reader_titles:
        title_principals.append(row)

In [145]:
with open('movie_names.tsv', encoding="utf8") as tsvfile4:
    reader_rating = csv.reader(tsvfile4, delimiter='\t')
    for row in reader_rating:
        movie_names.append(row)
        
movie_data_np = np.array(movie_data[1:])

We begin our extraction by taking the movie_metadata table and selecting a few wanted attributes. The genres are clumped together in an ugly anonymous type-like object, so we split it up and save only the attribute names in an array. More filtering is done by first sorting the movies on imdb's id for later, and then removing any movie with a rating below 7.0. 

In [146]:
movie_data_selected_attributes = []

# Selecting attributes
for movie in movie_data_np:
    if(len(movie) > 22):
        movie_data_selected_attributes.append([movie[6], movie[3], movie[8], movie[14], movie[22], movie[23]])

print("Filter attributes done!")
        
movie_data_stripped_genres = []

# Rearranging genres
for movie in movie_data_selected_attributes:
    genre_arr = movie[1].split('\'')
    genres = []
    for i in range(5, len(genre_arr), 6):
        genres.append(genre_arr[i])       
    movie_data_stripped_genres.append([movie[0], genres, movie[2], movie[3], movie[4], movie[5]])

print("Genre rearranged done!")    

movies_above_7 = []

def custom_sort(t):
    return t[0]

def try_parse_float(input):
    try:
        number = float(input)
    except:
        number = -1
    return number

movie_data_stripped_genres.sort(key=custom_sort)

# Filter out movies with rating < 7.0
for movie in movie_data_stripped_genres:
    if(try_parse_float(movie[4]) >= 7.0):
        movies_above_7.append(movie)

print("Filter movies below 7 done!")
movies_above_7 = movies_above_7[3:]

Filter attributes done!
Genre rearranged done!
Filter movies below 7 done!


We define a function to help compare on ids. All imdb-ids consist of "xx0124002" where x is a letter. With this function we iterate through title_principals, the list with all connections, and movies_above_7, our filtered, sorted, list of movies. The connection is made by comparing ids from both lists, where movie_id iterates to "catch up" to title_id since many there are many more titles than movies in our lists.

The list with movie list is then filtered to remove any movie that did not have a corresponding crew.

In [12]:
def get_number(string):
    try:
        return (int) (string[2:])
    except:
        return 0
    
movies_with_crew = []

# Iterate through title_principals and movies_above_7, join on id.
j = 0
movie_id = 0
for i in range(len(title_principals)):
    title_id = get_number(title_principals[i][0])
    while(movie_id < title_id):
        j += 1
        if(j > len(movies_above_7) -1):
            break
        movie_id = get_number(movies_above_7[j][0])
    if(movie_id == title_id):
        if(j > len(movies_above_7) -1):
            break
        movies_above_7[j].append(title_principals[i][2])
        
movies_with_actors = []

# Remove non-matches, missing values
for movie in movies_above_7:
    if(len(movie) > 6):
        movies_with_actors.append(movie)

This list is saved as "movies_and_crew.csv" to not have to do this tedious task again...

In [15]:
with open('movies_and_crew.csv', 'w', encoding='utf-8') as writeFile:
    writer = csv.writer(writeFile)
    writer.writerows(movies_with_actors)

Read movies_and_crew:

In [30]:
movies_and_crew = []

with open('movies_and_crew.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        if(len(row) > 0):
            movies_and_crew.append(row)


Filter out movies before 1960. Remove all non-actors from movie_names. Defines a function to look for an array of actor ids and return corresponding names. Loop breaks early if index at actor_id is reached, since the actor list is sorted we can assume no actor is beyond their own id as index. 

In [152]:
movies_after_1960 = []

# Remove movies before 1960
for movie in movies_and_crew:
    if(try_parse_float(movie[3].split('-')[0]) >= 1960):
        movies_after_1960.append(movie)
    
movie_names_actors = []

# Remove non-actors
for name in movie_names:
    if("actor" in name[4] or "director" in name[4]):
        movie_names_actors.append(name)
    
# Takes a list of actor-ids and returns an array of names
def get_actor_names(actorList):
    actors = []
    for actor in actorList:
        actor_id = get_number(actor)
        count = 0
        for name in movie_names_actors:
            count += 1
            if(count > actor_id):
                break
            if(actor == name[0]):
                actors.append(name[1])
                break
    return actors

Connect all actors in the movie on id and replace with names of actors in initial array. To do this, first the movie_names list is filtered to remove any non-actor.

In [None]:
movies_with_actor_names = []

# Replace actor ids with names
for movie in movies_after_1960:
    actors = get_actor_names(movie[6:])
    movies_with_actor_names.append([movie[:5], actors])

Aaaaand we save it.

In [6]:
with open('movies_and_crew_names.csv', 'w', encoding='utf-8') as writeFile:
    writer = csv.writer(writeFile)
    writer.writerows(movies_with_actor_names)

In [29]:
import csv

movies_with_actor_names = []

with open('movies_and_crew_names.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        if(len(row) > 0):
            movies_with_actor_names.append(row)

movies_with_actor_names = movies_with_actor_names[1:]

Aaaaand we did something wrong when we saved it... Cleaning up some chars and setting up the rows again.

In [141]:
we_doing_this_again = []

for values in movies_with_actor_names:
    for value in values:
        value = value.replace('\'', '')
        value = value.replace('"', '')
        value = value.replace('[', '')
        value = value.replace(']', '')
        value = value.replace(' ', '')
        values = value.split(',')
        we_doing_this_again.append(values)

nice_and_clean = []

for i in range(0, len(we_doing_this_again), 2):
    nice_and_clean.append([we_doing_this_again[i], we_doing_this_again[i+1]])

for values in nice_and_clean:
    genres = []
    for i in range(1, (len(values[0]) - 3)):
        genres.append(values[0][i])
    nr_of_genres = len(genres)
    values[0] = [values[0][0], genres, values[0][1+nr_of_genres], values[0][2+nr_of_genres], values[0][3+nr_of_genres], values[1]]

correct_format = []

for value in nice_and_clean:
    correct_format.append(value[0])

Importing table containing movie-director connections. Connecting all directors ids to their corresponding movies and then replaces the id with a name. 

In [111]:
movie_crews = []

with open('data_crew.tsv', encoding="utf8") as tsvfile4:
    reader_titles = csv.reader(tsvfile4, delimiter='\t')
    for row in reader_titles:
        movie_crews.append(row)

In [128]:
movie_with_directors = []

def get_number(string):
    try:
        return (int) (string[2:])
    except:
        return 0
    
# Iterate through all movies and movie_crews, join on movie id.
j = 0
crew_id = 0
for i in range(len(correct_format)):
    movie_id = get_number(correct_format[i][0])
    while(crew_id < movie_id):
        j += 1
        if(j > len(movie_crews) -1):
            break
        crew_id = get_number(movie_crews[j][0])
    if(crew_id == movie_id):
        if(j > len(movie_crews) -1):
            break
        try:
            correct_format[i].append(movie_crews[j][1])

In [132]:
with open('movies_crew_and_director.csv', 'w', encoding='utf-8') as writeFile:
    writer = csv.writer(writeFile)
    writer.writerows(correct_format)

In [203]:
import csv

movies_with_directors = []

with open('movies_crew_and_director.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        if(len(row) > 0):
            movies_with_directors.append(row)

movies_with_directors = movies_with_directors[1:]

Matching directors with name id in the same way we did with actors name id

In [222]:
def get_director_names(director):
    director_id = get_number(director)
    count = 0
    for name in movie_names:
        count += 1
        if(count > director_id):
            break
        if(director == name[0]):
            return name[1]
    return ""

movies_with_director_names = []

# Replace actor ids with names
for movie in movies_with_directors:
    directors = get_director_names(movie[6])
    movie[6] = directors
    movies_with_director_names.append(movie)


In [224]:
movies_with_directors

with open('complete_array.csv', 'w', encoding='utf-8') as writeFile:
    writer = csv.writer(writeFile)
    writer.writerows(movies_with_directors)

## Data analysis

*Document you choice and motivation for selected data mining method(s) here. Choose a data mining method(s) to use in Python code to perform an analysis of your chosen dataset. Describe why you chose the method(s) and what interesting things you have found from the analysis.*

*Replace the contents of this cell with your own text.*

In [3]:
# Add your own code

## Evaluation of results

*Document an evaluation your analysis results and describe how potentially actionable they are.*

*Replace the contents of this cell with your own text.*

In [None]:
# Add your own code

## Schedule and description of project plan

*Rough schedule for the project beyond the pilot study presented in 3-5. This does not have to be advanced, you can simply provide an estimate based upon reported schedules for similar projects in the literature.*

*Replace the contents of this cell with your own text.*

## Ethical aspects that need to be considered

*Are there ethical aspects that need to be considered? Are there legal implications (e.g., personal data / GDPR)? Are there implications if the case organization is a business, public authority, or nonprofit entity?*

*Replace the contents of this cell with your own text.*