# Netflix next banger

## Background

Netflix is a member-based internet entertainment service founded in 1997. The service allows members to watch tv-series, movies and documentaries through a streaming software on internet-connected devices against a monthly fee. Netflix’s content relies on a complex structure of algorithms that watches customer’s online use and are supposed to optimize satisfying decisions. Thus the company is in great need of user data. (Clares-Gavilán, Fernández-Manzano och Neira 2016, 568-576)

The company’s operational environment is characterized by hands-off approach to management. The workforce is heterogeneous, about 3,500 people who 
leaders trust to have the company’s best interests at heart and act correspondingly. (Gulati 2018, 4-13).  Furthermore a majority of Netflix’s content is from outside sources and the company’s operational flexibility is limited by outside partners (Ciejka 2018, 3-38).


## Problem description

As an effect of high competition and slow growth in the number of new subscribers Netflix share price has declined in fall 2019, hence Netflix wants to maximize the success of their next creation. The problem lies in not knowing what will be best received by the public. (Ny Teknik, 2019)

Additionally, by having some of its content come from outside sources, part of the problem becomes choosing the right content. If Netflix have to choose between several newly released movies or series, how will they know which one has the best chance of success?

In our project we will gather data from movies and their scripts. We will look for commonly used words, actors, genres and directors to predict which attributes Netflix should be the most interested in when creating their next movie and also which outside-produced movies they should add to their assortment. Finally upcoming movies will also be discussed with our analysed data in mind.


## Previous Studies
There are many different researches that has been carried out in the field using different approaches. Most of the studies found focuses on extracting data from social media to map how users online discussions, like reviews etc can predict a movies success. There are fewer studies that focuses on the attributes of a movie to explore how this is related to a movies success. 

In Ahmad et. al’s study Movie Success Prediction Using Data Mining (2017) the authors presents a study which aims to predict a movies success based on the weighted attributes budget, actors, director, producer, set locations, story writer, movie release day, competing movie releases at the same time, music, release location and target audience. The weighing is done using a mathematical model. It’s stated that “success cannot be predicted based on a particular attribute” and therefore weighing of multiple attributes will be integrated in this study as well. Furthermore the study gives a basic framework in how to value different attributes and what method to consider for the data mining. Lastly, the article shows how classification and clustering can be used in the context of prediction. It also gives a reference point to compare the results of this study to. 

Another similar study is Hammad Afzals article Prediction of Movies popularity Using Machine Learning Techniques (2016) which describes how machine learning approaches can predict an movies success based on datasets of IMDB scores and attributes. The most important finding of this article regarding the work of this report is that the attributes which contributed with most information was metascore and number of votes for each movie, Oscar awards won by the movies and the number of screens the movie is going to be screened which is considered in the choosing of attributes for this report. 

Furthermore Meenakshi et al also try to determine a movies success using several of the same attributes in their 2018 study. In the paper A Data mining Technique for Analyzing and Predicting the success of Movie they employ a decision tree together with clusters to categorize movies as either “Flop”, “Average” or “Hit”. They found that budget was no indication of how well a movie would be rated, and that genre was the most important node in their decision tree. (Meenakshi et al. 2018, 1-6)

The paper Predicting movie success and academy awards through sentiment and social network analysis talks about how much influence online communities have on the actual success of movies. The paper focuses on a model that that predicts academy award nominees by researching the opinions of online communities. Afterwards it uses the same approach to see if there is a correlation between online communities and movie success at the box office. By doing this, the study were able to predict different real events, such as nine academy award winners, by considering and using mentioned models on social networks in movie communities online.(Jonas Sebastian Krauss et al. 2008, 9-10). 


## Data collection

importing data

In [1]:
import csv
import numpy as np
import pandas as pd
        
movie_basics = []
    
with open('data_basics.tsv', encoding="utf8") as tsvfile1:
    reader_basic = csv.reader(tsvfile1, delimiter='\t')
    for row in reader_basic:
        movie_basics.append(row)
        
movie_names = []

with open('movie_names.tsv', encoding="utf8") as tsvfile4:
    reader_rating = csv.reader(tsvfile4, delimiter='\t')
    for row in reader_rating:
        movie_names.append(row)
        
movie_crews = []
        
with open('data_crew.tsv', encoding="utf8") as tsvfile2:
    reader_crew = csv.reader(tsvfile2, delimiter='\t')
    for row in reader_crew:
        movie_crews.append(row)
        
movie_ratings = []

with open('data_ratings.tsv', encoding="utf8") as tsvfile3:
    reader_rating = csv.reader(tsvfile3, delimiter='\t')
    for row in reader_rating:
        movie_ratings.append(row)

# Remove header row
movie_ratings = movie_ratings[1:]

filter out unwanted attributes

In [30]:
movie_basics_filtered = []

def try_parse_float(input):
    try:
        number = float(input)
    except:
        number = -1
    return number

# get appropriate attributes
for movie in movie_basics:
    if(len(movie) > 8 and try_parse_float(movie[5]) > 2010):
        movie_basics_filtered.append([movie[0], movie[2], movie[5], movie[8]])

join movies with ratings

In [31]:
movie_basics_above_threshold = []

def get_number(string):
    try:
        return (int) (string[2:])
    except:
        return 0
    
# stepping through movie_basics and movie_ratings simultainously, since they are sorted
j = 0
rating_id = 0
for i in range(len(movie_basics_filtered)):
    movie_id = get_number(movie_basics_filtered[i][0])
    while (rating_id < movie_id):
        j += 1
        if(j > len(movie_ratings) -1):
            break
        rating_id = get_number(movie_ratings[j][0])
        
    if(j > len(movie_ratings) -1):
        break
    if(movie_id == rating_id and try_parse_float(movie_ratings[j][1]) >= 7.0):
        
        movie_basics_filtered[i].append([movie_ratings[j][2], movie_ratings[j][1]])

movie_basics_and_ratings = []

# Only keeping any movie that had a rating 
for movie in movie_basics_filtered:
    if(len(movie) > 4):
        movie_basics_and_ratings.append(movie)


joining movies with directors

In [None]:
def get_person(person):
    person_id = get_number(person)
    for i in range (len(movie_names)):
        try:
            if(get_number(movie_names[i][0]) == person_id):
                return movie_names[i][1]
        except:
            print(i)

# stepping through movie_basics and movie_crews and connecting on movie id
j = 1
crew_id = 0
directors = []
for i in range(len(movie_basics_and_ratings)):
    movie_id = get_number(movie_basics_and_ratings[i][0])
    while(crew_id < movie_id):
        j += 1
        crew_id = get_number(movie_crews[j][0])
    if(crew_id == movie_id):
        for member in movie_crews[j][1].split(','):
            directors.append(get_person(member))
        movie_basics_and_ratings[i].append(directors)
        directors = []

In [47]:
count = 0
for movie in movie_basics_and_ratings:
    if(len(movie) > 5):
        count += 1
count

151

In [42]:
movie_basics_and_ratings[130:155]

[['tt10003330',
  'Camp Hot Streets',
  '2019',
  'Action,Animation,Comedy',
  ['9', '7.7'],
  ['Pete Michels']],
 ['tt10003408', '129. Bolum', '2019', '\\N', ['6', '9.7'], ['Onur Tan']],
 ['tt10003626',
  'The Treasure',
  '2019',
  'Action,Adventure,Animation',
  ['6', '7.0'],
  ['Sylvain Blais']],
 ['tt10003660',
  'Adam Acts His Age',
  '2019',
  'Comedy',
  ['83', '8.3'],
  ['Kevin Bright']],
 ['tt10003700',
  'Sierra Maestra',
  '2019',
  'Drama',
  ['157', '7.9'],
  ['Steve Shill']],
 ['tt10003824',
  'Louis T. Steinhil (No. 27)',
  '2019',
  'Crime,Drama,Mystery',
  ['531', '8.8'],
  ['Bill Roe']],
 ['tt10003978',
  'Fiddler: A Miracle of Miracles',
  '2019',
  'Documentary',
  ['114', '7.2'],
  [None]],
 ['tt10004066', 'Papaya Bull', '2017', 'Animation', ['6', '8.5']],
 ['tt10004088', 'The Rank Room', '2019', 'Reality-TV', ['8', '7.4']],
 ['tt10004142',
  'Crime and Punishment',
  '2019',
  'Action,Adventure,Drama',
  ['690', '7.5']],
 ['tt10004144',
  'Judge, Jury...',
  '201

3466
Susan Drennan Mcgrath


In [13]:
movie_names[107048][0]
get_number(movie_names[97048][0])

101510

## Data analysis

*Document you choice and motivation for selected data mining method(s) here. Choose a data mining method(s) to use in Python code to perform an analysis of your chosen dataset. Describe why you chose the method(s) and what interesting things you have found from the analysis.*

*Replace the contents of this cell with your own text.*

In [3]:
# Add your own code

## Evaluation of results

*Document an evaluation your analysis results and describe how potentially actionable they are.*

*Replace the contents of this cell with your own text.*

In [None]:
# Add your own code

## Schedule and description of project plan

*Rough schedule for the project beyond the pilot study presented in 3-5. This does not have to be advanced, you can simply provide an estimate based upon reported schedules for similar projects in the literature.*

*Replace the contents of this cell with your own text.*

## Ethical aspects that need to be considered

*Are there ethical aspects that need to be considered? Are there legal implications (e.g., personal data / GDPR)? Are there implications if the case organization is a business, public authority, or nonprofit entity?*

*Replace the contents of this cell with your own text.*