# Main Question: 

### Can we predict performance of a given movie?

In this project, we will be trying to predict the **performance** of a movie with help from various datasets. We hope to extract, clean and perform data expoloration on our gathered dataset to gain insights regarding the main question. We will then use a machine learning model to answer our question.


Before we start, we determined a movie's **performance** based on it's movie **ratings** (given by critics), as well as the amount of **revenue** and **profits** it earned through the cinemas and theaters. 

--

--

# Data Acquisition
We shall use movie data from TMBD and OMBD. Why scrape data from two sources?
Firstly, we might uncover more variables or factors that might be useful to us in analysis of the performance. Secondly, there are many different variables for the same performance metric that we can obtain from OMBD. For instance, the rating of a movie can be measured by rotten tomatoes score, IMBD rating, or even both. To measure the profits of a movie, we can consider using box-office values or the budget variables.

In [None]:
#Basic libraries
import numpy as np
import pandas as pd
import requests
import json
import pprint

**NOTE**: Our API key has expired, thus running the code in this notebook will give an error. However, we have already saved the required data into multiple csv files, such as Bigger_Data.csv and cleanData.csv, which are provided in the same zip submission file.

--

# TMBD Dataset Acquisition

In [None]:
movielist=[]
for i in range(9000):
    url="https://api.themoviedb.org/3/discover/movie?api_key=4568df78fb309238e68974a24013b626&language=en-US&include_adult=false&page="+str(i)
    #Note that we decided to exclude adult movies
    r = requests.get(url)
    if r.status_code in range(200,299): #if status code is in range(200,299), this indicates a successful request, unlike error 401 or 500s which are connection errors.
        data = r.json()  #json form
        results = data["results"]
        for j in range(len(results)):
            movielist.append(results[j])
   
    
             

movieData = pd.DataFrame(movielist) #Convert raw JSON data in list to Dataframe

We took into account the fact that children and teenagers are not able to watch adult movies, resulting in certain performance metrics, such as revenue or profits, to differ largely for adult movies and non adult movies. Therefore, we decided to exclude adult movies from our data sraping.


This also applies for the language of a movie. English movies, being the global language, should be available to almost every country and audience. There should be a more even distribution of audience size, and thus perfomrance metrics would be less "skewed". This will make our predictions more fair.

Thus, in our data extraction, we decided to include only english and non-adult movies, with use of parameters &language=en-US 
and include_adult=false in the URL search key.

In [None]:
movieData = pd.read_csv("Bigger_Data.csv")

movieData=movieData[movieData['imdb_id'].notnull()] #dropping all the null IMBD ID values from the TMBD set, so that we can scrape from OMBD

print(len(movieData["imdb_id"])) #So that we can know how many movies we can extract from OMBD dataset

--

--

# OMBD Dataset Acquistion 

In [None]:
headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
} #Set user agent, so that our server appears to be a human requesting for access to the website. Certain websites
#might block what seems to be an automated response.

ombdList=[]
for i in range(1000): #OMBD API Keys have a limit of 1000 per day
    url= "http://www.omdbapi.com/?i=" + str(movieData["imdb_id"].values[i]) +"&apikey=f625be79"
    r = requests.get(url,headers=headers)
    if r.status_code in range(200,299):   #successful GET request
       data = r.json()                    
       ombdList.append(data)

In [None]:
for i in range(1000,2000):
    url= "http://www.omdbapi.com/?i=" + str(movieData["imdb_id"].values[i]) +"&apikey=76052a79"
    r = requests.get(url,headers=headers)
    if r.status_code in range(200,299):   #successful GET request
       data = r.json() 
       ombdList.append(data)   

In [None]:
for i in range(2000,3000):
    url= "http://www.omdbapi.com/?i=" + str(movieData["imdb_id"].values[i]) +"&apikey=5018e12e"
    r = requests.get(url,headers=headers)
    if r.status_code in range(200,299):   #successful GET request
       data = r.json() 
       ombdList.append(data)

In [None]:
for i in range(3000,4000):
    url= "http://www.omdbapi.com/?i=" + str(movieData["imdb_id"].values[i]) +"&apikey=7b1bb186"
    r = requests.get(url,headers=headers)
    if r.status_code in range(200,299):    #successful GET request
       data = r.json() 
       ombdList.append(data)

In [None]:
ombdData=pd.DataFrame(ombdList)  #Convert raw JSON data in list to Dataframe

--

--

---
## Data Cleaning

In [None]:
ombdData.dtypes

As we mentioned above, we judge the performance of a movie based on three seperate metrics, "Ratings", "Revenue" and "Profits".

In order to increase the accuracy and reliability of the ratings of a movie, we made use of more than 1 rating metrics. Instead of relying solely on TMBD's popularity rating, we scrape OMDB datasets for other metrics such as IMBD ratings and Metascore.

Other factors that might influence the profit or ratings of movies could be the data of its release and Genre.


We also took OMBD Genres instead of using TMBD's genre list, as the TMBD genre list only contains a dictionary of genre IDs, while OMBD contains a string. As such, taking OMBD's genre column would mean that we do not have to data scrape again.
E.g. "Genre":"Action, Sci-Fi, Thriller" (OMBD) vs {ID: 4; ID; 3} (IMBD)


In [None]:
ombdData = ombdData[["imdbID","Metascore", "imdbRating", "Genre", "Year", "Ratings", "Rated"]]
#Merging the common rows of TMBD and OMBD dataset based on imbd ID
CleaningData = pd.merge(movieData, ombdData, left_on = 'imdb_id', right_on = 'imdbID', how = 'right')
CleaningData.dtypes

---
## Dropping uncessary columns
Dropping uncessary columns which obviously do not have any influence on profit or ratings.
Some of these ratings are also caterogical, and are in the form of dictionaries, for example, actors, too many exist for us
#to do analysis on this variable.

In [None]:
to_drop = ['adult', 'backdrop_path', 'belongs_to_collection', 'genres', 'homepage', 'original_title','overview',
           'poster_path','status','video', 'Unnamed: 0', "id", "tagline", "spoken_languages", "production_companies", 
           "production_countries", "imdb_id", "imdbID"] 

CleaningData.drop(to_drop, inplace=True, axis=1)    

---
## Dropping null, N/A and zero values

In [None]:
CleaningData.isnull().sum() #Checking number of null values in each column

In [None]:
for col in CleaningData.columns:
  CleaningData = CleaningData[CleaningData[str(col)]!="N/A"] #Some of the null values are labelled N/A

for col in CleaningData.columns:
  CleaningData = CleaningData[CleaningData[str(col)]!=0] #Taking out the null values that have a value of 0

CleaningData.reset_index(drop=True, inplace=True) #Resetting the index (as taking rows out do not realign the index)

for col in CleaningData.columns:
  CleaningData[str(col)].fillna(value = "Nan", inplace = True) #filling NA values with NaN
  CleaningData.drop(CleaningData[CleaningData[str(col)] == 'Nan' ].index , inplace=True) #Dropping the NaN values

In [None]:
CleaningData.isnull().sum() #Checking number of null values in each column is 0

In [None]:
CleaningData.to_csv('cleanData.csv')  #This is our cleaned data without null values and uncessary columns.