# Prediction of Movie Revenue

Dataset from Kaggle : **"Full TMDB Movies Dataset 2024"** by *ASANICZKA*  
Source: https:www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies

### Essential Libraries

In [823]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

### Importing Dataset & Listing Variables

In [824]:
tmdbdata = pd.read_csv('TMDB_movie_dataset_v11.csv')
tmdbdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1024235 entries, 0 to 1024234
Data columns (total 24 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   id                    1024235 non-null  int64  
 1   title                 1024223 non-null  object 
 2   vote_average          1024235 non-null  float64
 3   vote_count            1024235 non-null  int64  
 4   status                1024235 non-null  object 
 5   release_date          897532 non-null   object 
 6   revenue               1024235 non-null  int64  
 7   runtime               1024235 non-null  int64  
 8   adult                 1024235 non-null  bool   
 9   backdrop_path         286248 non-null   object 
 10  budget                1024235 non-null  int64  
 11  homepage              110690 non-null   object 
 12  imdb_id               577196 non-null   object 
 13  original_language     1024235 non-null  object 
 14  original_title        1024223 non-

### Sieving columns

These will be the columns we work with.

In [825]:
sievedtmdbdata = pd.DataFrame(tmdbdata[['id', 'title','status', 'revenue', 'vote_average', 'vote_count','runtime', 'adult', 'budget', 'original_language', 'popularity', 'genres', 'production_companies', 'production_countries', 'spoken_languages']])
sievedtmdbdata.shape
#sievedtmdbdata.head()

(1024235, 15)

### Column Description (need to add for votecount)
1. id: Unique identifier for each movie. (type: int)
2. title: Title of the movie. (type: str)
3. status: The status of the movie (e.g., Released, Rumored, Post Production, etc.). (type: str)
4. revenue: Total revenue generated by the movie. (type: int)
5. vote_average: Average vote or rating given by viewers. (type: float)
6. runtime: Duration of the movie in minutes. (type: int)
7. adult: Indicates if the movie is suitable only for adult audiences. (type: bool)
8. budget: Budget allocated for the movie. (type: int)
9. original_language: Original language in which the movie was produced. (type: str)
10. popularity: Popularity score of the movie. (type: float)
11. genres: List of genres the movie belongs to. (type: str)
12. production_companies: List of production companies involved in the movie. (type: str)
13. production_countries: List of countries involved in the movie production. (type: str)
14. spoken_languages: List of languages spoken in the movie. (type: str)

### Cleaning Data

- Conversion of all int 0 values to NaN
- Dropping Nan:
        - Within the dataset there exists movies where the Revenue=0, Runtime=0, Budget=0,Vote_count=0.... These circumstances are unlikely to be valid since majority of movies have non-zero values for these variables.

In [826]:
sievedtmdbdata.replace(0, np.nan, inplace=True)
sievedtmdbdata

Unnamed: 0,id,title,status,revenue,vote_average,vote_count,runtime,adult,budget,original_language,popularity,genres,production_companies,production_countries,spoken_languages
0,27205,Inception,Released,8.255328e+08,8.364,34495.0,148.0,False,160000000.0,en,83.952,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America","English, French, Japanese, Swahili"
1,157336,Interstellar,Released,7.017292e+08,8.417,32571.0,169.0,False,165000000.0,en,140.241,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",English
2,155,The Dark Knight,Released,1.004558e+09,8.512,30619.0,152.0,False,185000000.0,en,130.643,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America","English, Mandarin"
3,19995,Avatar,Released,2.923706e+09,7.573,29815.0,162.0,False,237000000.0,en,79.932,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom","English, Spanish"
4,24428,The Avengers,Released,1.518816e+09,7.710,29166.0,143.0,False,220000000.0,en,98.082,"Science Fiction, Action, Adventure",Marvel Studios,United States of America,"English, Hindi, Russian"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1024230,692340,Zoltan Kaszas: Modern Male,Released,,,,65.0,False,,en,0.863,Comedy,,United States of America,English
1024231,692341,Händel / Mozart: Der Messias,Released,,,,135.0,False,,de,0.600,Music,"3sat, ORF, Unitel",Austria,
1024232,692342,Jak se naučit švédsky,Released,,,,,False,,cs,0.600,Comedy,"Československá televize Praha, Filmové studio ...",Czechoslovakia,Czech
1024233,692345,Bumbhaiyya,Released,,,,5.0,False,,en,0.600,,,,


### Dropping all those rows with na value in those specific columns for numerical and categorical visualisation

In [827]:
sievedtmdbdata.dropna(subset=['revenue', 'budget', 'runtime', 'vote_average', 'popularity', 'genres', 'vote_count', 'original_language'], inplace=True)
sievedtmdbdata.shape

(10233, 15)

- After cleaning, there are 10233 rows of existing movies that have complete information in the required fields for both numerical and categorical analysis

In [828]:
sievedtmdbdata

Unnamed: 0,id,title,status,revenue,vote_average,vote_count,runtime,adult,budget,original_language,popularity,genres,production_companies,production_countries,spoken_languages
0,27205,Inception,Released,8.255328e+08,8.364,34495.0,148.0,False,160000000.0,en,83.952,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America","English, French, Japanese, Swahili"
1,157336,Interstellar,Released,7.017292e+08,8.417,32571.0,169.0,False,165000000.0,en,140.241,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",English
2,155,The Dark Knight,Released,1.004558e+09,8.512,30619.0,152.0,False,185000000.0,en,130.643,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America","English, Mandarin"
3,19995,Avatar,Released,2.923706e+09,7.573,29815.0,162.0,False,237000000.0,en,79.932,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom","English, Spanish"
4,24428,The Avengers,Released,1.518816e+09,7.710,29166.0,143.0,False,220000000.0,en,98.082,"Science Fiction, Action, Adventure",Marvel Studios,United States of America,"English, Hindi, Russian"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
346034,592192,WWR Adios Aurora,Released,1.122000e+03,10.000,1.0,110.0,False,8665.0,en,1.127,"Drama, Action",Women's Wrestling Revolution,United States of America,English
346035,309625,Winter Carnival,Released,4.742860e+05,6.000,1.0,105.0,False,412640.0,en,0.636,"Romance, Comedy, Drama",Walter Wanger Productions,United States of America,English
346215,280402,Freaky Night,Released,1.000000e+03,4.500,1.0,15.0,False,500.0,en,0.600,Horror,Mystery Forest,Norway,
346490,402714,Sming,Released,8.903800e+04,6.000,1.0,105.0,False,15000000.0,th,1.405,"Drama, Action, Horror",Fast Time Motion Pictures,Thailand,Thai
