# ML Project : Predicting Movie Revenue
Made by:

Corentin Maillard 21306

Mourad Mettioui 195019

## A) Data Understanding and analysis

### 1) Load the Dataset and libraries

In [152]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

dataset_file_path = 'dataset.csv'
dataset_supplement_file_path = 'dataset_supplement.csv'
# transform the first collum into an index
dataset_data = pd.read_csv(dataset_file_path, index_col=0)
dataset_supplement_data = pd.read_csv(dataset_supplement_file_path, index_col=0)

dataset_supplement_data.drop(columns=['title'], inplace=True)
left = dataset_data.set_index(['id'])
right = dataset_supplement_data.set_index(['movie_id'])
data_combind = left.join(right)
# data_combind.set_index(['id'])
data_combind = data_combind.sort_values(by = 'id')


### 2) Undersand the structure of the dataset

In [181]:
# examine the contents of the resultant Dataset using the head() command, which grabs the first five rows:
data_combind

We can see the different features which are:
- vote count
- vote average
- title
- tagline
- status
- spoken languages
- duration
- release
- countries_of_production
- production
- popularity
- overview
- original_title
- original_language
- keywords
- id
- homepage
- genres
- financial_investment
- actors
- production_crew

And we have the target : 
- revenue

In [123]:
# we are going to see if all the data that we have are complete
data_combind.info()
# we can also use the isnull().sum() to see directly how many data are missing
data_combind.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
Index: 4803 entries, 5 to 459488
Data columns (total 21 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   vote_count               4803 non-null   int64  
 1   vote_average             4323 non-null   float64
 2   title                    4803 non-null   object 
 3   tagline                  3959 non-null   object 
 4   status                   4803 non-null   object 
 5   spoken_languages         4803 non-null   object 
 6   duration                 4801 non-null   float64
 7   revenue                  4755 non-null   float64
 8   release                  4802 non-null   object 
 9   countries_of_production  4803 non-null   object 
 10  production               4803 non-null   object 
 11  popularity               4803 non-null   float64
 12  overview                 4800 non-null   object 
 13  original_title           4803 non-null   object 
 14  original_language        45

vote_count                    0
vote_average                480
title                         0
tagline                     844
status                        0
spoken_languages              0
duration                      2
revenue                      48
release                       1
countries_of_production       0
production                    0
popularity                    0
overview                      3
original_title                0
original_language           240
keywords                      0
homepage                   3091
genres                        0
financial_investment        384
actors                        0
production_crew               0
dtype: int64

We can see that they are only 1712 non-null homepage on 4803 entires, so we will discard this feature because they are too many missing datas.

For the other features, we can see if they are complete or not

In [139]:
# we are going to see if all the data are different or not
data_combind.nunique()

vote_count                 1609
vote_average                 70
title                      4800
tagline                    3944
status                        3
spoken_languages            544
duration                    156
revenue                    3268
release                    3280
countries_of_production     469
production                 3697
popularity                 4802
overview                   4800
original_title             4801
original_language            37
keywords                   4222
homepage                   1691
genres                     1175
financial_investment        412
actors                     4761
production_crew            4776
dtype: int64

In [135]:
# we will drop the id because it is not an information that will ever be seen by the custommer.
# So we can assume that it is not an important feature for our model.
# data_combind.drop(columns=['id'], inplace=True)
data_combind.drop(columns=['homepage'], inplace=True)
data_combind.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4803 entries, 5 to 459488
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   vote_count               4803 non-null   int64  
 1   vote_average             4323 non-null   float64
 2   title                    4803 non-null   object 
 3   tagline                  3959 non-null   object 
 4   status                   4803 non-null   object 
 5   spoken_languages         4803 non-null   object 
 6   duration                 4801 non-null   float64
 7   revenue                  4755 non-null   float64
 8   release                  4802 non-null   object 
 9   countries_of_production  4803 non-null   object 
 10  production               4803 non-null   object 
 11  popularity               4803 non-null   float64
 12  overview                 4800 non-null   object 
 13  original_title           4803 non-null   object 
 14  original_language        45

### 3) Perform Exploratory Data Analysis (EDA)

In [147]:
data_combind.describe()

Unnamed: 0,vote_count,vote_average,duration,revenue,popularity,financial_investment
count,4803.0,4323.0,4801.0,4755.0,4803.0,4419.0
mean,690.217989,6.090354,106.875859,82314860.0,21.492301,28984660.0
std,1234.585891,1.193315,22.611935,163087200.0,31.81665,40655260.0
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,54.0,5.6,94.0,0.0,4.66807,750000.0
50%,235.0,6.2,103.0,19184020.0,12.921594,14800000.0
75%,737.0,6.8,118.0,93119110.0,28.313505,40000000.0
max,13752.0,10.0,338.0,2787965000.0,875.581305,380000000.0


In [161]:
# sns.pairplot(dataset_data)  # Visualize pairwise relationships using a pairplot
# plt.show()

We can see that that the revenue will be affected by:
- the financial investment
- the vote count

For the other features (vote average, duration, popularity), it is not really visible the impact they have an the revenue

We can see that the vote average is affected by:
- the vote count
- the financial investment

We can see a big problem, that is the feature that are not number cannot be used to make an EDA with the way I make it.

We will use the one hot encoding approach that generally perform best on:
- the country_of_production
- the spoken_languages
- genres
- status
- original_language

For the feature like the title, tagline, overview, original_title we know that they can have an influence on the renenue but they are to complexe to exploit so we will discard them

In [180]:

import json

def extract_iso(df, col_name, iso_num):
    collone_brut = df[col_name]
    
    for i in df.index:
        langperligne = []
        lst = json.loads(collone_brut[i])
        for item in lst:
            langperligne.append(item[iso_num])
        collone_brut[i] = ','.join(langperligne)
    
    return collone_brut

# Utilisez la fonction avec le nom de la colonne et le numéro d'ISO

# extract_iso(data_combind, 'countries_of_production', 'iso_3166_1')
# extract_iso(data_combind, 'spoken_languages', 'iso_639_1')
# extract_iso(data_combind, 'keywords', 'name')
# extract_iso(data_combind, 'genres', 'name')
# extract_iso(data_combind, 'actors', 'name')
# extract_iso(data_combind, 'production_crew', 'name')
# extract_iso(data_combind, 'production', 'name')





data_combind

# print(data_combind.nunique())