<h1> CSMODEL Project 1 </h1>
<h2> Group 23</h2>
Members:
Lima, Alfonso Gabriel V.
Olalia, Pamela Kirsten G.
Ortega, A


## Netflix Original Films & IMDb Scores
The dataset of the notebook is called *Netflix Original Films & IMDB Scores*. It contains all of Netflix's produced films as of June 1, 2021. It contains basic information regarding each film such as the title, genre, language, runtime, and IMDb Score. 

The data in this dataset was acquired via webscraping of a [Wikipedia page](https://en.wikipedia.org/wiki/Lists_of_Netflix_original_films) by Nakul Lakhotia about Netflix's produced films over the years. The data was then integrated with its corresponding IMDb score manually by the owner of the dataset. The IMDb scores were aquired in their official website: [imdb.com](https://www.imdb.com/), which is the premiere website for movie reviews and critique.



## pandas, matplotlib and chardet
**pandas** is a software library for Python that is designed for data manipulation and data analysis. **matplotlib** is a software libary for data visualization, which allows us to easily render various types of graphs. We will be using these two libraries in this Notebook. **chardet** is a software library for Python that serves as the Universal Character Encoding Detector.

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import chardet

## Reading the dataset

In [10]:
with open('NetflixOriginals.csv', 'rb') as f:
    result = chardet.detect(f.read())  # or readline if the file is large

In [11]:
netflix_df = pd.read_csv('NetflixOriginals.csv', encoding=result['encoding'])

## Structure of the Dataset

In [12]:
netflix_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 584 entries, 0 to 583
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Title       584 non-null    object 
 1   Genre       584 non-null    object 
 2   Premiere    584 non-null    object 
 3   Runtime     584 non-null    int64  
 4   IMDB Score  584 non-null    float64
 5   Language    584 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 27.5+ KB


There are a total of 584 observations in the dataset. Each observation contains 6 variables to describe each observation. These are the records and details for every film by Netflix

In [13]:
netflix_df.head()

Unnamed: 0,Title,Genre,Premiere,Runtime,IMDB Score,Language
0,Enter the Anime,Documentary,"August 5, 2019",58,2.5,English/Japanese
1,Dark Forces,Thriller,"August 21, 2020",81,2.6,Spanish
2,The App,Science fiction/Drama,"December 26, 2019",79,2.6,Italian
3,The Open House,Horror thriller,"January 19, 2018",94,3.2,English
4,Kaali Khuhi,Mystery,"October 30, 2020",90,3.4,Hindi


## Variables in the Dataset

The following are detailed descriptions of each column in the dataset: <br>
> 1. ***Title*** - The name of the film. This is the unique value of each observation in the dataset <br>
2. ***Genre*** - The type of film based on its narrative elements. This describes what kind of plot the film portrays. Some observations may have a mix of 2 or more genres. <br>
3. ***Premiere*** - The date when the film was released to the public.<br>
4. ***Runtime*** - The overall length of the film in minutes.<br>
5. ***IMDB Score*** - The score of the film on the IMDB website. The scores are taken from members of the IMDB community.
6. ***Language*** - The languages used in the film. Some films may have a mix of 2 or more languages

## Data Cleaning

#### Missing Values
The following code returns the number of null values which represents a missing value for each variable in the dataset.

In [14]:
netflix_df.isnull().sum()

Title         0
Genre         0
Premiere      0
Runtime       0
IMDB Score    0
Language      0
dtype: int64

As seen above, there are no observations that contains a null value in any variable.

#### Data Type
The following code returns the data type of each variable in the dataset.

In [15]:
netflix_df.dtypes

Title          object
Genre          object
Premiere       object
Runtime         int64
IMDB Score    float64
Language       object
dtype: object

Almost all of the variables have the appropriate data type for their values except for the *Premiere*. Since the *Premiere* column contains the date of the premiere of the film, the appropriate data type for this is the **Datetime** data type not an object/string.

The following code will convert the data type to **Datetime**.

In [16]:
netflix_df["Premiere"] = pd.to_datetime(netflix_df['Premiere'])
netflix_df.dtypes

Title                 object
Genre                 object
Premiere      datetime64[ns]
Runtime                int64
IMDB Score           float64
Language              object
dtype: object

Now all of the variables in the dataset have the appropriate data type for their values.

#### Duplicate  Values
The values under the *Title* column are the only distinct values for each observation. This is because this is the only attribute that does not describe the observation but names it. So, to check if there are any duplicate values in the dataset, we must count how many unique values are under the *Title* column. If they are equal to the number of observations, then there are no duplicate observations.


In [17]:
netflix_df['Title'].nunique()

584

The code above returns the number of unique values under the *Title* column. It is equal to the number of observations, therefore, there are **no duplicate values in the dataset**

#### Data Format
Now we want to check if the data under the different columns are consistent with each other. The following blocks of code will display this.

In [18]:
netflix_df['Title'].unique()


array(['Enter the Anime', 'Dark Forces', 'The App', 'The Open House',
       'Kaali Khuhi', 'Drive', 'Leyla Everlasting',
       'The Last Days of American Crime', 'Paradox', 'Sardar Ka Grandson',
       'Searching for Sheela', 'The Call', 'Whipped',
       'All Because of You', 'Mercy', 'After the Raid', 'Ghost Stories',
       'The Last Thing He Wanted', 'What Happened to Mr. Cha?',
       'Death Note', "Hello Privilege. It's Me, Chelsea",
       'Secret Obsession', 'Sextuplets', 'The Girl on the Train',
       'Thunder Force', 'Fatal Affair', 'Just Say Yes',
       'Seriously Single', 'The Misadventures of Hedi and Cokeman',
       '5 Star Christmas', 'After Maria',
       'I Am the Pretty Thing That Lives in the House', 'Paris Is Us',
       'Porta dos Fundos: The First Temptation of Christ', 'Rattlesnake',
       'The Players', 'We Are One', 'Finding Agnes', 'IO', 'Sentinelle',
       'Sol Levante', 'The Binding', 'We Can Be Heroes',
       'Christmas Crossfire', 'Coin Heist', 'Mr

In [19]:
netflix_df['Genre'].unique()


array(['Documentary', 'Thriller', 'Science fiction/Drama',
       'Horror thriller', 'Mystery', 'Action', 'Comedy',
       'Heist film/Thriller', 'Musical/Western/Fantasy', 'Drama',
       'Romantic comedy', 'Action comedy', 'Horror anthology',
       'Political thriller', 'Superhero-Comedy', 'Horror',
       'Romance drama', 'Anime / Short', 'Superhero', 'Heist', 'Western',
       'Animation/Superhero', 'Family film', 'Action-thriller',
       'Teen comedy-drama', 'Romantic drama', 'Animation',
       'Aftershow / Interview', 'Christmas musical',
       'Science fiction adventure', 'Science fiction', 'Variety show',
       'Comedy-drama', 'Comedy/Fantasy/Family', 'Supernatural drama',
       'Action/Comedy', 'Action/Science fiction',
       'Romantic teenage drama', 'Comedy / Musical', 'Musical',
       'Science fiction/Mystery', 'Crime drama',
       'Psychological thriller drama', 'Adventure/Comedy', 'Black comedy',
       'Romance', 'Horror comedy', 'Christian musical',
       'Rom

For the *Genre* column, we can see that some multi-genre values are inconsistent with each other. Some have spaces before and after the "/" while others do not. This might cause difficulty later on in finding values so we want to change this to be having no space at all. <br> <br>
The following code will do that:


In [20]:
netflix_df["Genre"] = netflix_df["Genre"].str.replace(" / ","/")
netflix_df["Genre"].unique()

array(['Documentary', 'Thriller', 'Science fiction/Drama',
       'Horror thriller', 'Mystery', 'Action', 'Comedy',
       'Heist film/Thriller', 'Musical/Western/Fantasy', 'Drama',
       'Romantic comedy', 'Action comedy', 'Horror anthology',
       'Political thriller', 'Superhero-Comedy', 'Horror',
       'Romance drama', 'Anime/Short', 'Superhero', 'Heist', 'Western',
       'Animation/Superhero', 'Family film', 'Action-thriller',
       'Teen comedy-drama', 'Romantic drama', 'Animation',
       'Aftershow/Interview', 'Christmas musical',
       'Science fiction adventure', 'Science fiction', 'Variety show',
       'Comedy-drama', 'Comedy/Fantasy/Family', 'Supernatural drama',
       'Action/Comedy', 'Action/Science fiction',
       'Romantic teenage drama', 'Comedy/Musical', 'Musical',
       'Science fiction/Mystery', 'Crime drama',
       'Psychological thriller drama', 'Adventure/Comedy', 'Black comedy',
       'Romance', 'Horror comedy', 'Christian musical',
       'Romantic 

Now everything is consistent.

In [21]:
netflix_df["Premiere"].unique()

array(['2019-08-05T00:00:00.000000000', '2020-08-21T00:00:00.000000000',
       '2019-12-26T00:00:00.000000000', '2018-01-19T00:00:00.000000000',
       '2020-10-30T00:00:00.000000000', '2019-11-01T00:00:00.000000000',
       '2020-12-04T00:00:00.000000000', '2020-06-05T00:00:00.000000000',
       '2018-03-23T00:00:00.000000000', '2021-05-18T00:00:00.000000000',
       '2021-04-22T00:00:00.000000000', '2020-11-27T00:00:00.000000000',
       '2020-09-18T00:00:00.000000000', '2020-10-01T00:00:00.000000000',
       '2016-11-22T00:00:00.000000000', '2019-12-19T00:00:00.000000000',
       '2020-01-01T00:00:00.000000000', '2020-02-21T00:00:00.000000000',
       '2021-01-01T00:00:00.000000000', '2017-08-25T00:00:00.000000000',
       '2019-09-13T00:00:00.000000000', '2019-07-18T00:00:00.000000000',
       '2019-08-16T00:00:00.000000000', '2021-02-26T00:00:00.000000000',
       '2021-04-09T00:00:00.000000000', '2020-07-16T00:00:00.000000000',
       '2021-04-02T00:00:00.000000000', '2020-07-31

In [22]:
netflix_df["Runtime"].unique()

array([ 58,  81,  79,  94,  90, 147, 112, 149,  73, 139,  97, 101,  25,
       144, 115, 102, 100,  64,  99, 120, 105,  89, 107,  95,  37,  83,
        46,  85,  88,  86,  80,   4,  93, 106, 103, 119,  96, 113, 104,
        10,  98, 117,  70, 131,  87,  60, 116,  92, 121,  78, 114,  56,
        21,  63, 126, 142, 108, 125,  91,  49, 118,  34, 124,  52, 111,
        75, 148,  32,  23,  53, 132, 123, 122, 128,  82,  84,  42, 151,
        72,  30, 129,  44, 134, 109,  16,  41,  28,  74,   9, 155,  55,
        40,  17, 136, 130,  19,  54,  76,  39,   7,  57,  14,  31,  48,
        27,  45,  36,  47, 110, 138, 133, 140,  13,  11,  24,  15,  26,
       137,  71, 135,  12, 209,  51, 153])

In [23]:
netflix_df["IMDB Score"].unique()

array([2.5, 2.6, 3.2, 3.4, 3.5, 3.7, 3.9, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6,
       4.7, 4.8, 4.9, 5. , 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9,
       6. , 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9, 7. , 7.1, 7.2,
       7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 8. , 8.1, 8.2, 8.3, 8.4, 8.5,
       8.6, 9. ])

In [24]:
netflix_df["Language"].unique()

array(['English/Japanese', 'Spanish', 'Italian', 'English', 'Hindi',
       'Turkish', 'Korean', 'Indonesian', 'Malay', 'Dutch', 'French',
       'English/Spanish', 'Portuguese', 'Filipino', 'German', 'Polish',
       'Norwegian', 'Marathi', 'Thai', 'Swedish', 'Japanese',
       'Spanish/Basque', 'Spanish/Catalan', 'English/Swedish',
       'English/Taiwanese/Mandarin', 'Thia/English', 'English/Mandarin',
       'Georgian', 'Bengali', 'Khmer/English/French', 'English/Hindi',
       'Tamil', 'Spanish/English', 'English/Korean', 'English/Arabic',
       'English/Russian', 'English/Akan', 'English/Ukranian/Russian'],
      dtype=object)

From the following code, only the *genre* column had issues then was fixed. The rest of the data in other columns were consistent with each other.

## Exploratory Data Analysis

In [53]:
netflix_df.describe()

Unnamed: 0,Runtime,IMDB Score,LanguageValue,GenreValue
count,584.0,584.0,584.0,584.0
mean,93.577055,6.271747,4.938356,21.457192
std,27.761683,0.979256,5.588794,29.044915
min,4.0,2.5,0.0,0.0
25%,86.0,5.7,3.0,0.0
50%,97.0,6.35,3.0,9.0
75%,108.0,7.0,3.0,32.0
max,209.0,9.0,37.0,114.0


#### Question 1:  Does Language and Genre have an effect on the IMDB score and the Runtime length?

In [54]:
# converting language category to numerical values
netflix_df['LanguageValue'] = pd.factorize(netflix_df.Language)[0]

In [55]:
netflix_df.Language.value_counts()

English                       401
Hindi                          33
Spanish                        31
French                         20
Italian                        14
Portuguese                     12
Indonesian                      9
Japanese                        6
Korean                          6
German                          5
Turkish                         5
English/Spanish                 5
Polish                          3
Dutch                           3
Marathi                         3
English/Japanese                2
English/Mandarin                2
English/Hindi                   2
Thai                            2
Filipino                        2
Thia/English                    1
English/Taiwanese/Mandarin      1
Swedish                         1
Spanish/English                 1
Spanish/Basque                  1
Spanish/Catalan                 1
Malay                           1
Georgian                        1
English/Swedish                 1
Tamil         

In [56]:
netflix_df.LanguageValue.value_counts()

3     401
4      33
1      31
10     20
2      14
12     12
7       9
20      6
6       6
14      5
5       5
11      5
15      3
9       3
17      3
30      2
18      2
26      2
0       2
13      2
35      1
28      1
34      1
33      1
32      1
31      1
36      1
29      1
19      1
27      1
25      1
24      1
23      1
22      1
21      1
16      1
8       1
37      1
Name: LanguageValue, dtype: int64

In [57]:
# converting genre category to numerical values
netflix_df['GenreValue'] = pd.factorize(netflix_df.Genre)[0]

In [58]:
netflix_df.Genre.value_counts()

Documentary               159
Drama                      77
Comedy                     49
Romantic comedy            39
Thriller                   33
                         ... 
Horror anthology            1
Variety Show                1
Animation/Comedy            1
Science fiction/Action      1
Family/Comedy-drama         1
Name: Genre, Length: 115, dtype: int64

In [59]:
netflix_df.GenreValue.value_counts()

0      159
9       77
6       49
10      39
1       33
      ... 
58       1
7        1
8        1
12       1
114      1
Name: GenreValue, Length: 115, dtype: int64

In [60]:
netflix_df.corr()

Unnamed: 0,Runtime,IMDB Score,LanguageValue,GenreValue
Runtime,1.0,-0.040896,-0.037447,0.042249
IMDB Score,-0.040896,1.0,0.052679,0.109471
LanguageValue,-0.037447,0.052679,1.0,-0.055059
GenreValue,0.042249,0.109471,-0.055059,1.0
