## Netflix IMDB Score Analysis

IMDb, which stands for Internet Movie Database, is an online database of information related to films, television programs, home videos, and streaming content. It is a comprehensive resource that includes details about movies, TV shows, cast and crew information, user ratings, reviews, and more. Users can rate and review movies on IMDb based on their personal opinions and experiences.

In [1]:
# importing neccesary libaries
import pandas as pd
import os

In [2]:
os.getcwd()

'C:\\Users\\WELLS\\Documents\\rebuilding portfolio projects\\DecoderBot\\Netflix'

In [3]:
data = pd.read_csv(r'C:\\Users\\WELLS\\Documents\\rebuilding portfolio projects\\DecoderBot\\Netflix\\Task 1- Netflix TV Shows and Movies.csv')

In [4]:
# previewing the data
data.head()

Unnamed: 0,index,id,title,type,description,release_year,age_certification,runtime,imdb_id,imdb_score,imdb_votes
0,0,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,tt0075314,8.3,795222.0
1,1,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,tt0071853,8.2,530877.0
2,2,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,tt0079470,8.0,392419.0
3,3,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,tt0070047,8.1,391942.0
4,4,ts22164,Monty Python's Flying Circus,SHOW,A British sketch comedy series with the shows ...,1969,TV-14,30,tt0063929,8.8,72895.0


In [5]:
data.shape

(5283, 11)

### Understanding the data

-----

**id**: This column represents a unique identifier for each movie or TV show in the dataset.


**title**: This column contains the title of the movie or TV show.


**type**: This column indicates the type of content, such as whether it's a movie or a TV show.


**description**: This column contains a brief description or summary of the movie or TV show.


**release_year**: This column indicates the year when the movie or TV show was released.


**age_certification**: This column contain information about the age certification or rating assigned to the content, indicating the appropriate audience age.


**runtime**: This column represents the duration or runtime of the movie or TV show in minutes.


**imdb_id**: This column contains the IMDb identifier for the movie or TV show.


**imdb_score**: This column represents the IMDb score assigned to the movie or TV show by users. Ranging from 1 (lowest) - 10 (highest)


**imdb_votes**: This column contains the number of votes that contributed to the IMDb score for the movie or TV show.

----

### Data Preprocessing

In [6]:
# lets drop columns we won't be needing for the analysis

data.drop(columns = ['index', 'id', 'imdb_id'], inplace = True)

In [7]:
data.shape

(5283, 8)

In [8]:
# checking for missing values, incorrect formatting...

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5283 entries, 0 to 5282
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   title              5283 non-null   object 
 1   type               5283 non-null   object 
 2   description        5278 non-null   object 
 3   release_year       5283 non-null   int64  
 4   age_certification  2998 non-null   object 
 5   runtime            5283 non-null   int64  
 6   imdb_score         5283 non-null   float64
 7   imdb_votes         5267 non-null   float64
dtypes: float64(2), int64(2), object(4)
memory usage: 330.3+ KB


---

As we can see, in the *age_certification* column, we have over half of the total oservations of the data missing, and in this  analysis, we cant afford to remove the rows that are missing, because it will significantly affect our analysis overall. And, we cant also afford to impute the highest occuring value in the age certification column, because in that sense, it will make our anlaysis bias.

What we are going to do is to remove the entire column, so that consistency in the metrics provided would be accurate.

In [9]:
# dropping the age_certification column

data.drop(columns = ['age_certification'], inplace = True)

Now, after doing that, lets take a look at the imdb_votes column, the column contains 5267 observations out of 5283. In this case, we are going to drop rows that contains null values in the dataset.

In [10]:
data.dropna(subset=['imdb_votes'], inplace=True)

In [11]:
data.shape

(5267, 7)

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5267 entries, 0 to 5282
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         5267 non-null   object 
 1   type          5267 non-null   object 
 2   description   5263 non-null   object 
 3   release_year  5267 non-null   int64  
 4   runtime       5267 non-null   int64  
 5   imdb_score    5267 non-null   float64
 6   imdb_votes    5267 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 329.2+ KB


In [13]:
data.drop(columns = ['description'], inplace = True)

In [14]:
# checking for duplicates

data.drop_duplicates(inplace=True)

In [15]:
data.shape

(5267, 6)

Data is cleaned and correctly formatted, also contains no duplicates in the data. Now, the data is ready for analysis, and further preprocessing

In [16]:
# first of all, lets check the summary of the numeric data

data.describe()

Unnamed: 0,release_year,runtime,imdb_score,imdb_votes
count,5267.0,5267.0,5267.0,5267.0
mean,2015.869375,79.308335,6.533264,23407.19
std,7.353379,38.886352,1.161348,87134.32
min,1953.0,0.0,1.5,5.0
25%,2015.0,46.0,5.8,521.0
50%,2018.0,87.0,6.6,2279.0
75%,2020.0,106.0,7.4,10144.0
max,2022.0,235.0,9.6,2268288.0


The IMDB scores ranges from 1 to 10, therefore as we can see, the max and min values in the scores are correct and there are no outliers

In [17]:
data.head()

Unnamed: 0,title,type,release_year,runtime,imdb_score,imdb_votes
0,Taxi Driver,MOVIE,1976,113,8.3,795222.0
1,Monty Python and the Holy Grail,MOVIE,1975,91,8.2,530877.0
2,Life of Brian,MOVIE,1979,94,8.0,392419.0
3,The Exorcist,MOVIE,1973,133,8.1,391942.0
4,Monty Python's Flying Circus,SHOW,1969,30,8.8,72895.0


Now, we are going to be performing feature engineering using the imdb_votes and imdb_score columns. 

We would be creating a column called imdb_strength, which would be generated from dividing the imdb_score from the imdb_votes.

Why are we doing this? 

Because the strength of the imdb score depends on the number of votes related to that score.

Therefore there is need for the imdb_strength to tell us the strength of the score across different movies.

In [18]:
data['imdb_strength'] = data.imdb_votes / data.imdb_score

In [19]:
data['imdb_strength'] = data['imdb_strength'].round(1)

In [20]:
data.head()

Unnamed: 0,title,type,release_year,runtime,imdb_score,imdb_votes,imdb_strength
0,Taxi Driver,MOVIE,1976,113,8.3,795222.0,95809.9
1,Monty Python and the Holy Grail,MOVIE,1975,91,8.2,530877.0,64741.1
2,Life of Brian,MOVIE,1979,94,8.0,392419.0,49052.4
3,The Exorcist,MOVIE,1973,133,8.1,391942.0,48387.9
4,Monty Python's Flying Circus,SHOW,1969,30,8.8,72895.0,8283.5


Now, the data is ready for the dashboard build-up.

In [None]:
# extracting the data

data.to_csv('netflix.csv', index = False)