## DATA PREPROCESSING


# 1. Overview of preprocessing and data exploration

- Handling missing data

- Noise handling

In [1]:
#import library
import requests
import numpy as np
import pandas as pd

# Read the original obtained data file:


In [2]:
#Read data
data_origin = pd.read_csv("movie.csv")
print(data_origin)

                                           Poster_Link  \
0    https://m.media-amazon.com/images/M/MV5BM2MyNj...   
1    https://m.media-amazon.com/images/M/MV5BMTMxNT...   
2    https://m.media-amazon.com/images/M/MV5BNzA5ZD...   
3    https://m.media-amazon.com/images/M/MV5BNGNhMD...   
4    https://m.media-amazon.com/images/M/MV5BNDE4OT...   
..                                                 ...   
191  https://m.media-amazon.com/images/M/MV5BMzFkM2...   
192  https://m.media-amazon.com/images/M/MV5BN2JlZT...   
193  https://m.media-amazon.com/images/M/MV5BMjM1Nj...   
194  https://m.media-amazon.com/images/M/MV5BMjAwMT...   
195  https://m.media-amazon.com/images/M/MV5BMjAwMT...   

                                             Title Certificate  Runtime (min)  \
0                                    The Godfather           A            175   
1                                  The Dark Knight          UA            152   
2    The Lord of the Rings: The Return of the King          

Obtain 196 rows and 16 columns

## Does the raw data have duplicate rows?

In [3]:
# Check if data have duplicate rows
num_duplicated_rows = data_origin.duplicated().sum()
if num_duplicated_rows == 0:
    print(f"Your raw data have no duplicated line.!")
else:
    if num_duplicated_rows > 1:
        ext = "lines"
    else:
        ext = "line"
    print(f"Your raw data have {num_duplicated_rows} duplicated " + ext + ". Please de-deduplicate your raw data.!")

Your raw data have 5 duplicated lines. Please de-deduplicate your raw data.!


In [4]:
#Drop duplicate
data_origin.drop_duplicates(inplace=True)

# Check if data have duplicate rows
num_duplicated_rows = data_origin.duplicated().sum()
if num_duplicated_rows == 0:
    print(f"Your raw data have no duplicated line.!")
else:
    if num_duplicated_rows > 1:
        ext = "lines"
    else:
        ext = "line"
    print(f"Your raw data have {num_duplicated_rows} duplicated " + ext + ". Please de-deduplicate your raw data.!")

Your raw data have no duplicated line.!


After checking, there are 5 duplicate columns.

## What data type does each column currently have? Are there any columns whose data types are not suitable for further processing?

In [7]:
#Type of each column
dtypes = data_origin.dtypes

print(dtypes)

Poster_Link          object
Title                object
Certificate          object
Runtime (min)         int64
Genre                object
Overview             object
Meta_score          float64
Director             object
No_of_Votes           int64
Gross                object
Year                  int64
imdbRating          float64
imdbRatingVotes       int64
rottenRating          int64
metacriticRating      int64
Actors               object
dtype: object


# Check the percentage of missing data in the columns

In [8]:
#Percentage of missing data
missing_percentage = data_origin.isnull().mean() * 100
print("Missing ratio")
print(missing_percentage)

Missing ratio
Poster_Link         0.00000
Title               0.00000
Certificate         0.00000
Runtime (min)       0.00000
Genre               0.00000
Overview            0.00000
Meta_score          0.00000
Director            0.00000
No_of_Votes         0.00000
Gross               1.04712
Year                0.00000
imdbRating          0.00000
imdbRatingVotes     0.00000
rottenRating        0.00000
metacriticRating    0.00000
Actors              0.00000
dtype: float64


In [10]:
#Cho các cột có missing data thành unknown
data_origin.fillna("unknown", inplace=True)
#check again
#Percentage of missing data
missing_percentage = data_origin.isnull().mean() * 100
print("Missing ratio")
print(missing_percentage)

Missing ratio
Poster_Link         0.0
Title               0.0
Certificate         0.0
Runtime (min)       0.0
Genre               0.0
Overview            0.0
Meta_score          0.0
Director            0.0
No_of_Votes         0.0
Gross               0.0
Year                0.0
imdbRating          0.0
imdbRatingVotes     0.0
rottenRating        0.0
metacriticRating    0.0
Actors              0.0
dtype: float64


### After determining the percentage of missing data in the columns, we will now divide them into two categories: numeric data type and non-numeric data type for processing.

# For each column with numeric data type, how are the values distributed? 

For columns with numeric data types, we will calculate:
- Percentage (from 0 to 100) of missing values
- The min
- The lower quartile
- The median
- The upper quartile
- The max

Column with numeric data: Price and Rating

# Save data to proccesed file 


In [12]:
data_origin.to_csv('data_processed.csv',index=False, encoding='utf-8-sig')