# **PART 2. DATA PREPROCESSING**

## **1. Overview of preprocessing and data exploration**
- Handling missing data
- Noise handling
- Standardize data
-  Give first glimpses of data

## **2. Read the original obtained data file**


### Import libraries

In [1]:
#import library
import requests
import numpy as np
import pandas as pd

In [2]:
#Read data
IMDB = pd.read_csv('imdb.csv')
TMDB = pd.read_csv('TMDB.csv')
ROTTEN = pd.read_csv('rotten.csv')

In [3]:
print(IMDB.head())

                      Title  imdbRating
0  The Shawshank Redemption         9.3
1             The Godfather         9.2
2           The Dark Knight         9.0
3     The Godfather Part II         9.0
4              12 Angry Men         9.0


In [4]:
print (TMDB.head())

                                         Poster_Link  \
0  https://image.tmdb.org/t/p/w500/9cqNxx0GxF0bfl...   
1  https://image.tmdb.org/t/p/w500/3bhkrj58Vtu7en...   
2  https://image.tmdb.org/t/p/w500/hek3koDUyRQk7F...   
3  https://image.tmdb.org/t/p/w500/sF1U4EUQS8YHUY...   
4  https://image.tmdb.org/t/p/w500/ow3wq89wM8qd5X...   

                      Title  \
0  The Shawshank Redemption   
1             The Godfather   
2     The Godfather Part II   
3          Schindler's List   
4              12 Angry Men   

                                            Overview Certificate  \
0  Imprisoned in the 1940s for the double murder ...           R   
1  Spanning the years 1945 to 1955, a chronicle o...           R   
2  In the continuing saga of the Corleone crime f...           R   
3  The true story of how businessman Oskar Schind...           R   
4  The defense and the prosecution have rested an...          NR   

   Runtime (min)                Genre  \
0            142         D

In [5]:
print (ROTTEN.head())

               Title  rottenRating
0      The Godfather            97
1         Casablanca            99
2  L.A. Confidential            99
3      Seven Samurai           100
4           Parasite            99


In [6]:
# # Loại bỏ các dấu ngoặc trong cột imdbRating của của DataFrame IMDB
# IMDB['imdbRating'] = IMDB['imdbRating'].str.replace('[(),]', '', regex=True).astype(float)
# # Chuyển đổi kiểu dữ liệu của cột imdbRating thành float    
# IMDB['imdbRating'] = IMDB['imdbRating'].astype(float)
# # Lưu lại vào file CSV 
# IMDB.to_csv('imdb.csv', index=False)

In [7]:
print (IMDB.head())

                      Title  imdbRating
0  The Shawshank Redemption         9.3
1             The Godfather         9.2
2           The Dark Knight         9.0
3     The Godfather Part II         9.0
4              12 Angry Men         9.0


In [8]:
import pandas as pd
from fuzzywuzzy import fuzz, process

# Đọc 3 file
df1 = pd.read_csv('imdb.csv')
df2 = pd.read_csv('TMDB.csv')
df3 = pd.read_csv('rotten.csv')

# Danh sách tiêu đề của từng file thứ 2, thứ 3
titles_b = df2['Title'].tolist()
titles_c = df3['Title'].tolist()

# Hàm tìm match trong một danh sách tiêu đề với ngưỡng score
def match_title(title, choices, threshold=90):
    match, score = process.extractOne(title, choices, scorer=fuzz.token_set_ratio)
    return match if score >= threshold else None

# 1) Match df1 -> df2
df1['Matched_B'] = df1['Title'].apply(lambda t: match_title(t, titles_b))

# 2) Match df1 -> df3 (dùng cột gốc Title, hoặc có thể dùng Matched_B nếu muốn chain)
df1['Matched_C'] = df1['Title'].apply(lambda t: match_title(t, titles_c))

# 3) Gộp kết quả với df2 và df3
#   a) với df2
merged = pd.merge(
    df1, df2,
    left_on='Matched_B', right_on='Title',
    how='left',
    suffixes=('', '_b')
)

#   b) với df3
merged = pd.merge(
    merged, df3,
    left_on='Matched_C', right_on='Title',
    how='left',
    suffixes=('', '_c')
)

# 4) Xoá cột tạm và cột Title bên merge phụ
merged.drop(['Matched_B', 'Matched_C', 'Title_b', 'Title_c'], axis=1, inplace=True)

# Kết quả cuối
print(merged.head())
merged.to_csv('merged_3files.csv', index=False)

data = merged


                      Title  imdbRating  \
0  The Shawshank Redemption         9.3   
1             The Godfather         9.2   
2           The Dark Knight         9.0   
3     The Godfather Part II         9.0   
4              12 Angry Men         9.0   

                                         Poster_Link  \
0  https://image.tmdb.org/t/p/w500/9cqNxx0GxF0bfl...   
1  https://image.tmdb.org/t/p/w500/3bhkrj58Vtu7en...   
2  https://image.tmdb.org/t/p/w500/qJ2tW6WMUDux91...   
3  https://image.tmdb.org/t/p/w500/3bhkrj58Vtu7en...   
4  https://image.tmdb.org/t/p/w500/ow3wq89wM8qd5X...   

                                            Overview Certificate  \
0  Imprisoned in the 1940s for the double murder ...           R   
1  Spanning the years 1945 to 1955, a chronicle o...           R   
2  Batman raises the stakes in his war on crime. ...       PG-13   
3  Spanning the years 1945 to 1955, a chronicle o...           R   
4  The defense and the prosecution have rested an...          NR

## Does the raw data have duplicate rows?

In [9]:
# Check if data have duplicate rows
num_duplicated_rows = data.duplicated().sum()
if num_duplicated_rows > 0:
    print(f"Found {num_duplicated_rows} duplicated rows. Removing duplicates...")
    data = data.drop_duplicates()
    print("✅ Duplicates removed.")
else:
    print("✅ Your data has no duplicated rows.")

✅ Your data has no duplicated rows.


## **3. What data type does each column currently have? Are there any columns whose data types are not suitable for further processing?**

In [10]:
#Type of each column
dtypes = data.dtypes
print("Data types of each column:")
print(dtypes)

Data types of each column:
Title             object
imdbRating       float64
Poster_Link       object
Overview          object
Certificate       object
Runtime (min)    float64
Genre             object
Actors            object
Director          object
Year             float64
tmdbRating       float64
rottenRating     float64
dtype: object


## **4. Check the percentage of missing data in the columns**

In [11]:
#Percentage of missing data
missing_percentage = data.isnull().mean() * 100
print("Missing ratio")
print(missing_percentage)

Missing ratio
Title             0.0
imdbRating        0.0
Poster_Link      16.8
Overview         16.8
Certificate      20.4
Runtime (min)    16.8
Genre            16.8
Actors           16.8
Director         16.8
Year             16.8
tmdbRating       16.8
rottenRating     45.2
dtype: float64


- After identifying the basic statistical numbers that describe data, we further need to determine the features that have a large number of missing values. Such features are not useful for the analysis stage and must be removed from the dataset.

- Depending on goals, the threshold for "large" can be defined. Usually, if the percentage of missing values is greater than 75%, the column is dropped from the dataframe and an updated dataframe is returned.

In [12]:
def drop_missing_features(df: pd.DataFrame, missing_percentage: pd.Series, threshold: float = 75.0) -> pd.DataFrame:
    # Find columns with missing data percentage greater than the threshold
    cols_to_drop = missing_percentage[missing_percentage > threshold].index
    # Drop those columns from the DataFrame
    return df.drop(columns=cols_to_drop)

# Apply the function to the dataframe `data` using the `missing_percentage` series
raw_df = drop_missing_features(data, missing_percentage)

# Display the first few rows of the resulting dataframe
raw_df.head()


Unnamed: 0,Title,imdbRating,Poster_Link,Overview,Certificate,Runtime (min),Genre,Actors,Director,Year,tmdbRating,rottenRating
0,The Shawshank Redemption,9.3,https://image.tmdb.org/t/p/w500/9cqNxx0GxF0bfl...,Imprisoned in the 1940s for the double murder ...,R,142.0,"Drama, Crime","Morgan Freeman, Tim Robbins, Bob Gunton, Willi...",Frank Darabont,1994.0,8.7,
1,The Godfather,9.2,https://image.tmdb.org/t/p/w500/3bhkrj58Vtu7en...,"Spanning the years 1945 to 1955, a chronicle o...",R,175.0,"Drama, Crime","Marlon Brando, Al Pacino, James Caan, Robert D...",Francis Ford Coppola,1972.0,8.686,97.0
2,The Dark Knight,9.0,https://image.tmdb.org/t/p/w500/qJ2tW6WMUDux91...,Batman raises the stakes in his war on crime. ...,PG-13,152.0,"Drama, Action, Crime, Thriller","Christian Bale, Heath Ledger, Aaron Eckhart, M...",Christopher Nolan,2008.0,8.5,94.0
3,The Godfather Part II,9.0,https://image.tmdb.org/t/p/w500/3bhkrj58Vtu7en...,"Spanning the years 1945 to 1955, a chronicle o...",R,175.0,"Drama, Crime","Marlon Brando, Al Pacino, James Caan, Robert D...",Francis Ford Coppola,1972.0,8.686,97.0
4,12 Angry Men,9.0,https://image.tmdb.org/t/p/w500/ow3wq89wM8qd5X...,The defense and the prosecution have rested an...,NR,97.0,Drama,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",Sidney Lumet,1957.0,8.548,100.0


##### After determining the percentage of missing data in the columns, we will now **divide them into two categories: numeric data type and non-numeric data type** for processing.

## **5. For each column with numeric data type, how are the values distributed?**

For columns with numeric data types, we will calculate:
- Percentage (from 0 to 100) of missing values
- The min
- The lower quartile
- The median
- The upper quartile
- The max



In [13]:
def missing_ratio(column):
    return column.isna().mean() * 100

def lower_quartile(column):
    return column.quantile(0.25)

def median(column):
    return column.median()

def upper_quartile(column):
    return column.quantile(0.75)

# Select numerical columns (float64, int64) from the DataFrame
numeric_cols = data.select_dtypes(include=['float64', 'int64']).columns

# Dictionary to store the statistics
statistics = {
    "missing_ratio": [],
    "min": [],
    "lower_quartile": [],
    "median": [],
    "upper_quartile": [],
    "max": []
}

# Calculate statistics for each numerical column
for col in numeric_cols:
    missing_ratio_val = missing_ratio(data[col])
    min_val = data[col].min()
    lower_quartile_val = lower_quartile(data[col])
    median_val = median(data[col])
    upper_quartile_val = upper_quartile(data[col])
    max_val = data[col].max()
    
    statistics["missing_ratio"].append(missing_ratio_val)
    statistics["min"].append(min_val)
    statistics["lower_quartile"].append(lower_quartile_val)
    statistics["median"].append(median_val)
    statistics["upper_quartile"].append(upper_quartile_val)
    statistics["max"].append(max_val)

# Create a DataFrame from the statistics dictionary
num_col_info_df = pd.DataFrame(statistics, index=numeric_cols).T.round(1)
num_col_info_df


Unnamed: 0,imdbRating,Runtime (min),Year,tmdbRating,rottenRating
missing_ratio,0.0,16.8,16.8,16.8,45.2
min,8.0,22.0,1921.0,7.9,89.0
lower_quartile,8.1,109.0,1964.8,8.0,94.0
median,8.2,127.5,1995.0,8.1,96.0
upper_quartile,8.4,148.2,2009.0,8.3,98.0
max,9.3,233.0,2024.0,8.7,100.0


## **6. For each column with a non-numeric data type, how are the values distributed?**

In [14]:
non_numeric_cols = raw_df.select_dtypes(exclude=['float64', 'int64']).columns
cat_statistics = {
    "missing_ratio": [],
    "num_values": [],  #Numbers of unique values
}
for col in non_numeric_cols:
    
    missing_ratio = raw_df[col].isna().mean() * 100

    num_values = raw_df[col].nunique()

    cat_statistics["missing_ratio"].append(round(missing_ratio, 1))
    cat_statistics["num_values"].append(num_values)

    
cat_col_info_df = pd.DataFrame(cat_statistics, index=non_numeric_cols).T
print (cat_col_info_df)

               Title  Poster_Link  Overview  Certificate  Genre  Actors  \
missing_ratio    0.0         16.8      16.8         20.4   16.8    16.8   
num_values     250.0        201.0     201.0          6.0  116.0   200.0   

               Director  
missing_ratio      16.8  
num_values        127.0  


## 7 Điền vào các ô bị thiếu

In [17]:
import requests
import csv
import time

API_KEY = '738f8682fc5143163b145d03a2016b0b'
BASE_URL = 'https://api.themoviedb.org/3'
SEARCH_URL = f'{BASE_URL}/search/movie'

def search_movie_by_title(title):
    params = {
        'api_key': API_KEY,
        'query': title,
        'language': 'en-US'
    }
    response = requests.get(SEARCH_URL, params=params)
    if response.status_code == 200:
        results = response.json().get('results', [])
        return results[0] if results else None
    return None

def get_movie_details(movie_id):
    url = f'{BASE_URL}/movie/{movie_id}?language=en-US&api_key={API_KEY}'
    return requests.get(url).json()

def get_movie_credits(movie_id):
    url = f'{BASE_URL}/movie/{movie_id}/credits?api_key={API_KEY}'
    return requests.get(url).json()

def get_movie_certification(movie_id):
    url = f'{BASE_URL}/movie/{movie_id}/release_dates?api_key={API_KEY}'
    response = requests.get(url)
    if response.status_code == 200:
        results = response.json().get("results", [])
        for country in results:
            if country.get("iso_3166_1") == "US":
                for release in country.get("release_dates", []):
                    cert = release.get("certification")
                    if cert:
                        return cert
    return ''

def fill_missing_fields(row):
    title = row['Title']
    movie = search_movie_by_title(title)
    if not movie:
        print(f"❌ Movie not found: {title}")
        return row

    movie_id = movie['id']
    details = get_movie_details(movie_id)
    credits = get_movie_credits(movie_id)
    cert = get_movie_certification(movie_id)

    # Fill in missing fields only
    if not row.get('Poster_Link'):
        poster_path = details.get('poster_path', '')
        row['Poster_Link'] = f'https://image.tmdb.org/t/p/w500{poster_path}' if poster_path else ''

    if not row.get('Overview'):
        row['Overview'] = details.get('overview', '')

    if not row.get('Certificate'):
        row['Certificate'] = cert

    if not row.get('Runtime (min)'):
        row['Runtime (min)'] = details.get('runtime', 0)

    if not row.get('Genre'):
        genres = [genre['name'] for genre in details.get('genres', [])]
        row['Genre'] = ', '.join(genres)

    if not row.get('Actors'):
        actors = [cast['name'] for cast in credits.get('cast', [])[:5]]
        row['Actors'] = ', '.join(actors)

    if not row.get('Director'):
        directors = [crew['name'] for crew in credits.get('crew', []) if crew['job'] == 'Director']
        row['Director'] = ', '.join(directors)

    if not row.get('Year'):
        row['Year'] = details.get('release_date', '')[:4]

    if not row.get('tmdbRating'):
        row['tmdbRating'] = details.get('vote_average', 0)

    # Keep imdbRating and rottenRating as is (already in input)
    row['imdbRating'] = row.get('imdbRating', '')
    row['rottenRating'] = row.get('rottenRating', '')

    time.sleep(0.25)
    print(f"✅ Filled: {title}")
    return row

def main():
    input_file = 'merged_3files.csv'
    output_file = 'movie.csv'

    with open(input_file, mode='r', encoding='utf-8') as file:
        reader = csv.DictReader(file)
        rows = list(reader)

    filled_rows = []
    for row in rows:
        # Check if any critical field is missing
        needs_fill = any(not row.get(field) for field in [
            'Poster_Link', 'Overview', 'Certificate', 'Runtime (min)',
            'Genre', 'Actors', 'Director', 'Year', 'tmdbRating'
        ])
        if needs_fill:
            row = fill_missing_fields(row)
        filled_rows.append(row)

    # Define column order
    fieldnames = [
        'Poster_Link', 'Title', 'Certificate', 'Overview',
        'Runtime (min)', 'Genre', 'Actors', 'Director',
        'Year', 'imdbRating', 'rottenRating', 'tmdbRating'
    ]

    with open(output_file, mode='w', encoding='utf-8', newline='') as file:
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writeheader()
        for row in filled_rows:
            writer.writerow({field: row.get(field, '') for field in fieldnames})

    print("✅ Done! Output saved to:", output_file)

if __name__ == '__main__':
    main()


✅ Filled: Star Wars: Episode IV - A New Hope
✅ Filled: WALL·E
✅ Filled: 12th Fail
✅ Filled: Capernaum
✅ Filled: The Chaos Class Failed the Class
✅ Filled: The Apartment
✅ Filled: North by Northwest
✅ Filled: Die Hard
✅ Filled: Indiana Jones and the Last Crusade
✅ Filled: Snatch
✅ Filled: L.A. Confidential
✅ Filled: Batman Begins
✅ Filled: The Kid
✅ Filled: Pan's Labyrinth
✅ Filled: A Beautiful Mind
✅ Filled: Finding Nemo
✅ Filled: Monty Python and the Holy Grail
✅ Filled: The Bridge on the River Kwai
✅ Filled: Fargo
✅ Filled: Warrior
✅ Filled: Mad Max: Fury Road
✅ Filled: Children of Heaven
✅ Filled: Memories of Murder
✅ Filled: Ratatouille
✅ Filled: Monsters, Inc.
✅ Filled: Jaws
✅ Filled: Wild Strawberries
✅ Filled: Logan
✅ Filled: Rocky
✅ Filled: Tokyo Story
✅ Filled: Spotlight
✅ Filled: The Big Lebowski
✅ Filled: The Terminator
✅ Filled: Maharaja
✅ Filled: Pirates of the Caribbean: The Curse of the Black Pearl
✅ Filled: Jai Bhim
✅ Filled: Hotel Rwanda
✅ Filled: Platoon
✅ Filled: Bef

### Check again after filled null values

In [18]:
data_new = pd.read_csv('movie.csv')
print(data_new.head())

# Check if data have duplicate rows
num_duplicated_rows = data_new.duplicated().sum()
if num_duplicated_rows > 0:
    print(f"Found {num_duplicated_rows} duplicated rows. Removing duplicates...")
    data_new = data_new.drop_duplicates()
    print("✅ Duplicates removed.")
else:
    print("✅ Your data has no duplicated rows.")
    

                                         Poster_Link  \
0  https://image.tmdb.org/t/p/w500/9cqNxx0GxF0bfl...   
1  https://image.tmdb.org/t/p/w500/3bhkrj58Vtu7en...   
2  https://image.tmdb.org/t/p/w500/qJ2tW6WMUDux91...   
3  https://image.tmdb.org/t/p/w500/3bhkrj58Vtu7en...   
4  https://image.tmdb.org/t/p/w500/ow3wq89wM8qd5X...   

                      Title Certificate  \
0  The Shawshank Redemption           R   
1             The Godfather           R   
2           The Dark Knight       PG-13   
3     The Godfather Part II           R   
4              12 Angry Men          NR   

                                            Overview  Runtime (min)  \
0  Imprisoned in the 1940s for the double murder ...          142.0   
1  Spanning the years 1945 to 1955, a chronicle o...          175.0   
2  Batman raises the stakes in his war on crime. ...          152.0   
3  Spanning the years 1945 to 1955, a chronicle o...          175.0   
4  The defense and the prosecution have rested an