**Linear Regression**

Link for the dataset: https://www.kaggle.com/datasets/kianindeed/imdb-movie-dataset-dec-2023

This dataset contains top IMDB movies updated till 15 Dec 2023. This file in the csv fromat and it contains 11 columns namely: Moive Name, Rating, Votes, Meta Score, Genre, PG Rating, Year, Duration, Cast, Director. The data has 1950 rows

**Cleaning and modifying data**

In [450]:
# install all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [451]:
# uploading dataset to juputer notebook
df = pd.read_csv("imdb_movie_data_2023.csv")

In [452]:
# read the data and see how I can modify this dataset
df.head()

# I have to google some names of columns to better understand what they mean
# Metascore is considered the rating of a film. Scores are assigned 
# to movie's reviews of large group of the world's most respected critics, 
# and weighted average are applied to summarize their opinions range.
# https://www.imdb.com/list/ls051211184/#:~:text=Metascore%20is%20considered%20the%20rating,to%20summarize%20their%20opinions%20range.

# PG rating means to what audience these movies
# if there is any restrictions for audience

Unnamed: 0.1,Unnamed: 0,Moive Name,Rating,Votes,Meta Score,Genre,PG Rating,Year,Duration,Cast,Director
0,0,Leave the World Behind,6.5,90000.0,67.0,"Drama, Mystery, Thriller",R,2023,2h 18m,"Julia Roberts, Mahershala Ali, Ethan Hawke, My...",Sam Esmail
1,1,Wonka,7.4,24000.0,66.0,"Adventure, Comedy, Family",PG,2023,1h 56m,"Timothée Chalamet, Gustave Die, Murray McArthu...",Paul King
2,2,Poor Things,8.5,6700.0,86.0,"Comedy, Drama, Romance",R,2023,2h 21m,"Emma Stone, Mark Ruffalo, Willem Dafoe, Ramy Y...",Yorgos Lanthimos
3,3,Killers of the Flower Moon,7.8,128000.0,89.0,"Crime, Drama, History",R,2023,3h 26m,"Leonardo DiCaprio, Robert De Niro, Lily Gladst...",Martin Scorsese
4,4,May December,7.0,21000.0,85.0,"Comedy, Drama",R,2023,1h 57m,"Natalie Portman, Chris Tenzis, Charles Melton,...",Todd Haynes


In [453]:
# check the format of columns
df.dtypes

# I need to check and if it is possible to modify the next columns:
# Genre, PG Rating, Duration

Unnamed: 0      int64
Moive Name     object
Rating        float64
Votes         float64
Meta Score    float64
Genre          object
PG Rating      object
Year            int64
Duration       object
Cast           object
Director       object
dtype: object

In [454]:
# I can drop columns Cast and Director
# because they containt a lot of data that cannot be modified to numeric
# the Moive Name is unnecessary for the Linear Regression
# that is why we also drop this column
df = df.drop(columns=['Cast', 'Director', 'Moive Name'])

In [455]:
# we drop all NaN values
df.dropna(inplace=True)

In [456]:
# Creating a set of unique values by splitting on spaces and commas
unique_values = set(' '.join(df['Genre'].values.tolist()).replace(',', ' ').split())

# Convert the set back to a list
unique_values_list = list(unique_values)
unique_values_list

['Biography',
 'Animation',
 'Romance',
 'Documentary',
 'Sci-Fi',
 'Family',
 'Horror',
 'Sport',
 'Western',
 'Mystery',
 'Drama',
 'Music',
 'Musical',
 'Crime',
 'Adventure',
 'War',
 'Thriller',
 'Action',
 'Fantasy',
 'History',
 'Comedy']

In [457]:
# creating a new column for each value in unique_values
for number in unique_values:
    df[f'{number}'] = 0

In [458]:
# check that we do not have any samw names of columns
columns_list = df.columns.tolist()
sorted_columns = sorted(columns_list)
sorted_columns

['Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Duration',
 'Family',
 'Fantasy',
 'Genre',
 'History',
 'Horror',
 'Meta Score',
 'Music',
 'Musical',
 'Mystery',
 'PG Rating',
 'Rating',
 'Romance',
 'Sci-Fi',
 'Sport',
 'Thriller',
 'Unnamed: 0',
 'Votes',
 'War',
 'Western',
 'Year']

In [459]:
# converting the value of column to 0 or 1
from sklearn.preprocessing import LabelEncoder

variables = ['Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Sport',
 'Thriller',
 'War',
 'Western']
encoder = LabelEncoder()

# appling the LabelEncoder to specific columns in our DataFrame
df[variables] = df[variables].apply(encoder.fit_transform)

In [460]:
# when we check that I splitted correclty the Genre column into the numeric data
columns_list = df.columns.tolist()
sorted_columns = sorted(columns_list)
sorted_columns

['Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Duration',
 'Family',
 'Fantasy',
 'Genre',
 'History',
 'Horror',
 'Meta Score',
 'Music',
 'Musical',
 'Mystery',
 'PG Rating',
 'Rating',
 'Romance',
 'Sci-Fi',
 'Sport',
 'Thriller',
 'Unnamed: 0',
 'Votes',
 'War',
 'Western',
 'Year']

In [461]:
# verifing the current state of the DataFrame
df.head()

Unnamed: 0.1,Unnamed: 0,Rating,Votes,Meta Score,Genre,PG Rating,Year,Duration,Biography,Animation,...,Music,Musical,Crime,Adventure,War,Thriller,Action,Fantasy,History,Comedy
0,0,6.5,90000.0,67.0,"Drama, Mystery, Thriller",R,2023,2h 18m,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,7.4,24000.0,66.0,"Adventure, Comedy, Family",PG,2023,1h 56m,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,8.5,6700.0,86.0,"Comedy, Drama, Romance",R,2023,2h 21m,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,7.8,128000.0,89.0,"Crime, Drama, History",R,2023,3h 26m,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,7.0,21000.0,85.0,"Comedy, Drama",R,2023,1h 57m,0,0,...,0,0,0,0,0,0,0,0,0,0


In [462]:
# I can drop the Genre column 
# also I can drop "Unnamed" because I can use the same data from indexes
df = df.drop(columns=['Genre', 'Unnamed: 0'])

In [463]:
df.head()

Unnamed: 0,Rating,Votes,Meta Score,PG Rating,Year,Duration,Biography,Animation,Romance,Documentary,...,Music,Musical,Crime,Adventure,War,Thriller,Action,Fantasy,History,Comedy
0,6.5,90000.0,67.0,R,2023,2h 18m,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7.4,24000.0,66.0,PG,2023,1h 56m,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,8.5,6700.0,86.0,R,2023,2h 21m,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,7.8,128000.0,89.0,R,2023,3h 26m,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,7.0,21000.0,85.0,R,2023,1h 57m,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [464]:
# this makes multiple columns with the variable PG Rating
from sklearn.preprocessing import OneHotEncoder
variables = ['PG Rating']

# I use encoder in order to make columns with only numeric data
encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")
one_hot_encoded = encoder.fit_transform(df[variables]).astype(int)
df = pd.concat([df,one_hot_encoded],axis=1).drop(columns=variables)

In [465]:
# verifing the current state of the DataFrame
df

Unnamed: 0,Rating,Votes,Meta Score,Year,Duration,Biography,Animation,Romance,Documentary,Sci-Fi,...,PG Rating_PG-13,PG Rating_Passed,PG Rating_R,PG Rating_TV-14,PG Rating_TV-G,PG Rating_TV-MA,PG Rating_TV-PG,PG Rating_TV-Y7,PG Rating_Unrated,PG Rating_X
0,6.5,90000.0,67.0,2023,2h 18m,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,7.4,24000.0,66.0,2023,1h 56m,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,8.5,6700.0,86.0,2023,2h 21m,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,7.8,128000.0,89.0,2023,3h 26m,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,7.0,21000.0,85.0,2023,1h 57m,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1944,6.3,22000.0,67.0,2021,2h 21m,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1945,7.1,172000.0,59.0,2008,1h 52m,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1946,7.6,198000.0,79.0,1986,1h 36m,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1947,6.5,71000.0,46.0,1992,1h 44m,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [466]:
# we can drop PG Rating_Unrated for the same reason
# and also we can delete the last column: PG Rating_X
# becuase we used OneHotEncoder
df = df.drop(columns=['PG Rating_Unrated', 'PG Rating_X'])

In [467]:
# we can start to modify the last column Duration
# we need to remove "h" and "m" and modify the data it into minutes
# we can make a function that convert a string column
# into the numeric one
def convert_to_minutes(duration_str):
    try:
        # if the value is already an integer, return it as is
        if isinstance(duration_str, int):
            return duration_str

        # split the string into parts based on 'h' and 'm'
        parts = duration_str.split()

        # initialize hours and minutes
        hours, minutes = 0, 0

        # check each part and update hours or minutes accordingly
        for part in parts:
            if 'h' in part:
                hours = int(part.replace('h', ''))
            elif 'm' in part:
                minutes = int(part.replace('m', ''))

        # calculate total minutes
        total_minutes = hours * 60 + minutes
        return total_minutes
    except Exception as e:
        print(f"Error processing {duration_str}: {e}")
        return pd.NA

# appling the conversion function to the 'Duration' column
df['Duration'] = df['Duration'].apply(convert_to_minutes)

In [468]:
# checking the dataset that we finally have
df.head()

# we can that all data that we have now is numeric
# and now we can proceed with checking the balance of the data

Unnamed: 0,Rating,Votes,Meta Score,Year,Duration,Biography,Animation,Romance,Documentary,Sci-Fi,...,PG Rating_NC-17,PG Rating_PG,PG Rating_PG-13,PG Rating_Passed,PG Rating_R,PG Rating_TV-14,PG Rating_TV-G,PG Rating_TV-MA,PG Rating_TV-PG,PG Rating_TV-Y7
0,6.5,90000.0,67.0,2023,138,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,7.4,24000.0,66.0,2023,116,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,8.5,6700.0,86.0,2023,141,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,7.8,128000.0,89.0,2023,206,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,7.0,21000.0,85.0,2023,117,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


**Checking the balance of the data**

In [469]:
df.describe()

Unnamed: 0,Rating,Votes,Meta Score,Year,Duration,Biography,Animation,Romance,Documentary,Sci-Fi,...,PG Rating_NC-17,PG Rating_PG,PG Rating_PG-13,PG Rating_Passed,PG Rating_R,PG Rating_TV-14,PG Rating_TV-G,PG Rating_TV-MA,PG Rating_TV-PG,PG Rating_TV-Y7
count,1784.0,1784.0,1784.0,1784.0,1784.0,1784.0,1784.0,1784.0,1784.0,1784.0,...,1784.0,1784.0,1784.0,1784.0,1784.0,1784.0,1784.0,1784.0,1784.0,1784.0
mean,6.952691,292544.9,62.142377,2006.497758,117.170404,0.0,0.0,0.0,0.0,0.0,...,0.003924,0.148543,0.338004,0.005045,0.453475,0.002242,0.000561,0.007848,0.002803,0.001121
std,0.862546,318430.6,16.569672,15.241521,21.743498,0.0,0.0,0.0,0.0,0.0,...,0.062534,0.355737,0.473163,0.070868,0.49797,0.047312,0.023676,0.088263,0.052881,0.033473
min,2.4,107.0,14.0,1938.0,69.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,6.4,90000.0,50.0,1998.0,102.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,7.0,200000.0,63.0,2010.0,114.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,7.6,377250.0,74.0,2019.0,129.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
max,9.3,2800000.0,100.0,2023.0,246.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [470]:
# we can 
class_counts = df['Year'].value_counts()
class_counts

2023    173
2022    100
2019     71
2021     67
2017     63
       ... 
1946      1
1938      1
1966      1
1944      1
1956      1
Name: Year, Length: 79, dtype: int64