**Linear Regression**

Link for the dataset: https://www.kaggle.com/datasets/kianindeed/imdb-movie-dataset-dec-2023

This dataset contains top IMDB movies updated till 15 Dec 2023. This file in the csv fromat and it contains 11 columns namely: Moive Name, Rating, Votes, Meta Score, Genre, PG Rating, Year, Duration, Cast, Director. The data has 1950 rows

In [58]:
# install all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [46]:
# uploading dataset to juputer notebook
df = pd.read_csv("imdb_movie_data_2023.csv")

In [47]:
# read the data and see how I can modify this dataset
df.head()

# I have to google some names of columns to better understand what they mean
# Metascore is considered the rating of a film. Scores are assigned 
# to movie's reviews of large group of the world's most respected critics, 
# and weighted average are applied to summarize their opinions range.
# https://www.imdb.com/list/ls051211184/#:~:text=Metascore%20is%20considered%20the%20rating,to%20summarize%20their%20opinions%20range.

# PG rating means to what audience these movies
# if there is any restrictions for audience

Unnamed: 0.1,Unnamed: 0,Moive Name,Rating,Votes,Meta Score,Genre,PG Rating,Year,Duration,Cast,Director
0,0,Leave the World Behind,6.5,90000.0,67.0,"Drama, Mystery, Thriller",R,2023,2h 18m,"Julia Roberts, Mahershala Ali, Ethan Hawke, My...",Sam Esmail
1,1,Wonka,7.4,24000.0,66.0,"Adventure, Comedy, Family",PG,2023,1h 56m,"Timothée Chalamet, Gustave Die, Murray McArthu...",Paul King
2,2,Poor Things,8.5,6700.0,86.0,"Comedy, Drama, Romance",R,2023,2h 21m,"Emma Stone, Mark Ruffalo, Willem Dafoe, Ramy Y...",Yorgos Lanthimos
3,3,Killers of the Flower Moon,7.8,128000.0,89.0,"Crime, Drama, History",R,2023,3h 26m,"Leonardo DiCaprio, Robert De Niro, Lily Gladst...",Martin Scorsese
4,4,May December,7.0,21000.0,85.0,"Comedy, Drama",R,2023,1h 57m,"Natalie Portman, Chris Tenzis, Charles Melton,...",Todd Haynes


In [48]:
# check the format of columns
df.dtypes

# I need to check and if it is possible to modify the next columns:
# Genre, PG Rating, Duration, Cast, Director

Unnamed: 0      int64
Moive Name     object
Rating        float64
Votes         float64
Meta Score    float64
Genre          object
PG Rating      object
Year            int64
Duration       object
Cast           object
Director       object
dtype: object

In [49]:
# I can drop columns Cast and Director
# because they containt a lot of data that cannot be modified to numeric
columns_to_drop = ['Cast', 'Director']

In [50]:
# I can start work with the Genre column
# we can split the data from every row to columns
# because there is a limitation of possible genres
# and they probably have an effect on the rating of these movies

# Spliting data from the Genre column into the list
# in order to make new columns from this
df['Genre'] = df['Genre'].str.split(',')

In [51]:
# make new columns that contain genres of movies
df = df.join(df['Genre'].apply(pd.Series).add_prefix('Genre_'))
df

Unnamed: 0.1,Unnamed: 0,Moive Name,Rating,Votes,Meta Score,Genre,PG Rating,Year,Duration,Cast,Director,Genre_0,Genre_1,Genre_2
0,0,Leave the World Behind,6.5,90000.0,67.0,"[Drama, Mystery, Thriller]",R,2023,2h 18m,"Julia Roberts, Mahershala Ali, Ethan Hawke, My...",Sam Esmail,Drama,Mystery,Thriller
1,1,Wonka,7.4,24000.0,66.0,"[Adventure, Comedy, Family]",PG,2023,1h 56m,"Timothée Chalamet, Gustave Die, Murray McArthu...",Paul King,Adventure,Comedy,Family
2,2,Poor Things,8.5,6700.0,86.0,"[Comedy, Drama, Romance]",R,2023,2h 21m,"Emma Stone, Mark Ruffalo, Willem Dafoe, Ramy Y...",Yorgos Lanthimos,Comedy,Drama,Romance
3,3,Killers of the Flower Moon,7.8,128000.0,89.0,"[Crime, Drama, History]",R,2023,3h 26m,"Leonardo DiCaprio, Robert De Niro, Lily Gladst...",Martin Scorsese,Crime,Drama,History
4,4,May December,7.0,21000.0,85.0,"[Comedy, Drama]",R,2023,1h 57m,"Natalie Portman, Chris Tenzis, Charles Melton,...",Todd Haynes,Comedy,Drama,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1945,1945,"Definitely, Maybe",7.1,172000.0,59.0,"[Comedy, Drama, Romance]",PG-13,2008,1h 52m,"Ryan Reynolds, Rachel Weisz, Abigail Breslin, ...",Adam Brooks,Comedy,Drama,Romance
1946,1946,The Fly,7.6,198000.0,79.0,"[Drama, Horror, Sci-Fi]",R,1986,1h 36m,"Jeff Goldblum, Geena Davis, John Getz, Joy Bou...",David Cronenberg,Drama,Horror,Sci-Fi
1947,1947,The Mighty Ducks,6.5,71000.0,46.0,"[Comedy, Drama, Family]",PG,1992,1h 44m,"Emilio Estevez, Joss Ackland, Lane Smith, Heid...",Stephen Herek,Comedy,Drama,Family
1948,1948,Little Giants,6.4,30000.0,,,PG,1994,1h 47m,,,,,


In [52]:
unique_values = df['Genre_1'].unique()
unique_values

# Drama and Romance
# History and Biography
# Action and Adventure

array([' Mystery', ' Comedy', ' Drama', ' Adventure', ' Family', nan,
       ' Crime', ' Thriller', ' Musical', ' Romance', ' Fantasy',
       ' Action', ' Horror', ' Music', ' Sci-Fi', ' History', ' Western',
       ' War', ' Sport', ' Biography'], dtype=object)

In [53]:
unique_values = df['PG Rating'].unique()
unique_values

array(['R', 'PG', 'PG-13', 'TV-PG', nan, 'G', 'TV-G', 'TV-14', 'TV-Y7',
       'TV-MA', 'Unrated', '18+', 'Passed', 'Approved', 'NC-17', 'X',
       '16+', '13+', 'GP'], dtype=object)

In [54]:
# this makes multiple columns with the variable 
from sklearn.preprocessing import OneHotEncoder
variables = ['PG Rating']

# use encoder in order to make columns with only numeric data
encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")
one_hot_encoded = encoder.fit_transform(df[variables]).astype(int)
df = pd.concat([df,one_hot_encoded],axis=1).drop(columns=variables)

In [55]:
# we can delete NaN column becuase it simply means 
# that there is no info for PG Rating
# also we can drop PG Rating_Unrated for the same reason
# and also we can delete the last column: PG Rating_X
# becuase we used OneHotEncoder
columns_to_drop = ['PG Rating_Unrated', 'PG Rating_nan', 'PG Rating_X']

In [56]:
df

Unnamed: 0.1,Unnamed: 0,Moive Name,Rating,Votes,Meta Score,Genre,Year,Duration,Cast,Director,...,PG Rating_Passed,PG Rating_R,PG Rating_TV-14,PG Rating_TV-G,PG Rating_TV-MA,PG Rating_TV-PG,PG Rating_TV-Y7,PG Rating_Unrated,PG Rating_X,PG Rating_nan
0,0,Leave the World Behind,6.5,90000.0,67.0,"[Drama, Mystery, Thriller]",2023,2h 18m,"Julia Roberts, Mahershala Ali, Ethan Hawke, My...",Sam Esmail,...,0,1,0,0,0,0,0,0,0,0
1,1,Wonka,7.4,24000.0,66.0,"[Adventure, Comedy, Family]",2023,1h 56m,"Timothée Chalamet, Gustave Die, Murray McArthu...",Paul King,...,0,0,0,0,0,0,0,0,0,0
2,2,Poor Things,8.5,6700.0,86.0,"[Comedy, Drama, Romance]",2023,2h 21m,"Emma Stone, Mark Ruffalo, Willem Dafoe, Ramy Y...",Yorgos Lanthimos,...,0,1,0,0,0,0,0,0,0,0
3,3,Killers of the Flower Moon,7.8,128000.0,89.0,"[Crime, Drama, History]",2023,3h 26m,"Leonardo DiCaprio, Robert De Niro, Lily Gladst...",Martin Scorsese,...,0,1,0,0,0,0,0,0,0,0
4,4,May December,7.0,21000.0,85.0,"[Comedy, Drama]",2023,1h 57m,"Natalie Portman, Chris Tenzis, Charles Melton,...",Todd Haynes,...,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1945,1945,"Definitely, Maybe",7.1,172000.0,59.0,"[Comedy, Drama, Romance]",2008,1h 52m,"Ryan Reynolds, Rachel Weisz, Abigail Breslin, ...",Adam Brooks,...,0,0,0,0,0,0,0,0,0,0
1946,1946,The Fly,7.6,198000.0,79.0,"[Drama, Horror, Sci-Fi]",1986,1h 36m,"Jeff Goldblum, Geena Davis, John Getz, Joy Bou...",David Cronenberg,...,0,1,0,0,0,0,0,0,0,0
1947,1947,The Mighty Ducks,6.5,71000.0,46.0,"[Comedy, Drama, Family]",1992,1h 44m,"Emilio Estevez, Joss Ackland, Lane Smith, Heid...",Stephen Herek,...,0,0,0,0,0,0,0,0,0,0
1948,1948,Little Giants,6.4,30000.0,,,1994,1h 47m,,,...,0,0,0,0,0,0,0,0,0,0


In [57]:
unique_values = df['Director'].unique()
unique_values

array(['Sam Esmail', 'Paul King', 'Yorgos Lanthimos', 'Martin Scorsese',
       'Todd Haynes', 'Francis Lawrence', 'Ridley Scott',
       'Christopher Nolan', 'Richard Curtis', 'Reginald Hudlin',
       'Emerald Fennell', 'Greta Gerwig', 'Jeremiah S. Chechik',
       'Zack Snyder', 'Simon Cellan Jones', 'Alexander Payne', 'McG',
       'James Mangold', 'Jon Favreau', 'Ron Howard', 'Bradley Cooper',
       'Randy Zisk', 'Sean Durkin',
       'Robert MarianettiRobert SmigelDavid Wachtenheim', 'Phillip Noyce',
       'Gareth Edwards', 'Yarrow CheneyScott Mosier', 'Sam Fell',
       'John McTiernan', 'David Fincher',
       'Aaron HorvathMichael JelenicPierre Leduc', 'Tommy Wirkola',
       'Chris Columbus', 'Kenneth Branagh', 'Nia DaCosta',
       'Chris BuckFawn Veerasunthorn', 'Robert Zemeckis', 'Frank Capra',
       'Emma Tammi', nan, 'John Pasquin', 'Michael Curtiz',
       'Brian Helgeland', 'Eli Roth', 'Nancy Meyers', 'Gary Ross',
       'George Clooney', 'Bob Clark', 'Blitz Bazawul