**Linear Regression**

Link for the dataset: https://www.kaggle.com/datasets/kianindeed/imdb-movie-dataset-dec-2023

This dataset contains top IMDB movies updated till 15 Dec 2023. This file in the csv fromat and it contains 11 columns namely: Moive Name, Rating, Votes, Meta Score, Genre, PG Rating, Year, Duration, Cast, Director. The data has 1950 rows

**Cleaning and modifying data**

In [10]:
# install all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [11]:
# uploading dataset to juputer notebook
df = pd.read_csv("imdb_movie_data_2023.csv")

In [12]:
# read the data and see how I can modify this dataset
df.head()

# I have to google some names of columns to better understand what they mean
# Metascore is considered the rating of a film. Scores are assigned 
# to movie's reviews of large group of the world's most respected critics, 
# and weighted average are applied to summarize their opinions range.
# https://www.imdb.com/list/ls051211184/#:~:text=Metascore%20is%20considered%20the%20rating,to%20summarize%20their%20opinions%20range.

# PG rating means to what audience these movies
# if there is any restrictions for audience

Unnamed: 0.1,Unnamed: 0,Moive Name,Rating,Votes,Meta Score,Genre,PG Rating,Year,Duration,Cast,Director
0,0,Leave the World Behind,6.5,90000.0,67.0,"Drama, Mystery, Thriller",R,2023,2h 18m,"Julia Roberts, Mahershala Ali, Ethan Hawke, My...",Sam Esmail
1,1,Wonka,7.4,24000.0,66.0,"Adventure, Comedy, Family",PG,2023,1h 56m,"Timothée Chalamet, Gustave Die, Murray McArthu...",Paul King
2,2,Poor Things,8.5,6700.0,86.0,"Comedy, Drama, Romance",R,2023,2h 21m,"Emma Stone, Mark Ruffalo, Willem Dafoe, Ramy Y...",Yorgos Lanthimos
3,3,Killers of the Flower Moon,7.8,128000.0,89.0,"Crime, Drama, History",R,2023,3h 26m,"Leonardo DiCaprio, Robert De Niro, Lily Gladst...",Martin Scorsese
4,4,May December,7.0,21000.0,85.0,"Comedy, Drama",R,2023,1h 57m,"Natalie Portman, Chris Tenzis, Charles Melton,...",Todd Haynes


In [13]:
# check the format of columns
df.dtypes

# I need to check and if it is possible to modify the next columns:
# Genre, PG Rating, Duration

Unnamed: 0      int64
Moive Name     object
Rating        float64
Votes         float64
Meta Score    float64
Genre          object
PG Rating      object
Year            int64
Duration       object
Cast           object
Director       object
dtype: object

In [14]:
# I can drop columns Cast and Director
# because they containt a lot of data that cannot be modified to numeric
# the Moive Name is unnecessary for the Linear Regression
# that is why we also drop this column
df = df.drop(columns=['Cast', 'Director', 'Moive Name'])

In [15]:
# we drop all NaN values
df.dropna(inplace=True)

In [16]:
# Splitting genres and creating one-hot encoding
genres = df['Genre'].str.get_dummies(sep=', ')

# Concatenate one-hot encoded genres with original DataFrame
df = pd.concat([df, genres], axis=1)

# Dropping the original 'Genre' column
df.drop('Genre', axis=1, inplace=True)

In [17]:
# check that we do not have any samw names of columns
columns_list = df.columns.tolist()
sorted_columns = sorted(columns_list)
sorted_columns

['Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Duration',
 'Family',
 'Fantasy',
 'History',
 'Horror',
 'Meta Score',
 'Music',
 'Musical',
 'Mystery',
 'PG Rating',
 'Rating',
 'Romance',
 'Sci-Fi',
 'Sport',
 'Thriller',
 'Unnamed: 0',
 'Votes',
 'War',
 'Western',
 'Year']

In [18]:

df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1939,1940,1941,1942,1943,1944,1945,1946,1947,1949
Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1939,1940,1941,1942,1943,1944,1945,1946,1947,1949
Rating,6.5,7.4,8.5,7.8,7.0,7.1,6.6,8.4,7.6,5.6,...,7.2,7.9,6.9,6.6,5.6,6.3,7.1,7.6,6.5,7.1
Votes,90000.0,24000.0,6700.0,128000.0,21000.0,56000.0,66000.0,553000.0,517000.0,13000.0,...,149000.0,81000.0,54000.0,215000.0,328000.0,22000.0,172000.0,198000.0,71000.0,203000.0
Meta Score,67.0,66.0,86.0,89.0,85.0,54.0,64.0,89.0,55.0,47.0,...,65.0,69.0,60.0,83.0,32.0,67.0,59.0,79.0,46.0,65.0
PG Rating,R,PG,R,R,R,PG-13,R,R,R,PG,...,R,PG-13,R,PG-13,PG-13,R,PG-13,R,PG,R
Year,2023,2023,2023,2023,2023,2023,2023,2023,2003,2023,...,2002,1965,2018,2009,2014,2021,2008,1986,1992,2017
Duration,2h 18m,1h 56m,2h 21m,3h 26m,1h 57m,2h 37m,2h 38m,3h,2h 15m,1h 57m,...,2h 18m,3h 17m,2h 39m,1h 39m,2h 45m,2h 21m,1h 52m,1h 36m,1h 44m,1h 55m
Action,0,0,0,0,0,1,1,0,0,0,...,1,0,1,0,1,0,0,0,0,1
Adventure,0,1,0,0,0,1,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
Animation,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
