## Importing required libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error,mean_squared_error

## Importing and preprocessing the dataset

#### Importing the dataset

In [2]:
df = pd.read_csv('movies.csv')

#### Viewing and understanding the structure of the dataset

In [3]:
df.head()

Unnamed: 0,name,rating,genre,year,released,score,votes,director,writer,star,country,budget,gross,company,runtime
0,The Shining,R,Drama,1980,"June 13, 1980 (United States)",8.4,927000.0,Stanley Kubrick,Stephen King,Jack Nicholson,United Kingdom,19000000.0,46998772.0,Warner Bros.,146.0
1,The Blue Lagoon,R,Adventure,1980,"July 2, 1980 (United States)",5.8,65000.0,Randal Kleiser,Henry De Vere Stacpoole,Brooke Shields,United States,4500000.0,58853106.0,Columbia Pictures,104.0
2,Star Wars: Episode V - The Empire Strikes Back,PG,Action,1980,"June 20, 1980 (United States)",8.7,1200000.0,Irvin Kershner,Leigh Brackett,Mark Hamill,United States,18000000.0,538375067.0,Lucasfilm,124.0
3,Airplane!,PG,Comedy,1980,"July 2, 1980 (United States)",7.7,221000.0,Jim Abrahams,Jim Abrahams,Robert Hays,United States,3500000.0,83453539.0,Paramount Pictures,88.0
4,Caddyshack,R,Comedy,1980,"July 25, 1980 (United States)",7.3,108000.0,Harold Ramis,Brian Doyle-Murray,Chevy Chase,United States,6000000.0,39846344.0,Orion Pictures,98.0


In [4]:
df.shape

(7668, 15)

This dataset contains 7668 rows and 15 columns

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7668 entries, 0 to 7667
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      7668 non-null   object 
 1   rating    7591 non-null   object 
 2   genre     7668 non-null   object 
 3   year      7668 non-null   int64  
 4   released  7666 non-null   object 
 5   score     7665 non-null   float64
 6   votes     7665 non-null   float64
 7   director  7668 non-null   object 
 8   writer    7665 non-null   object 
 9   star      7667 non-null   object 
 10  country   7665 non-null   object 
 11  budget    5497 non-null   float64
 12  gross     7479 non-null   float64
 13  company   7651 non-null   object 
 14  runtime   7664 non-null   float64
dtypes: float64(5), int64(1), object(9)
memory usage: 898.7+ KB


Some columns in dataset has null values. There are 5 columns of float data type, 9 of object and 1 of integer

In [6]:
df.isnull().sum()

name           0
rating        77
genre          0
year           0
released       2
score          3
votes          3
director       0
writer         3
star           1
country        3
budget      2171
gross        189
company       17
runtime        4
dtype: int64

Budget column has a lot of null values compared to the total entries of dataset. Since this column is irrelevant for analysis, removing it is a good choice

In [7]:
model_data = df.drop(['name','year','released','budget','gross'],axis=1)

All the non required columns are removed while creating a dataset for the model

In [8]:
model_data.dropna(inplace = True)

All entries corresponding to null values in the dataset for model are removed

In [9]:
label_encoder = LabelEncoder()

In [10]:
columns = ['rating','genre','director','writer','star','country','company']

for col in columns:
    model_data[col] = label_encoder.fit_transform(model_data[col])

Our model is a linear regression model which can work on numerical values only. The label encoder will convert the categorical data in the selected columns into numerical data

## Model Building and evaluation

#### Model training

The model used for the prediction is linear regression model

In [11]:
model = LinearRegression()

The data in the model_data dataset is splitted into training and testing datasets

In [12]:
X = model_data.drop('rating',axis=1)
Y = model_data['rating']

In [13]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=42)

The model is trained on the training datasets

In [14]:
model.fit(X_train, Y_train)

#### Model Evaluation

Some predictions are made by the model on the test dataset

In [15]:
Y_pred = model.predict(X_test)

These predictions are compared to the actual values and the model is evaluated with the help of parameters like mean absolute error, mean squared error and root mean squared error

In [16]:
print('Mean abolute error is:', mean_absolute_error(Y_test,Y_pred))
print('Mean squared error is:', mean_squared_error(Y_test,Y_pred))
print('Root mean squared error is:', np.sqrt(mean_squared_error(Y_test,Y_pred)))

Mean abolute error is: 0.8494451251706017
Mean squared error is: 1.2876486737862354
Root mean squared error is: 1.1347460833976186
