# **Objective**



1.   Build a model that predicts the rating of a movie based on
features like genre, director, and actors. You can use regression
techniques to tackle this problem.
2.   The goal is to analyze historical movie data and develop a model
that accurately estimates the rating given to a movie by users or
critics.
3.   Movie Rating Prediction project enables you to explore data
analysis, preprocessing, feature engineering, and machine
learning modeling techniques. It provides insights into the factors
that influence movie ratings and allows you to build a model that
can estimate the ratings of movies accurately.

# Downloading the dataset

In [None]:
%pip install kaggle



In [None]:
!mkdir ~/.kaggle

In [None]:
! kaggle datasets download adrianmcmahon/imdb-india-movies

Downloading imdb-india-movies.zip to /content
100% 494k/494k [00:00<00:00, 890kB/s]
100% 494k/494k [00:00<00:00, 890kB/s]


In [None]:
!unzip /content/imdb-india-movies.zip -d /content/

Archive:  /content/imdb-india-movies.zip
  inflating: /content/IMDb Movies India.csv  


# Importing Libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings('ignore')

# Reading & Understanding the data

In [None]:
data = pd.read_csv('/content/IMDb Movies India.csv', sep=',', encoding='latin-1')

In [None]:
data.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15509 non-null  object 
 1   Year      14981 non-null  object 
 2   Duration  7240 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
 5   Votes     7920 non-null   object 
 6   Director  14984 non-null  object 
 7   Actor 1   13892 non-null  object 
 8   Actor 2   13125 non-null  object 
 9   Actor 3   12365 non-null  object 
dtypes: float64(1), object(9)
memory usage: 1.2+ MB


# Checking for missing data

In [None]:
data.isnull().sum()

Name           0
Year         528
Duration    8269
Genre       1877
Rating      7590
Votes       7589
Director     525
Actor 1     1617
Actor 2     2384
Actor 3     3144
dtype: int64

# Data Cleanup

## The following columns: Year, Duration, and Votes had null values + Datatypes are of object

## Conversion of Datatypes keeping the NaN as it is:


### *   Year : object -> float
### *   Duration : object -> float eliminating 'min' and extracting the value as a float
### *   Votes : object -> float






In [None]:
data.Year.unique()

array([nan, '(2019)', '(2021)', '(2010)', '(1997)', '(2005)', '(2008)',
       '(2012)', '(2014)', '(2004)', '(2016)', '(1991)', '(1990)',
       '(2018)', '(1987)', '(1948)', '(1958)', '(2017)', '(2020)',
       '(2009)', '(2002)', '(1993)', '(1946)', '(1994)', '(2007)',
       '(2013)', '(2003)', '(1998)', '(1979)', '(1951)', '(1956)',
       '(1974)', '(2015)', '(2006)', '(1981)', '(1985)', '(2011)',
       '(2001)', '(1967)', '(1988)', '(1995)', '(1959)', '(1996)',
       '(1970)', '(1976)', '(2000)', '(1999)', '(1973)', '(1968)',
       '(1943)', '(1953)', '(1986)', '(1983)', '(1989)', '(1982)',
       '(1977)', '(1957)', '(1950)', '(1992)', '(1969)', '(1975)',
       '(1947)', '(1972)', '(1971)', '(1935)', '(1978)', '(1960)',
       '(1944)', '(1963)', '(1940)', '(1984)', '(1934)', '(1955)',
       '(1936)', '(1980)', '(1966)', '(1949)', '(1962)', '(1964)',
       '(1952)', '(1933)', '(1942)', '(1939)', '(1954)', '(1945)',
       '(1961)', '(1965)', '(1938)', '(1941)', '(1931)', 

In [None]:
data["Year"] = data["Year"].apply(lambda x: int(str(x)[1:5]) if not isinstance(x, float) else np.nan)
data.Year

0           NaN
1        2019.0
2        2021.0
3        2019.0
4        2010.0
          ...  
15504    1988.0
15505    1999.0
15506    2005.0
15507    1988.0
15508    1998.0
Name: Year, Length: 15509, dtype: float64

In [None]:
data.Duration.unique()

array([nan, '109 min', '90 min', '110 min', '105 min', '147 min',
       '142 min', '59 min', '82 min', '116 min', '96 min', '120 min',
       '161 min', '166 min', '102 min', '87 min', '132 min', '66 min',
       '146 min', '112 min', '168 min', '158 min', '126 min', '94 min',
       '138 min', '124 min', '144 min', '157 min', '136 min', '107 min',
       '113 min', '80 min', '122 min', '149 min', '148 min', '130 min',
       '121 min', '188 min', '115 min', '103 min', '114 min', '170 min',
       '100 min', '99 min', '140 min', '128 min', '93 min', '125 min',
       '145 min', '75 min', '111 min', '134 min', '85 min', '104 min',
       '92 min', '137 min', '127 min', '150 min', '119 min', '135 min',
       '86 min', '76 min', '70 min', '72 min', '151 min', '95 min',
       '52 min', '89 min', '143 min', '177 min', '117 min', '123 min',
       '154 min', '88 min', '175 min', '153 min', '78 min', '139 min',
       '133 min', '101 min', '180 min', '60 min', '46 min', '164 min',
       '

In [None]:
data['Duration'] = data['Duration'].apply(lambda x: int(str(x).split(' ')[0]) if not isinstance(x, float) else np.nan)
data.Duration

0          NaN
1        109.0
2         90.0
3        110.0
4        105.0
         ...  
15504      NaN
15505    129.0
15506      NaN
15507      NaN
15508    130.0
Name: Duration, Length: 15509, dtype: float64

In [None]:
np. set_printoptions(threshold=np.inf)
data.Votes.unique()

array([nan, '8', '35', '827', '1,086', '326', '11', '17', '59', '983',
       '512', '6,619', '162', '72', '63', '26', '6,329', '1,002', '15',
       '1,235', '10', '16', '3,100', '1,559', '1,811', '1,069', '3,223',
       '1,892', '20', '106', '14', '21', '33', '24,034', '21,938', '112',
       '94', '52', '361', '642', '5', '7', '32', '194', '514', '165',
       '2,322', '357,889', '23', '358', '6', '238', '9', '4,373', '392',
       '128', '252', '93', '80', '1,128', '2,548', '75', '36', '82', '19',
       '171', '31', '281', '398', '5,640', '34', '449', '249', '5,459',
       '66', '2,767', '1,901', '38', '412', '4,637', '179', '202',
       '4,145', '181', '5,227', '142', '627', '337', '24', '75,118',
       '1,621', '866', '348', '115', '339', '28', '264', '150', '18',
       '69', '568', '196', '97', '149', '62', '266', '357', '29', '227',
       '13', '42', '22', '101', '30', '381', '274', '275', '25', '448',
       '586', '40', '144', '46', '65', '37', '79', '88', '2,998', '15

### While evaluating the code snippet below, a special case occurred in the dataset, which contains '$5.16M' as Votes. Since this is the only case that differs from the rest of the data in the Votes column, we will convert it to '516' to maintain consistency.

In [None]:
data['Votes'] = data['Votes'].replace('$5.16M', 516)
data['Votes'] = data['Votes'].apply(lambda x: int(str(x).replace(',', '')) if not isinstance(x, float) else np.nan)
data.Votes.unique()

array([        nan, 8.00000e+00, 3.50000e+01, 8.27000e+02, 1.08600e+03,
       3.26000e+02, 1.10000e+01, 1.70000e+01, 5.90000e+01, 9.83000e+02,
       5.12000e+02, 6.61900e+03, 1.62000e+02, 7.20000e+01, 6.30000e+01,
       2.60000e+01, 6.32900e+03, 1.00200e+03, 1.50000e+01, 1.23500e+03,
       1.00000e+01, 1.60000e+01, 3.10000e+03, 1.55900e+03, 1.81100e+03,
       1.06900e+03, 3.22300e+03, 1.89200e+03, 2.00000e+01, 1.06000e+02,
       1.40000e+01, 2.10000e+01, 3.30000e+01, 2.40340e+04, 2.19380e+04,
       1.12000e+02, 9.40000e+01, 5.20000e+01, 3.61000e+02, 6.42000e+02,
       5.00000e+00, 7.00000e+00, 3.20000e+01, 1.94000e+02, 5.14000e+02,
       1.65000e+02, 2.32200e+03, 3.57889e+05, 2.30000e+01, 3.58000e+02,
       6.00000e+00, 2.38000e+02, 9.00000e+00, 4.37300e+03, 3.92000e+02,
       1.28000e+02, 2.52000e+02, 9.30000e+01, 8.00000e+01, 1.12800e+03,
       2.54800e+03, 7.50000e+01, 3.60000e+01, 8.20000e+01, 1.90000e+01,
       1.71000e+02, 3.10000e+01, 2.81000e+02, 3.98000e+02, 5.640

In [None]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,14981.0,1987.012215,25.416689,1913.0,1968.0,1991.0,2009.0,2022.0
Duration,7240.0,128.126519,28.912724,2.0,110.0,131.0,147.0,321.0
Rating,7919.0,5.841621,1.381777,1.1,4.9,6.0,6.8,10.0
Votes,7920.0,1938.340783,11601.694372,5.0,16.0,55.0,404.0,591417.0


In [None]:
data.isnull().sum()

Name           0
Year         528
Duration    8269
Genre       1877
Rating      7590
Votes       7589
Director     525
Actor 1     1617
Actor 2     2384
Actor 3     3144
dtype: int64

In [None]:
def scatter_plot(x):
    fig = px.scatter(x=data[x], y=data['Rating'], title = f"{x} vs Rating",
                    labels = {'x': f'{x} of the Movie', 'y':'Rating'}, color_discrete_sequence=['blue'])
    fig.update_traces(marker=dict(size=10,
                                line=dict(width=3,
                                            color='black')))
    fig.show()

## Duration of movies has no linear correlation with their ratings.

In [None]:
scatter_plot('Duration')

## The scatter plot indicates that the dataset contains a highest amount of movies based on Drama genre

In [None]:
scatter_plot('Genre')

In [None]:
scatter_plot('Director')

In [None]:
scatter_plot('Votes')

# Insights :-

## 1. We can discern from the scatter plot above that there isn't a linear relationship between the duration and the movie's rating.

## 2. We cannot conclude that Drama or Action, Crime, Thriller are the genres with the best ratings because the dataset contains a higher number of movies under these genres. Therefore, the analysis will be considered invalid.

## 3. The ratings are evenly distributed when compared with the movie's director.

## 4. The scatter plot for Votes vs Rating suggests that movies with more votes tend to have higher ratings. However, there are other factors that also play a role in a movie's rating, such as its genre, release date, and popularity.

In [None]:
corr_matrix = data.corr(method='spearman')
fig = px.imshow(corr_matrix,
                x = corr_matrix.columns,
                y = corr_matrix.columns,
                color_continuous_scale="Blues", text_auto = True)
fig.update_layout(width=800, height=600)
fig.show()


## Based on the graph above, it appears that there is a weak or negligible correlation among the variables in our dataset.

# Exploratory Data Analysis

In [None]:
data.dropna(subset=['Rating','Year'], inplace=True)

In [None]:
data['Votes'] = data['Votes'].astype(int)

In [None]:
data['Duration'].fillna(data['Duration'].mean().astype('float64'), inplace = True)

In [None]:
data.dropna(subset=['Actor 1','Actor 2','Actor 3','Director'],inplace=True)

In [None]:
data['Year']= data['Year'].astype(int)

In [None]:
data['Duration'] = data['Duration'].astype(int)

In [None]:
data

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
1,#Gadhvi (He thought he was Gandhi),2019,109,Drama,7.0,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
3,#Yaaram,2019,110,"Comedy, Romance",4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
5,...Aur Pyaar Ho Gaya,1997,147,"Comedy, Drama, Musical",4.7,827,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor
6,...Yahaan,2005,142,"Drama, Romance, War",7.4,1086,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
8,?: A Question Mark,2012,82,"Horror, Mystery, Thriller",5.6,326,Allyson Patel,Yash Dave,Muntazir Ahmad,Kiran Bhatia
...,...,...,...,...,...,...,...,...,...,...
15501,Zulm Ki Hukumat,1992,132,"Action, Crime, Drama",5.3,135,Bharat Rangachary,Dharmendra,Moushumi Chatterjee,Govinda
15503,Zulm Ki Zanjeer,1989,125,"Action, Crime, Drama",5.8,44,S.P. Muthuraman,Chiranjeevi,Jayamalini,Rajinikanth
15504,Zulm Ko Jala Doonga,1988,132,Action,4.6,11,Mahendra Shah,Naseeruddin Shah,Sumeet Saigal,Suparna Anand
15505,Zulmi,1999,129,"Action, Drama",4.5,655,Kuku Kohli,Akshay Kumar,Twinkle Khanna,Aruna Irani


## Top 10 rated movies

In [None]:
data.loc[data['Rating'].sort_values(ascending=False)[:10].index]

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
8339,Love Qubool Hai,2020,94,"Drama, Romance",10.0,5,Saif Ali Sayeed,Ahaan Jha,Mahesh Narayan,Rajasree Rajakumari
5410,Half Songs,2021,79,"Music, Romance",9.7,7,Sriram Raja,Raj Banerjee,Emon Chatterjee,Purshottam Mulani
2563,Breed,2020,132,Drama,9.6,48,Bobby Kumar,Bobby Kumar,Ashfaq,Fasih Choudhry
6852,June,2021,93,Drama,9.4,18,Suhrud Godbole,Vaibhav Khisti,Nilesh Divekar,Jitendra Joshi
14222,The Reluctant Crime,2020,113,Drama,9.4,16,Arvind Pratap,Dharmendra Ahir,Awanish Kotnal,Rakhi Mansha
5077,Gho Gho Rani,2019,105,"History, Romance",9.4,47,Munni Pankaj,Nishi Neha Mishra,Pankaj Kamal,Akash Kumar
11843,Refl3ct,2021,65,Sci-Fi,9.3,467,Nikhil Mahar,Vijay Mahar,Vijay Mahar,Nikhil Mahar
1729,Baikunth,2021,72,Family,9.3,29,Vishwa Bhanu,Vishwa Bhanu,Sangam Shukla,Vijay Thakur
13231,Sindhustan,2019,64,"Documentary, Family, History",9.3,36,Sapna Bhavnani,Leila Advani,Laj Badlani,Chaho Bhara
9105,Meher,2020,132,Drama,9.3,27,Rajat Bhardwaj,Amrit,Dimple Chauhan,Sapna Das


## Average movie ratings by year


### *   Best Ratings - 1948
### *   Worst Ratings - 2002



In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x = data.groupby('Year')[['Rating']].sum().index, y = data.groupby('Year')[['Rating']].mean()['Rating'], mode='lines', name='Average Votes', line=dict(color='#2ca02c')))

fig.update_layout(
    xaxis=dict(title='Year', tickvals=list(range(1917, 2023, 4))),
    yaxis=dict(title='Average Rating', range=[0, data.groupby('Year')[['Rating']].mean()]),
    title='Average Rating by Year',
    legend=dict(x=0, y=1),
)
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x = data.groupby('Year')[['Votes']].sum().index, y = data.groupby('Year')[['Votes']].mean()['Votes'], mode='lines', name='Average Votes', line=dict(color='blue')))

fig.update_layout(
    xaxis=dict(title='Year', tickvals=list(range(1917, 2023, 4))),
    yaxis=dict(title='Average Votes', range=[0, data.groupby('Year')[['Votes']].mean()]),
    title='Average Votes by Year',
    legend=dict(x=0, y=1),
)
fig.show()

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x = data.groupby('Rating')[['Votes']].sum().index, y = data.groupby('Rating')[['Votes']].mean()['Votes'], mode='lines', name='Average Votes', line=dict(color='blue')))

fig.update_layout(
    xaxis=dict(title='Rating', tickvals= np.arange(0.0,10.5,0.5)),
    yaxis=dict(title='Average Votes', range=[0, data.groupby('Rating')[['Votes']].mean()]),
    title='Average Votes by Rating',
    legend=dict(x=0, y=1),
)
fig.show()

# Data Preprocessing

## To predict ratings, I will replace each genre with the average rating for all movies in that specific genre, and I will follow the same process for directors and actors.

### Genre Analysis

In [None]:
genre = data.groupby('Genre').agg({'Rating':['mean','count']})
genre.reset_index(inplace=True)
genre.columns = ['Genre','Avg Rating','Movie Count']
genre['Avg Rating'] = genre['Avg Rating'].round(1)
genre

Unnamed: 0,Genre,Avg Rating,Movie Count
0,Action,5.0,391
1,"Action, Adventure",5.6,24
2,"Action, Adventure, Biography",7.8,1
3,"Action, Adventure, Comedy",5.6,40
4,"Action, Adventure, Crime",5.6,16
...,...,...,...
411,"Thriller, Action",4.3,1
412,"Thriller, Musical, Mystery",7.1,1
413,"Thriller, Mystery",6.5,3
414,"Thriller, Mystery, Family",6.1,1


In [None]:
genre.isnull().sum()

Genre          0
Avg Rating     0
Movie Count    0
dtype: int64

### Directors Analysis

In [None]:
directors  = data.groupby('Director').agg({'Rating':['mean','count']})
directors.columns = directors.columns.droplevel(0)
directors.reset_index(inplace=True)
directors.columns = ['Director','Average Rating','Movie count']
directors['Average Rating'] = directors['Average Rating'].round(1)
directors.sort_values(by='Movie count',ascending=False,inplace=True)
directors.head()

Unnamed: 0,Director,Average Rating,Movie count
1346,Mahesh Bhatt,5.5,45
589,David Dhawan,5.2,43
905,Hrishikesh Mukherjee,7.1,42
2423,Shakti Samanta,6.6,38
1163,Kanti Shah,4.9,37


### Actors Analysis

In [None]:
df_melted = data.melt(id_vars='Rating', value_name='actor', var_name='role', value_vars=['Actor 1', 'Actor 2', 'Actor 3'])
actor_scores = df_melted.groupby('actor')['Rating'].agg(['mean', 'count'])
actor_scores.reset_index(inplace=True)
actor_scores.columns = ['Actor','Average Score', 'Number of movies']
actor_scores.sort_values('Number of movies', ascending=False, inplace=True)
actor_scores['Average Score'] = actor_scores['Average Score'].round(1)
actor_scores.sort_values(by='Average Score',ascending=False,inplace=True)
actor_scores

Unnamed: 0,Actor,Average Score,Number of movies
2718,Mahesh Narayan,10.0,1
3983,Rajasree Rajakumari,10.0,1
196,Ahaan Jha,10.0,1
3946,Raj Banerjee,9.7,1
3853,Purshottam Mulani,9.7,1
...,...,...,...
4588,Sameer Malhotra,1.7,1
2978,Meghna Desai,1.7,1
4277,Richard Harris,1.6,1
2055,Jasmine Kaur,1.6,1


In [None]:
genre_dict = dict(zip(genre['Genre'],genre['Avg Rating']))
directors_dict = dict(zip(directors['Director'],directors['Average Rating']))
actor_score_dict = dict(zip(actor_scores['Actor'], actor_scores['Average Score']))

In [None]:
df = data.select_dtypes(include=np.number)
df

Unnamed: 0,Year,Duration,Rating,Votes
1,2019,109,7.0,8
3,2019,110,4.4,35
5,1997,147,4.7,827
6,2005,142,7.4,1086
8,2012,82,5.6,326
...,...,...,...,...
15501,1992,132,5.3,135
15503,1989,125,5.8,44
15504,1988,132,4.6,11
15505,1999,129,4.5,655


In [None]:
scalar = MinMaxScaler()
df = pd.DataFrame(scalar.fit_transform(df),columns = df.columns)

In [None]:
data = data.drop(['Name'],axis=1)

data['Genre'] = data['Genre'].map(genre_dict)
data['Director'] = data['Director'].map(directors_dict)
data['Actor 1'] = data['Actor 1'].map(actor_score_dict)
data['Actor 2'] = data['Actor 2'].map(actor_score_dict)
data['Actor 3'] = data['Actor 3'].map(actor_score_dict)
data

Unnamed: 0,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
1,2019,109,6.3,7.0,8,7.0,6.6,7.0,7.0
3,2019,110,5.7,4.4,35,4.4,5.7,4.4,4.4
5,1997,147,6.2,4.7,827,5.4,4.9,5.9,6.5
6,2005,142,6.8,7.4,1086,7.5,5.6,5.4,6.7
8,2012,82,5.5,5.6,326,5.6,5.6,5.8,5.6
...,...,...,...,...,...,...,...,...,...
15501,1992,132,5.6,5.3,135,5.6,5.8,6.1,4.9
15503,1989,125,5.6,5.8,44,5.9,6.4,6.6,5.7
15504,1988,132,5.0,4.6,11,4.1,6.2,4.1,6.2
15505,1999,129,5.5,4.5,655,5.2,5.5,4.9,5.6


In [None]:
data.dropna(inplace=True)

In [None]:
data[['Rating','Votes','Year']] = scalar.fit_transform(data[['Rating','Votes','Year']])

In [None]:
data[['Genre','Director','Duration','Actor 1','Actor 2','Actor 3']] = scalar.fit_transform(data[['Genre','Director','Duration','Actor 1','Actor 2','Actor 3']])

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7558 entries, 1 to 15508
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      7558 non-null   float64
 1   Duration  7558 non-null   float64
 2   Genre     7558 non-null   float64
 3   Rating    7558 non-null   float64
 4   Votes     7558 non-null   float64
 5   Director  7558 non-null   float64
 6   Actor 1   7558 non-null   float64
 7   Actor 2   7558 non-null   float64
 8   Actor 3   7558 non-null   float64
dtypes: float64(9)
memory usage: 590.5 KB


In [None]:
corr_df = data.corr(numeric_only=True)
corr_df['Rating'].sort_values(ascending=False)

Rating      1.000000
Director    0.789429
Actor 3     0.690653
Actor 2     0.683428
Actor 1     0.681067
Genre       0.414580
Votes       0.134655
Duration    0.004747
Year       -0.194990
Name: Rating, dtype: float64

In [None]:
fig = px.imshow(corr_df,
                x = corr_df.columns,
                y = corr_df.columns,
                color_continuous_scale="Blues", text_auto = True)
fig.update_layout(width=1000, height=800)
fig.show()


# Model Building

In [None]:
models = {
    'RandomForestRegressor': RandomForestRegressor(),
    'LinearRegression': LinearRegression(),
    'DecisionTreeRegressor': DecisionTreeRegressor()
}

params = {
    'RandomForestRegressor': { 'n_estimators': [75,100,125,150], 'max_features': ['sqrt', 'log2'] },
    'LinearRegression': {  },
    'DecisionTreeRegressor': {'max_depth': [7, 8, 9, 10, 20] ,'random_state': [42]}
}

In [None]:
X = data.drop('Rating',axis=1)
y = data['Rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

for model_name, model in models.items():

    model_to_tune = GridSearchCV(model, params[model_name], cv=5)
    model_to_tune.fit(X_train, y_train)

    y_pred = model_to_tune.predict(X_train)
    y_pred_test = model_to_tune.predict(X_test)

    print('---'*50, end='\n')
    print('Model Name: ', model_name)
    print(f"Best parameters: {model_to_tune.best_params_}")
    print(f"Best score: {model_to_tune.best_score_}")
    print('R2 scorefor training data: ',r2_score(y_train,y_pred))
    print('R2 score for testing data: ',r2_score(y_test,y_pred_test))
    print('Mean squared error: ',mean_squared_error(y_test,y_pred_test))
    print('Mean absolute error: ',mean_absolute_error(y_test,y_pred_test))
    print('---'*50, end='\n')

------------------------------------------------------------------------------------------------------------------------------------------------------
Model Name:  RandomForestRegressor
Best parameters: {'max_features': 'log2', 'n_estimators': 125}
Best score: 0.763832440572546
R2 scorefor training data:  0.9668227292965451
R2 score for testing data:  0.7674068637640283
Mean squared error:  0.005440692579967926
Mean absolute error:  0.052341537364009275
------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Model Name:  LinearRegression
Best parameters: {}
Best score: 0.7154700713890136
R2 scorefor training data:  0.7175629897029184
R2 score for testing data:  0.714308626772798
Mean squared error:  0.00668273776101952
Mean absolute error:  0.0

# Conclusion:

### The analysis of the IMDb India Movies dataset has shed light on numerous factors that influence a movie's success or high rating, encompassing elements like the director, actors, genre, and more. Through our evaluation of three regression models—RandomForestRegressor, LinearRegression, and DecisionTreeRegressor—we've discerned that the RandomForestRegressor outperforms the others. It exhibited the highest R2 score for the testing data and yielded the lowest mean squared error and mean absolute error. While showing promise as a model for rating prediction based on the dataset, it's essential to be mindful of potential overfitting.