# IMDB MOVIES INDIA RATING PREDICTION

### Below are the Contents of Project:

>1. <a href="#Description">Description</a>
2. <a href="#Importing-the-Essential-Libraries">Importing the Essential Libraries</a>
3. <a href="#Creating-DataFrame">Creating DataFrame</a>
4. <a href="#Exploring-the-DataFrame">Exploring the DataFrame</a>
5. <a href="#Null-Values">Null Values</a>
6. <a href="#Values-in-Genre-Column">Values in Genre Column</a>
7. <a href="#Droping-Unwanted-Columns">Droping Unwanted Columns</a>
8. <a href="#Reindexing-the-Column">Reindexing the Column</a>
9. <a href="#Percentage-of-Missing-Data">Percentage of Missing Data</a>
10. <a href="#Preprocessing-the-Votes-column">Preprocessing the Votes Column</a>
11. <a href="#Top-5-Directors">Top 5 Directors</a>
12. <a href="#Top-5-Actor-1">Top 5 Actor 1</a>
13. <a href="#Top-5-Actor-2">Top 5 Actor 2</a>
14. <a href="#Top-5-Actor-3">Top 5 Actor 3</a>
15. <a href="#Splitting-the-Dataset">Splitting the Dataset</a>
16. <a href="#Numerical-Pipeline">Numerical Pipeline</a>
17. <a href="#Categorical-Column">Categorical Column</a>
18. <a href="#Combining-both-Pipelines">Combining both Pipelines</a>
19. <a href="#Applying-Linear-Regression">Applying Linear Regression</a>
20. <a href="#Fitting-in-Train-Dataset">Fitting in Train Dataset</a>
21. <a href="#Predicting-the-X_test">Predicting the X_test</a>
22. <a href="#Predicting-the-random-columns">Predicting the Random Data</a>

### Description

> Every dataset has a story and this set is pulled from IMDb.com of all the Indian movies on the platform. Clean this data by removing missing values or adding average values this process will help to manipulate the data to help with your EDA.

>Build a model that predicts the rating of a movie based on
features like genre, director, and actors. You can use regression
techniques to tackle this problem.

>![convert notebook to web app](https://storage.googleapis.com/kaggle-datasets-images/1416444/2346296/3903011cecd40b873ea4f106b8aca27b/dataset-cover.jpg?t=2021-06-18-01-08-01)
<br>
The goal is to analyze historical movie data and develop a model
that accurately estimates the rating given to a movie by users or
critics.
Movie Rating Prediction project enables you to explore data
analysis, preprocessing, feature engineering, and machine
learning modeling techniques. It provides insights into the factors
that influence movie ratings and allows you to build a model that
can estimate the ratings of movies accurately.
<br>

### Importing the Essential Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

### Creating DataFrame

In [2]:
df=pd.read_csv(r"K:\Ml Dataset\IMDb Movies India.csv",encoding='latin-1')
s,k=df.shape
print('Number of Rows: ',s)
print('Number of Columns: ',k)

Number of Rows:  15509
Number of Columns:  10


### Exploring the DataFrame

In [3]:
df.head(5)

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,0.0,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),-2019.0,109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,-2021.0,90 min,"Drama, Musical",0.0,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,-2019.0,110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,-2010.0,105 min,Drama,0.0,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


In [4]:
df.columns

Index(['Name', 'Year', 'Duration', 'Genre', 'Rating', 'Votes', 'Director',
       'Actor 1', 'Actor 2', 'Actor 3'],
      dtype='object')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15509 non-null  object 
 1   Year      14981 non-null  float64
 2   Duration  7240 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    15509 non-null  float64
 5   Votes     7920 non-null   object 
 6   Director  14984 non-null  object 
 7   Actor 1   13892 non-null  object 
 8   Actor 2   13125 non-null  object 
 9   Actor 3   12365 non-null  object 
dtypes: float64(2), object(8)
memory usage: 1.2+ MB


### Null Values

In [6]:
df.isnull().sum()

Name           0
Year         528
Duration    8269
Genre       1877
Rating         0
Votes       7589
Director     525
Actor 1     1617
Actor 2     2384
Actor 3     3144
dtype: int64

### Values in Genre Column

In [7]:
df['Genre'].value_counts()

Drama                         2780
Action                        1289
Thriller                       779
Romance                        708
Drama, Romance                 524
                              ... 
Action, Musical, War             1
Horror, Crime, Thriller          1
Animation, Comedy                1
Romance, Action, Crime           1
Adventure, Fantasy, Sci-Fi       1
Name: Genre, Length: 485, dtype: int64

In [8]:
#Creating the combined genre

tempgen=df['Genre'].str.split(',',expand=True).iloc[:,0:2]
tempgen.columns=['genre_1','genre_2']
tempgen.genre_2.fillna(tempgen.genre_1,inplace=True)
tempgen

Unnamed: 0,genre_1,genre_2
0,Drama,Drama
1,Drama,Drama
2,Drama,Musical
3,Comedy,Romance
4,Drama,Drama
...,...,...
15504,Action,Action
15505,Action,Drama
15506,Action,Action
15507,Action,Action


In [9]:
dur=df['Duration'].str.split(' ',expand=True).iloc[:,0:2]
dur.columns=['duration(min)','non']
dur

Unnamed: 0,duration(min),non
0,,
1,109,min
2,90,min
3,110,min
4,105,min
...,...,...
15504,,
15505,129,min
15506,,
15507,,


In [10]:
df=pd.concat([df,tempgen],axis=1)
df

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3,genre_1,genre_2
0,,,,Drama,0.0,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia,Drama,Drama
1,#Gadhvi (He thought he was Gandhi),-2019.0,109 min,Drama,7.0,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid,Drama,Drama
2,#Homecoming,-2021.0,90 min,"Drama, Musical",0.0,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana,Drama,Musical
3,#Yaaram,-2019.0,110 min,"Comedy, Romance",4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor,Comedy,Romance
4,...And Once Again,-2010.0,105 min,Drama,0.0,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali,Drama,Drama
...,...,...,...,...,...,...,...,...,...,...,...,...
15504,Zulm Ko Jala Doonga,-1988.0,,Action,4.6,11,Mahendra Shah,Naseeruddin Shah,Sumeet Saigal,Suparna Anand,Action,Action
15505,Zulmi,-1999.0,129 min,"Action, Drama",4.5,655,Kuku Kohli,Akshay Kumar,Twinkle Khanna,Aruna Irani,Action,Drama
15506,Zulmi Raj,-2005.0,,Action,0.0,,Kiran Thej,Sangeeta Tiwari,,,Action,Action
15507,Zulmi Shikari,-1988.0,,Action,0.0,,,,,,Action,Action


In [11]:
df=pd.concat([df,dur],axis=1)
df

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3,genre_1,genre_2,duration(min),non
0,,,,Drama,0.0,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia,Drama,Drama,,
1,#Gadhvi (He thought he was Gandhi),-2019.0,109 min,Drama,7.0,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid,Drama,Drama,109,min
2,#Homecoming,-2021.0,90 min,"Drama, Musical",0.0,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana,Drama,Musical,90,min
3,#Yaaram,-2019.0,110 min,"Comedy, Romance",4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor,Comedy,Romance,110,min
4,...And Once Again,-2010.0,105 min,Drama,0.0,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali,Drama,Drama,105,min
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15504,Zulm Ko Jala Doonga,-1988.0,,Action,4.6,11,Mahendra Shah,Naseeruddin Shah,Sumeet Saigal,Suparna Anand,Action,Action,,
15505,Zulmi,-1999.0,129 min,"Action, Drama",4.5,655,Kuku Kohli,Akshay Kumar,Twinkle Khanna,Aruna Irani,Action,Drama,129,min
15506,Zulmi Raj,-2005.0,,Action,0.0,,Kiran Thej,Sangeeta Tiwari,,,Action,Action,,
15507,Zulmi Shikari,-1988.0,,Action,0.0,,,,,,Action,Action,,


### Droping Unwanted Columns

In [12]:
df.drop(['Name','Year','Genre','genre_2','non','Duration'],
        axis = 1,
        inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Rating         15509 non-null  float64
 1   Votes          7920 non-null   object 
 2   Director       14984 non-null  object 
 3   Actor 1        13892 non-null  object 
 4   Actor 2        13125 non-null  object 
 5   Actor 3        12365 non-null  object 
 6   genre_1        13632 non-null  object 
 7   duration(min)  7240 non-null   object 
dtypes: float64(1), object(7)
memory usage: 969.4+ KB


### Reindexing the Column

In [13]:
df=df.iloc[:,[1,2,3,4,5,6,7,0]]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Votes          7920 non-null   object 
 1   Director       14984 non-null  object 
 2   Actor 1        13892 non-null  object 
 3   Actor 2        13125 non-null  object 
 4   Actor 3        12365 non-null  object 
 5   genre_1        13632 non-null  object 
 6   duration(min)  7240 non-null   object 
 7   Rating         15509 non-null  float64
dtypes: float64(1), object(7)
memory usage: 969.4+ KB


### Percentage of Missing Data

In [14]:
# Checking for null value so that Data Impuation can be done
df.isnull().mean()*100

Votes            48.932878
Director          3.385131
Actor 1          10.426204
Actor 2          15.371720
Actor 3          20.272100
genre_1          12.102650
duration(min)    53.317429
Rating            0.000000
dtype: float64

In [15]:
(df.isnull().sum(axis=1).sort_values(ascending=False)>=5).sum()

1619

In [16]:
# Dropping the unwanted columns

df.dropna(thresh=5, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(thresh=5, inplace=True)


In [17]:
#These many rows dont' have missing value more than 3
df.shape

(13179, 8)

In [18]:
df['genre_1'].value_counts()

Drama          4072
Action         3315
Comedy         1507
Romance         617
Thriller        543
Crime           440
Horror          356
Adventure       237
Documentary     207
Fantasy         176
Musical         149
Family          146
Mystery         139
Biography       130
Animation        72
History          24
Music            10
Sci-Fi            9
Sport             8
War               7
Reality-TV        1
Name: genre_1, dtype: int64

In [19]:
# dropping the duplicate data

df.drop_duplicates(keep='first',inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop_duplicates(keep='first',inplace=True)


In [20]:
df.shape

(13161, 8)

### Preprocessing the Votes column

In [21]:
df['Votes'] = df['Votes'].str.replace(',', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Votes'] = df['Votes'].str.replace(',', '')


In [22]:
df['Votes'] = df['Votes'].fillna(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Votes'] = df['Votes'].fillna(0)


In [23]:
df['Votes'].unique()

array([0, '8', '35', ..., '70344', '408', '1496'], dtype=object)

In [24]:
df['Votes']=df['Votes'].astype('int')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Votes']=df['Votes'].astype('int')


### Top 5 Directors

In [25]:
df.groupby('Director').Rating.mean().sort_values(ascending=False).head(5)

Director
Saif Ali Sayeed    10.0
Sriram Raja         9.7
Bobby Kumar         9.6
Suhrud Godbole      9.4
Arvind Pratap       9.4
Name: Rating, dtype: float64

### Top 5 Actor 1

In [26]:
df.groupby('Actor 1').Rating.mean().sort_values(ascending=False).head(5)

Actor 1
Ahaan Jha            10.0
Raj Banerjee          9.7
Vaibhav Khisti        9.4
Nishi Neha Mishra     9.4
Dharmendra Ahir       9.4
Name: Rating, dtype: float64

### Top 5 Actor 2

In [27]:
df.groupby('Actor 2').Rating.mean().sort_values(ascending=False).head(5)

Actor 2
Mahesh Narayan     10.0
Emon Chatterjee     9.7
Ashfaq              9.6
Pankaj Kamal        9.4
Awanish Kotnal      9.4
Name: Rating, dtype: float64

### Top 5 Actor 3

In [28]:
df.groupby('Actor 3').Rating.mean().sort_values(ascending=False).head(5)

Actor 3
Rajasree Rajakumari    10.0
Purshottam Mulani       9.7
Fasih Choudhry          9.6
Rakhi Mansha            9.4
Akash Kumar             9.4
Name: Rating, dtype: float64

### Splitting the Dataset

In [29]:
x_train, x_test, y_train, y_test = train_test_split(
    df.drop(labels=["Rating"],axis=1),
    df['Rating'],
    test_size = 0.25,
    random_state = 0)

### Numerical Pipeline

In [30]:
numeric_process=Pipeline(
    steps=[('imputataion',SimpleImputer(missing_values=np.nan,strategy="mean")),
          ("Standardise",StandardScaler())])
numeric_process

### Categorical Column

In [31]:
categorical_process=Pipeline(
    steps=[('cat_imputation',SimpleImputer(fill_value="missing",strategy="constant")),
          ('onehot',OneHotEncoder(sparse_output=False,handle_unknown='ignore'))])
categorical_process

In [32]:
df.head(2)

Unnamed: 0,Votes,Director,Actor 1,Actor 2,Actor 3,genre_1,duration(min),Rating
0,0,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia,Drama,,0.0
1,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid,Drama,109.0,7.0


### Combining both Pipelines

In [33]:
preprocessor=ColumnTransformer([
    ('numerical',numeric_process,[0,6]),
    ('categorical',categorical_process,[1,2,3,4,5])
],remainder='passthrough')
preprocessor

### Applying Linear Regression

In [34]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression 
pipe=make_pipeline(preprocessor,LinearRegression())
pipe

### Fitting in Train Dataset

In [35]:
pipe.fit(x_train,y_train)

### Predicting the X_test

In [36]:
y_pred=pipe.predict(x_test)
y_pred

array([ 7.82287235e+10,  1.27169800e+00,  3.10000610e+00, ...,
       -5.51076371e+11, -5.79009558e+11,  1.57432233e+11])

### Predicting the random columns

In [38]:
# predicting value using pipeline
pipe.predict([[8,'Gaurav Bakshi','Rasika Dugal','Vivek Ghamande','Arvind Jangid','Drama',109]])



array([7.12490845])