#MOVIE RATING PREDICTION WITH PYTHON

#Instructions:

Build a model that predicts the rating of a movie based on features like genre, director, and actors. You can use regression techniques to tackle this problem. The goal is to analyze historical movie data and develop a model that accurately estimates the rating given to a movie by users or critics. Movie Rating Prediction project enables you to explore data analysis, preprocessing, feature engineering, and machine learning modeling techniques. It provides insights into the factors that influence movie ratings and allows you to build a model that can estimate the ratings of movies accurately.

#Provide some trends:

* Year with best rating
* Does length of movie have any impact with the rating?
* Top 10 movies according to rating per year and overall.
* Number of popular movies released each year.
* Counting the number of votes which movies preformed better in rating per year and overall.
* Any other trends or future prediction you may have
* Which director directed the most movies
* Which actor starred in the movie
* Any other trends you can find

#Importing Important Libraries and Reading Dataset and Analyze

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('IMDb Movies India.csv', encoding='latin-1')
df.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


In [None]:
#Getting the Shape
print('Dataset Shape:', df.shape)

Dataset Shape: (15509, 10)


In [None]:
#Getting the Dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15509 non-null  object 
 1   Year      14981 non-null  object 
 2   Duration  7240 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
 5   Votes     7920 non-null   object 
 6   Director  14984 non-null  object 
 7   Actor 1   13892 non-null  object 
 8   Actor 2   13125 non-null  object 
 9   Actor 3   12365 non-null  object 
dtypes: float64(1), object(9)
memory usage: 1.2+ MB


In [None]:
#Getting the Null values
df.isnull().sum()

Unnamed: 0,0
Name,0
Year,528
Duration,8269
Genre,1877
Rating,7590
Votes,7589
Director,525
Actor 1,1617
Actor 2,2384
Actor 3,3144


In [None]:
#Getting Duplicate values
df.duplicated().sum()

6

In [None]:
#Getting the EDA
df.describe()

Unnamed: 0,Rating
count,7919.0
mean,5.841621
std,1.381777
min,1.1
25%,4.9
50%,6.0
75%,6.8
max,10.0


#Making a copy of the original data to prevent any inconvenience

In [None]:
df1 = df.copy()

#Initiate Data Cleaning process....

In [None]:
#Handling Null values
df1.dropna(inplace=True)

In [None]:
df1.isnull().sum()

Unnamed: 0,0
Name,0
Year,0
Duration,0
Genre,0
Rating,0
Votes,0
Director,0
Actor 1,0
Actor 2,0
Actor 3,0


In [None]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5659 entries, 1 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      5659 non-null   object 
 1   Year      5659 non-null   object 
 2   Duration  5659 non-null   object 
 3   Genre     5659 non-null   object 
 4   Rating    5659 non-null   float64
 5   Votes     5659 non-null   object 
 6   Director  5659 non-null   object 
 7   Actor 1   5659 non-null   object 
 8   Actor 2   5659 non-null   object 
 9   Actor 3   5659 non-null   object 
dtypes: float64(1), object(9)
memory usage: 486.3+ KB


In [None]:
#Handling the duplicates
df1.drop_duplicates(inplace=True)
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5659 entries, 1 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      5659 non-null   object 
 1   Year      5659 non-null   object 
 2   Duration  5659 non-null   object 
 3   Genre     5659 non-null   object 
 4   Rating    5659 non-null   float64
 5   Votes     5659 non-null   object 
 6   Director  5659 non-null   object 
 7   Actor 1   5659 non-null   object 
 8   Actor 2   5659 non-null   object 
 9   Actor 3   5659 non-null   object 
dtypes: float64(1), object(9)
memory usage: 486.3+ KB


In [None]:
#checking duplicates
df1.duplicated().sum()

0

#Handling the unwanted symbols in column "Name", "Year", "Duration" and Change the data-type of column "Year", "Duration" and "votes"

In [None]:
df1.head(5)

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
5,...Aur Pyaar Ho Gaya,(1997),147 min,"Comedy, Drama, Musical",4.7,827,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor
6,...Yahaan,(2005),142 min,"Drama, Romance, War",7.4,1086,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
8,?: A Question Mark,(2012),82 min,"Horror, Mystery, Thriller",5.6,326,Allyson Patel,Yash Dave,Muntazir Ahmad,Kiran Bhatia


In [None]:
df1['Name'] = df1['Name'].str.replace('[^A-Za-z\s\'\-]', '', regex=True)
df1['Year'] = df1['Year'].str.replace(r'[()]', '', regex=True)
df1['Duration'] = df1['Duration'].str.replace(' min', '', regex=False)
df1['Votes'] = df1['Votes'].str.replace(',', '', regex=False)
df1.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
1,Gadhvi He thought he was Gandhi,2019,109,Drama,7.0,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
3,Yaaram,2019,110,"Comedy, Romance",4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
5,Aur Pyaar Ho Gaya,1997,147,"Comedy, Drama, Musical",4.7,827,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor
6,Yahaan,2005,142,"Drama, Romance, War",7.4,1086,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
8,A Question Mark,2012,82,"Horror, Mystery, Thriller",5.6,326,Allyson Patel,Yash Dave,Muntazir Ahmad,Kiran Bhatia


In [None]:
#Changing the data type from object to int
df1['Year'] = df1['Year'].astype(int)
df1['Duration'] = df1['Duration'].astype(int)
df1['Votes'] = df1['Votes'].astype(int)
df1.info()


<class 'pandas.core.frame.DataFrame'>
Index: 5659 entries, 1 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      5659 non-null   object 
 1   Year      5659 non-null   int64  
 2   Duration  5659 non-null   int64  
 3   Genre     5659 non-null   object 
 4   Rating    5659 non-null   float64
 5   Votes     5659 non-null   int64  
 6   Director  5659 non-null   object 
 7   Actor 1   5659 non-null   object 
 8   Actor 2   5659 non-null   object 
 9   Actor 3   5659 non-null   object 
dtypes: float64(1), int64(3), object(6)
memory usage: 615.4+ KB


In [None]:
# Splitting the 'Genre' column by comma and space
df1['Genre'] = df1['Genre'].str.split(', ')

# Exploding the list so that each genre gets its own row
df1 = df1.explode('Genre')

# Filling any null values in the 'Genre' column with the most frequent (mode) genre
df1['Genre'].fillna(df1['Genre'].mode()[0], inplace=True)

# Removing duplicates to ensure each combination of Genre and other columns is unique
df1.drop_duplicates(subset=['Name', 'Year', 'Genre'], inplace=True)

In [None]:
df1.head(10)

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
1,Gadhvi He thought he was Gandhi,2019,109,Drama,7.0,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
3,Yaaram,2019,110,Comedy,4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
3,Yaaram,2019,110,Romance,4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
5,Aur Pyaar Ho Gaya,1997,147,Comedy,4.7,827,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor
5,Aur Pyaar Ho Gaya,1997,147,Drama,4.7,827,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor
5,Aur Pyaar Ho Gaya,1997,147,Musical,4.7,827,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor
6,Yahaan,2005,142,Drama,7.4,1086,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
6,Yahaan,2005,142,Romance,7.4,1086,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
6,Yahaan,2005,142,War,7.4,1086,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
8,A Question Mark,2012,82,Horror,5.6,326,Allyson Patel,Yash Dave,Muntazir Ahmad,Kiran Bhatia


#New Exploratory_Data_Analysis on cleaned data

Resetting the Index

In [None]:
df1.reset_index(drop=True, inplace=True)
df1.head(10)

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,Gadhvi He thought he was Gandhi,2019,109,Drama,7.0,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
1,Yaaram,2019,110,Comedy,4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
2,Yaaram,2019,110,Romance,4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
3,Aur Pyaar Ho Gaya,1997,147,Comedy,4.7,827,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor
4,Aur Pyaar Ho Gaya,1997,147,Drama,4.7,827,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor
5,Aur Pyaar Ho Gaya,1997,147,Musical,4.7,827,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor
6,Yahaan,2005,142,Drama,7.4,1086,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
7,Yahaan,2005,142,Romance,7.4,1086,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
8,Yahaan,2005,142,War,7.4,1086,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
9,A Question Mark,2012,82,Horror,5.6,326,Allyson Patel,Yash Dave,Muntazir Ahmad,Kiran Bhatia


In [None]:
# Remove duplicates based on the 'Name' column and keep the first occurrence
df1 = df1.drop_duplicates(subset='Name', keep='first')

Handling the outliers with z-score

In [None]:
#Getting the EDA to see the basic calculation
df1.describe()

Unnamed: 0,Year,Duration,Rating,Votes
count,5363.0,5363.0,5363.0,5363.0
mean,1996.419728,133.042886,5.912717,2680.202872
std,19.712718,25.47053,1.379154,13896.686378
min,1931.0,21.0,1.1,5.0
25%,1983.0,118.0,5.0,30.0
50%,2002.0,135.0,6.1,128.0
75%,2013.0,150.0,6.9,903.5
max,2021.0,321.0,10.0,591417.0


In [None]:
#Import stats and perform Z-score
import scipy.stats as stats
df1 = df1[(np.abs(stats.zscore(df1[['Rating', 'Votes', 'Duration']], nan_policy='omit')) < 3).all(axis=1)]

In [None]:
df1.shape

(5260, 10)

#"Unveiling Insights: A Dive into Univariate Analysis"

In [None]:
#Performing the histogram plot
import plotly.express as px
fig = px.histogram(data_frame=df1, x="Duration")
fig.show()

In [None]:
#Performing the histogram plot
fig = px.histogram(data_frame=df1, x="Rating")
fig.show()

'Duration' and 'Rating' Show a Symmetrical, Bell-Curve (Normal) Distribution"

In [None]:
#Performing the violin plot
fig = px.violin(data_frame=df1, x="Votes")
fig.show()

'Votes' demonstrates a rightward skew in its distribution

In [None]:
#Top 3 categories in Genre with no of movies
g_cnt = df1['Genre'].value_counts().reset_index()
g_cnt.columns = ['Genre', 'Count']
g_cnt

Unnamed: 0,Genre,Count
0,Drama,1717
1,Action,1502
2,Comedy,942
3,Crime,251
4,Romance,151
5,Horror,119
6,Adventure,97
7,Thriller,86
8,Musical,82
9,Biography,76


* Analysis of Year with Best rating

In [None]:
#Group by Year and calculate avg-Rating
avg_ratings = df1.groupby('Year')['Rating'].mean().reset_index()

Best_year = avg_ratings.loc[avg_ratings['Rating'].idxmax()]

print(f"The year with the best average rating is {int(Best_year['Year'])} with an average rating of {Best_year['Rating']:.2f}.")

The year with the best average rating is 1952 with an average rating of 7.21.


In [None]:
#Same output on Line-plot for better Visualization
fig = px.line(data_frame=avg_ratings, x='Year', y='Rating',
              title='Average Movie Ratings by Year',
              labels={'Year': 'Year', 'Rating': 'Average Rating'},
              markers=True)  # Adds markers for better visibility

fig.update_layout(xaxis_title='Year', yaxis_title='Average Rating')
fig.show()

* Does length of movie have any impact with the rating?

In [None]:
# Descriptive statistics Analysis
des_stats = df1[['Duration', 'Rating']].describe()
print(des_stats)

          Duration       Rating
count  5260.000000  5260.000000
mean    132.770722     5.898992
std      24.006217     1.359466
min      57.000000     1.800000
25%     118.000000     5.000000
50%     135.000000     6.100000
75%     150.000000     6.900000
max     208.000000    10.000000


In [None]:
# Calculate correlation coefficient
correlation = df1['Duration'].corr(df1['Rating'])
print(f'Correlation between Duration and Rating: {correlation:.2f}')

Correlation between Duration and Rating: -0.00


In [None]:
import plotly.express as px

# Scatter plot
fig = px.scatter(df1, x='Duration', y='Rating',
                 title='Movie Duration vs Rating',
                 labels={'Duration': 'Movie Duration (minutes)', 'Rating': 'Movie Rating'},
                 trendline='ols') #Adding a trendline
fig.show()


Based on above 3 different analysis:

* There is a very weak correlation between the length of a movie and its rating, as indicated by the low correlation coefficient and low R-squared value in the regression analysis.

* While the relationship is statistically significant, the practical significance is negligible, meaning that other factors are likely far more important in determining movie ratings.

#10 movies according to rating per year and overall.

In [None]:
# Sort the dataframe by rating in descending order
top_10_overall = df1.sort_values(by='Rating', ascending=False).head(10)

# output
print("Top 10 Movies by Rating (Overall):")
print(top_10_overall[['Name', 'Rating', 'Year']])


Top 10 Movies by Rating (Overall):
                      Name  Rating  Year
6712       Love Qubool Hai    10.0  2020
4306            Half Songs     9.7  2021
10943  The Reluctant Crime     9.4  2020
4031          Gho Gho Rani     9.4  2019
5488                  June     9.4  2021
4074           God of gods     9.3  2019
1141          Ashok Vatika     9.3  2018
10129           Sindhustan     9.3  2019
9173                Reflct     9.3  2021
6717          Love Sorries     9.3  2021


#Number of popular movies released each year

In [None]:
#Popularity based on more than rating 7.0 and votes 1000
pop_movies = df1[(df1['Rating'] > 7.0) & (df1['Votes'] > 1000)]

per_year = pop_movies['Year'].value_counts().sort_index()

print(per_year)

Year
1951     1
1953     1
1955     1
1957     4
1958     2
        ..
2017    16
2018    24
2019    26
2020    17
2021     4
Name: count, Length: 64, dtype: int64


#Counting the number of votes which movies preformed better in rating per year and overall

In [None]:
#Per year
per_year = df1.loc[df1.groupby('Year')['Rating'].idxmax()]

#Sort by Year
per_year = per_year.sort_values('Year')

#Overall rating
overall = df1.loc[df1['Rating'].idxmax()]

print("Best Movie per Year (with Number of Votes):")
print(per_year[['Year', 'Name', 'Rating', 'Votes']])

print("\nBest Movie Overall (with Number of Votes):")
print(overall[['Name', 'Year', 'Rating', 'Votes']])

Best Movie per Year (with Number of Votes):
       Year                    Name  Rating  Votes
10900  1931  The Light of the World     6.2    112
3694   1933                    Fate     6.2     12
7170   1934                 Mazdoor     8.5      6
4899   1935                Inquilab     7.4     38
5316   1936             Jeevan Naya     7.3      6
...     ...                     ...     ...    ...
9171   2017     Rediscovering India     9.0     62
1141   2018            Ashok Vatika     9.3      7
4031   2019            Gho Gho Rani     9.4     47
6712   2020         Love Qubool Hai    10.0      5
4306   2021              Half Songs     9.7      7

[90 rows x 4 columns]

Best Movie Overall (with Number of Votes):
Name      Love Qubool Hai
Year                 2020
Rating               10.0
Votes                   5
Name: 6712, dtype: object


#Which director directed the most movies

In [None]:
#Count the number of movies directed by each director
d_m_count = df1['Director'].value_counts()

#Director and the movies he directed
director_name = d_m_count.idxmax()
max_count = d_m_count.max()

print(f"The director who directed the most movies is {director_name}, with {max_count} movies.")

The director who directed the most movies is David Dhawan, with 37 movies.


#Which actor starred in the movie

In [None]:
if 'Actor 1' in df1.columns:
    star_actors = df1[['Name', 'Actor 1']]

#Rename 'Actor 1' column to 'Star Actor'
    star_actors = star_actors.rename(columns={'Actor 1': 'Star Actor'})

    print(star_actors)
else:
    print("The 'Actor 1' column does not exist in the dataset.")


                                  Name       Star Actor
0      Gadhvi He thought he was Gandhi     Rasika Dugal
1                               Yaaram          Prateik
3                    Aur Pyaar Ho Gaya       Bobby Deol
6                               Yahaan  Jimmy Sheirgill
9                      A Question Mark        Yash Dave
...                                ...              ...
11967                           Zubaan    Vicky Kaushal
11968                         Zubeidaa   Karisma Kapoor
11971                  Zulm Ki Zanjeer      Chiranjeevi
11974                            Zulmi     Akshay Kumar
11976                     Zulm-O-Sitam       Dharmendra

[5260 rows x 2 columns]


#  Data Preprocessing

In [None]:
df2 = df1.copy()

In [None]:
#Genre encoding
genre_mean_rating = df2.groupby('Genre')['Rating'].transform('mean')
df2['Genre_mean_rating'] = genre_mean_rating

# Director encoding
director_mean_rating = df2.groupby('Director')['Rating'].transform('mean')
df2['Director_encoded'] = director_mean_rating

#Actor 1 encoding
actor1_mean_rating = df2.groupby('Actor 1')['Rating'].transform('mean')
df2['Actor1_encoded'] = actor1_mean_rating

#Actor 2 encoding
actor2_mean_rating = df2.groupby('Actor 2')['Rating'].transform('mean')
df2['Actor2_encoded'] = actor2_mean_rating

#Actor 3 encoding
actor3_mean_rating = df2.groupby('Actor 3')['Rating'].transform('mean')
df2['Actor3_encoded'] = actor3_mean_rating

In [None]:
from sklearn.model_selection import train_test_split
X = df2[['Year', 'Duration', 'Votes','Genre_mean_rating','Director_encoded','Actor1_encoded', 'Actor2_encoded', 'Actor3_encoded']]
y = df2['Rating']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

#Model Building and Evaluation

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
rf = RandomForestRegressor()
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
print('MSE: ',mean_squared_error(y_test, rf_pred))
print('MAE: ',mean_absolute_error(y_test, rf_pred))
print('R2: ',r2_score(y_test, rf_pred))

MSE:  0.309585830798479
MAE:  0.36932129277566544
R2:  0.8369568449673469


By. Girish Kumar (Intership_Trainee@CodSoft)