# Movie Rating Prediction

This project focuses on predicting movie ratings based on structured movie attributes such as release year, duration, and votes. Movie ratings represent audience feedback and can help understand factors that influence a movie's reception.

The objective of this project is to explore the dataset, prepare relevant features, and build a regression model to predict movie ratings.

In [1]:
import os
os.listdir("/content")

['.config', 'IMDb Movies India.csv', 'sample_data']

In [2]:
import pandas as pd
df = pd.read_csv("/content/IMDb Movies India.csv", encoding='latin-1')

In [3]:
df.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


In [4]:
df.shape

(15509, 10)

In [5]:
df.columns

Index(['Name', 'Year', 'Duration', 'Genre', 'Rating', 'Votes', 'Director',
       'Actor 1', 'Actor 2', 'Actor 3'],
      dtype='object')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15509 non-null  object 
 1   Year      14981 non-null  object 
 2   Duration  7240 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
 5   Votes     7920 non-null   object 
 6   Director  14984 non-null  object 
 7   Actor 1   13892 non-null  object 
 8   Actor 2   13125 non-null  object 
 9   Actor 3   12365 non-null  object 
dtypes: float64(1), object(9)
memory usage: 1.2+ MB


In [7]:
df.isnull().sum()

Unnamed: 0,0
Name,0
Year,528
Duration,8269
Genre,1877
Rating,7590
Votes,7589
Director,525
Actor 1,1617
Actor 2,2384
Actor 3,3144


## Exploratory Data Analysis

The dataset contains information about Indian movies, including attributes such as year of release, duration, genre, votes, and ratings. The data includes both numerical and categorical features, and several columns contain missing values. This highlights the need for careful data cleaning and feature selection before building a prediction model.

In [8]:
df = df[["Year", "Duration", "Votes", "Genre", "Rating"]]

In [9]:
df["Year"] = df["Year"].astype(str)

In [10]:
df["Year"] = df["Year"].str.extract(r"(\d{4})")

In [11]:
df["Year"] = pd.to_numeric(df["Year"], errors="coerce")

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      14981 non-null  float64
 1   Duration  7240 non-null   object 
 2   Votes     7920 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
dtypes: float64(2), object(3)
memory usage: 605.9+ KB


In [13]:
df = pd.read_csv("/content/IMDb Movies India.csv", encoding="latin-1")

In [14]:
df = df[["Year", "Duration", "Votes", "Genre", "Rating"]]

In [15]:
df["Year"].head(10)

Unnamed: 0,Year
0,
1,(2019)
2,(2021)
3,(2019)
4,(2010)
5,(1997)
6,(2005)
7,(2008)
8,(2012)
9,(2014)


In [16]:
df["Year"] = df["Year"].astype(str)
df["Year"] = df["Year"].str.extract(r"(\d{4})")
df["Year"] = pd.to_numeric(df["Year"], errors="coerce")

In [17]:
df["Duration"] = df["Duration"].astype(str)


In [18]:
df["Duration"] = df["Duration"].str.replace("min","", regex=False)

In [19]:
df["Duration"] = pd.to_numeric(df["Duration"], errors="coerce")

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      14981 non-null  float64
 1   Duration  7240 non-null   float64
 2   Votes     7920 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
dtypes: float64(3), object(2)
memory usage: 605.9+ KB


In [21]:
df["Votes"] = df["Votes"].astype(str)

In [22]:
df["Votes"] = df["Votes"].str.replace(",", "",regex=False)

In [23]:
df["Votes"] = pd.to_numeric(df["Votes"], errors="coerce")

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Year      14981 non-null  float64
 1   Duration  7240 non-null   float64
 2   Votes     7919 non-null   float64
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
dtypes: float64(4), object(1)
memory usage: 605.9+ KB


In [25]:
df["Primary_Genre"] = df["Genre"].str.split(",").str[0]

In [26]:
df = df.drop("Genre", axis=1)

In [28]:
df["Primary_Genre"] = df["Primary_Genre"].astype("category").cat.codes

In [29]:
df = df.dropna()

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5851 entries, 1 to 15508
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Year           5851 non-null   float64
 1   Duration       5851 non-null   float64
 2   Votes          5851 non-null   float64
 3   Rating         5851 non-null   float64
 4   Primary_Genre  5851 non-null   int8   
dtypes: float64(4), int8(1)
memory usage: 234.3 KB


In [31]:
x = df.drop("Rating", axis=1)
y = df["Rating"]

In [32]:
x.shape, y.shape

((5851, 4), (5851,))

In [33]:
from sklearn.model_selection import train_test_split

In [34]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [35]:
from sklearn.ensemble import RandomForestRegressor

In [36]:
model = RandomForestRegressor(n_estimators=100, random_state=42)

In [37]:
model.fit(x_train, y_train)

In [38]:
y_pred = model.predict(x_test)

In [39]:
from sklearn.metrics import mean_absolute_error, r2_score

In [40]:
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [41]:
mae, r2

(0.8717187446626815, 0.28165760016292996)

# Model Evaluation

The performance of the movie rating prediction model was evaluated using Mean Absolute Error and R-squared score. The Mean Absolute Error of approximately 0.87 indicates that, on average, the predicted ratings differ from actual ratings by less than one rating point. The R-squared value of about 0.28 suggests that while the selected features capture some of the variability in movie ratings, there are additional factors influencing ratings that are not included in the model.

# Conclusion

In this project, a movie rating prediction model was built using structured features such as release year, duration, number of votes, and primary genre. Due to the complex and subjective nature of movie ratings, the model shows moderate predictive performance. This project highlights the challenges of predicting audience ratings and reinforces the importance of feature selection, data preprocessing, and realistic interpretation of model results when working with real-world datasets.