-Build a model that predicts the rating of a movie based on
features like genre, director, and actors. You can use regression
techniques to tackle this problem.

-The goal is to analyze historical movie data and develop a model
that accurately estimates the rating given to a movie by users or
critics.

-Movie Rating Prediction project enables you to explore data
analysis, preprocessing, feature engineering, and machine
learning modeling techniques. It provides insights into the factors
that influence movie ratings and allows you to build a model that
can estimate the ratings of movies accurately.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
import numpy as np

In [None]:
ds=pd.read_csv("/content/drive/MyDrive/CodSoft Internship Program/IMDb movie dataset/IMDb Movies India.csv",encoding="latin1")

In [None]:
ds.head(5)

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


In [None]:
ds.tail(5)

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
15504,Zulm Ko Jala Doonga,(1988),,Action,4.6,11.0,Mahendra Shah,Naseeruddin Shah,Sumeet Saigal,Suparna Anand
15505,Zulmi,(1999),129 min,"Action, Drama",4.5,655.0,Kuku Kohli,Akshay Kumar,Twinkle Khanna,Aruna Irani
15506,Zulmi Raj,(2005),,Action,,,Kiran Thej,Sangeeta Tiwari,,
15507,Zulmi Shikari,(1988),,Action,,,,,,
15508,Zulm-O-Sitam,(1998),130 min,"Action, Drama",6.2,20.0,K.C. Bokadia,Dharmendra,Jaya Prada,Arjun Sarja


#Data cleaning

In [None]:
# Check for missing values
ds.isnull().sum()

Name           0
Year         528
Duration    8269
Genre       1877
Rating      7590
Votes       7589
Director     525
Actor 1     1617
Actor 2     2384
Actor 3     3144
dtype: int64

In [None]:
# Print sample data from 'Year' and 'Duration' columns
print(ds[['Year', 'Duration']].head(10))

# Check data types of 'Year' and 'Duration'
print(ds[['Year', 'Duration']].dtypes)


     Year Duration
0     NaN      NaN
1  (2019)  109 min
2  (2021)   90 min
3  (2019)  110 min
4  (2010)  105 min
5  (1997)  147 min
6  (2005)  142 min
7  (2008)   59 min
8  (2012)   82 min
9  (2014)  116 min
Year        object
Duration    object
dtype: object



We have a significant number of missing values in several columns. Here’s how we can handle these missing values:

Drop rows with missing target values (Rating): Since we're predicting ratings, we can't use rows without them.

Handle missing values in other columns:

For categorical columns (Genre, Director, Actor 1, Actor 2, Actor 3), we can fill missing values with a placeholder like "Unknown".

For numerical columns (Year, Duration, Votes), we can fill missing values with the median value of each column.

In [None]:
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15509 non-null  object 
 1   Year      14981 non-null  object 
 2   Duration  7240 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
 5   Votes     7920 non-null   object 
 6   Director  14984 non-null  object 
 7   Actor 1   13892 non-null  object 
 8   Actor 2   13125 non-null  object 
 9   Actor 3   12365 non-null  object 
dtypes: float64(1), object(9)
memory usage: 1.2+ MB


In [None]:
# Fill missing values in categorical columns with "Unknown"
categorical_columns = ['Genre', 'Director', 'Actor 1', 'Actor 2', 'Actor 3']
ds.loc[:, categorical_columns] = ds.loc[:, categorical_columns].fillna('Unknown')

# Clean up 'Year' column
ds['Year'] = ds['Year'].astype(str).str.extract(r'(\d+)').astype(float)  # Extract numeric year and convert to float

# Clean up 'Duration' column
ds['Duration'] = ds['Duration'].astype(str).str.replace(' min', '')  # Remove ' min' suffix
ds['Duration'] = pd.to_numeric(ds['Duration'], errors='coerce')  # Convert to numeric

# Fill missing values in 'Year' and 'Duration' with median
ds['Year'].fillna(ds['Year'].median(), inplace=True)
ds['Duration'].fillna(ds['Duration'].median(), inplace=True)

# Check if all missing values are handled
missing_values_after = ds.isnull().sum()
print(missing_values_after)

# Print sample data from 'Year' and 'Duration' columns
print(ds[['Year', 'Duration']].head(10))

# Check data types of 'Year' and 'Duration'
print(ds[['Year', 'Duration']].dtypes)

Name           0
Year           0
Duration       0
Genre          0
Rating      7590
Votes       7589
Director       0
Actor 1        0
Actor 2        0
Actor 3        0
dtype: int64
     Year  Duration
0  1991.0     131.0
1  2019.0     109.0
2  2021.0      90.0
3  2019.0     110.0
4  2010.0     105.0
5  1997.0     147.0
6  2005.0     142.0
7  2008.0      59.0
8  2012.0      82.0
9  2014.0     116.0
Year        float64
Duration    float64
dtype: object


In [None]:
# Clean up 'Votes' column
ds['Votes'] = pd.to_numeric(ds['Votes'], errors='coerce')  # Convert to numeric, coercing errors to NaN
ds['Votes'].fillna(ds['Votes'].median(), inplace=True)  # Fill missing values with median


In [None]:
# Drop rows with missing target values (Rating)
ds = ds.dropna(subset=['Rating'])
missing_values_after = ds.isnull().sum()
print(missing_values_after)


Name        0
Year        0
Duration    0
Genre       0
Rating      0
Votes       0
Director    0
Actor 1     0
Actor 2     0
Actor 3     0
dtype: int64


#Encode Categorical Variables

In [None]:
# Ensure one-hot encoding for categorical columns
categorical_columns = ['Genre', 'Director', 'Actor 1', 'Actor 2', 'Actor 3']
ds_encoded = pd.get_dummies(ds, columns=categorical_columns, drop_first=True)

# Check the resulting dataframe
print(ds_encoded.head(5))
print(ds_encoded.shape)


                                 Name    Year  Duration  Rating  Votes  \
1  #Gadhvi (He thought he was Gandhi)  2019.0     109.0     7.0    8.0   
3                             #Yaaram  2019.0     110.0     4.4   35.0   
5                ...Aur Pyaar Ho Gaya  1997.0     147.0     4.7  827.0   
6                           ...Yahaan  2005.0     142.0     7.4   35.0   
8                  ?: A Question Mark  2012.0      82.0     5.6  326.0   

   Genre_Action, Adventure  Genre_Action, Adventure, Biography  \
1                    False                               False   
3                    False                               False   
5                    False                               False   
6                    False                               False   
8                    False                               False   

   Genre_Action, Adventure, Comedy  Genre_Action, Adventure, Crime  \
1                            False                           False   
3                 

In [None]:
# Verify data types of all columns
print(ds_encoded.dtypes[ds_encoded.dtypes == 'object'])


Name    object
dtype: object


In [None]:
# Drop the 'Name' column
ds_encoded = ds_encoded.drop(columns=['Name'])

# Ensure all columns are now numeric
print(ds_encoded.dtypes[ds_encoded.dtypes == 'object'])


Series([], dtype: object)


#Split the Data into Training and Testing Sets

In [None]:
from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = ds_encoded.drop('Rating', axis=1)
y = ds_encoded['Rating']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shapes of the training and testing sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)



X_train shape: (6335, 12062)
X_test shape: (1584, 12062)
y_train shape: (6335,)
y_test shape: (1584,)


#Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Create and train the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)
print("Random Forest Mean Squared Error:", mse_rf)
print("Random Forest R-squared:", r2_rf)


Random Forest Mean Squared Error: 1.2567384450757577
Random Forest R-squared: 0.32402348928958924


Mean Squared Error (MSE): The MSE value of approximately 1.26 shows that model's predictions are generally close to the actual values, indicating good overall performance in minimizing prediction errors.

R-squared (R²): With an R-squared value of about 0.32, model explains around 32% of the variance in the test data, suggesting it captures a significant portion of the target variable's variation.