# Movie Rating Prediction with Python
### Goal: Analyze historical movie data and develop a model that predicts the rating of a moive based on features like genre, director, and actors. 

In [83]:
# imports
import pandas as pd   
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns 

### Data Exploration

In [84]:
# load data
df = pd.read_csv("./IMDb.csv", na_values="?",encoding='latin-1')

In [85]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15509 non-null  object 
 1   Year      14981 non-null  object 
 2   Duration  7240 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
 5   Votes     7920 non-null   object 
 6   Director  14984 non-null  object 
 7   Actor 1   13892 non-null  object 
 8   Actor 2   13125 non-null  object 
 9   Actor 3   12365 non-null  object 
dtypes: float64(1), object(9)
memory usage: 1.2+ MB


In [86]:
df.describe()

Unnamed: 0,Rating
count,7919.0
mean,5.841621
std,1.381777
min,1.1
25%,4.9
50%,6.0
75%,6.8
max,10.0


In [87]:
df.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


In [88]:
df.isna().sum()

Name           0
Year         528
Duration    8269
Genre       1877
Rating      7590
Votes       7589
Director     525
Actor 1     1617
Actor 2     2384
Actor 3     3144
dtype: int64

Here we can notice some dirty data which can lead to inaccurate predictions. 
1. Name column has '#' special character
2. Year column has missing values and '()' around the year 
3. Duration column has missing values, some are described in hour and others in min, inconsistent
4. Genre column has multiple genres for some movies, missing values, so might consider feature engineering to split movies based on genre
5. Rating column has missing values
6. Votes column has missing values
5. Director column as well as actors columns have missing values. 

### Data Preprocessing

In [89]:
# 1. Name Column
# Remove duplicate rows based on movie name
df.drop_duplicates(subset='Name', keep='first', inplace=True)

# Remove special characters
df['Name'] = df['Name'].str.lstrip('!@#$%^&*()')

In [90]:
# Drop any rows with missing data
df.dropna(inplace=True)

In [91]:
df.isna().sum()

Name        0
Year        0
Duration    0
Genre       0
Rating      0
Votes       0
Director    0
Actor 1     0
Actor 2     0
Actor 3     0
dtype: int64

In [92]:
# Now that all 'NaN' values are handled, we can clean dirty data for the remainder of features
# Year - remove '()'

df['Year'] = df['Year'].str.extract('(\d+)').astype(int)

In [93]:
df['Duration']

1        109 min
3        110 min
5        147 min
6        142 min
8         82 min
          ...   
15493    115 min
15494    153 min
15503    125 min
15505    129 min
15508    130 min
Name: Duration, Length: 5148, dtype: object

In [94]:
# Duration - convert all values to min then get rid of alphanumeric values. Convert to type int. 

def convert_to_minutes(duration_str):
    if pd.isnull(duration_str) or duration_str == '':
        return None  # Return None for missing or empty strings
    # Initialize total minutes to zero
    total_minutes = 0
    # Check for the presence of hours
    if 'h' in duration_str or 'hr' in duration_str:
        # Find and convert the hours to minutes
        hours = int(duration_str.split('h')[0])
        total_minutes += hours * 60
        # Remove the hours part from the string
        duration_str = duration_str.split('h')[1]
    # Check and clean up the minutes part
    if 'm' in duration_str or 'min' in duration_str:
        # Extract minutes and convert to int
        minutes = int(''.join(filter(str.isdigit, duration_str)))
        total_minutes += minutes
    return total_minutes

# Re-apply the conversion function to the duration column
df['Duration'] = df['Duration'].apply(convert_to_minutes)

# Check the result
df['Duration']


1        109
3        110
5        147
6        142
8         82
        ... 
15493    115
15494    153
15503    125
15505    129
15508    130
Name: Duration, Length: 5148, dtype: int64

In [95]:
df['Genre']

1                            Drama
3                  Comedy, Romance
5           Comedy, Drama, Musical
6              Drama, Romance, War
8        Horror, Mystery, Thriller
                   ...            
15493                        Drama
15494    Biography, Drama, History
15503         Action, Crime, Drama
15505                Action, Drama
15508                Action, Drama
Name: Genre, Length: 5148, dtype: object

In [96]:
# Genre - split ',' for all unique values

df['Genre'] = df['Genre'].str.split(',')
df = df.explode('Genre')

In [98]:
# Removing duplicates
df = df.drop_duplicates(subset='Name')

In [99]:
df['Genre']

1            Drama
3           Comedy
5           Comedy
6            Drama
8           Horror
           ...    
15493        Drama
15494    Biography
15503       Action
15505       Action
15508       Action
Name: Genre, Length: 5148, dtype: object

In [None]:
# Votes - remove commas

df['Votes'] = df['Votes'].str.replace(',', '').astype(int)

In [None]:
# Finally, reset the index
df.reset_index(drop=True, inplace=True)

In [None]:
df.info()

In [None]:
df['Name'].describe()

In [None]:
df['Duration'].unique()

In [None]:
# Save the cleaned data to a new CSV file
df.to_csv('clean_movie_data.csv', index=False)

In [None]:
df_clean = pd.read_csv('clean_movie_data.csv')