<a href="https://colab.research.google.com/github/Youssef-Rafikk/CODSOFT/blob/main/Task_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

#-------------------------------------------------------------------------------------------------------------------------------#
# Load the data with specified encoding
file_path = '/content/IMDb Movies India.csv'  # Update this path to your file path
df = pd.read_csv(file_path, encoding='latin1')  # Try 'latin1' or 'cp1252' if 'utf-8' fails

#-------------------------------------------------------------------------------------------------------------------------------#
# Display the first few rows of the dataframe
print(df.head())

#-------------------------------------------------------------------------------------------------------------------------------#
# Preprocess the data
# Handle missing values
df = df.dropna()

#-------------------------------------------------------------------------------------------------------------------------------#
# Clean the 'Year' column
df['Year'] = df['Year'].str.extract('(\d+)').astype(int)

#-------------------------------------------------------------------------------------------------------------------------------#
# Convert 'Duration' to numerical value (in minutes)
df['Duration'] = df['Duration'].str.replace(' min', '').astype(int)

#-------------------------------------------------------------------------------------------------------------------------------#
# Convert categorical variables to numerical using Label Encoding
label_encoders = {}
for column in ['Genre', 'Director', 'Actor 1', 'Actor 2', 'Actor 3']:
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column])
    label_encoders[column] = le

#-------------------------------------------------------------------------------------------------------------------------------#
# Define features (X) and target (y)
X = df[['Year', 'Duration', 'Genre', 'Director', 'Actor 1', 'Actor 2', 'Actor 3']]
y = df['Rating']

#-------------------------------------------------------------------------------------------------------------------------------#
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#-------------------------------------------------------------------------------------------------------------------------------#
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#-------------------------------------------------------------------------------------------------------------------------------#
# Train a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

#-------------------------------------------------------------------------------------------------------------------------------#
# Make predictions
y_pred = model.predict(X_test)

#-------------------------------------------------------------------------------------------------------------------------------#
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f'Root Mean Squared Error: {rmse}')

#-------------------------------------------------------------------------------------------------------------------------------#
# Save the model and encoders
import pickle

with open('movie_rating_model.pkl', 'wb') as file:
    pickle.dump(model, file)

with open('label_encoders.pkl', 'wb') as file:
    pickle.dump(label_encoders, file)

with open('scaler.pkl', 'wb') as file:
    pickle.dump(scaler, file)


                                 Name    Year Duration            Genre  \
0                                         NaN      NaN            Drama   
1  #Gadhvi (He thought he was Gandhi)  (2019)  109 min            Drama   
2                         #Homecoming  (2021)   90 min   Drama, Musical   
3                             #Yaaram  (2019)  110 min  Comedy, Romance   
4                   ...And Once Again  (2010)  105 min            Drama   

   Rating Votes            Director       Actor 1             Actor 2  \
0     NaN   NaN       J.S. Randhawa      Manmauji              Birbal   
1     7.0     8       Gaurav Bakshi  Rasika Dugal      Vivek Ghamande   
2     NaN   NaN  Soumyajit Majumdar  Sayani Gupta   Plabita Borthakur   
3     4.4    35          Ovais Khan       Prateik          Ishita Raj   
4     NaN   NaN        Amol Palekar  Rajat Kapoor  Rituparna Sengupta   

           Actor 3  
0  Rajendra Bhatia  
1    Arvind Jangid  
2       Roy Angana  
3  Siddhant Kapoor  
4    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Year'] = df['Year'].str.extract('(\d+)').astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Duration'] = df['Duration'].str.replace(' min', '').astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column] = le.fit_transform(df[column])
A value is trying to be set on a copy 