# MOVIE GENRE CLASSIFICATION

# Problem Statement: Create a machine learning model that can predict the genre of a movie based on its plot summary or other textual information. You can use techniques like TF-IDF or word embeddings with classifiers such as Naive Bayes, Logistic Regression, or Support Vector Machines.

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

In [16]:
training_data_file = "train_data.txt"
validation_data_file = "test_data_solution.txt"
test_data_file = "test_data.txt"

In [17]:
train_df = pd.read_csv(training_data_file, delimiter=" ::: ", engine='python', names=["index", "movie_name", "genre", "description"])
validation_df = pd.read_csv(validation_data_file, delimiter=" ::: ",  engine='python',names=["index", "movie_name", "genre", "description"])

In [18]:
test_df = pd.read_csv(test_data_file, delimiter=" ::: ", engine='python', names=["index", "movie_name", "description"])

In [19]:
train_df

Unnamed: 0,index,movie_name,genre,description
0,1,Oscar et la dame rose (2009),drama,Listening in to a conversation between his doc...
1,2,Cupid (1997),thriller,A brother and sister with a past incestuous re...
2,3,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fiel...
3,4,The Secret Sin (1915),drama,To help their unemployed father make ends meet...
4,5,The Unrecovered (2007),drama,The film's title refers not only to the un-rec...
...,...,...,...,...
54209,54210,"""Bonino"" (1953)",comedy,This short-lived NBC live sitcom centered on B...
54210,54211,Dead Girls Don't Cry (????),horror,The NEXT Generation of EXPLOITATION. The siste...
54211,54212,Ronald Goedemondt: Ze bestaan echt (2008),documentary,"Ze bestaan echt, is a stand-up comedy about gr..."
54212,54213,Make Your Own Bed (1944),comedy,Walter and Vivian live in the country and have...


In [20]:
test_df

Unnamed: 0,index,movie_name,description
0,1,Edgar's Lunch (1998),"L.R. Brane loves his life - his car, his apart..."
1,2,La guerra de papá (1977),"Spain, March 1964: Quico is a very naughty chi..."
2,3,Off the Beaten Track (2010),One year in the life of Albin and his family o...
3,4,Meu Amigo Hindu (2015),"His father has died, he hasn't spoken with his..."
4,5,Er nu zhai (1955),Before he was known internationally as a marti...
...,...,...,...
54195,54196,"""Tales of Light & Dark"" (2013)","Covering multiple genres, Tales of Light & Dar..."
54196,54197,Der letzte Mohikaner (1965),As Alice and Cora Munro attempt to find their ...
54197,54198,Oliver Twink (2007),"A movie 169 years in the making. Oliver Twist,..."
54198,54199,Slipstream (1973),"Popular, but mysterious rock D.J Mike Mallard ..."


In [21]:
validation_df

Unnamed: 0,index,movie_name,genre,description
0,1,Edgar's Lunch (1998),thriller,"L.R. Brane loves his life - his car, his apart..."
1,2,La guerra de papá (1977),comedy,"Spain, March 1964: Quico is a very naughty chi..."
2,3,Off the Beaten Track (2010),documentary,One year in the life of Albin and his family o...
3,4,Meu Amigo Hindu (2015),drama,"His father has died, he hasn't spoken with his..."
4,5,Er nu zhai (1955),drama,Before he was known internationally as a marti...
...,...,...,...,...
54195,54196,"""Tales of Light & Dark"" (2013)",horror,"Covering multiple genres, Tales of Light & Dar..."
54196,54197,Der letzte Mohikaner (1965),western,As Alice and Cora Munro attempt to find their ...
54197,54198,Oliver Twink (2007),adult,"A movie 169 years in the making. Oliver Twist,..."
54198,54199,Slipstream (1973),drama,"Popular, but mysterious rock D.J Mike Mallard ..."


In [22]:
#Combine Train and Validation for Vectorization
combined_df = pd.concat([train_df, validation_df])

In [23]:
#  TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = tfidf_vectorizer.fit_transform(combined_df['description'])
y = combined_df['genre']

In [24]:
#  Train/Validation Split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

In [25]:
#Train Classifier (Logistic Regression)
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X_train, y_train)

In [26]:
#  Predict Genre for Test Data
X_test = tfidf_vectorizer.transform(test_df['description'])
y_pred = classifier.predict(X_test)
test_df['predicted_genre'] = y_pred

In [27]:
#  Interactive Movie Prediction
while True:
    movie_name = input("Enter a movie name (or 'quit' to exit): ")
    if movie_name.lower() == 'quit':
        break
    else:
        movie = test_df[test_df['movie_name'] == movie_name]
        if not movie.empty:
            predicted_genre = movie.iloc[0]['predicted_genre']
            print(f"Predicted Genre for '{movie_name}': {predicted_genre}")
        else:
            print(f"Movie '{movie_name}' not found in the test dataset.")

Enter a movie name (or 'quit' to exit): "Tales of Light & Dark" (2013)	
Movie '"Tales of Light & Dark" (2013)	' not found in the test dataset.
Enter a movie name (or 'quit' to exit): Edgar's Lunch (1998)
Predicted Genre for 'Edgar's Lunch (1998)': short
Enter a movie name (or 'quit' to exit): Meu Amigo Hindu (2015)
Predicted Genre for 'Meu Amigo Hindu (2015)': drama
Enter a movie name (or 'quit' to exit): quit


In [28]:
print("\nTest Results on Validation Set:")
X_val_pred = classifier.predict(X_val)
print(classification_report(y_val, X_val_pred))


Test Results on Validation Set:


  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

      action       0.47      0.30      0.36       541
       adult       0.65      0.31      0.42       247
   adventure       0.55      0.22      0.31       312
   animation       0.46      0.10      0.17       201
   biography       0.00      0.00      0.00       104
      comedy       0.54      0.61      0.57      2908
       crime       0.32      0.05      0.09       214
 documentary       0.69      0.85      0.76      5213
       drama       0.57      0.77      0.65      5516
      family       0.53      0.15      0.24       316
     fantasy       0.60      0.08      0.14       115
   game-show       0.84      0.70      0.76        73
     history       0.00      0.00      0.00        86
      horror       0.66      0.57      0.61       916
       music       0.62      0.51      0.56       290
     musical       0.33      0.03      0.05       115
     mystery       0.22      0.02      0.03       118
        news       0.80    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
