# Final Project Machine Learning Course


## Movie Genre Classification
Dataset link: [Genre Classification Dataset - IMDb](https://www.kaggle.com/datasets/hijest/genre-classification-dataset-imdb/data)





### We are tasked to classify genre to movie base on the description.






In [1]:
import pandas as pd

#read the data using pandas

train_path = 'Genre Classification Dataset/train_data.txt'
train_data = pd.read_csv(train_path, delimiter=':::', engine='python', names=['ID', 'Title', 'Genre', 'Description'])

test_path = 'Genre Classification Dataset/test_data_solution.txt'
test_data = pd.read_csv(test_path, delimiter=':::', engine='python', names=['ID', 'Title', 'Genre', 'Description'])




## Load dataset
### Overview:

The dataset is sourced from IMDb (Internet Movie Database),
a comprehensive online database containing information about films, TV shows, videos, games, and streaming content.

### Content:
The dataset consists of two main components: train data and test data.

* Train Data:

 - Each entry includes an ID, title, genre, and description.
 - The ID uniquely identifies each entry.
 - Title represents the name of the movie or TV show.
 - Genre specifies the category or categories to which the title belongs.
 - Description provides a brief overview or summary of the title's plot or content.

Test Data:

* Test Data:

    - Similar to train data, each entry comprises an ID, title, genre, and description.

In [2]:
# view the data
print("shape of train data")
print(train_data.shape)

print("shape of test data")
print(test_data.shape)

shape of train data
(54214, 4)
shape of test data
(54200, 4)


In [3]:
train_data.head(n=10)

Unnamed: 0,ID,Title,Genre,Description
0,1,Oscar et la dame rose (2009),drama,Listening in to a conversation between his do...
1,2,Cupid (1997),thriller,A brother and sister with a past incestuous r...
2,3,"Young, Wild and Wonderful (1980)",adult,As the bus empties the students for their fie...
3,4,The Secret Sin (1915),drama,To help their unemployed father make ends mee...
4,5,The Unrecovered (2007),drama,The film's title refers not only to the un-re...
5,6,Quality Control (2011),documentary,Quality Control consists of a series of 16mm ...
6,7,"""Pink Slip"" (2009)",comedy,In tough economic times Max and Joey have all...
7,8,One Step Away (1985),crime,Ron Petrie (Keanu Reeves) is a troubled teen ...
8,9,"""Desperate Hours"" (2016)",reality-tv,"A sudden calamitous event, causing great loss..."
9,10,Spirits (2014/I),horror,Four high school students embark on a terrify...


### Data Cleaning and Preprocessing
We will clean and preprocess the data
- Remove duplicates and NaN values.
- Preprocess the text data (description) for better model performance.


In [4]:
import nltk
from nltk import LancasterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\97252\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\97252\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\97252\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
## Check for Duplicates and Remove them
num = train_data.duplicated().sum()
print(num)
if num > 0:
    train_data.drop_duplicates(inplace=True)

0


In [6]:
## Check for nan values
num = train_data.isna().sum()
print(num)
train_data.dropna(inplace=True)

ID             0
Title          0
Genre          0
Description    0
dtype: int64


In [15]:
# Preprocessing

#Get the list of English stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

#exmple for LancasterStemmer
# words = ['sincerely','electricity','roughly','ringing']
# Lanc = LancasterStemmer()
# for w in words:
#     print(w, " : ", Lanc.stem(w))


Lanc = LancasterStemmer()


def clean_text(text):
    # Expand contractions


    # Remove URLs and email addresses
    text = re.sub(r'\b(?:https?://|www\.)\S+\b', '', text)
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', text)

    # Remove special characters, punctuation, and symbols
    text = re.sub(r'[^A-Za-z\s]', '', text)

    # Convert to lowercase
    text = text.lower()

    # Tokenize text
    word_tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))

    word_tokens = [word for word in word_tokens if word not in stop_words]

    # Lemmatization
    # We found that it didn't improve accuracy and might have even reduced it.
    # lemmatizer = WordNetLemmatizer()
    # tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # Join tokens back into a single string
    cleaned_text = ' '.join(word_tokens)

    return cleaned_text



train_data['Text_cleaning'] = train_data['Description'].apply(clean_text)
test_data['Text_cleaning'] = test_data['Description'].apply(clean_text)
train_data.head()

{'each', 'ours', 'he', 'had', 'here', 'does', 'no', 'herself', 'again', 'itself', 've', 'is', 'out', 'own', 'there', 'them', 'than', 'myself', 'y', 'my', 'below', 'just', 'were', 'during', 'yourselves', 'yourself', "doesn't", 'any', 'don', 'shan', 'few', 'me', 'both', 'nor', 'very', 'to', 'not', 'into', "mightn't", 'she', 'once', "should've", "she's", 'with', 're', "you'll", 'aren', 'too', "that'll", 'who', 'ma', 'your', 'so', 'did', 'same', 'ain', 'they', 'under', 'ourselves', 'where', 's', 'was', 'which', 'o', 'an', 'himself', 'between', 'all', "it's", 'hers', 'a', 'isn', "needn't", 'be', 'because', 'before', 'has', 'it', 'off', 'but', 'being', 'those', 'didn', 't', 'as', 'further', "wouldn't", 'through', 'most', 'him', 'on', 'when', "couldn't", 'having', 'these', 'hasn', 'some', "didn't", 'or', 'in', 'mightn', 'yours', 'doesn', "won't", 'haven', 'i', 'up', 'over', 'now', 'more', "haven't", 'her', 'and', 'for', 'then', 'won', 'd', "weren't", "don't", "you're", 'am', 'until', 'theirs'

KeyboardInterrupt: 

###  Data Visualization


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train_data["Genre"].value_counts()

In [None]:
# Set the style of seaborn
sns.set(style="whitegrid")

# Plotting genre distribution horizontally
plt.figure(figsize=(10, 8))
sns.countplot(x='Genre', data=train_data, order=train_data['Genre'].value_counts().index)
plt.title('Genre Distribution')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.show()

# Text to vectors

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform training data
tfidf_train = tfidf_vectorizer.fit_transform(train_data['Text_cleaning'])

# Transform test data
tfidf_test = tfidf_vectorizer.transform(test_data['Text_cleaning'])


# Initialize CountVectorizer
count_vectorizer = CountVectorizer()

# Fit and transform training data
count_train = count_vectorizer.fit_transform(train_data['Text_cleaning'])

# Transform test data
count_test = count_vectorizer.transform(test_data['Text_cleaning'])

In [None]:
# from gensim.models import Word2Vec
# import numpy as np
#
# # Assuming train_data['Text_cleaning'] contains preprocessed text data
# sentences = [text.split() for text in train_data['Text_cleaning']]
#
# # Train Word2Vec model
# word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
#
# # Function to convert text to a fixed-length vector representation
# def text_to_vector(text):
#     words = text.split()
#     vectors = [word2vec_model.wv[word] for word in words if word in word2vec_model.wv]
#     if vectors:
#         return np.mean(vectors, axis=0)
#     else:
#         return np.zeros(word2vec_model.vector_size)  # Return zero vector if no words found in Word2Vec model
#
# # Convert train and test data to vectors
# word2vec_train = np.array([text_to_vector(text) for text in train_data['Text_cleaning']])
# word2vec_test = np.array([text_to_vector(text) for text in test_data['Text_cleaning']])

# Now you can use these vectors with any classifier of your choice (e.g., Naive Bayes, SVM, etc.)
# Train and evaluate the classifier using word2vec_train and word2vec_test

## Evaluate the Model


In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

In [None]:
# Initialize the Naive Bayes classifier
nb_classifier = MultinomialNB()

# Train the classifier
nb_classifier.fit(tfidf_train, train_data['Genre'])

# Make predictions on the test data
predictions = nb_classifier.predict(tfidf_test)

# Evaluate the model
accuracy = accuracy_score(test_data['Genre'], predictions)
print("Accuracy:", accuracy)

# Get classification report
print("\nClassification Report:")
print(classification_report(test_data['Genre'], predictions))

In [None]:
# Initialize the Naive Bayes classifier
nb_classifier_count = MultinomialNB()

# Train the classifier
nb_classifier_count.fit(count_train, train_data['Genre'])

# Make predictions on the test data
predictions_count = nb_classifier_count.predict(count_test)

# Evaluate the model
accuracy_count = accuracy_score(test_data['Genre'], predictions_count)
print("Accuracy with CountVectorizer:", accuracy_count)

# Get classification report
print("\nClassification Report with CountVectorizer:")
print(classification_report(test_data['Genre'], predictions_count))

In [None]:
# nb_classifier_word2vec = MultinomialNB()
# nb_classifier_word2vec.fit(word2vec_train, train_data['Genre'])
#
# predictions_word2vec = nb_classifier_word2vec.predict(word2vec_test)
#
# accuracy_word2vec = accuracy_score(test_sol_data['Genre'], predictions_word2vec)
# print("Accuracy with Word2Vec:", accuracy_word2vec)
#
# print("\nClassification Report with Word2Vec:")
# print(classification_report(test_sol_data['Genre'], predictions_word2vec))

In [None]:
from sklearn.neighbors import KNeighborsClassifier
# Initialize the KNN classifier
knn_classifier = KNeighborsClassifier()

# Train the KNN classifier
knn_classifier.fit(tfidf_train, train_data['Genre'])

# Make predictions on the test data using KNN
predictions_knn = knn_classifier.predict(tfidf_test)

# Evaluate the KNN model
accuracy_knn = accuracy_score(test_data['Genre'], predictions_knn)
print("\nAccuracy (KNN):", accuracy_knn)

# Get classification report for KNN
print("\nClassification Report (KNN):")
print(classification_report(test_data['Genre'], predictions_knn))

In [None]:
# Initialize the KNN classifier
knn_classifier_count = KNeighborsClassifier()

# Train the KNN classifier
knn_classifier_count.fit(count_train, train_data['Genre'])

# Make predictions on the test data using KNN
predictions_knn_count = knn_classifier.predict(count_test)

# Evaluate the KNN model
accuracy_knn_count = accuracy_score(test_data['Genre'], predictions_knn_count)
print("\nAccuracy (KNN):", accuracy_knn_count)

# Get classification report for KNN
print("\nClassification Report (KNN):")
print(classification_report(test_data['Genre'], predictions_knn_count))

In [None]:
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier
#
# print('DecisionTreeClassifier')
#
# # Initialize the Decision Tree classifier
# dt_classifier = DecisionTreeClassifier()
#
# # Train the Decision Tree classifier
# dt_classifier.fit(tfidf_train, train_data['Genre'])
#
# # Make predictions on the test data using Decision Tree
# predictions_dt = dt_classifier.predict(tfidf_test)
#
# # Evaluate the Decision Tree model
# accuracy_dt = accuracy_score(test_sol_data['Genre'], predictions_dt)
# print("Accuracy (Decision Tree):", accuracy_dt)
#
# # Get classification report for Decision Tree
# print("\nClassification Report (Decision Tree):")
# print(classification_report(test_sol_data['Genre'], predictions_dt))

In [None]:
# # Initialize the Random Forest classifier
# rf_classifier = RandomForestClassifier()
#
# # Train the Random Forest classifier
# rf_classifier.fit(tfidf_train, train_data['Genre'])
#
# # Make predictions on the test data using Random Forest
# predictions_rf = rf_classifier.predict(tfidf_test)
#
# # Evaluate the Random Forest model
# accuracy_rf = accuracy_score(test_sol_data['Genre'], predictions_rf)
# print("\nAccuracy (Random Forest):", accuracy_rf)
#
# # Get classification report for Random Forest
# print("\nClassification Report (Random Forest):")
# print(classification_report(test_sol_data['Genre'], predictions_rf))







In [None]:
# from sklearn.linear_model import LogisticRegression
#
#
# # Initialize the Logistic Regression classifier
# lr_classifier = LogisticRegression(multi_class='multinomial',max_iter=100, solver='sag')
#
# # Train the Logistic Regression classifier
# lr_classifier.fit(tfidf_train, train_data['Genre'])
#
# # Make predictions on the test data using Logistic Regression
# predictions_lr = lr_classifier.predict(tfidf_test)
#
# # Evaluate the Logistic Regression model
# accuracy_lr = accuracy_score(test_sol_data['Genre'], predictions_lr)
# print("Accuracy (Logistic Regression):", accuracy_lr)
#
# # Get classification report for Logistic Regression
# print("\nClassification Report (Logistic Regression):")
# print(classification_report(test_sol_data['Genre'], predictions_lr))

In [None]:
# print('svm')
# # Initialize the SVM classifier
# svm_classifier = SVC(kernel='linear')
#
# # Train the classifier
# svm_classifier.fit(tfidf_train, train_data['Genre'])
#
# # Make predictions on the test data
# predictions = svm_classifier.predict(tfidf_test)
#
# # Evaluate the model
# accuracy = accuracy_score(test_sol_data['Genre'], predictions)
# print("Accuracy:", accuracy)
#
# # Get classification report
# print("\nClassification Report:")
# print(classification_report(test_sol_data['Genre'], predictions))



