# IST 736 - Text Mining: Movie Recommender
By: Gianni Conde, Sana Khan, Sahil Nanavaty
Spring 2024

## Goal
To determine movie recommendations based on movie genre and description using supervised and unsupervised machine learning techniques.  

### Role
My role in this analysis was to perform data cleansing and preparation, assist in creating exploratory visualizations, create training and testing data sets, and develope supervised machine learning models. The models that I specifically developed were Multinomial and Binomial Naive Bayes. 

## About the Data
The data was retrieved from Kaggle and originated from IMDB.com. The original corpus contained 16 separate CSV files that each represented a genre. These genres included action, adventure, animation, biography, crime, family, fantasy, film-noir, history, horror, mystery, romance, sci-fi, sports, thriller, and war. Each file contained varying amounts of rows and 14 columns. The number of rows between each file ranged from 987 to 52,618. The columns represented movie ID, movie name, year, certificate, runtime, genre, rating, description, director, director ID, star, star ID, votes, and gross in USD.

Link(s):
https://www.kaggle.com/datasets/rajugc/imdb-movies-dataset-based-on-genre

In [None]:
## Libraries
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, plot_confusion_matrix
from sklearn.metrics import (accuracy_score, f1_score, classification_report)
from sklearn.metrics import silhouette_samples, silhouette_score
from nltk.stem import WordNetLemmatizer 
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
import random as rd
import itertools
import json


from sklearn.decomposition import PCA
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
import pyLDAvis
from gensim.models import LdaModel


## Importing the Data

In [None]:
## Reading csv files
action = pd.read_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\data\action.csv')    
adventure = pd.read_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\data\adventure.csv')    
animation = pd.read_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\data\animation.csv')    
bio = pd.read_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\data\biography.csv')    
crime = pd.read_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\data\crime.csv')    
family = pd.read_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\data\family.csv')    
fantasy = pd.read_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\data\fantasy.csv')    
film_noir = pd.read_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\data\film-noir.csv')    
history = pd.read_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\data\history.csv')    
horror = pd.read_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\data\horror.csv')    
mystery = pd.read_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\data\mystery.csv')  
romance = pd.read_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\data\romance.csv')    
scifi = pd.read_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\data\scifi.csv')    
sports = pd.read_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\data\sports.csv')    
thriller = pd.read_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\data\thriller.csv')    
war = pd.read_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\data\war.csv') 

## Data Cleansing and Preparation

In [None]:
## Adding Genre column
action['genre'] = 'action'
adventure['genre'] = 'adventure'
animation['genre'] = 'animation'
bio['genre'] = 'biography'
crime['genre'] = 'crime'
family['genre'] = 'family'
fantasy['genre'] = 'fantasy'
film_noir['genre'] = 'film-noir'
history['genre'] = 'history'
horror['genre'] = 'horror'
mystery['genre'] = 'mystery'
romance['genre'] = 'romance'
scifi['genre'] = 'sci-fi'
sports['genre'] = 'sports'
thriller['genre'] = 'thriller'
war['genre'] = 'war'

In [None]:
## Concatenate all files into on dataframe
#movies = pd.concat(file_list, ignore_index = True)
movies = pd.concat([action,adventure,animation,bio,crime,family,fantasy,
                    film_noir,history,horror,mystery,romance,scifi,
                    sports,thriller,war])
# Resetting index
movies = movies.reset_index(drop=True)
movies.head()
## Writing the dataframe to a csv file
movies.to_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\new data\combined_data.csv', index = False)

In [None]:
## Inspection
print(movies.shape) # 368,300 rows (movies), 14 columns (categories)
print('The column names are:\n')
for col in movies.columns:
    print(col)
    
## Checking for missing values
movies.isnull().sum()

In [None]:
## Year
movies['year'].unique()
unwanted_years = ['I', 'II', 'V', 'III', 'VII', 'IV', 'XXIII', 'IX', 'XV', 'VI',
                  'X', 'XIV', 'XIX', 'XXIX', 'XXI', 'VIII', 'XI', 'XVIII', 'XII', 
                  'XIII', 'LXXI', 'XVI', 'XX', 'XXXIII', 'XXXII', 'XXXVI', 'XVII', 
                  'LXIV', 'LXII', 'LXVIII', 'XL', 'XXXIV', 'XXXI', 'XLV', 'XLIV', 
                  'XXIV', 'XXVII', 'LX', 'XXV', 'XXXIX', '2029', 'XXVIII', 'XXX', 
                  'LXXII', '1909', 'XXXVIII', 'XXII', 'LVI', 'LVII' 'XLI', 'LII', 
                  'XXXVII', 'LIX', 'LVIII', 'LXX', 'XLIII', 'XLIX', 'LXXIV', 'XXVI', 
                  'C', 'XLI', 'LVII', 'LV','XLVI', 'LXXVII', 'XXXV', 'LIV', 'LI', 
                  'LXXXII', 'XCIX', 'LXIII']
movies_clean = movies[~movies['year'].isin(unwanted_years)]
# Filling missing values with 0s
movies_clean['year'] = movies_clean['year'].fillna(0).astype(int)
# Replacing 0s with NaNs
movies_clean['year'] = movies['year'].replace(0, np.nan)
# Dropping rows with missing values
movies_clean = movies_clean.dropna(subset=['year'])
movies_clean['year'].isnull().sum()
# Make 'year' column int
movies_clean['year'] = movies_clean['year'].astype(int)

In [None]:
## Removing unwanted columns
movies_clean = movies_clean.drop(['movie_id', 'director_id', 'star_id'], axis = 1)

In [None]:
## Director
# Change type
movies_clean['director'] = movies_clean['director'].str.replace('\n', '')
movies_clean['director'] = movies_clean['director'].fillna('none')

## Star
# Change type
movies_clean['star'] = movies_clean['star'].str.replace('\n', '')
movies_clean['star'] = movies_clean['star'].fillna('none')
# Rename column
movies_clean.rename(columns = {'star':'actors'}, inplace = True)

## Runtime
movies_clean = movies_clean.dropna(subset=['runtime'])
#Change type to int
movies_clean['runtime'] = movies_clean['runtime'].str.replace('min', '')
movies_clean['runtime'] = movies_clean['runtime'].str.replace(',', '').astype(int)
# Convert to datetime format
movies_clean['runtime'] = pd.to_timedelta(movies_clean['runtime'], unit='m')

## Certificate
movies_clean['certificate'].unique()
movies_clean['certificate'].value_counts()
movies_clean['certificate'] = movies_clean['certificate'].fillna('NR')
movies_clean['certificate'] = movies_clean['certificate'].replace({'Not Rated':'NR'}, regex = True)

## Movie Name
movies_clean = movies_clean.dropna(subset=['movie_name'])

In [None]:
## Gross
# Rename column
movies_clean.rename(columns = {'gross(in $)':'revenue'}, inplace = True)
# Normalizing revenue
movies_clean['revenue'] = movies_clean['revenue'].fillna(movies_clean['revenue'].median())

## Votes
movies_clean['votes'] = movies_clean['votes'].fillna(0).astype(int)

## Rating
movies_clean['rating'] = movies_clean['rating'].fillna(0).astype(float)

In [None]:
## Final check for missing values
movies_clean.isna().sum()

In [None]:
## Checking for duplicated rows
duplicate_values = movies_clean['movie_name'].duplicated()
print(duplicate_values)
## Removing duplicate rows
movies_clean = movies_clean.drop_duplicates(subset=['movie_name'], keep='first')
print(movies_clean)

## Writing the clean dataframe to a csv file
movies_clean.to_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\new data\movies_clean.csv', index = False)

## Exploratory Data Analysis

In [None]:
movies_clean.describe()
movies_clean.columns.unique()
movies_clean.shape # 142,626 movies, 11 columns
movies_clean.info()

#options for plot styles
print(plt.style.available)

In [None]:
genre_counts = movies_clean['genre'].value_counts()
genre_counts

## Plotting movies per genre
plt.figure(figsize=(10, 6))
genre_counts.plot(kind='bar', color='skyblue')
plt.title('Number of Movies per Genre')
plt.xlabel('Genre')
plt.ylabel('Number of Movies')
plt.xticks(rotation=45, ha='right')  
plt.tight_layout()
plt.show()

In [None]:
revenue_genre = movies_clean.groupby('genre', as_index=False)['revenue'].sum().sort_values(by='revenue', ascending=False)
revenue_genre

## Plotting revenue by genre
plt.figure(figsize=(14,7))
ax = sns.barplot(x=revenue_genre['revenue'], 
                  y=revenue_genre['genre'], 
                  palette='crest')
plt.xlabel('revenue (hundreds of billions)', fontdict = {'fontname': 'Times New Roman', 'color': 'black', 'fontsize' : '15'})
plt.ylabel('genre',fontdict = {'fontname': 'Times New Roman', 'color': 'black', 'fontsize' : '15'})
plt.title('Revenue by Genre', 
          fontdict = {'fontname': 'Times New Roman', 'color': 'black', 'fontsize' : '25'})
plt.show()

In [None]:
# Calculate average ratings per genre
avg_ratings = movies_clean.groupby('genre')['rating'].mean().sort_values(ascending=False).index

# Create a boxplot with the specified order
plt.figure(figsize=(12, 8))
sns.boxplot(x='Genre', y='rating', data=movies_clean, order=avg_ratings, palette='viridis')
plt.title('Boxplot of Ratings per Genre')
plt.xlabel('Genre')
plt.ylabel('Rating')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better visibility
plt.show()

In [None]:
### Creating a smaller data frame
## Genres: sci-fi, crime, romance, sports, horror
movies_clean['genre'].unique()
movies_new = movies_clean.loc[movies_clean['genre'].isin(
    ['sci-fi','crime','romance','sports','horror'])]
movies_new['genre'].unique()
movies_new  # 62,010 rows, 11 columns

# Writing the new data set to a csv
movies_new.to_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\new data\5_genres.csv', index = False)

In [None]:
## DF only containing title and description
movies_new.columns
columns_dropped = ['movie_name','year','certificate','runtime','rating', 
                   'director','actors','votes','revenue']
movies_new.drop(columns_dropped, axis = 1, inplace = True)
movies_new
print(movie_description)

In [None]:
print(movies_new)  # 62,010 descriptions (rows), 2 columns
## Reading data as csv, not df
movie_description = r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\new data\5_genres.csv'

## Model Preparations

In [None]:
### Writing the DF to csv
### Sampling the data (30%)
myDF = movies_new.sample(frac = 0.3, random_state = 42)
print(myDF.shape)  # 18,603 descriptions 
print(myDF)
print(type(myDF))

### Writing the DF to a csv
myDF.to_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\new data\descriptions.csv', index = False)
infile = r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\new data\descriptions.csv'
print(infile)
print(type(infile))

In [None]:
### Tokenize and Vectorize the descriptions
## Create the list of descriptions
## Keep the labels
description_LIST = []
label_LIST = []
with open(infile, 'r', encoding="utf-8") as FILE:
    FILE.readline()
    for row in FILE:
        next_label, next_description = row.split(',', 1)
        label_LIST.append(next_label)
        description_LIST.append(next_description)
     
    FILE.close()
    
print('The description list is:\n')
print(description_LIST)
print('The label list is:\n')
print(label_LIST)

In [None]:
## Remove all words in desciptions that match the genres.
new_description_LIST = []

for element in description_LIST:
    print(element)
    print(type(element))
    ## make into list
    all_words = element.split(" ")
    print(all_words)
    ## Now remove words that are in your topics
    new_words_LIST = []
    for word in all_words:
        print(word)
        word = word.lower()
        if word in infile['description']:
            print(word)
        else:
            new_words_LIST.append(word)            
    ##turn back to string
    new_words = " ".join(new_words_LIST)
    ## Place into new_headline_LIST
    new_description_LIST.append(new_words)

In [None]:
## Seting the old description list to the new one
description_LIST = new_description_LIST
print(description_LIST) 

In [None]:
### Lemmatizer
lemmer = WordNetLemmatizer()
def my_lemmer(str_input):
    words = re.sub(r"[^A-Za-z\-]", " ", str_input).lower().split()
    words = [lemmer.lemmatize(word) for word in words]
    return words

In [None]:
### Instantiating vectorizers
## CountVectorizer
myCV = CountVectorizer(input = 'content',
                       lowercase = True,
                       stop_words = 'english',
                       max_features = 1000,
                       tokenizer = my_lemmer)
## Bernoulli
myCV2 = TfidfVectorizer(input = 'content',
                           stop_words = 'english',
                           lowercase = True,
                           max_features = 1000,
                           binary = True,
                           tokenizer = my_lemmer)
myCV3 = CountVectorizer(input = 'content',
                           stop_words = 'english',
                           lowercase = True,
                           max_features = 1000,
                           binary = True,
                           tokenizer = my_lemmer)

In [None]:
## Applying vectorizers
matrix_CV = myCV.fit_transform(description_LIST)
matrix_CV2 = myCV2.fit_transform(description_LIST)
matrix_CV3 = myCV3.fit_transform(description_LIST)
print(type(matrix_CV))
print(type(matrix_CV2))
print(type(matrix_CV3))

In [None]:
## Retrieving column names
cols1 = myCV.get_feature_names_out()
cols2 = myCV2.get_feature_names_out()
cols3 = myCV3.get_feature_names_out()
print(cols1)
print(cols2)
print(cols3)

In [None]:
## Creating the dataframes
DF1 = pd.DataFrame(matrix_CV.toarray(), columns = cols1)
DF2 = pd.DataFrame(matrix_CV2.toarray(), columns = cols2)
DF3 = pd.DataFrame(matrix_CV3.toarray(), columns = cols3)
## Checking columns 
DF1.columns
DF2.columns
DF3.columns

In [None]:
## Remove columns containing numbers
def remove_cols(myStrings):
    return any(char.isdigit() for char in myStrings)
for next_col in DF1.columns:
    logical = remove_cols(next_col)
    if(logical==True):
        DF1 = DF1.drop([next_col], axis = 1)
        DF2 = DF2.drop([next_col], axis = 1)
        DF3 = DF3.drop([next_col], axis = 1)
    ## Remove columns containing 2 words or less
    elif(len(str(next_col))<=2):
        print(next_col)
        DF1 = DF1.drop([next_col], axis = 1)
        DF2 = DF2.drop([next_col], axis = 1)
        DF3 = DF3.drop([next_col], axis = 1)
 
print(DF1)
print(DF2) 
print(DF3) 

In [None]:
## Removing 
DF1 = DF1.drop('-year-old', axis = 1)
DF2 = DF2.drop('-year-old', axis = 1)
DF2 = DF2.drop('-', axis = 1)
DF3 = DF3.drop('-year-old', axis = 1)
DF3 = DF3.drop('-', axis = 1)
print(DF1)
print(DF2)
print(DF3)

In [None]:
## Adding labels to the dfs
DF1.insert(loc = 0, column = 'GENRE', value = label_LIST)
DF2.insert(loc = 0, column = 'GENRE', value = label_LIST)
DF3.insert(loc = 0, column = 'GENRE', value = label_LIST)
print(DF1) # 18,603 rows, 998 unique words
print(DF2) # 18,603 rows, 999 unique words
print(DF3)
## Writing the new dfs to their own csv files
DF1.to_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\new data\df_CV.csv', index = False)
DF2.to_csv(r'C:\Users\cassh\OneDrive\Desktop\IST-736\Project\new data\df_TFID.csv', index = False)

In [None]:
## Splitting the data into training and testing sets
train1, test1 = train_test_split(DF1, test_size = 0.3, random_state = 42)
train2, test2 = train_test_split(DF2, test_size = 0.3, random_state = 42)
train3, test3 = train_test_split(DF3, test_size = 0.3, random_state = 42)
print(train1) # 13,022 rows
print(test1)  # 5,581 rows
print(train2)
print(test2)

In [None]:
## Saving and removing labels DF1
train1_LABEL = train1['GENRE']
train1_DATA = train1.drop(columns='GENRE')
test1_LABEL = test1['GENRE']
test1_DATA = test1.drop(columns='GENRE')
print(train1_LABEL)
print(train1_DATA)
print(test1_LABEL)
print(test1_DATA)

In [None]:
## Saving and removing labels DF2
train2_LABEL = train2['GENRE']
train2_DATA = train2.drop(columns='GENRE')
test2_LABEL = test2['GENRE']
test2_DATA = test2.drop(columns='GENRE')
print(train2_LABEL)
print(train2_DATA)
print(test2_LABEL)
print(test2_DATA)

In [None]:
## Saving and removing labels DF3
train3_LABEL = train3['GENRE']
train3_DATA = train3.drop(columns='GENRE')
test3_LABEL = test3['GENRE']
test3_DATA = test3.drop(columns='GENRE')
print(train3_LABEL)
print(train3_DATA)
print(test3_LABEL)
print(test3_DATA)

## Results

### Model 1: Multinomial Naive Bayes

In [None]:
### Fitting the Multinomial NB model
## Instantiate
NB = MultinomialNB()
## Model Fitting
NB_model1 = NB.fit(train1_DATA, train1_LABEL)
print('\nThe classes are:')
print(NB_model1.classes_)  # crime, horror, romance, sci-fi, sports
print('\nThe class counts are:')
print(NB_model1.class_count_)  # crime:3488, horror:3128, romance:5457, scifi:526, sports:423
print('\nThe feature log probabilities are:')
print(NB_model1.feature_log_prob_)

In [None]:
## Prediction
prediction1 = NB_model1.predict(test1_DATA)
print(np.round(NB_model1.predict_proba(test1_DATA), 2))
print('\nThe prediction from NB is:')
print(prediction1)
print('\nThe actual labels are:')
print(test1_LABEL)
print('\nThe value counts of the actual labels are:')
print(test1_LABEL.value_counts())  # crime:2303, horror:1537, romance:1330, scifi:219, sports:192

In [None]:
## Model Accuracy
accuracy1 = accuracy_score(test1_LABEL, prediction1)
print('Accuracy of MultinomialNB: {}%'.format(round(accuracy1 * 100, 2))) # accuracy of 66.91%
print(classification_report(test1_LABEL, prediction1))

In [None]:
## Confusion Matrix
cm_NB1 = confusion_matrix(test1_LABEL, prediction1)
print(cm_NB1)
classes = ['crime','horror','romance','sci-fi','sports'] 
sns.heatmap(cm_NB1, annot = True, fmt = 'd', cmap = 'flare', 
            xticklabels = classes, yticklabels = classes)
plt.title('Confusion Matrix for MultinomialNB (CV)')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()

### Model 2: Bernoulli Naive Bayes

In [None]:
### Fitting the Bernoulli NB model (CV)
## Instantiate
BNB = BernoulliNB()
## Model Fitting
BNB_model2 = BNB.fit(train3_DATA, train3_LABEL)
print('\nThe classes are:')
print(BNB_model2.classes_)  # crime, horror, romance, sci-fi, sports
print('\nThe class counts are:')
print(BNB_model2.class_count_)  # crime:3488, horror:3128, romance:5457, scifi:526, sports:423
print('\nThe feature log probabilities are:')
print(BNB_model2.feature_log_prob_)

In [None]:
## Prediction
prediction4 = BNB_model2.predict(test3_DATA)
print(np.round(BNB_model2.predict_proba(test3_DATA), 2))
print('\nThe prediction from NB is:')
print(prediction4)
print('\nThe actual labels are:')
print(test3_LABEL)
print('\nThe value counts of the actual labels are:')
print(test3_LABEL.value_counts())  # crime:2303, horror:1537, romance:1330, scifi:219, sports:192

In [None]:
## Model Accuracy
accuracy4 = accuracy_score(test3_LABEL, prediction4)
print('Accuracy of MultinomialNB: {}%'.format(round(accuracy4 * 100, 2))) # accuracy of 66.87%
print(classification_report(test3_LABEL, prediction4))

In [None]:
## Confusion Matrix
cm_BNB2 = confusion_matrix(test3_LABEL, prediction4)
print(cm_BNB2)
classes = ['crime','horror','romance','sci-fi','sports'] 
sns.heatmap(cm_BNB2, annot = True, fmt = 'd', cmap = 'flare_r', 
            xticklabels = classes, yticklabels = classes)
plt.title('Confusion Matrix for BernoulliNB (CV)')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()

When comparing the results of both Naive Bayes models, it appears that they are both nearly identical. Their similar accuracy score imply that the features are not highly correlated and that the genres were well separated by movie description features. The low classification report scores for sci-fi and sports were most likely due to low support.

### Model 3: Support Vector Machine (SVM)

In [None]:
# Define kernels and costs
kernels = ['linear', 'rbf', 'poly']
costs = [0.1, 1, 10]

# Train models
results = []
for kernel in kernels:
    for cost in costs:
        model = SVC(kernel=kernel, C=cost)
        model.fit(train2_DATA, train2_LABEL)
        predictions = model.predict(test2_DATA)
        accuracy = accuracy_score(test2_LABEL, predictions)
        results.append((kernel, cost, accuracy))
        print(f'Kernel: {kernel}, Cost: {cost}, Accuracy: {accuracy}')
        
# Results
for result in results:
    print(f'Kernel: {result[0]}, Cost: {result[1]}, Accuracy: {result[2]}')

The linear kernel, while simplistic, was expected to offer the most computationally efficient model. In contrast, the RBF and polynomial kernels were hypothesized to yield higher accuracy scores after precise parameter tuning.

In [None]:
# Confusion Matrix    
for kernel in kernels:
    for cost in costs:
        model = SVC(kernel=kernel, C=cost)
        model.fit(train2_DATA, train2_LABEL)
        predictions = model.predict(test2_DATA)
        
        # Calculate accuracy
        accuracy = accuracy_score(y_test, predictions)
        results.append((kernel, cost, accuracy))
        
        # Generate confusion matrix
        cm = confusion_matrix(test2_DATA, predictions, labels=model.classes_)
        
        # Plot confusion matrix
        plt.figure(figsize=(10,7))
        sns.heatmap(cm, annot=True, fmt='d', xticklabels=model.classes_, yticklabels=model.classes_)
        plt.title(f'Confusion Matrix for Kernel: {kernel}, Cost: {cost}')
        plt.xlabel('Predicted')
        plt.ylabel('True')
        plt.show()
        
print(f'Kernel: {kernel}, Cost: {cost}, Accuracy: {accuracy}')

Higher values of the cost parameter within the linear kernel allow the model to penalize misclassified points more heavily, leading to a tighter decision boundary. However, the recorded improvements in accuracy were relatively small across different cost values. This suggests that the linear kernel may have reached its performance limit on this dataset. 

The RBF kernel's accuracy score improved when transitioning from a low-cost value of 0.1 to a higher cost value of 1. This indicates that a moderate cost value was potentially more suitable for this kernel on this dataset. Despite this observation, increasing the cost parameter further to 10 resulted in a decrease in accuracy. This could indicate that the decision boundary became too rigid, leading to overfitting on the training data and reduced generalization.

The polynomial kernel displayed a lower accuracy compared to the others across all cost values. Increasing the cost value from 0.1 to 1 led to a significant improvement in accuracy, suggesting that a moderate cost value allowed was more appropriate for the polynomial kernel. 

Ultimately, increasing the cost parameter to 10 resulted in a decrease in accuracy, similar to the behavior observed with the RBF kernel. This indicates that the decision boundary may have been too complex which led to overfitting.

### Model 4: K-Means Clustering

In [None]:
X = DF1
inertia = []
k_values = range(1, 11)

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)

# Plot the Elbow curve
plt.plot(k_values, inertia, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()

In [None]:
DF1 = DF1.drop('LABEL', axis=1)

# Instantiate and fit the KMeans model
My_KMean = KMeans(n_clusters=3, random_state=42)
My_KMean.fit(df_vector1)

# Predict clusters
My_labels = My_KMean.predict(df_vector1)
print(My_labels)

My_KMean= KMeans(n_clusters=3)
My_KMean.fit(DF1)
My_labels=My_KMean.predict(DF1)
print(My_labels)

My_KMean2 = KMeans(n_clusters=4).fit(preprocessing.normalize(DF1))
My_KMean2.fit(DF1)
My_labels2=My_KMean2.predict(DF1)
print(My_labels2)

My_KMean3= KMeans(n_clusters=3)
My_KMean3.fit(DF1)
My_labels3=My_KMean3.predict(DF1)
print("Silhouette Score for k = 3 \n",silhouette_score(DF1, My_labels3))

In [None]:
# Add the cluster labels to the original DataFrame
DF1['Cluster_KMeans3'] = My_labels3

# Perform PCA for dimensionality reduction to plot in 2D
pca = PCA(n_components=2)
DF1_pca = pca.fit_transform(DF1)

# Plot the data points with color-coded clusters
plt.figure(figsize=(12, 6))
# Plot the KMeans (k=3) clusters
plt.subplot(1, 2, 1)
sns.scatterplot(x=DF1_pca[:, 0], y=DF1_pca[:, 1], hue=DF1['Cluster_KMeans3'], palette='viridis', legend='full')
plt.title('KMeans (k=3) Clustering')

# Plot the silhouette scores
plt.subplot(1, 2, 2)
sns.barplot(x=[silhouette_score(DF1, My_labels3)], y=['KMeans (k=3)'], palette='viridis')
plt.xlabel('Silhouette Score')
plt.title('Silhouette Score for KMeans (k=3)')
plt.tight_layout()
plt.show()

It appears that clusters 0 and 1 have a lot of similarities with a few outliers. However, cluster 2 is spread out and does not share as many similarities amongst the movie genres. If a viewer were to watch a movie from 0 and wanted to watch another movie from this cluster, the recommendation would surely be a good match. Although, given the disparity amongst the last cluster, it may be difficult to provide an accurate recommendation for the viewer. 

The silhouette score of 0.16 further suggests that while the clusters are distinct enough to be separate, the separation is not particularly strong. This means that the descriptions contain either ambiguous terms or the genres are not distinctly different in the features used for clustering.

In [None]:
kmeans4 = KMeans(n_clusters=4, random_state=42)
kmeans4.fit(DF1)
my_labels4 = kmeans4.predict(DF1)

DF1['Cluster_KMeans4'] = my_labels4

# Perform PCA for dimensionality reduction to plot in 2D
pca = PCA(n_components=2)
DF1_pca = pca.fit_transform(DF1)

# Plot the data points with color-coded clusters
plt.figure(figsize=(14, 6))
# Plot the KMeans (k=4) clusters
plt.subplot(1, 2, 1)
sns.scatterplot(x=DF1_pca[:, 0], y=DF1_pca[:, 1], hue=DF1['Cluster_KMeans4'], palette='viridis', legend='full')
plt.title('KMeans (k=4) Clustering')

# Calculate the silhouette score for k=4 and plot
silhouette_avg4 = silhouette_score(DF1, my_labels4)
plt.subplot(1, 2, 2)
sns.barplot(x=['KMeans (k=4)'], y=[silhouette_avg4], palette='viridis')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for KMeans (k=4)')
plt.tight_layout()
plt.show()

The model contained 4 clusters as the elbow appears to plateau between 3 and 4. There are improvements in the clusters due to the addition of a new cluster absorbing some of the data points that were found in other clusters. Cluster 1 has some overlap with cluster 3, but is otherwise more defined in comparison to the previous plot. Cluster 3 has a relatively wider spread, indicating that there may not be as many similarities amongst them. Clusters 0 and 2 contain some outliers, but otherwise have a more defined shape which indicates some similarities in the words found in the descriptions. 

The silhouette for 4 clusters provided an improved score of 0.35. This score suggests that the clusters are reasonably well separated and that the points within each cluster are relatively close to each other. 

### Model 5: Latent Dirichlet Allocation (LDA)

In [None]:
MyVectLDA_DH=CountVectorizer(input='content', stop_words= "english")

Vect_DH = MyVectLDA_DH.fit_transform(DF1)
ColumnNamesLDA_DH=MyVectLDA_DH.get_feature_names()
CorpusDF_DH=pd.DataFrame(Vect_DH.toarray(),columns=ColumnNamesLDA_DH)
print(CorpusDF_DH)

lda_model_DH = LatentDirichletAllocation(n_components=5, max_iter=100, learning_method='online')
LDA_DH_Model = lda_model_DH.fit_transform(Vect_DH)

print("SIZE: ", LDA_DH_Model.shape)  # (NO_DOCUMENTS, NO_TOPICS)

# Let's see how the first document in the corpus looks like in
## different topic spaces
print(LDA_DH_Model[0])
print(LDA_DH_Model[6])
print("List of prob: ")
print(LDA_DH_Model)

In [None]:
exclude_word = 'cluster_kmeans3'  

plt.figure(figsize=(num_topics * 5, 8))

for t in range(num_topics):
    plt.subplot(1, num_topics, t + 1)
    plt.ylim(0, num_top_words + 0.5)
    plt.xticks([])
    plt.yticks([])
    plt.title('Topic #{}'.format(t))
    top_words_idx = np.argsort(word_topic[:,t])[::-1]  # descending order
    top_words_idx = top_words_idx[:num_top_words]
    top_words = vocab_array[top_words_idx]
    top_words_shares = word_topic[top_words_idx, t]

    word_count = 0
    for i, (word, share) in enumerate(zip(top_words, top_words_shares)):
        if word != exclude_word:  # Only plot the word if it is not the excluded word
            plt.text(0.3, num_top_words-word_count-0.5, word, fontsize=fontsize_base)
            word_count += 1
        if word_count >= num_top_words:
            break

plt.tight_layout()
plt.show()

In [None]:
pyLDAvis.enable_notebook()
panel = gensimvis.prepare(LDA_DH_Model, Vect_DH, vectorizer, mds='tsne')
pyLDAvis.display(panel)

The LDA model suggests that it is difficult to identify a clear genre for either topic. Topic 0 could be crime films. Words like suspect, murdered and survive would align best with movies containing criminal elements. Topic 2 could be related to horror movies. Words such as thriller, violent, woods, and unexpected indicate something fearful which is the basis of the horror genre. Topic 3 could potentially be about sports movies, with words like team, race, and time. Topic 4 seems best aligned to sci-fi with zombie vampire, and tale. Topic 1 does not appear to have a clear genre. However, based on words like woman, tragic, and teenager, the topic could be pertaining to romance movies. 

While LDA is effective at providing a general idea of the top features per topic, it was unfortunately not able to provide definitive genres classifications. This is likely due to the overlap between the text descriptions and the fact that movies can belong to multiple distinct genres. Therefore, LDA may not be the most adequate model to use for a recommendation system. 