<a href="https://colab.research.google.com/github/ferrefab/ml2project/blob/main/multimodalMovieClassification_ferrefab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Project Title: Multimodal Movie Genre Classification**

Project Goal: To classify movies into genres using both plot summaries and poster images

Motivation and Relevancy: Enhancing movie recommendation systems by accurately classifying movies into genres using different models. My project is relevant because I'm evaluating three different models that are trying to solve the same problem. Hereby my goal is to find out if for this specific problem a text model or image classification is better to solve this problem. Or if a combination of the two results in an even more accurate model.

Data Collection:
- kaggle datasat for IMDb movie ids
- get movie data from IMDb
- get poster images from TMBDB

Model:
- pre-trained text model (BERT) for genre classification based on plot summaries
- pre-trained image model (VGG) for genre classification based on poster images
- combine the two models to create one singular, hopefully more powerful model, to classify movie genres

Validation:
- Evaluate all model using accuracy and F1 score
- Compare the multimodal model with text-only and image-only models

In [1]:
# install required libraries
!pip install tensorflow torch transformers scikit-learn opencv-python imdbpy


Collecting imdbpy
  Downloading IMDbPY-2022.7.9-py3-none-any.whl (1.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia

Firstly, I downloaded a kaggle dataset with 5000 movies on the IMDb page. The goal was to get a list of unique movie ids to then scrape theses movies individually from IMDb directly, to get the title, the full plot and the genres of the movie.

The link to the mentioned dataset: https://www.kaggle.com/datasets/carolzhangdc/imdb-5000-movie-dataset?resource=download

The kaggle dataset mentioned is the file "imdb_movies_kaggle_dataset.csv".

In [2]:
#extract unique movie_ids into an array
import pandas as pd

url =  'https://raw.githubusercontent.com/ferrefab/ml2project/main/data/imdb_movies_kaggle_dataset.csv'
df = pd.read_csv(url)
#df = pd.read_csv('imdb_movies_kaggle_dataset.csv')
movie_ids = df['movie_id'].unique()

movie_ids

array([ 499549,  449088, 2379713, ..., 2107644, 2070597,  378407])

Scrape title, full plot and genres(=target) from imdb

In [None]:
from imdb import IMDb
import requests
import os
import pandas as pd

imdb = IMDb()

def fetch_movie_details(movie_id):
    movie = imdb.get_movie(movie_id)
    title = movie.get('title')
    plot = movie.get('plot outline')
    genres = movie.get('genres')
    return title, plot, genres

movies_data = []
for movie_id in movie_ids:
    try:
        title, plot, genres = fetch_movie_details(movie_id)
        movies_data.append({'title': title, 'plot': plot, 'genres': genres})
    except Exception as e:
        print(f"Error fetching movie {movie_id}: {e}")

movies_df = pd.DataFrame(movies_data) #File generated from this code cell (check folder "data")
movies_df.to_csv('movies_data.csv', index=False)
movies_df.head()


2024-05-30 22:34:58,664 CRITICAL [imdbpy] /usr/local/lib/python3.10/dist-packages/imdb/_exceptions.py:32: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'https://www.imdb.com/title/tt0277296/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': RemoteDisconnected('Remote end closed connection without response')},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/imdb/parser/http/__init__.py", line 233, in retrieve_unicode
    response = uopener.open(url)
  File "/usr/lib/python3.10/urllib/request.py", line 519, in open
    response = self._open(req, data)
  File "/usr/lib/python3.10/urllib/request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 1391, in https_open
    return self.do_open(

Error fetching movie 277296: {'errcode': None, 'errmsg': 'None', 'url': 'https://www.imdb.com/title/tt0277296/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': RemoteDisconnected('Remote end closed connection without response')}
Error fetching movie 349710: {'errcode': None, 'errmsg': 'None', 'url': 'https://www.imdb.com/title/tt0349710/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': RemoteDisconnected('Remote end closed connection without response')}
Error fetching movie 790724: {'errcode': None, 'errmsg': 'None', 'url': 'https://www.imdb.com/title/tt0790724/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(ConnectionResetError(104, 'Connection reset by peer'))}


2024-05-31 02:30:10,511 CRITICAL [imdbpy] /usr/local/lib/python3.10/dist-packages/imdb/_exceptions.py:32: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'https://www.imdb.com/title/tt0092325/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': TimeoutError('The read operation timed out')},); kwds: {}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/imdb/parser/http/__init__.py", line 233, in retrieve_unicode
    response = uopener.open(url)
  File "/usr/lib/python3.10/urllib/request.py", line 519, in open
    response = self._open(req, data)
  File "/usr/lib/python3.10/urllib/request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnec

Error fetching movie 92325: {'errcode': None, 'errmsg': 'None', 'url': 'https://www.imdb.com/title/tt0092325/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': TimeoutError('The read operation timed out')}


Unnamed: 0,title,plot,genres
0,Avatar,"When his brother is killed in a robbery, parap...","[Action, Adventure, Fantasy, Sci-Fi]"
1,Pirates of the Caribbean: At World's End,After losing Jack Sparrow to the locker of Dav...,"[Action, Adventure, Fantasy]"
2,Spectre,A cryptic message from the past sends James Bo...,"[Action, Adventure, Thriller]"
3,The Dark Knight Rises,Despite his tarnished reputation after the eve...,"[Action, Drama, Thriller]"
4,Star Wars: Episode VII - The Force Awakens,,"[Documentary, Short]"


After scraping, the file "movies_data.csv" was stored with the data gathered from IMDb. I was able to gather data from {number} different movies. However, some data was missing and either the plot or the genres wasn't properly scraped from the website. This is where I inputed some additional data into the csv file manually or using ChatGPT 4o, to have the best dataset possbiel with more plots and genres to work with. For instances where I didn't have either a plot or genre, the data was deleted.
The complete file is saved as **"movies_data_updated.csv"** and is the file loaded in the next code cell.

In [4]:
#In this part we are going to Preprocess the plot data to make it better to feed the model by removing special characters and digits, lowercase everything, tokenize and remove stop words
import re
import nltk
import pandas as pd

nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    if not isinstance(text, str):
        return ""
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\d', ' ', text)
    text = text.lower()
    words = word_tokenize(text)
    words = [word for word in words if word not in stop_words]
    return ' '.join(words)

url = 'https://raw.githubusercontent.com/ferrefab/ml2project/main/data/movies_data_updated.csv'
# movies_df = pd.read_csv('movies_data_updated.csv')
movies_df = pd.read_csv(url)
movies_df['processed_plot'] = movies_df['plot'].apply(preprocess_text)
movies_df.to_csv('movies_data_processed.csv', index=False) #File generated from this code cell (check folder "data")
movies_df.head()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,title,plot,genres,processed_plot
0,Avatar,"When his brother is killed in a robbery, parap...","['Action', 'Adventure', 'Fantasy', 'Sci-Fi']",brother killed robbery paraplegic marine jake ...
1,Pirates of the Caribbean: At World's End,After losing Jack Sparrow to the locker of Dav...,"['Action', 'Adventure', 'Fantasy']",losing jack sparrow locker davy jones team tur...
2,Spectre,A cryptic message from the past sends James Bo...,"['Action', 'Adventure', 'Thriller']",cryptic message past sends james bond rogue mi...
3,The Dark Knight Rises,Despite his tarnished reputation after the eve...,"['Action', 'Drama', 'Thriller']",despite tarnished reputation events dark knigh...
4,John Carter,"John Carter, a Civil War veteran, who in 1868 ...","['Action', 'Adventure', 'Sci-Fi']",john carter civil war veteran trying live norm...


After this step again we have a new file with our data that is now processed. To train the pre-trained BERT Model in the next step we take the data from this new, complete and processed data file named "movies_data_processed.csv"

**Step 1: Build and Train pre-trained BERT Model**



In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, TFBertModel
import tensorflow as tf

url =  'https://raw.githubusercontent.com/ferrefab/ml2project/main/data/movies_data_processed.csv'
data = pd.read_csv(url)
#data = pd.read_csv('movies_data_processed.csv')

plots = data['processed_plot'].values
genres = data['genres'].apply(eval).values  #Convert string representation of list to actual list

#MultiLabelBinarizer to one-hot encode the genre labels
mlb = MultiLabelBinarizer()
genre_labels = mlb.fit_transform(genres)

genre_classes = mlb.classes_
print("Genre classes:", genre_classes)

#Load Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

max_length = 512
input_ids = []
attention_masks = []

for plot in plots:
    encoded_dict = tokenizer.encode_plus(
        plot,
        add_special_tokens=True,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='np',
    )

    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])

input_ids = np.squeeze(np.array(input_ids))
attention_masks = np.squeeze(np.array(attention_masks))
labels = np.array(genre_labels)

#Split data
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels, test_size=0.1, random_state=42)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, input_ids, test_size=0.1, random_state=42)

#Load BERT model
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

#Freeze BERT model layers to reduce computing time
for layer in bert_model.layers:
    layer.trainable = False

#Create model
input_ids_in = tf.keras.Input(shape=(max_length,), dtype='int32')
attention_masks_in = tf.keras.Input(shape=(max_length,), dtype='int32')
bert_output = bert_model([input_ids_in, attention_masks_in])[1]
dropout = tf.keras.layers.Dropout(0.3)(bert_output)
output = tf.keras.layers.Dense(len(genre_classes), activation='sigmoid')(dropout)
model = tf.keras.Model(inputs=[input_ids_in, attention_masks_in], outputs=output)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5), loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

history = model.fit([train_inputs, train_masks], train_labels, validation_data=([validation_inputs, validation_masks], validation_labels), epochs=50, batch_size=32)

model.save('bert_movie_classification_model.h5')


Genre classes: ['Action' 'Adventure' 'Animation' 'Biography' 'Comedy' 'Crime'
 'Documentary' 'Drama' 'Family' 'Fantasy' 'Film-Noir' 'History' 'Horror'
 'Music' 'Musical' 'Mystery' 'News' 'Reality-TV' 'Romance' 'Sci-Fi'
 'Short' 'Sport' 'Thriller' 'War' 'Western']


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 512)]                0         []                            
                                                                                                  
 input_2 (InputLayer)        [(None, 512)]                0         []                            
                                                                                                  
 tf_bert_model (TFBertModel  TFBaseModelOutputWithPooli   1094822   ['input_1[0][0]',             
 )                           ngAndCrossAttentions(last_   40         'input_2[0][0]']             
                             hidden_state=(None, 512, 7                                           
                             68),                                                             

  saving_api.save_model(


In the next step we will now use the same data as for the bert model, and download movie posters as images from TMDB. After, we will create a VGG Image Classification to classify the movies and see if it performs better than the BERT model.


**DISCLAIMER:** The reviewer will have to insert the api key provided in this part of the code. The folder generated might take a while but is essential to train the model. The same folder with the data was provided in the Readme file with the onedrive link for demonstration.

In [None]:
import os
import requests
import pandas as pd

image_dir = 'movie_posters'
os.makedirs(image_dir, exist_ok=True)

def fetch_poster_url(title):
    api_key = 'TMDB_API_KEY'  # Insert API Key here
    search_url = f'https://api.themoviedb.org/3/search/movie?api_key={api_key}&query={title}'
    response = requests.get(search_url)
    if response.status_code == 200:
        data = response.json()
        if data['results']:
            poster_path = data['results'][0]['poster_path']
            return f'https://image.tmdb.org/t/p/w500{poster_path}'
    return None

url =  'https://raw.githubusercontent.com/ferrefab/ml2project/main/data/movies_data.csv'
movies_df = pd.read_csv(url)
#movies_df = pd.read_csv('movies_data.csv')
movies_df = movies_df.drop(columns=['plot'])

#Download images and add poster URLs to the DataFrame
for index, row in movies_df.iterrows():
    poster_url = fetch_poster_url(row['title'])
    if poster_url:
        response = requests.get(poster_url)
        if response.status_code == 200:
            image_path = os.path.join(image_dir, f"{row['title'].replace('/', '_')}.jpg")
            with open(image_path, 'wb') as f:
                f.write(response.content)
            movies_df.at[index, 'poster_url'] = image_path

#new DataFrame with movie title, image path, and genres
movies_df = movies_df[['title', 'poster_url', 'genres']]
movies_df.head()
movies_df.to_csv('movies_posters.csv', index=False)


In [None]:
movies_df.head()

Unnamed: 0,title,poster_url,genres
0,Avatar,movie_posters/Avatar.jpg,"['Action', 'Adventure', 'Fantasy', 'Sci-Fi']"
1,Pirates of the Caribbean: At World's End,movie_posters/Pirates of the Caribbean: At Wor...,"['Action', 'Adventure', 'Fantasy']"
2,Spectre,movie_posters/Spectre.jpg,"['Action', 'Adventure', 'Thriller']"
3,The Dark Knight Rises,movie_posters/The Dark Knight Rises.jpg,"['Action', 'Drama', 'Thriller']"
4,Star Wars: Episode VII - The Force Awakens,movie_posters/Star Wars: Episode VII - The For...,"['Documentary', 'Short']"


In [None]:
#Download movies posters to a zip folder
#import shutil
#from google.colab import files

#folder = 'movie_posters'
#zip = 'movie_posters.zip'

#shutil.make_archive(zip.replace('.zip', ''), 'zip', folder)

#files.download(zip)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

After this step I again manually inputed the file movies_posters by adding missing values where I could and deleting movies that did not have a poster and hereby creating the file movies_posters_updated.csv


**Step 2: Build and Train pre-trained VGG Image Classification model**

In [None]:
import os
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Flatten, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer

#Load the CSV file
url =  'https://raw.githubusercontent.com/ferrefab/ml2project/main/data/movies_posters_updated.csv'
df = pd.read_csv(url)
#movies_df = pd.read_csv('movies_posters_updated.csv')

#MultiLabelBinarizer to one-hot encode the genre labels
mlb = MultiLabelBinarizer()
movies_df['genres'] = movies_df['genres'].apply(eval)
genre_labels = mlb.fit_transform(movies_df['genres'])

#Preprocess images
def preprocess_image(img_path, target_size=(224, 224)):
    img = load_img(img_path, target_size=target_size)
    img_array = img_to_array(img)
    img_array = np.expand_dims(img_array, axis=0)
    img_array = preprocess_input(img_array)  # Preprocess for VGG
    return img_array

image_dir = 'movie_posters'
images = []
for img_path in movies_df['poster_url']:
    img_array = preprocess_image(os.path.join(image_dir, img_path))
    images.append(img_array)

images = np.vstack(images)

#prepare & split data
labels = np.array(genre_labels)
train_images, val_images, train_labels, val_labels = train_test_split(images, labels, test_size=0.1, random_state=42)

#Load VGG16 model
vgg = VGG16(include_top=False, weights='imagenet', input_shape=(224, 224, 3))

#Freeze the layers except last 4 to reduce computing time
for layer in vgg.layers[:-4]:
    layer.trainable = False

#Create model
x = Flatten()(vgg.output)
x = Dense(1024, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(512, activation='relu')(x)
x = Dropout(0.5)(x)
output = Dense(len(mlb.classes_), activation='sigmoid')(x)
model = Model(inputs=vgg.input, outputs=output)

model.compile(optimizer=Adam(learning_rate=1e-4), loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(train_images, train_labels, validation_data=(val_images, val_labels), epochs=50, batch_size=32)
model.save('vgg_movie_classification_model.h5')


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


  saving_api.save_model(


**Step 3: Concatenate models and bring it all together**
In this final step I will now again individually prepare the data for the bert and vgg models and then concatenate the two to create 1 singular model that hopefully performs better than the two indivdual ones.
Here I had to manually decrease the number of data points in the poster data so that both csv files with the plots and posters loaded for the bert and vgg models had the same amount of movies in them, otherwise the concatenation wouldn't work

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, TFBertModel
import tensorflow as tf

#Load data
url =  'https://raw.githubusercontent.com/ferrefab/ml2project/main/data/movies_data_processed.csv'
data = pd.read_csv(url)

plots = data['processed_plot'].values
genres = data['genres'].apply(eval).values  # Convert string representation of list to actual list

#MultiLabelBinarizer to one-hot encode the genre labels
mlb = MultiLabelBinarizer()
genre_labels = mlb.fit_transform(genres)
genre_classes = mlb.classes_
print("Genre classes:", genre_classes)

#Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

#Tokenize plot summaries
max_length = 512
input_ids = []
attention_masks = []

for plot in plots:
    encoded_dict = tokenizer.encode_plus(
        plot,
        add_special_tokens=True,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='np',
    )

    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])

input_ids = np.squeeze(np.array(input_ids))
attention_masks = np.squeeze(np.array(attention_masks))
labels = np.array(genre_labels)

train_input_ids, val_input_ids, train_labels, val_labels = train_test_split(input_ids, labels, test_size=0.1, random_state=42)
train_attention_masks, val_attention_masks, _, _ = train_test_split(attention_masks, input_ids, test_size=0.1, random_state=42)


Genre classes: ['Action' 'Adventure' 'Animation' 'Biography' 'Comedy' 'Crime'
 'Documentary' 'Drama' 'Family' 'Fantasy' 'Film-Noir' 'History' 'Horror'
 'Music' 'Musical' 'Mystery' 'News' 'Reality-TV' 'Romance' 'Sci-Fi'
 'Short' 'Sport' 'Thriller' 'War' 'Western']


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import load_img, img_to_array

# Load the movie data with poster URLs
url = 'https://raw.githubusercontent.com/ferrefab/ml2project/main/data/movies_posters_multimodal.csv'
movies_df = pd.read_csv(url)
#movies_df = pd.read_csv('movies_posters_multimodal.csv')

# Convert genres from string to list
movies_df['genres'] = movies_df['genres'].apply(eval)

# Use MultiLabelBinarizer to one-hot encode the genre labels
mlb = MultiLabelBinarizer()
genre_labels = mlb.fit_transform(movies_df['genres'])

# Preprocess images
def preprocess_image(img_path, target_size=(224, 224)):
    img = load_img(img_path, target_size=target_size)
    img_array = img_to_array(img)
    img_array = np.expand_dims(img_array, axis=0)
    img_array = img_array / 255.0  # Normalize to [0, 1]
    return img_array

image_dir = 'movie_posters'
images = []
for img_path in movies_df['poster_url']:
    img_array = preprocess_image(os.path.join(image_dir, img_path))
    images.append(img_array)

images = np.vstack(images)

# Convert lists to arrays
labels = np.array(genre_labels)

# Ensure the number of images and labels match
assert len(images) == len(labels), "Number of images and labels must match"

# Split the data into training and validation sets
train_images, val_images, train_labels, val_labels = train_test_split(images, labels, test_size=0.1, random_state=42)


In [None]:
#this exact transformers version was needed to make sure the models could be combined
!pip install transformers==4.37.2




In [None]:
import os
import pandas as pd
import numpy as np
import transformers
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.layers import Input, Dense, Dropout, concatenate, Flatten
from tensorflow.keras.models import Model
from tensorflow.keras.applications import VGG16
from tensorflow.keras.optimizers import Adam
from transformers import TFBertModel, BertTokenizer
import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# input layers for BERT
bert_input_ids = Input(shape=(512,), dtype=tf.int32, name='input_ids')
bert_attention_masks = Input(shape=(512,), dtype=tf.int32, name='attention_masks')
bert_token_type_ids = Input(shape=(512,), dtype=tf.int32, name='token_type_ids')

# Load BERT model
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

# Unfreeze last few layers of BERT for fine-tuning
for layer in bert_model.layers[-10:]:
    layer.trainable = True

# BERT outputs
bert_outputs = bert_model(bert_input_ids, attention_mask=bert_attention_masks, token_type_ids=bert_token_type_ids)
bert_output = bert_outputs.last_hidden_state[:, 0, :]  # Use the first token ([CLS]) representation

# Define the VGG model
vgg_model = VGG16(include_top=False, weights='imagenet', input_shape=(224, 224, 3))

# input layer for VGG
vgg_input = vgg_model.input

# VGG outputs
vgg_output = vgg_model.output
vgg_output_flatten = Flatten()(vgg_output)

# Concatenate BERT and VGG outputs
combined_output = concatenate([bert_output, vgg_output_flatten])

x = Dense(1024, activation='relu')(combined_output)
x = Dropout(0.5)(x)
x = Dense(512, activation='relu')(x)
x = Dropout(0.5)(x)
final_output = Dense(len(mlb.classes_), activation='sigmoid')(x)

combined_model = Model(inputs=[bert_input_ids, bert_attention_masks, bert_token_type_ids, vgg_input], outputs=final_output)
combined_model.compile(optimizer=Adam(learning_rate=2e-5), loss='binary_crossentropy', metrics=['accuracy'])
combined_model.summary()

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
max_length = 512
input_ids = []
attention_masks = []
token_type_ids = []

for plot in plots:
    encoded_dict = tokenizer.encode_plus(
        plot,
        add_special_tokens=True,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_token_type_ids=True,
        return_tensors='np',
    )

    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])
    token_type_ids.append(encoded_dict['token_type_ids'])

input_ids = np.squeeze(np.array(input_ids))
attention_masks = np.squeeze(np.array(attention_masks))
token_type_ids = np.squeeze(np.array(token_type_ids))
labels = np.array(genre_labels)

# Split data into training and validation sets
train_input_ids, val_input_ids, train_labels, val_labels = train_test_split(input_ids, labels, test_size=0.1, random_state=42)
train_attention_masks, val_attention_masks, _, _ = train_test_split(attention_masks, input_ids, test_size=0.1, random_state=42)
train_token_type_ids, val_token_type_ids, _, _ = train_test_split(token_type_ids, input_ids, test_size=0.1, random_state=42)

# Callbacks for early stopping and saving best model
callbacks = [
    EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True),
    ModelCheckpoint(filepath='multimodal_movie_classification_model.h5', save_best_only=True, monitor='val_loss')
]

# Train combined model
history = combined_model.fit(
    [train_input_ids, train_attention_masks, train_token_type_ids, train_images], train_labels,
    validation_data=([val_input_ids, val_attention_masks, val_token_type_ids, val_images], val_labels),
    epochs=20,
    batch_size=16,
    callbacks=callbacks
)

combined_model.save('multimodal_movie_classification_model2.h5')


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_2 (InputLayer)        [(None, 224, 224, 3)]        0         []                            
                                                                                                  
 block1_conv1 (Conv2D)       (None, 224, 224, 64)         1792      ['input_2[0][0]']             
                                                                                                  
 block1_conv2 (Conv2D)       (None, 224, 224, 64)         36928     ['block1_conv1[0][0]']        
                                                                                                  
 block1_pool (MaxPooling2D)  (None, 112, 112, 64)         0         ['block1_conv2[0][0]']        
                                                                                            



Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20


In [None]:
import shutil

# Define the path to the file in session storage
session_file_path = 'multimodal_movie_classification_model2.h5'

# Define the path to the destination in Google Drive
drive_file_path = '/content/drive/MyDrive'

# Copy the file from session storage to Google Drive
shutil.copy(session_file_path, drive_file_path)

print("File has been copied to your Google Drive.")


File has been copied to your Google Drive.


# **Step 4:Results and Evaluation of the three models:**

**1. BERT text analysis Model**

**Accuracy and loss Trends:**

The training started with an initial loss of 0.6073 and an accuracy of 0.0275 in the first epoch.
By the end of the 50 epochs, the training loss decreased to 0.2944 and the accuracy improved to 0.2270.
The validation loss started at 0.4821 and decreased to 0.2767 by the final epoch.
The validation accuracy showed an improvement from 0.0322 in the first epoch to 0.2511 in the last epoch.

**Training and Validation Loss:**

Both training and validation losses consistently decreased over the epochs, indicating that the model was learning effectively.
There was no significant divergence between the training and validation loss curves, suggesting that overfitting was not a major issue.

**How accurate was it?**

Overall, while the model shows a clear learning trend, its accuracy indicates it may need further tuning and possibly more data to achieve better performance.
The accuracy improvement, although steady, was relatively modest. Starting from a very low baseline, reaching around **25% validation accuracy** is an indicator of progress but also suggests that the model has way more room for improvement. For this project however I wanted to see if either the BERT or VGG model would perform better using roughly the same architecture and data, so no further improvements were made.




**2. VGG Image Classification Model**

**Accuracy and loss trends:**

At the beginning of training (Epoch 1), the model had a low accuracy of 14.40% on the training set and 20.74% on the validation set. The initial loss values were 0.5731 for the training set and 0.3017 for the validation set.

Validation accuracy remained relatively low and fluctuated between 20.74% and 33.06%. The **highest validation accuracy of 33.06%** was achieved in the last epoch, which is still significantly lower than what would be expected from a well-performing model, but already better than the BERT text analysis model.

**Progression and Overfitting:**

Over the epochs, the training loss consistently decreased, reaching a very low value of 0.0198 by the 50th epoch. Training accuracy showed improvement but remained low, peaking at around 49.55%. Validation loss decreased initially but started to increase from around Epoch 11, indicating potential overfitting. This suggests that the model began to memorize the training data rather than learning to generalize from it.



**3. Concatenated model using both BERT and VGG**

**Accuracy and loss trends:**

The training loss and accuracy improved consistently over the epochs, indicating the model was learning from the training data. The validation loss and accuracy showed improvement initially but started to plateau and slightly fluctuate after the 7th epoch.

The validation accuracy however peaked at 46% in the 8th Epoch before the model early stopped at Epoch 11. Even though there is still room for improvement this is a higher accuracy than the two indvidual models.

**Overfitting?:**

The gap between the training and validation accuracy suggests a potential overfitting issue. The validation accuracy did not improve significantly after the initial epochs, which might indicate that the model is not generalizing well to unseen data.

**Conclusion**

Overall, all three models did not reach the expectation of a well-trained model, even if a large dataset with over 4000 instances each was used. However, it did prove that an image classification pre-trained model is better than a pre-trained BERT text analysis model to solve this problem.

Also the fine-tuned, concatenated model of the two did perfom the best, utilizing both features of text analysis for the plots and image analysis for the movie posters to make the best possible classification.

Overall this was a very fun project to create and there is stil room for improvement in the future, so, **what would you change?**