# Final Project: Movie Poster Multi-Label Genre Prediction
David Mehoves

## Introduction

This project will create a CNN to classify the images from a Kaggle Movie Poster dataset into one or more appropriate genres.  The dataset contains 7,867 images and a training dataset of 7,242 images labeled with 1 to 3 of the following 25 genres - Action, Adventure, Animation, Biography, Comedy, Crime, Documentary, Drama, Family, Fantasy, History, Horror, Music, Musical, Mystery, N/A, News, Reality TV, Romance, Sci-Fi, Short, Sport, Thriller, War, Western.

## Data Loading and Analysis

In [1]:
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D,MaxPool2D,Dense,Flatten,BatchNormalization,Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing import image
from sklearn.model_selection import train_test_split


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import kagglehub

# Download latest version
path = kagglehub.dataset_download("raman77768/movie-classifier")

print("Path to dataset files:", path)

df = pd.read_csv(f'{path}/Multi_Label_dataset/train.csv')

Path to dataset files: C:\Users\david\.cache\kagglehub\datasets\raman77768\movie-classifier\versions\1


Function to blend multiple labels into one `unique` label string

In [38]:
def single_label(in_list):
    temp_list = []
    for movie in in_list:
        movie = eval(movie)
        temp_gen = ""
        for gen in sorted(movie):
            print(gen)
            temp_gen += gen.lower()
            temp_gen += "_"
        temp_gen = temp_gen[:-1]
        temp_list.append(temp_gen)
    return temp_list

In [39]:
genre_df = df["Genre"]

genre_list = single_label(genre_df)
temp_dict = {}

for row in genre_list:
    if row not in temp_dict:
        temp_dict[row] = 1
    else:
        temp_dict[row] += 1

print(temp_dict)
    

Comedy
Drama
Drama
Music
Romance
Comedy
Sci-Fi
Thriller
Action
Adventure
Thriller
Action
Thriller
Drama
Music
Romance
Comedy
Drama
Comedy
Crime
Drama
Adventure
Comedy
Action
Comedy
Sci-Fi
Comedy
Drama
Action
Adventure
Thriller
Horror
Action
Crime
Drama
Drama
Musical
Romance
Biography
Drama
History
Crime
Horror
Mystery
Comedy
Action
Drama
Thriller
Adventure
Drama
Horror
Sci-Fi
Comedy
Adventure
Drama
Romance
Crime
Drama
Comedy
Romance
Comedy
Fantasy
Romance
Comedy
Comedy
Drama
Romance
Adventure
Drama
History
Horror
Thriller
Action
Drama
Horror
Horror
Sci-Fi
Thriller
Action
Drama
Romance
Comedy
Family
Comedy
Drama
Comedy
Romance
Drama
Romance
Sport
Action
Adventure
Fantasy
Action
Adventure
Sci-Fi
Action
Comedy
Crime
Drama
Mystery
Comedy
Drama
Comedy
Romance
Comedy
Crime
Drama
Action
Crime
Drama
Action
Comedy
Romance
Action
Crime
Drama
Adventure
Family
Adventure
Fantasy
Comedy
Thriller
Comedy
Drama
Romance
Horror
Mystery
Thriller
Comedy
Comedy
Crime
Sci-Fi
Thriller
Comedy
Fantasy
Romance
D

In [40]:
count_list = []
for key, value in temp_dict.items():
    count_list.append({"genre": key, "count": value})

sorted_data = sorted(count_list, key=lambda x: x["count"])
print(f"Unique Labels: {len(sorted_data)}")
sorted_data

Unique Labels: 462


[{'genre': 'fantasy_mystery_thriller', 'count': 1},
 {'genre': 'documentary_horror_thriller', 'count': 1},
 {'genre': 'action_family_romance', 'count': 1},
 {'genre': 'comedy_musical_sci-fi', 'count': 1},
 {'genre': 'comedy_horror_musical', 'count': 1},
 {'genre': 'action_horror_romance', 'count': 1},
 {'genre': 'action_adventure_war', 'count': 1},
 {'genre': 'animation_comedy_crime', 'count': 1},
 {'genre': 'war', 'count': 1},
 {'genre': 'adventure_horror_mystery', 'count': 1},
 {'genre': 'comedy_mystery_sci-fi', 'count': 1},
 {'genre': 'adventure_history_romance', 'count': 1},
 {'genre': 'biography_drama_horror', 'count': 1},
 {'genre': 'comedy_sci-fi_western', 'count': 1},
 {'genre': 'adventure_animation_crime', 'count': 1},
 {'genre': 'biography_comedy_music', 'count': 1},
 {'genre': 'romance_western', 'count': 1},
 {'genre': 'action_drama_music', 'count': 1},
 {'genre': 'animation_sci-fi_short', 'count': 1},
 {'genre': 'family_fantasy_music', 'count': 1},
 {'genre': 'adventure_bio

I decided not to create multi-genre single labels for several reasons:
- Doing so greatly increased the number of unique labels and many of them appeared very infrequently in the dataset.
- Because the distribution of labels was imbalanced, I was concerned the model would be biased towards the more frequent labels without significant data augmentation.
- The increase in unique labels meant training would require more computational resources.

## Data Cleaning
In this section I will label images that are not labeled, and handle the ten posters labeled as `N/A`

In [41]:
len(df[df['N/A']==1])

10

Looks like we have ten images with no label.  Because of the small count, I have decided to remove them from the training data.
- I had trouble loading my images when I removed the N/As because it left a 'dead' index in teh array.  I need to find a way to clean it that reduces the total size of the array.  To fix this I added the `.reset_index(drop=True)`

In [42]:
df = df[df['N/A']==0] \
    .reset_index(drop=True)
len(df[df['N/A']==1])

0

## Data Preprocessing

### Feature Engineering
- Resize the images so they all have the same dimensions
- Normalize the pixels down from 0-255 to a range of 0-1

In [43]:
WIDTH = 200
HEIGHT = 200

def create_X_features_from_Images(width,height):
    X = []
    for i in range(df.shape[0]):
        img_path = f'{path}/Multi_Label_dataset/Images/'+df['Id'][i]+'.jpg'
        # first we load the image and then resize it to be width x height
        img = image.load_img(img_path,target_size=(width,height,3))        #3 here refers to the number of channels R,G,B
        # Next we normalize it by deviding by 255 (since it's in RGB format)
        img = image.img_to_array(img)
        img = img/255.0
        X.append(img)
    return np.array(X)

X = create_X_features_from_Images(WIDTH, HEIGHT)
X.shape

(7244, 200, 200, 3)

### Data Augmentation
- Increase the quantity of images in genres with lower counts (e.g. Reality and News) to be more equitable with higher count genres to avoid introducing bias, prevent underfitting and lead to better generalization
- Note: some genres legitimately occur more often than others because they are more popular with audiences.


### Dataset Splits 
- Ensure all genres are represented in the training, testing, and validation datasets
- Remove columns for id and genre, leaving only one-hot-encloding in y labels
- Split into Test, Train, and Validation Data-Sets

In [None]:
y = df.drop(['Id','Genre'],axis=1)
y = y.to_numpy()

# 80% Train, 10% Test, 10% Validate
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_valid, X_test, y_valid, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

## Model Building & Training

### Model Selection
A Convolutional Neural Network (CNN) model is well-suited to classifying movie poster images into one or more genres because it excels at extracting features from complex images and can recognize objects regardless of their position in the image. CNNs are also capable of predicting multiple labels simultaneously.

### Loss Function
Binary Cross-Entropy – if each instance can belong to multiple classes simultaneously    
~~Sparse Categorical CrossEntropy  if using independent multi-genre single labels~~ (I have removed this as an option because I am no longer using multi-genre single labels)

### Hyperparameters

- <strong>Activation:</strong>    

    Middle Layers - ReLU activation to mitigate the vanishing gradient problem, which will allow the model to learn faster and perform better.    
    Output Layer Options:    
    >Sigmoid – will output a probability for each label independently, which is essential to multi-label classification if we keep our labels separated, since labels are correlated both positively and negatively.    
    >~~Softmax – if we pre-process the data to create independent multi-genre labels~~

- <strong>Optimizer:</strong> Adam adjusts the learning rate dynamically which is beneficial to a model learning complex patterns from movie poster input
- <strong>Filter Size (Kernel Size):</strong> Determines the dimensions of the filters applied during convolution operations.
- <strong>Number of Filters:</strong> Specifies how many filters are used in each convolutional layer.
- <strong>Stride:</strong> Defines the step size with which the filter moves across the input image.
- <strong>Padding:</strong> Controls whether the input is padded with zeros around the border, affecting the output size.
- <strong>Pooling Size:</strong> Sets the dimensions of the pooling window, typically used in max or average pooling layers.
- <strong>Learning Rate:</strong> Determines the step size at each iteration while moving toward a minimum of the loss function.
- <strong>Batch Size:</strong> The number of training examples utilized in one iteration.
- <strong>Number of Epochs:</strong> The number of complete passes through the training dataset.
- <strong>Dropout Rate:</strong> The fraction of neurons to drop during training to prevent overfitting.


### Training

In [None]:
Conv2D = tf.keras.layers.Conv2D
MaxPooling2D = tf.keras.layers.MaxPooling2D
Flatten = tf.keras.layers.Flatten
Dense = tf.keras.layers.Dense
Dropout = tf.keras.layers.Dropout
ImageDataGenerator = tf.keras.preprocessing.image.ImageDataGenerator
BatchNormalization = tf.keras.layers.BatchNormalization
Activation = tf.keras.layers.Activation

In [97]:

def evaluate(model):
    print("\n\nEvaluate Training Data")
    model.evaluate(X_train, y_train)
    print("Evaluate Test Data")
    model.evaluate(X_test, y_test)
    print("Evaluate Validation Data")
    model.evaluate(X_valid, y_valid)


#### Test #1: Initial Setup

In [None]:
NUMBER_OF_CLASSES = 25
BatchSize = 64
NUMBER_OF_EPOCHS = 10

INPUT_SHAPE = (WIDTH, HEIGHT, 3)  
KERNAL_SIZE = 3  # Typical filter size for CNNs
POOL_SIZE = (2, 2)  # Typical pooling size

model = tf.keras.models.Sequential(
    [
        Conv2D(32, KERNAL_SIZE, activation='relu', input_shape=INPUT_SHAPE),
        MaxPooling2D(pool_size=POOL_SIZE),

        Conv2D(64, KERNAL_SIZE, activation='relu'),
        MaxPooling2D(pool_size=POOL_SIZE),

        Conv2D(128, KERNAL_SIZE, activation='relu'),
        MaxPooling2D(pool_size=POOL_SIZE),

        Flatten(),
        Dense(1024, activation='relu'),
        Dropout(0.5),
        Dense(NUMBER_OF_CLASSES, activation='sigmoid')
    ]
)

model.summary()

In [64]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [65]:
augmentor = ImageDataGenerator(
    rotation_range=30,
    zoom_range=0.2,
    width_shift_range=0.25,
    height_shift_range=0.25, 
    shear_range=0.20,
    horizontal_flip=True,
    fill_mode="nearest"
)

In [None]:
history = model.fit(
    augmentor.flow(
        X_train,
        y_train,
        batch_size=BATCH_SIZE
    ),
    validation_data=(X_test, y_test), 
    steps_per_epoch=len(X_train) // BatchSize,
    epochs=NUMBER_OF_EPOCHS
)

Epoch 1/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m90s[0m 967ms/step - accuracy: 0.2302 - loss: 0.3306 - val_accuracy: 0.3697 - val_loss: 0.2374
Epoch 2/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 23ms/step - accuracy: 0.2188 - loss: 0.2309 - val_accuracy: 0.3683 - val_loss: 0.2374
Epoch 3/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m89s[0m 966ms/step - accuracy: 0.3017 - loss: 0.2419 - val_accuracy: 0.3614 - val_loss: 0.2374
Epoch 4/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 24ms/step - accuracy: 0.3906 - loss: 0.2465 - val_accuracy: 0.3586 - val_loss: 0.2350
Epoch 5/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m86s[0m 936ms/step - accuracy: 0.2934 - loss: 0.2391 - val_accuracy: 0.3655 - val_loss: 0.2370
Epoch 6/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 22ms/step - accuracy: 0.4062 - loss: 0.2245 - val_accuracy: 0.3710 - val_loss: 0.2370
Epoch 7/10
[1m90/90[0m [3

In [68]:
evaluate(model)



Evaluate Training Data
[1m182/182[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 88ms/step - accuracy: 0.3248 - loss: 0.2279
Evaluate Test Data
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 87ms/step - accuracy: 0.3430 - loss: 0.2355
Evaluate Validation Data Data
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 94ms/step - accuracy: 0.3379 - loss: 0.2266


Looks like we are getting about a 33% success rate in selecting the correct genres.  The large dropout rate was causing some fluctuation, and I would like to try and reduce it and see if we can get a more stable training without overfitting.

#### Test #2: Lower dropout rate

In [None]:
NUMBER_OF_CLASSES = 25
BatchSize = 64
NUMBER_OF_EPOCHS = 10

INPUT_SHAPE = (WIDTH, HEIGHT, 3)  
KERNAL_SIZE = 3  # Typical filter size for CNNs
POOL_SIZE = (2, 2)  # Typical pooling size

model = tf.keras.models.Sequential(
    [
        Conv2D(32, KERNAL_SIZE, activation='relu', input_shape=INPUT_SHAPE),
        MaxPooling2D(pool_size=POOL_SIZE),

        Conv2D(64, KERNAL_SIZE, activation='relu'),
        MaxPooling2D(pool_size=POOL_SIZE),

        Conv2D(128, KERNAL_SIZE, activation='relu'),
        MaxPooling2D(pool_size=POOL_SIZE),

        Flatten(),
        Dense(1024, activation='relu'),
        Dropout(0.2),
        Dense(NUMBER_OF_CLASSES, activation='sigmoid')
    ]
)

model.summary()

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

augmentor = ImageDataGenerator(
    rotation_range=30,
    zoom_range=0.2,
    width_shift_range=0.25,
    height_shift_range=0.25, 
    shear_range=0.20,
    horizontal_flip=True,
    fill_mode="nearest"
)
steps_per_epoch = max(1, len(X_train) // BatchSize)
history = model.fit(
    augmentor.flow(
        X_train,
        y_train,
        batch_size=BATCH_SIZE
    ),
    validation_data=(X_test, y_test), 
    steps_per_epoch=steps_per_epoch,
    epochs=NUMBER_OF_EPOCHS
)

evaluate(model)

Epoch 1/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m88s[0m 944ms/step - accuracy: 0.2479 - loss: 0.3084 - val_accuracy: 0.3600 - val_loss: 0.2389
Epoch 2/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 23ms/step - accuracy: 0.2656 - loss: 0.2428 - val_accuracy: 0.3586 - val_loss: 0.2402
Epoch 3/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m86s[0m 938ms/step - accuracy: 0.3021 - loss: 0.2377 - val_accuracy: 0.3517 - val_loss: 0.2355
Epoch 4/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 23ms/step - accuracy: 0.3438 - loss: 0.2461 - val_accuracy: 0.3614 - val_loss: 0.2379
Epoch 5/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m85s[0m 922ms/step - accuracy: 0.3031 - loss: 0.2358 - val_accuracy: 0.3683 - val_loss: 0.2417
Epoch 6/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 23ms/step - accuracy: 0.3750 - loss: 0.2281 - val_accuracy: 0.3586 - val_loss: 0.2416
Epoch 7/10
[1m90/90[0m [3

#### Test #3: More, complex layers, adding BatchNormalization

In [None]:
NUMBER_OF_CLASSES = 25
BatchSize = 64
NUMBER_OF_EPOCHS = 10

INPUT_SHAPE = (WIDTH, HEIGHT, 3)  
KERNAL_SIZE = 3  # Typical filter size for CNNs
POOL_SIZE = (2, 2)  # Typical pooling size

model = tf.keras.models.Sequential(
    [
        Conv2D(32, KERNAL_SIZE, activation='relu', input_shape=INPUT_SHAPE),
        MaxPooling2D(pool_size=POOL_SIZE),

        Conv2D(64, KERNAL_SIZE, activation='relu'),
        MaxPooling2D(pool_size=POOL_SIZE),

        Conv2D(128, KERNAL_SIZE, activation='relu'),
        MaxPooling2D(pool_size=POOL_SIZE),

        Flatten(),
        Dense(1024),
        BatchNormalization(),
        Activation('relu'),
        Dropout(0.2),

        Dense(1024),
        BatchNormalization(),
        Activation('relu'),
        Dropout(0.2),

        Dense(1024),
        BatchNormalization(),
        Activation('relu'),
        Dropout(0.2),

        Dense(NUMBER_OF_CLASSES, activation='sigmoid')
    ]
)

model.summary()

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

augmentor = ImageDataGenerator(
    rotation_range=30,
    zoom_range=0.2,
    width_shift_range=0.25,
    height_shift_range=0.25, 
    shear_range=0.20,
    horizontal_flip=True,
    fill_mode="nearest"
)

train_dataset = tf.data.Dataset.from_generator(
    lambda: augmentor.flow(
        X_train,
        y_train,
        batch_size=BatchSize
    ),
    output_signature=(
        tf.TensorSpec(shape=(None, *X_train.shape[1:]), dtype=tf.float32),
        tf.TensorSpec(shape=(None, y_train.shape[1]), dtype=tf.float32)
    )
).repeat()

history = model.fit(
    train_dataset,
    validation_data=(X_test, y_test), 
    steps_per_epoch=steps_per_epoch,
    epochs=NUMBER_OF_EPOCHS
)

evaluate(model)

Epoch 1/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m93s[0m 1s/step - accuracy: 0.2374 - loss: 0.3402 - val_accuracy: 0.1738 - val_loss: 0.4559
Epoch 2/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m85s[0m 946ms/step - accuracy: 0.2962 - loss: 0.2433 - val_accuracy: 0.3793 - val_loss: 0.2640
Epoch 3/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m85s[0m 944ms/step - accuracy: 0.3190 - loss: 0.2417 - val_accuracy: 0.3545 - val_loss: 0.2568
Epoch 4/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m88s[0m 979ms/step - accuracy: 0.3207 - loss: 0.2408 - val_accuracy: 0.3669 - val_loss: 0.2440
Epoch 5/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m88s[0m 981ms/step - accuracy: 0.3207 - loss: 0.2360 - val_accuracy: 0.2634 - val_loss: 0.2442
Epoch 6/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m88s[0m 975ms/step - accuracy: 0.3405 - loss: 0.2350 - val_accuracy: 0.3600 - val_loss: 0.2694
Epoch 7/10
[1m90/90[0m 

#### Test #4:
- Switch metrics from `['Accuracy']` to `['Precision', 'Recall', 'F1-score']`
    - Precision - correctly identifying genres
    - Recall - capturing all relevant genres (multi-label)
    - F1-Score - balancing Precision and Recall
- Increase epochs form 10 to 20

In [None]:
NUMBER_OF_CLASSES = 25
BatchSize = 64
NUMBER_OF_EPOCHS = 20

INPUT_SHAPE = (WIDTH, HEIGHT, 3)  
KERNAL_SIZE = 3  # Typical filter size for CNNs
POOL_SIZE = (2, 2)  # Typical pooling size

model = tf.keras.models.Sequential(
    [
        Conv2D(32, KERNAL_SIZE, activation='relu', input_shape=INPUT_SHAPE),
        MaxPooling2D(pool_size=POOL_SIZE),

        Conv2D(64, KERNAL_SIZE, activation='relu'),
        MaxPooling2D(pool_size=POOL_SIZE),

        Conv2D(128, KERNAL_SIZE, activation='relu'),
        MaxPooling2D(pool_size=POOL_SIZE),

        Flatten(),
        Dense(1024),
        BatchNormalization(),
        Activation('relu'),
        Dropout(0.2),

        Dense(1024),
        BatchNormalization(),
        Activation('relu'),
        Dropout(0.2),

        Dense(1024),
        BatchNormalization(),
        Activation('relu'),
        Dropout(0.2),

        Dense(NUMBER_OF_CLASSES, activation='sigmoid')
    ]
)

model.summary()

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['Precision', 'Recall', tf.keras.metrics.F1Score])

augmentor = ImageDataGenerator(
    rotation_range=30,
    zoom_range=0.2,
    width_shift_range=0.25,
    height_shift_range=0.25, 
    shear_range=0.20,
    horizontal_flip=True,
    fill_mode="nearest"
)

train_dataset = tf.data.Dataset.from_generator(
    lambda: augmentor.flow(
        X_train,
        y_train,
        batch_size=BATCH_SIZE
    ),
    output_signature=(
        tf.TensorSpec(shape=(None, *X_train.shape[1:]), dtype=tf.float32),
        tf.TensorSpec(shape=(None, y_train.shape[1]), dtype=tf.float32)
    )
).repeat()

history = model.fit(
    train_dataset,
    validation_data=(X_test, y_test), 
    steps_per_epoch=steps_per_epoch,
    epochs=NUMBER_OF_EPOCHS
)

evaluate(model)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/20
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m91s[0m 969ms/step - Precision: 0.3019 - Recall: 0.2549 - f1_score: 0.0529 - loss: 0.3445 - val_Precision: 0.3563 - val_Recall: 0.2250 - val_f1_score: 0.0273 - val_loss: 0.4029
Epoch 2/20
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m89s[0m 994ms/step - Precision: 0.5533 - Recall: 0.2181 - f1_score: 0.0518 - loss: 0.2426 - val_Precision: 0.4611 - val_Recall: 0.2303 - val_f1_score: 0.0525 - val_loss: 0.2687
Epoch 3/20
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m86s[0m 959ms/step - Precision: 0.5640 - Recall: 0.2196 - f1_score: 0.0548 - loss: 0.2407 - val_Precision: 0.5229 - val_Recall: 0.1749 - val_f1_score: 0.0463 - val_loss: 0.2584
Epoch 4/20
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m85s[0m 946ms/step - Precision: 0.5860 - Recall: 0.2151 - f1_score: 0.0578 - loss: 0.2381 - val_Precision: 0.4905 - val_Recall: 0.1673 - val_f1_score: 0.0470 - val_loss: 0.2746
Epoch 5/20
[1m90/90

I have made an adjustment to the evaluate method to output a more realistic F1 Score by calculating the best threshold for a `True` genre value.

In [2]:

from sklearn.metrics import f1_score



def evaluate(model):
    print("\n\nEvaluate Training Data")
    model.evaluate(X_train, y_train)
    print("Evaluate Test Data")
    model.evaluate(X_test, y_test)
    print("Evaluate Validation Data")
    model.evaluate(X_valid, y_valid)
    
    print("Computing best threshold and associated F1 score")
    # After training, get predictions on the validation set
    y_pred_probs = model.predict(X_valid)  # shape: (num_samples, num_classes)

    threshold = 0.01
    best_threshold = threshold
    best_score = 0
    inc = 0.01
    while threshold < 1:
        y_pred_binary = (y_pred_probs > threshold).astype(int)
        score = f1_score(y_valid, y_pred_binary, average='macro', zero_division=0)
        if score > best_score:
            best_score = score
            best_threshold = threshold
        threshold += inc
    print(f"F1-Score: {round(best_score, 4)}, Best Threshold: {best_threshold}")

In [103]:
evaluate(model)



Evaluate Training Data
[1m182/182[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 85ms/step - Precision: 0.6423 - Recall: 0.2010 - f1_score: 0.0729 - loss: 0.2190
Evaluate Test Data
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 83ms/step - Precision: 0.6891 - Recall: 0.2227 - f1_score: 0.0712 - loss: 0.2342
Evaluate Validation Data
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 83ms/step - Precision: 0.6505 - Recall: 0.1921 - f1_score: 0.0725 - loss: 0.2236
Computing best threshold and associated F1 score
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 82ms/step
F1-Score: 0.2014, Best Threshold: 0.09


The genres it selects are correct about 65% of the time, but it is only getting all associated genres 25% of the time, so recall and F1 are low.  I believe this is because some genres are rare. I may need to augment the rare classes or oversample them to improve recall.

I suggest looking into the following:
- Adjust the sampling of less common genres

#### Test #5: iterative_train_test_splitBalance Labels

In [33]:
from skmultilearn.model_selection import iterative_train_test_split
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D,MaxPool2D,Dense,Flatten,BatchNormalization,Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing import image
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import kagglehub
from collections import Counter

# Download latest version
path = kagglehub.dataset_download("raman77768/movie-classifier")

print("Path to dataset files:", path)

df = pd.read_csv(f'{path}/Multi_Label_dataset/train.csv')

df = df[df['N/A']==0] \
    .reset_index(drop=True)


genre_counts = Counter(genre for row in df['Genre'] for genre in eval(row))
print(genre_counts)



Path to dataset files: C:\Users\david\.cache\kagglehub\datasets\raman77768\movie-classifier\versions\1
Counter({'Drama': 3619, 'Comedy': 2900, 'Action': 1343, 'Romance': 1334, 'Crime': 1176, 'Thriller': 918, 'Adventure': 870, 'Documentary': 652, 'Horror': 503, 'Fantasy': 467, 'Mystery': 454, 'Biography': 441, 'Family': 434, 'Sci-Fi': 399, 'Music': 305, 'Animation': 244, 'History': 224, 'Sport': 221, 'War': 144, 'Musical': 97, 'Western': 50, 'Short': 46, 'News': 21, 'Reality-TV': 2})


This shows that the number of times certain labels appear spans from just two, to 3619.  I need to increase the number of images with lower genre counts to bring the training data into alignment.

In [34]:
def genre_likelihood(genre):
    # Inverse proportional to representation
    return (max_count - genre_counts[genre]) / max_count

In [35]:

# Max percent representation of images can be from each other
x_percent = 0.10

# Step 1: Flatten the labels and count their occurrences
max_count = max(genre_counts.values())
print(f"Max Cout: {max_count}")
target_count = int(max_count * (1 - 0.10))
print(f"Minimum Representation: {target_count}")

# Step 2: Identify underrepresented labels
underrepresented_labels = {label: target_count - count for label, count in genre_counts.items() if count < target_count}

# Precompute likelihoods for efficiency
genre_likelihoods = {g: genre_likelihood(g) for g in genre_counts}
print(f"Genre Likelihoods: {genre_likelihoods}")

overrepresented_labels_set = {lbl for lbl, cnt in genre_counts.items() if cnt > target_count}
augmented_data = []

for label, needed in underrepresented_labels.items():
    if needed <= 0:
        continue

    # Filter images containing the label
    images_candidates = df[df['Genre'].apply(
        lambda x: label in eval(x) and not any(lbl in eval(x) for lbl in overrepresented_labels_set)
    )].copy()

    if images_candidates.empty:
        continue

    # Compute image weight for each candidate
    images_candidates['duplication_weight'] = images_candidates['Genre'].apply(
        lambda x: sum(genre_likelihoods[g] for g in eval(x))
    )

    total_weight = images_candidates['duplication_weight'].sum()
    if total_weight == 0:
        # If total_weight is 0, skip to avoid division by zero
        continue

    images_candidates['duplication_prob'] = images_candidates['duplication_weight'] / total_weight

    # Sample according to these probabilities
    duplicates = images_candidates.sample(n=needed, replace=True, weights='duplication_prob', random_state=42)
    augmented_data.append(duplicates)

# Combine all duplicates with the original df
if augmented_data:
    augmented_df = pd.concat([df] + augmented_data, ignore_index=True)
else:
    augmented_df = df

# Now augmented_df contains the original data plus the selected duplicates
print("Final dataset size:", len(augmented_df))
genre_counts = Counter(genre for row in augmented_df['Genre'] for genre in eval(row))
print(genre_counts)


augmented_data


Max Cout: 3619
Minimum Representation: 3257
Genre Likelihoods: {'Comedy': 0.19867366675877315, 'Drama': 0.0, 'Romance': 0.6313898867090356, 'Music': 0.9157225752970434, 'Sci-Fi': 0.8897485493230174, 'Thriller': 0.7463387676153633, 'Action': 0.6289030118817353, 'Adventure': 0.759602100027632, 'Crime': 0.6750483558994197, 'Horror': 0.8610113290964355, 'Musical': 0.9731970157502072, 'Biography': 0.8781431334622823, 'History': 0.9381044487427466, 'Mystery': 0.8745509809339597, 'Fantasy': 0.8709588284056369, 'Family': 0.8800773694390716, 'Sport': 0.9389334070185134, 'War': 0.9602100027631942, 'Animation': 0.9325780602376347, 'Documentary': 0.8198397347333517, 'Western': 0.9861840287372202, 'Short': 0.9872893064382426, 'Reality-TV': 0.9994473611494888, 'News': 0.9941972920696325}
Final dataset size: 68910
Counter({'Comedy': 19715, 'Documentary': 18656, 'Action': 14381, 'Adventure': 12766, 'Thriller': 9339, 'Crime': 7399, 'Horror': 7361, 'History': 7006, 'Family': 6945, 'Fantasy': 6929, 'Sci-

[             Id                              Genre  Action  Adventure  \
 2354  tt0116242   ['Comedy', 'Musical', 'Romance']       0          0   
 6592  tt1213663     ['Action', 'Comedy', 'Sci-Fi']       1          0   
 4839  tt0807758              ['Comedy', 'Romance']       0          0   
 3874  tt0245803    ['Action', 'Comedy', 'Fantasy']       1          0   
 961   tt0096874  ['Adventure', 'Comedy', 'Sci-Fi']       0          1   
 ...         ...                                ...     ...        ...   
 4324  tt0477347  ['Action', 'Adventure', 'Comedy']       1          1   
 990   tt0097366     ['Comedy', 'Mystery', 'Crime']       0          0   
 453   tt0091680              ['Comedy', 'Romance']       0          0   
 4153  tt0360139              ['Comedy', 'Romance']       0          0   
 199   tt0087932      ['Action', 'Comedy', 'Crime']       1          0   
 
       Animation  Biography  Comedy  Crime  Documentary  Drama  ...  \
 2354          0          0       1    

This has brought the genre counts much closer together instead of ratios 3.6k/2 we now have 19/3 at the most extreme.

Unfortunately, this will result in a large memory usage, and when testing I did in fact run out.  I will need to batch process these images on the fly.

In [36]:
from tensorflow.keras.preprocessing.image import load_img, img_to_array
import ast

# augmented_df = augmented_df.drop(['Id','Genre', 'duplication_weight', 'duplication_prob'],axis=1)

# Split into train and temp
train_df, temp_df = train_test_split(augmented_df, test_size=0.4, random_state=42)
# Split temp into validation and test
valid_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

WIDTH = 200
HEIGHT = 200
BATCH_SIZE = 128

def load_and_preprocess_image(img_path, width=200, height=200):
    img = load_img(img_path, target_size=(width, height))
    img = img_to_array(img)/255.0
    return img

def gen_paths_and_labels(df, path_prefix):
    # Extract the label columns
    label_cols = [c for c in df.columns if c not in ['Id', 'Genre', 'duplication_weight', 'duplication_prob', 'N/A']]
    for _, row in df.iterrows():
        img_path = f"{path_prefix}/Multi_Label_dataset/Images/{row['Id']}.jpg"
        labels = row[label_cols].values.astype('float32')
        yield img_path, labels

def process_record(img_path, label):
    img = tf.numpy_function(load_and_preprocess_image, [img_path, WIDTH, HEIGHT], tf.float32)
    img.set_shape((WIDTH, HEIGHT, 3))
    return img, label

# Create a reusable function to build a dataset from a df
def build_dataset(df, path_prefix, batch_size=BATCH_SIZE, shuffle=False):
    label_cols = [c for c in df.columns if c not in ['Id','Genre','duplication_weight','duplication_prob', 'N/A']]
    num_samples = len(df)

    # From_generator expects a callable that returns a generator
    dataset = tf.data.Dataset.from_generator(
        lambda: gen_paths_and_labels(df, path_prefix),
        output_signature=(
            tf.TensorSpec(shape=(), dtype=tf.string),
            tf.TensorSpec(shape=(len(label_cols),), dtype=tf.float32)
        )
    )

    dataset = dataset.map(process_record, num_parallel_calls=tf.data.AUTOTUNE)

    if shuffle:
        dataset = dataset.shuffle(buffer_size=num_samples, reshuffle_each_iteration=True)
    dataset = dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE)

    return dataset

# Now build separate datasets for train, valid, and test
train_dataset = build_dataset(train_df, path, batch_size=BATCH_SIZE, shuffle=True)
valid_dataset = build_dataset(valid_df, path, batch_size=BATCH_SIZE, shuffle=False)
test_dataset  = build_dataset(test_df, path, batch_size=BATCH_SIZE, shuffle=False).repeat()

I now need to randomize the images in the training set a bit because many are duplicated.

In [37]:
from tensorflow.keras.layers import RandomFlip, RandomRotation, RandomZoom, RandomTranslation
from tensorflow.keras import Sequential


augmentation = Sequential([
    RandomFlip("horizontal"),
    RandomRotation(0.1), 
    RandomZoom(0.2),
    RandomTranslation(0.25, 0.25)
], name="augmentation_layer")

def augment_image(img, label):
    img = augmentation(img, training=True)
    return img, label

train_dataset = train_dataset \
    .map(augment_image, num_parallel_calls=tf.data.AUTOTUNE)

train_dataset = train_dataset.repeat()


In [38]:
def evaluate(model):
    train_steps = len(train_df) // BATCH_SIZE
    test_steps = len(test_df) // BATCH_SIZE
    valid_steps = len(valid_df) // BATCH_SIZE

    label_cols = [c for c in df.columns if c not in ['Id', 'Genre', 'duplication_weight', 'duplication_prob', 'N/A']]
    total_preds = valid_steps * BATCH_SIZE  # total number of predicted samples
    y_valid = valid_df[label_cols].to_numpy()
    y_valid = y_valid[:total_preds]
    
    print("\n\nEvaluate Training Data")
    model.evaluate(train_dataset, steps=train_steps)
    print("Evaluate Test Data")
    model.evaluate(test_dataset, steps=test_steps)
    print("Evaluate Validation Data")
    model.evaluate(valid_dataset, steps=valid_steps)


    print("Computing best threshold and associated F1 score")
    # After training, get predictions on the validation set
    y_pred_probs = model.predict(valid_dataset, steps=valid_steps)  # shape: (num_samples, num_classes)

    threshold = 0.01
    best_threshold = threshold
    best_score = 0
    inc = 0.01
    while threshold < 1:
        y_pred_binary = (y_pred_probs > threshold).astype(int)
        score = f1_score(y_valid, y_pred_binary, average='macro', zero_division=0)
        if score > best_score:
            best_score = score
            best_threshold = threshold
        threshold += inc
    print(f"F1-Score: {round(best_score, 4)}, Best Threshold: {best_threshold}")

In [39]:
Conv2D = tf.keras.layers.Conv2D
MaxPooling2D = tf.keras.layers.MaxPooling2D
Flatten = tf.keras.layers.Flatten
Dense = tf.keras.layers.Dense
Dropout = tf.keras.layers.Dropout
ImageDataGenerator = tf.keras.preprocessing.image.ImageDataGenerator
BatchNormalization = tf.keras.layers.BatchNormalization
Activation = tf.keras.layers.Activation

In [40]:
NUMBER_OF_CLASSES = 24
NUMBER_OF_EPOCHS = 1

INPUT_SHAPE = (WIDTH, HEIGHT, 3)  
KERNAL_SIZE = 3  # Typical filter size for CNNs
POOL_SIZE = (2, 2)  # Typical pooling size

STRIDES = 2

model = tf.keras.models.Sequential(
    [
        Conv2D(32, KERNAL_SIZE, activation='relu', input_shape=INPUT_SHAPE, strides=STRIDES),
        MaxPooling2D(pool_size=POOL_SIZE),

        Conv2D(64, KERNAL_SIZE, activation='relu', strides=STRIDES),
        MaxPooling2D(pool_size=POOL_SIZE),

        Conv2D(128, KERNAL_SIZE, activation='relu', strides=STRIDES),
        MaxPooling2D(pool_size=POOL_SIZE),

        Flatten(),
        Dense(1024),
        BatchNormalization(),
        Activation('relu'),
        Dropout(0.2),

        Dense(1024),
        BatchNormalization(),
        Activation('relu'),
        Dropout(0.2),

        Dense(1024),
        BatchNormalization(),
        Activation('relu'),
        Dropout(0.2),

        Dense(NUMBER_OF_CLASSES, activation='sigmoid')
    ]
)

model.summary()

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['Precision', 'Recall'])

steps_per_epoch = train_df.shape[0] // BATCH_SIZE
num_valid_samples = len(valid_df)
validation_steps = num_valid_samples // BATCH_SIZE

history = model.fit(
    train_dataset,
    validation_data=test_dataset,
    validation_steps=validation_steps,
    steps_per_epoch=steps_per_epoch,
    epochs=NUMBER_OF_EPOCHS
)


evaluate(model)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m298s[0m 796ms/step - Precision: 0.3970 - Recall: 0.1258 - loss: 0.3313 - val_Precision: 0.4690 - val_Recall: 0.1037 - val_loss: 0.3178


Evaluate Training Data
[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m95s[0m 135ms/step - Precision: 0.5021 - Recall: 0.1010 - loss: 0.3196
Evaluate Test Data
[1m107/107[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 140ms/step - Precision: 0.4682 - Recall: 0.1037 - loss: 0.3173
Evaluate Validation Data
[1m107/107[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 141ms/step - Precision: 0.4702 - Recall: 0.1047 - loss: 0.3184
Computing best threshold and associated F1 score
[1m107/107[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 137ms/step
F1-Score: 0.2416, Best Threshold: 0.04


## Evaluation

### Metrics
- Precision - correctly identifying genres
- Recall - capturing all relevant genres (multi-label)
- F1-Score - balancing Precision and Recall


### Validation

## Conclusion

## References & Acknowledgements