
# Multimodal Training Using Hugging Face, Keras, and TensorFlow

This notebook guides you through the process of training a multimodal model that can handle both text and image inputs. We'll use the Flickr8k dataset, which contains images paired with textual descriptions, and a small language model (LLM) from Hugging Face.

## 1. Environment Setup

Ensure you have the necessary Python packages installed.
    

In [None]:

!pip install tensorflow transformers datasets tensorflow_hub matplotlib 'keras==3.2'



## 2. Load the Hugging Face Model

We'll start by loading a small language model (LLM) from Hugging Face that we'll use to process text data. For this example, we'll use the `distilbert-base-uncased` model.
    

In [None]:

from transformers import AutoTokenizer, TFAutoModel
import tensorflow as tf

BATCH_SIZE = 128  # Adjust based on your hardware and RAM
EPOCHS = 5
PATIENCE = 5

# Set the seed for reproducibility
tf.random.set_seed(42)

# Load the tokenizer for the DistilBERT model
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

# Load the DistilBERT model itself
text_model = TFAutoModel.from_pretrained('distilbert-base-uncased')



## 3. Prepare the Image Model

Next, we prepare a model for processing images. We use a pre-trained model like `MobileNetV2` from TensorFlow, which is lightweight and effective for this task.
    

In [None]:
from tensorflow.keras.applications import MobileNetV2

# Load a pre-trained MobileNetV2 model, excluding the top layers (which are meant for classification)
image_model = MobileNetV2(include_top=False, input_shape=(224, 224, 3))

# Add a GlobalAveragePooling2D layer to reduce the spatial dimensions of the feature map
image_model = None


## 4. Load and Preprocess the Dataset

We will now load the Flickr8k dataset, which includes images and captions. We'll preprocess the images and captions, making them ready for input into our model.
    

In [None]:

from datasets import load_dataset

# Load the Flickr8k dataset from the Hugging Face datasets library
dataset = load_dataset("jxie/flickr8k", split='train[:500]')  # We use 500 examples to make it fast


In [None]:
dataset[0]

In [None]:
import random
import numpy as np
from PIL import Image

def preprocess_text(example, label_index):
    caption_key = f'caption_{label_index}'
    encodings = None  # Apply the tokenizer
    return {'input_ids': encodings['input_ids'], 'attention_mask': encodings['attention_mask'], 'original': example}

def preprocess_image(example):
    try:
        image = example['image']  # This is already a PIL image
        image = image.convert('RGB')
        image = None  # Convert PIL image to array
        image = None  # Resize the image to the required input size
        image = None  # Normalize the image
        return image
    except Exception as e:
        print(f"Error processing image: {e}")
        return None  # Return None if the image is bad

def map_function(example):
    label_index = random.randint(0, 4)
    return {
        'input_ids': preprocess_text(example, label_index)['input_ids'],
        'attention_mask': preprocess_text(example, label_index)['attention_mask'],
        'image': preprocess_image(example),
        'label': label_index
    }

In [None]:
dataset = dataset.map(map_function)

In [None]:
dataset[0]

In [None]:
import tensorflow as tf

# Ensure that the images are in the correct float32 format and text inputs are in int32
def prepare_dataset(dataset):
    return tf.data.Dataset.from_tensor_slices((
        {
            'input_ids': None,  # Text input IDs
            'attention_mask': None,  # Attention masks
            'image_input': None,  # Image inputs
        },
        None  # Labels
    )).shuffle(1000).batch(BATCH_SIZE)

train_dataset = prepare_dataset(dataset)

In [None]:
for batch in train_dataset.take(1):
    print(batch[0]['input_ids'].shape)
    print(batch[0]['attention_mask'].shape)
    print(batch[0]['image_input'].shape)
    print(batch[1].shape)


## 5. Build the Multimodal Model

Now that we have models for both text and image data, we can combine these into a single multimodal model. This model will take both text and image inputs and output a prediction.
    

In [None]:
from tensorflow.keras.layers import Layer

class ReduceMeanLayer(Layer):
    def call(self, inputs):
        return tf.reduce_mean(inputs, axis=1)

# Instantiate the custom layer
reduce_mean_layer = ReduceMeanLayer()


In [None]:
from tensorflow.keras.layers import Input, Lambda, Dense, Concatenate
from tensorflow.keras.models import Model

# Define the input layers for text and images. Name them input_ids, attention_mask and image_input
input_ids = None
attention_mask = None
image_input = None

# Process the text inputs using a Lambda layer
def distilbert_encode(inputs):
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']
    return text_model(input_ids=input_ids, attention_mask=attention_mask)[0]

text_features = Lambda(distilbert_encode, output_shape=(128, 768))({
    'input_ids': input_ids,
    'attention_mask': attention_mask
})

# Apply the custom reduce mean layer to the text features
reduced_text_features = None

# Process the image input through the image model
image_features = None

# Combine (concatenate) the features from both modalities
combined_features = Concatenate()([reduced_text_features, image_features])

# Add a dense layer for learning complex patterns
dense_layer = Dense(128, activation='relu')(combined_features)

# Output layer with 5 units (for 5 possible captions), using softmax activation
output = None

# Create the full multimodal model
multimodal_model = Model(inputs=[input_ids, attention_mask, image_input], outputs=output)


In [None]:
multimodal_model.summary()


## 6. Compile and Train the Model

We now compile the model using an appropriate loss function and optimizer. After that, we'll train the model on our dataset.
    

In [None]:
# Compile the model
multimodal_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model for 5 epochs
history = None



## 7. Evaluate the Model

After training, we evaluate the model to see how well it performs on the training data. We also plot the training history to visualize the accuracy and loss.
    

In [None]:

import matplotlib.pyplot as plt

# Evaluate the model to check its performance
results = multimodal_model.evaluate(train_dataset)
print(f'Test loss: {results[0]}, Test accuracy: {results[1]}')

# Plot the training history for accuracy
plt.plot(history.history['accuracy'], label='accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
