

Gemma, a cutting-edge lightweight open model crafted using the same advanced research and technology employed in developing the Gemini models, is now accessible within the KerasNLP collection. Leveraging Keras 3, Gemma seamlessly operates on JAX, PyTorch, and TensorFlow frameworks. Alongside this launch, Keras introduces several novel features tailored for large language models, including a new LoRA API (Low Rank Adaptation) and enhanced capabilities for large-scale model-parallel training, which means that we can fine-tune also using TPUs!

Following the original tutorial working with the Hugging Face interface to be found at https://www.kaggle.com/code/lucamassaron/fine-tune-gemma-7b-it-for-sentiment-analysis, we are going to deal with a sentiment analysis on financial and economic information but this time having Gemma work with a TPU VM v3-8. Sentiment analysis on financial and economic information is highly relevant for businesses for several key reasons, ranging from market insights (gain valuable insights into market trends, investor confidence, and consumer behavior) to risk management (identifying potential reputational risks) to investment decisions (gauging the sentiment of stakeholders, investors, and the general public businesses can assess the potential success of various investment opportunities).

Before the technicalities of fine-tuning a large language model like Gemma, we had to find the correct dataset to demonstrate the potentialities of fine-tuning.

Particularly within the realm of finance and economic texts, annotated datasets are notably rare, with many being exclusively reserved for proprietary purposes. To address the issue of insufficient training data, scholars from the Aalto University School
of Business introduced in 2014 a set of approximately 5000 sentences. This collection aimed to establish human-annotated benchmarks, serving as a standard for evaluating alternative modeling techniques. The involved annotators (16 people with
adequate background knowledge on financial markets) were instructed to assess the sentences solely from the perspective of an investor, evaluating whether the news potentially holds a positive, negative, or neutral impact on the stock price.

The FinancialPhraseBank dataset is a comprehensive collection that captures the sentiments of financial news headlines from the viewpoint of a retail investor. Comprising two key columns, namely "Sentiment" and "News Headline," the dataset effectively classifies sentiments as either negative, neutral, or positive. This structured dataset serves as a valuable resource for analyzing and understanding the complex dynamics of sentiment in the domain of financial news. It has been used in various studies and research initiatives, since its inception in the work by Malo, P., Sinha, A., Korhonen, P., Wallenius, J., and Takala, P.  "Good debt or bad debt: Detecting semantic orientations in economic texts.", published in the Journal of the Association for Information Science and Technology in 2014.

As a first step, we install the specific libraries necessary to make this example work.

* tensorflow-cpu: This library provides the backbone for executing TensorFlow computations on CPUs, enabling efficient processing of machine learning tasks without the need for GPU acceleration.

* keras-nlp: An integral component, the KerasNLP library, offers a rich collection of tools and utilities tailored for natural language processing tasks. It encapsulates various models and functionalities, including Gemma, discussed earlier.

* tensorflow-hub: TensorFlow Hub serves as a repository for pre-trained machine learning models, enabling easy access to a wide array of models for tasks such as transfer learning and feature extraction. Its integration enhances the capabilities of the example by facilitating the use of pre-trained embeddings and models.

* keras: Keras, a high-level neural networks API, provides a user-friendly interface for building, training, and deploying deep learning models. Its integration with TensorFlow allows for seamless interoperability and simplifies the development process, making it an indispensable component for this example.

In [None]:
!pip install -q tensorflow-cpu
!pip install -q -U keras-nlp tensorflow-hub
!pip install -q -U keras>=3

Thanks to Keras, , users gain the flexibility to select the backend on which their model operates. 

The code imports the os module and sets two environment variables for setting JAX as a backend and then for pre-allocating the TPU memory:

* KERAS_BACKEND: The Keras 3 distribution API is only implemented for the JAX backend for now.
* XLA_PYTHON_CLIENT_MEM_FRACTION: Pre-allocate 90% of TPU memory to minimize memory fragmentation and allocation overhead

In [None]:
import os

os.environ["KERAS_BACKEND"] = "jax"
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "0.9"

The code import warnings; warnings.filterwarnings("ignore") imports the warnings module and sets the warning filter to ignore. This means that all warnings will be suppressed and will not be displayed. Actually during training there are many warnings that do not prevent the fine-tuning but can be distracting and make you wonder if you are doing the correct things.

In [None]:
import warnings
warnings.filterwarnings("ignore")

In the following cell there are all the other imports for running the notebook. In addition, The 'jax.devices()' function is used to retrieve a list of all available devices that JAX can utilize for computation.

In [None]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import matplotlib.pyplot as plt

import keras
import keras_nlp
from tensorflow.data import Dataset

from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)
from sklearn.model_selection import train_test_split

import jax
print(jax.devices())

To fine-tune the larger Gemma 7B model efficiently, a distributed setup is recommended, such as leveraging a TPUv3 with 8 TPU cores available on platforms like Kaggle or an 8-GPU machine from Google Cloud. The provided code snippet configures the model for distributed training using model parallelism. Essentially, it organizes the 8 accelerators into a 1 x 8 matrix, where the dimensions represent "batch" and "model". Model weights are distributed across the "model" dimension, which is split between the 8 accelerators, while data batches remain unpartitioned due to the "batch" dimension being set to 1.

The code initializes a device mesh and configures the model layout accordingly. It sets up the mapping between model components and the distributed layout, specifying how different parts of the model are distributed across the available accelerators. Once the model configuration is set, the Gemma 7B model is loaded and ready for training using the model.fit() method or text generation using the generate() method.

In [None]:
def create_device_mesh():
    """
    Create a device mesh with (1, 8) shape so that the weights are sharded across all 8 TPUs
    """
    device_mesh = keras.distribution.DeviceMesh(
        (1, 8),
        ["batch", "model"],
        devices=keras.distribution.list_devices())
    return device_mesh

model_dim = "model"
device_mesh = create_device_mesh()
layout_map = keras.distribution.LayoutMap(device_mesh)

# Weights that match 'token_embedding/embeddings' will be sharded on 8 TPUs
layout_map["token_embedding/embeddings"] = (None, model_dim)
# Regex to match against the query, key and value matrices in the decoder
# attention layers
layout_map["decoder_block.*attention.*(query|key|value).*kernel"] = (None, model_dim, None)

layout_map["decoder_block.*attention_output.*kernel"] = (
    None, None, model_dim)
layout_map["decoder_block.*ffw_gating.*kernel"] = (model_dim, None)
layout_map["decoder_block.*ffw_linear.*kernel"] = (None, model_dim)

model_parallel = keras.distribution.ModelParallel(device_mesh, layout_map, batch_dim_name="batch")
keras.distribution.set_distribution(model_parallel)

Gemma models come with a user-friendly KerasNLP API and a highly intuitive Keras implementation. Instantiating the model requires just a single line of code, making it exceptionally straightforward to use.

In [None]:
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_7b_en")

The code in the next cell performs the following steps:

1. Reads the input dataset from the all-data.csv file, which is a comma-separated value (CSV) file with two columns: sentiment and text.
2. Splits the dataset into training and test sets, with 300 samples in each set. The split is stratified by sentiment, so that each set contains a representative sample of positive, neutral, and negative sentiments.
3. Shuffles the train data in a replicable order (random_state=10)
4. Transforms the texts contained in the train and test data into prompts to be used by Gemma: the train prompts contains the expected answer we want to fine-tune the model with
5. The residual examples not in train or test, for reporting purposes during training (but it won't be used for early stopping), is treated as evaluation data, which is sampled with repetition in order to have a 50/50/50 sample (negative instances are very few, hence they should be repeated)
5. The train and eval data are wrapped by the class from Hugging Face (https://huggingface.co/docs/datasets/index)

This prepares in a single cell train_data, eval_data and test_data datasets to be used in our fine tuning.

In [None]:
filename = "../input/sentiment-analysis-for-financial-news/all-data.csv"

df = pd.read_csv(filename, 
                 names=["sentiment", "text"],
                 encoding="utf-8", encoding_errors="replace")

X_train = list()
X_test = list()
for sentiment in ["positive", "neutral", "negative"]:
    train, test  = train_test_split(df[df.sentiment==sentiment], 
                                    train_size=300,
                                    test_size=300, 
                                    random_state=42)
    X_train.append(train)
    X_test.append(test)

X_train = pd.concat(X_train).sample(frac=1, random_state=10)
X_test = pd.concat(X_test)

eval_idx = [idx for idx in df.index if idx not in list(train.index) + list(test.index)]
X_eval = df[df.index.isin(eval_idx)]
X_eval = (X_eval
          .groupby('sentiment', group_keys=False)
          .apply(lambda x: x.sample(n=50, random_state=10, replace=True)))
X_train = X_train.reset_index(drop=True)

def generate_prompt(data_point):
    return f"""
            Analyze the sentiment of the news headline enclosed in square brackets, 
            determine if it is positive, neutral, or negative, and return the answer as 
            the corresponding sentiment label "positive" or "neutral" or "negative"

            [{data_point["text"]}] = {data_point["sentiment"]}
            """.strip()

def generate_test_prompt(data_point):
    return f"""
            Analyze the sentiment of the news headline enclosed in square brackets, 
            determine if it is positive, neutral, or negative, and return the answer as 
            the corresponding sentiment label "positive" or "neutral" or "negative"

            [{data_point["text"]}] = 

            """.strip()

X_train = pd.DataFrame(X_train.apply(generate_prompt, axis=1), 
                       columns=["text"])
X_eval = pd.DataFrame(X_eval.apply(generate_prompt, axis=1), 
                      columns=["text"])

y_true = X_test.sentiment
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

This code snippet creates TensorFlow datasets (train_data and eval_data) from the text values of training (X_train) and evaluation (X_eval) datasets, respectively. Each dataset is batched with a batch size of 1, meaning each element in the dataset is treated as an individual batch. This allows for efficient processing of the text data during training and evaluation using TensorFlow's tf.data module.

In [None]:
train_data = Dataset.from_tensor_slices(X_train.text.values).batch(1)
eval_data = Dataset.from_tensor_slices(X_eval.text.values).batch(1)

As a test we retrieve a single element from the train_data dataset.

In [None]:
train_data.unbatch().take(1).get_single_element().numpy()

The following function, predict, takes in a set of test data X_test, a language model model, and an optional max_length parameter, which defaults to 128. It iterates through each instance in X_test, extracting the text prompt. Then, it generates text continuations from the model given the prompt, aiming for a maximum length of max_length. If the generated output does not contain any meaningful information (determined by the presence of "="), it doubles the max_length parameter and retries until meaningful output is obtained or until max_length exceeds 512.

Once a meaningful output is generated, it extracts the answer, typically a sentiment label (positive, negative, neutral, or none). The function then appends the predicted sentiment label (positive, negative, neutral, or none) to a list y_pred, based on the extracted answer. Finally, it returns the list of predicted sentiment labels for each instance in the test data.

In [None]:
def predict(X_test, model, max_length=128):
    y_pred = []
    for i in range(len(X_test)):
        prompt = X_test.iloc[i]["text"]
        
        while True:
            outputs = model.generate(prompt, max_length=max_length)
            result = [item for item in outputs.split("\n") if "=" in item]
            if len(result) == 0:
                max_length = int(max_length * 2.0)
            else:
                break
            if max_length > 512:
                result = [" = none"]
        answer = result[0].split("=")[-1].lower()
        if "positive" in answer:
            y_pred.append("positive")
        elif "negative" in answer:
            y_pred.append("negative")
        elif "neutral" in answer:
            y_pred.append("neutral")
        else:
            y_pred.append("none")
    return y_pred

The evaluate function is designed to assess the performance of a sentiment classification model by comparing its predicted labels (y_pred) against the true labels (y_true).

First, it defines a list of sentiment labels and a mapping from these labels to numerical values. It then uses this mapping to convert the sentiment labels in both y_true and y_pred arrays to numerical representations.

The function computes the overall accuracy of the predictions and prints it out. It also generates accuracy scores for each individual sentiment label, providing insight into how well the model performs for each class.

Furthermore, it generates a classification report, which includes precision, recall, F1-score, and support metrics for each sentiment class, providing a more detailed evaluation of the model's performance.

Finally, the function computes and displays a confusion matrix, which summarizes the counts of true positive, false positive, true negative, and false negative predictions for each sentiment class.

Overall, this function provides a comprehensive evaluation of the sentiment classification model's performance, including overall accuracy, accuracy per class, detailed classification report metrics, and a confusion matrix.

In [None]:
def evaluate(y_true, y_pred):
    labels = ['positive', 'neutral', 'negative']
    mapping = {'positive': 2, 'neutral': 1, 'none':1, 'negative': 0}
    def map_func(x):
        return mapping.get(x, 1)
    
    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f'Accuracy: {accuracy:.3f}')
    
    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels
    
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true)) 
                         if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {label}: {accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred)
    print('\nClassification Report:')
    print(class_report)
    
    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2])
    print('\nConfusion Matrix:')
    print(conf_matrix)

At this point, we are ready to test the Gemma 7B it model and see how it performs on our problem without any fine-tuning. This allows us to get insights on the model itself and establish a baseline.

In [None]:
y_pred = predict(X_test, gemma_lm)
evaluate(y_true, y_pred)

Now it is time to fine tune the Gemma 7B model.

Setting rank=8 replaces the weights matrix of pertinent layers with the product of two matrices of rank 8 (AxB). This operation effectively reduces the number of trainable parameters within the model.

In [None]:
gemma_lm.backbone.enable_lora(rank=8)

Now, first we set the sequence length of Gemma's preprocessor to 512. This defines the maximum length of input sequences that the model can process.

The we define the optimizer using the AdamW optimizer, a variant of Adam that incorporates weight decay regularization. Parameters such as learning rate, weight decay, clip value, and gradient accumulation steps are also configured here.

Afterwards, we specify which variables should be excluded from weight decay. In this case, biases and scale parameters are excluded, which can help prevent unnecessary regularization of these parameters.

Finally we compile the Gemma language model with specific loss, optimizer, and metrics configurations. Sparse categorical crossentropy loss is used for models that generate logits, such as language models. The optimizer defined above is passed along with weighted metrics for evaluation purposes.

As a last step, we print a summary of the Gemma language model, which provides information about its architecture, layer configurations, and the number of trainable parameters.

In [None]:
gemma_lm.preprocessor.sequence_length = 512

optimizer = keras.optimizers.AdamW(
    learning_rate=2e-4,
    weight_decay=0.001,
    clipvalue=0.3,
    gradient_accumulation_steps=16,
)
optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])

gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.summary()

Now we train the Gemma language model using the fit method. The train_data variable represents the training dataset that will be used to update the model parameters during training. The epochs parameter specifies the number of times the entire training dataset will be iterated over during training.

The validation_data parameter is used to provide an evaluation dataset (eval_data) to assess the model's performance on data that it hasn't been trained on. This helps monitor the model's generalization ability and detect overfitting. 

The fit method trains the model for the specified number of epochs (5 in this case) while monitoring performance on the validation data. The training history, including loss and metrics values over each epoch, is stored in the history variable.

In [None]:
history = gemma_lm.fit(train_data, epochs=5, validation_data=eval_data)

This code defines and executes a Python function, plot_keras_history, designed to visualize the training and validation history of Keras models. Given a Keras training history object and a list of measures (such as loss or accuracy), the function generates a set of subplots, with each subplot representing a measure's history. It plots the training and validation values of each measure over epochs, labels the axes appropriately, and displays a legend to differentiate between the training and validation data. However, the function call provided at the end lacks the actual measures to plot, which should be passed as a list to the measures parameter.

In [None]:
def plot_keras_history(history, measures):
    """
    history: Keras training history
    measures = list of names of measures
    """
    rows = len(measures) // 2 + len(measures) % 2
    fig, panels = plt.subplots(rows, 2, figsize=(15, 5))
    plt.subplots_adjust(top = 0.99, bottom=0.01, hspace=0.4, wspace=0.2)
    try:
        panels = [item for sublist in panels for item in sublist]
    except:
        pass
    for k, measure in enumerate(measures):
        panel = panels[k]
        panel.set_title(measure + ' history')
        panel.plot(history.epoch, history.history[measure], label="Train "+measure)
        panel.plot(history.epoch, history.history["val_"+measure], label="Validation "+measure)
        panel.set(xlabel='epochs', ylabel=measure)
        panel.legend()
        
    plt.show(fig)
    
plot_keras_history(history, measures=["loss", "sparse_categorical_accuracy"])

In [None]:
y_pred = predict(X_test, gemma_lm)
evaluate(y_true, y_pred)

The following code will create a Pandas DataFrame called evaluation containing the text, true labels, and predicted labels from the test set. This is expectially useful for understanding the errors that the fine-tuned model makes, and gettting insights on how to improve the prompt.

In [None]:
evaluation = pd.DataFrame({'text': X_test["text"], 
                           'y_true':y_true, 
                           'y_pred': y_pred},
                         )
evaluation.to_csv("test_predictions.csv", index=False)