# Tut 2c- Classifying embeddings with Keras (Kaggle 5-day Generative AI course.)

### Overview
In this notebook, you'll learn to use the embeddings produced by the Gemini API to train a model that can classify newsgroup posts into the categories (the newsgroup itself) from the post contents.

This technique uses the Gemini API's embeddings as input, avoiding the need to train on text input directly, and as a result it is able to perform quite well using relatively few examples compared to training a text model from scratch.

In [2]:
!pip install -U -q "google-genai==1.7.0"
from google import genai
from google.genai import types

genai.__version__

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h

'1.7.0'

In [3]:
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

client = genai.Client(api_key=GOOGLE_API_KEY)


**Get the News Data:** We download a dataset of news articles, split into training and testing parts.

**See the Categories:** We look at the list of news topics (like sports, science, etc.).

**Show an Example:** We print out the text of the first news article.

In [5]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset="train")
newsgroups_test = fetch_20newsgroups(subset="test")

# View list of class names for dataset
newsgroups_train.target_names

print(newsgroups_train.data[0])


From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







### Start by preprocessing the data for this tutorial in a Pandas dataframe. 
To remove any sensitive information like names and email addresses, you will take only the subject and body of each message. This is an optional step that transforms the input data into more generic text, rather than email posts, so that it will work in other contexts.

In [7]:
import email
import re
import pandas as pd

def preprocess_text(data):
    # Extract subject and body, remove email addresses, and truncate to 5,000 characters
    msg = email.message_from_string(data)
    text = f"{msg['Subject']}\n\n{msg.get_payload()}"
    return re.sub(r"[\w\.-]+@[\w\.-]+", "", text)[:5000]

def preprocess_newsgroup_data(dataset):
    # Create DataFrame and preprocess text
    df = pd.DataFrame({"Text": dataset.data, "Label": dataset.target})
    df["Text"] = df["Text"].apply(preprocess_text)
    df["Class Name"] = df["Label"].map(lambda l: dataset.target_names[l])
    return df


In [8]:
# Apply preprocessing function to training and test datasets
df_train = preprocess_newsgroup_data(newsgroups_train)
df_test = preprocess_newsgroup_data(newsgroups_test)

df_train.head()

Unnamed: 0,Text,Label,Class Name
0,WHAT car is this!?\n\n I was wondering if anyo...,7,rec.autos
1,SI Clock Poll - Final Call\n\nA fair number of...,4,comp.sys.mac.hardware
2,"PB questions...\n\nwell folks, my mac plus fin...",4,comp.sys.mac.hardware
3,Re: Weitek P9000 ?\n\nRobert J.C. Kyanko () wr...,1,comp.graphics
4,Re: Shuttle Launch Question\n\nFrom article <>...,14,sci.space


Next, you will sample some of the data by taking 100 data points in the training dataset, and dropping a few of the categories to run through this tutorial. Choose the science categories to compare.

In [9]:
def sample_data(df, num_samples, classes_to_keep):
    # Sample rows, selecting num_samples of each Label.
    df = (
        df.groupby("Label")[df.columns]
        .apply(lambda x: x.sample(num_samples))
        .reset_index(drop=True)
    )

    df = df[df["Class Name"].str.contains(classes_to_keep)]

    # We have fewer categories now, so re-calibrate the label encoding.
    df["Class Name"] = df["Class Name"].astype("category")
    df["Encoded Label"] = df["Class Name"].cat.codes

    return df

In [14]:
TRAIN_NUM_SAMPLES = 100
TEST_NUM_SAMPLES = 25
# Class name should contain 'sci' to keep science categories.
# Try different labels from the data - see newsgroups_train.target_names
CLASSES_TO_KEEP = "sci"

df_train = sample_data(df_train, TRAIN_NUM_SAMPLES, CLASSES_TO_KEEP)
df_test = sample_data(df_test, TEST_NUM_SAMPLES, CLASSES_TO_KEEP)

df_train.value_counts("Class Name")


Class Name
sci.crypt          100
sci.electronics    100
sci.med            100
sci.space          100
Name: count, dtype: int64

In [13]:
df_test.value_counts("Class Name")


Class Name
sci.crypt          25
sci.electronics    25
sci.med            25
sci.space          25
Name: count, dtype: int64

## Create the embeddings

**Turn Text into Numbers:** We'll use the Gemini API to convert each news article into a list of numbers, called "embeddings." These numbers represent the meaning of the text.

**Tell Gemini Our Goal:** We'll specify that we're using these embeddings for "classification," meaning we want to sort the articles into categories. This helps Gemini generate the most useful numbers.

**One by One:** The API processes each article separately, creating a number list for each one. This might take a while if you have many articles.

In [23]:
from google.api_core import retry
import tqdm
from tqdm.rich import tqdm as tqdmr
import warnings

# Enable progress bars for Pandas and suppress warnings
tqdmr.pandas()

warnings.filterwarnings("ignore", category=tqdm.TqdmExperimentalWarning)

@retry.Retry(
    predicate=lambda e: isinstance(e, genai.errors.APIError) and e.code in {429, 503},
    timeout=300
)
def embed_text(text: str) -> list[float]:
    """Generate embeddings for a given text using the specified model."""
    response = client.models.embed_content(
        model="models/text-embedding-004",
        contents=text,
        config=types.EmbedContentConfig(task_type="classification")
    )
    return response.embeddings[0].values

def create_embeddings(df):
    """Add embeddings to the DataFrame based on the 'Text' column."""
    df["Embeddings"] = df["Text"].progress_apply(embed_text)
    return df


**Explanation:**
We're taking our training and testing data, and we're adding columns to them that contain the numerical representations of the article text, so that our machine learning model can understand them.

In [24]:
df_train = create_embeddings(df_train)
df_test = create_embeddings(df_test)

Output()

Output()

In [25]:
df_train.head()


Unnamed: 0,Text,Label,Class Name,Encoded Label,Embeddings
0,Re: Clipper considered harmful\n\nIn article <...,11,sci.crypt,0,"[0.004346496, 0.029194878, -0.06622962, 0.0297..."
1,"Re: Once tapped, your code is no good any more...",11,sci.crypt,0,"[-0.017911352, 0.017148165, -0.04155469, -0.00..."
2,Cripple Chip\n\nHow about this: The\nTelCo ha...,11,sci.crypt,0,"[-0.013676126, 0.03906149, -0.041958194, 0.003..."
3,Re: clipper chip --Bush did it\n\n (John Gilbe...,11,sci.crypt,0,"[-0.0047690864, 0.024499344, -0.039486606, 0.0..."
4,Clipper chip -- technical details\n\nI receive...,11,sci.crypt,0,"[0.01475392, 0.020964768, -0.057706054, 0.0120..."



## Build a classification model¶

Here you will define a simple model that accepts the **raw embedding data** as input, has one **hidden layer**, and an **output layer** specifying the class probabilities. The prediction will correspond to the probability of a piece of text being a particular class of news.

When you run the model, Keras will take care of details like **shuffling the data points** , calculating metrics and other ML boilerplate.

We'll build a simple **"sorting machine" (model)** that takes the article's **number codes (embeddings) as input**. It **learns patterns** in these codes to **guess the article's category**, giving us a **probability for each category**. **Keras** handles the **learning process**.

In [33]:
import keras
from keras import Sequential, layers

def build_classification_model(input_dim: int, num_classes: int) -> Sequential:
    return Sequential([
        layers.Dense(input_dim, activation='relu', input_shape=(input_dim,)),
        layers.Dense(num_classes, activation='softmax')
    ])


In [34]:
# Derive the embedding size from observing the data. The embedding size can also be specified
# with the `output_dimensionality` parameter to `embed_content` if you need to reduce it.
embedding_size = len(df_train["Embeddings"].iloc[0])

classifier = build_classification_model(
    embedding_size, len(df_train["Class Name"].unique())
)
classifier.summary()

classifier.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(),
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    metrics=["accuracy"],
)

## Train the model
Finally, you can train your model. This code uses early stopping to exit the training loop once the loss value stabilises, so the number of epoch loops executed may differ from the specified value.

In [35]:
import numpy as np


NUM_EPOCHS = 20
BATCH_SIZE = 32

# Split the x and y components of the train and validation subsets.
y_train = df_train["Encoded Label"]
x_train = np.stack(df_train["Embeddings"])
y_val = df_test["Encoded Label"]
x_val = np.stack(df_test["Embeddings"])

# Specify that it's OK to stop early if accuracy stabilises.
early_stop = keras.callbacks.EarlyStopping(monitor="accuracy", patience=3)

# Train the model for the desired number of epochs.
history = classifier.fit(
    x=x_train,
    y=y_train,
    validation_data=(x_val, y_val),
    callbacks=[early_stop],
    batch_size=BATCH_SIZE,
    epochs=NUM_EPOCHS,
)

Epoch 1/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 25ms/step - accuracy: 0.2805 - loss: 1.3620 - val_accuracy: 0.7000 - val_loss: 1.2559
Epoch 2/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.5739 - loss: 1.2086 - val_accuracy: 0.7000 - val_loss: 1.1130
Epoch 3/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.8556 - loss: 1.0317 - val_accuracy: 0.9200 - val_loss: 0.9373
Epoch 4/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.9657 - loss: 0.7927 - val_accuracy: 0.9000 - val_loss: 0.7662
Epoch 5/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.9698 - loss: 0.6332 - val_accuracy: 0.9100 - val_loss: 0.6259
Epoch 6/20
[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.9784 - loss: 0.4799 - val_accuracy: 0.9200 - val_loss: 0.5250
Epoch 7/20
[1m13/13[0m [32m━━━━━━━━

## Evaluate model performance
Use Keras Model.evaluate to calculate the loss and accuracy on the test dataset.

In [36]:
classifier.evaluate(x=x_val, y=y_val, return_dict=True)

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.9623 - loss: 0.2181 


{'accuracy': 0.949999988079071, 'loss': 0.2744240164756775}

## Try a custom prediction¶
Now that you have a trained model with good evaluation metrics, you can try to make a prediction with new, hand-written data. Use the provided example or try your own data to see how the model performs.

In [37]:
def make_prediction(text: str) -> list[float]:
    """Infer categories from the provided text."""
    # Remember that the model takes embeddings as input, so calculate them first.
    embedded = embed_fn(new_text)

    # And recall that the input must be batched, so here they are wrapped as a
    # list to provide a batch of 1.
    inp = np.array([embedded])

    # And un-batched here.
    [result] = classifier.predict(inp)
    return result

In [38]:
# This example avoids any space-specific terminology to see if the model avoids
# biases towards specific jargon.
new_text = """
First-timer looking to get out of here.

Hi, I'm writing about my interest in travelling to the outer limits!

What kind of craft can I buy? What is easiest to access from this 3rd rock?

Let me know how to do that please.
"""

result = make_prediction(new_text)

for idx, category in enumerate(df_test["Class Name"].cat.categories):
    print(f"{category}: {result[idx] * 100:0.2f}%")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
sci.crypt: 0.01%
sci.electronics: 1.28%
sci.med: 0.09%
sci.space: 98.61%
