# Preparing Headline Data
This notebook processes a dataset of news headlines, converting them into vector embeddings and storing them for later use with our IVF index. We load headlines from a CSV file, generate embeddings for each using Gemini's API (with a 10-second delay between calls to respect rate limits), then save everything to a VectorStorage file. This preprocessing step creates the vector database that powers our semantic search demonstrations in the IVF index notebook.

In [18]:
import pandas as pd
from minivecdb.vector_storage import VectorStorage
from minivecdb import gen_emb
from types import SimpleNamespace
import time
import fastcore.all as fc

In [23]:
df = pd.read_csv("data/random_headlines.csv") # Data generated from Grok
df.head(), df.shape

(                                            Headline          Genre
 0  Scientists Discover New Species of Fish in Pac...        Science
 1  Local Bakery Wins National Award for Best Croi...           Food
 2         New Action Movie Breaks Box Office Records  Entertainment
 3     City Council Approves New Park Renovation Plan     Local News
 4               Tech Startup Secures $10M in Funding     Technology,
 (383, 2))

## Our Dataset

We're working with a dataset of random news headlines across different genres. Each headline has:
- The headline text
- A genre label (like Science, Technology, Sports)

This will let us build a semantic search engine that understands the meaning of headlines.

In [7]:
def compute_emb(x):
    headline, genre = x['Headline'], x['Genre']
    vector = gen_emb.get_embedding(headline)
    time.sleep(10)
    return SimpleNamespace(vector=vector, metadata={'headline': headline, 'genre': genre})

## Creating Embeddings

The `compute_emb` function:
1. Takes a headline and its genre
2. Creates an embedding using our Gemini function
3. Adds a 10-second delay to respect API rate limits
4. Returns a SimpleNamespace object with the vector and metadata

Note that the delay is important when working with external APIs - they often have limits on how quickly you can send requests.

In [8]:
emb_obj = df.apply(compute_emb, axis=1)

## Processing All Headlines

We apply our function to each headline in the dataset. This creates embeddings for all 383 headlines.

This is the most time-consuming part of the process (about an hour for this dataset) because:
1. Creating each embedding requires an API call
2. We add a 10-second delay between calls
3. We're processing hundreds of headlines

In [12]:
len(emb_obj)

383

In [21]:
vs = VectorStorage(768)
res = emb_obj.map(lambda o: vs.add(o.vector, o.metadata))
len(res)

383

## Storing Our Embeddings

Now we:
1. Create a VectorStorage for our 768-dimensional vectors
2. Add each embedding with its metadata (headline text and genre)
3. Save the complete storage to disk as a JSON file

This creates a searchable database of headline embeddings that we can load and use anytime without redoing the API calls.

In [22]:
vs.save("data/headline_embeddings.json")