# Ejercicio 9: Uso de la API de Google Gemini

En este ejercicio vamos a aprender a utilizar la API de OpenAI

## 1. Uso básico / Generation

El siguiente código sirve para conectarse con la API de Google Gemini de forma básica

In [None]:
from google import genai
import os

api_key = os.environ.get("GEMINI_API_KEY", "AIzaSyDP0e01I8PZmLvt3OwbWdfKg4oIY016C1A")

client = genai.Client(api_key=api_key)

response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="Explain how AI works in a few words"
)
print(response.text)

AI analyzes vast amounts of data to find patterns and make predictions.


## 2. Retrieval

### 2.1 Cargo el corpus de 20 News Groups

In [None]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroups = newsgroups.data

In [None]:
import pandas as pd
df = pd.DataFrame(newsgroups, columns=['text'])
df

Unnamed: 0,text
0,\n\nI am sure some bashers of Pens fans are pr...
1,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...
4,1) I have an old Jasmine drive which I cann...
...,...
18841,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...
18842,\nNot in isolated ground recepticles (usually ...
18843,I just installed a DX2-66 CPU in a clone mothe...
18844,\nWouldn't this require a hyper-sphere. In 3-...


In [None]:
import re
df = df.dropna(subset=['text']).reset_index(drop=True)

def normalize_text(s: str)->str:
    s = re.sub(r'\s+', ' ', s).strip()
    return s

df["text_norm"] = df["text"].astype(str).map(normalize_text)
df.head()

Unnamed: 0,text,text_norm
0,I am sure some bashers of Pens fans are pretty...,I am sure some bashers of Pens fans are pretty...
1,My brother is in the market for a high-perform...,My brother is in the market for a high-perform...
2,Finally you said what you dream about. Mediter...,Finally you said what you dream about. Mediter...
3,Think! It's the SCSI card doing the DMA transf...,Think! It's the SCSI card doing the DMA transf...
4,1) I have an old Jasmine drive which I cannot ...,1) I have an old Jasmine drive which I cannot ...


### 2.2 Transformo a embeddings

In [None]:
def chunk_text(text: str, max_chars: int = 800, overlap: int = 100):
    """
    Chunking por caracteres.
    max_chars ~ 600-1000 suele funcionar bien.
    overlap ayuda a no cortar ideas a la mitad.
    """
    chunks = []
    start = 0
    n = len(text)
    while start < n:
        end = min(start + max_chars, n)
        chunk = text[start:end]
        chunk = chunk.strip()
        if len(chunk) > 0:
            chunks.append(chunk)
        if end == n:
            break
        start = max(0, end - overlap)
    return chunks

records = []
for i, row in df.iterrows():
    chunks = chunk_text(row["text_norm"], max_chars=800, overlap=100)
    for j, ch in enumerate(chunks):
        records.append({
            "doc_id": int(i),
            "chunk_id": j,
            "text": ch
        })

chunks_df = pd.DataFrame(records)
chunks_df.head(), len(chunks_df)

(   doc_id  chunk_id                                               text
 0       0         0  I am sure some bashers of Pens fans are pretty...
 1       1         0  My brother is in the market for a high-perform...
 2       2         0  Finally you said what you dream about. Mediter...
 3       2         1  urds and Turks once upon a time! Ohhhh so swed...
 4       3         0  Think! It's the SCSI card doing the DMA transf...,
 38871)

In [None]:
from sentence_transformers import SentenceTransformer

MODEL_NAME = "intfloat/e5-base-v2"   # recomendado para retrieval
model = SentenceTransformer(MODEL_NAME)

# Textos a indexar (pasajes)
passages = ["passage: " + t for t in chunks_df["text"].tolist()]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [None]:
embeddings = model.encode(
    passages,
    batch_size=16,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True
).astype("float32")

Batches:   0%|          | 0/2430 [00:00<?, ?it/s]

In [None]:
print(embeddings.shape, embeddings.dtype)

(38871, 768) float32


In [None]:
import numpy as np
from google.colab import drive
drive.mount('/content/drive')
np.save(
    "/content/drive/MyDrive/embeddings2.npy",
    embeddings
)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### 2.3 Creo una query y hago la búsqueda

In [None]:
def embed_query(query: str) -> np.ndarray:
    q = "query: " + query
    vec = model.encode(
        [q],
        convert_to_numpy=True,
        normalize_embeddings=True
    ).astype("float32")
    return vec

query_text = "Battery"

query_vec = embed_query(query_text)
query_vec.shape

(1, 768)

Obtengo los 5 documentos más similares a mi query

In [None]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/23.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.2/23.8 MB[0m [31m178.0 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m12.7/23.8 MB[0m [31m185.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m20.0/23.8 MB[0m [31m187.4 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m23.8/23.8 MB[0m [31m201.2 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m23.8/23.8 MB[0m [31m201.2 MB/s[0m eta [36m0:

In [None]:
# código base para FAISS
import faiss
import numpy as np

# Asumiendo `embeddings` en un array NxD
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

D, I = index.search(query_vec, k=10)

In [None]:
text = ""
for i, idx in enumerate(I[0]):
    text_content = chunks_df.loc[idx, 'text']
    text += f"--- Documento ID: {idx}\nContenido:\n{text_content}\n\n"

In [None]:
text

'--- Documento ID: 38840\nContenido:\nAnaheim.\n\n--- Documento ID: 32749\nContenido:\nWith your date/time problems, you MIGHT have a problem with the Dallas Clock Chip (I\'m making a possibly bad assumption that your system has a clock chip and that it\'s the standard Dallas Clock Chip). I always look at the battery and the clock chip when such things go wrong-- at least, as the first course of action. Mel. White/Data Services/City of Garland, Texas\n\n--- Documento ID: 15083\nContenido:\nDFW was designed with the STS in mind (which really mean very little). Much of their early PR material had scenes with a shuttle landing and two or three others pulled up to gates. I guess they were trying to stress how advanced the airport was. For Dallas types: Imagine the fit Grapevine and Irving would be having if the shuttle WAS landing at DFW. (For the rest, they are currently having some power struggles between the airport and surrounding cities).\n\n--- Documento ID: 307\nContenido:\n...\n\n-

In [None]:
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="La querry fue: " + query_text + " y los resultados son {"+text+"} Dame un resumen"
)
print(response.text)

Los documentos proporcionados ofrecen una visión variada sobre las baterías, abarcando desde principios químicos y mantenimiento hasta proyectos educativos y aplicaciones técnicas. Aquí tienes un resumen de los puntos principales:

### 1. Química y Funcionamiento
*   **Procesos Químicos:** Se describe el funcionamiento de las baterías de plomo-ácido como una reacción química reversible y exotérmica. La descarga ocurre mediante un cambio en la composición química de las celdas.
*   **Factores de Descarga:** La humedad y los electrolitos disueltos (como la lluvia ácida) pueden crear caminos conductores que descargan la batería. Asimismo, la evaporación del agua aumenta la concentración de ácido sulfúrico, lo que también impulsa la descarga.
*   **Recarga:** El proceso es reversible si se introduce electricidad mediante un alternador o cargador.

### 2. Proyectos DIY y Educación
*   **Construcción Casera:** Se sugieren métodos para crear baterías con fines educativos usando metales comune