# Generating Embeddings with OpenAI

This notebook generates embeddings for chatbot conversations.  
Steps:
1. Load JSON files containing cleaned conversations.  
2. Extract message text.  
3. Use OpenAI's embedding model to create vector representations.  
4. Save the embeddings to new JSON files for later use.  

### 1. Import required libraries
We use:
- `openai` for embeddings.  
- `pandas` and `numpy` (optional, may be useful later).  
- `os` and `json` for file handling.  
- `cosine_similarity` from sklearn to compare embeddings if needed.  

In [2]:
import openai
import pandas as pd
import numpy as np
import os
import json
from sklearn.metrics.pairwise import cosine_similarity

### 2. Configure API key and paths
We define:
- The folder with the input JSON files (`processed_json`).  
- The folder where embeddings will be saved (`embeddingopenAIFiles`). 

In [3]:
openai.api_key = 'YOUR-API-KEY'
folder_path = "../raw data/processed_json"
out_path = "embeddingopenAIFiles"
os.makedirs(out_path, exist_ok=True)

### 3. Define embedding model
We use `text-embedding-3-small` with 768 dimensions. 

In [None]:
EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIMENSIONS = 768

### 4. Function to generate an embedding
This function calls OpenAI’s embedding API with the selected model and dimensions.

In [None]:
def generar_embedding(texto):
    response = openai.embeddings.create(
        model=EMBEDDING_MODEL,
        input=texto,
        dimensions=EMBEDDING_DIMENSIONS
    )
    return response.data[0].embedding

### 5. List input files
We check which JSON files exist in the input folder.
For testing, we only take the first 10 files.5. List input files

### 6. Process each file
For every JSON file:
1. Open and parse the content.  
2. Extract message texts.  
3. Generate embeddings for each text.  
4. Save the embeddings into a new JSON file. 

In [8]:
print("📁 Carpeta de entrada:", os.path.abspath(folder_path))

try:
    files = sorted(os.listdir(folder_path))[:10]
except FileNotFoundError:
    print(f"❌ La carpeta {folder_path} no existe.")
    exit()

print("📄 Archivos detectados:", files)

for name_file in files:
    ruta = os.path.join(folder_path, name_file)

    if not name_file.endswith('.json'):
        continue

    if not os.path.isfile(ruta):
        print(f"⚠️ Archivo no encontrado (saltado): {ruta}")
        continue

    with open(ruta, 'r', encoding='utf-8') as f:
        try:
            data = json.load(f)
        except Exception as e:
            print(f"❌ Error leyendo {name_file}: {e}")
            continue

    textos = []

    # ✅ Extraer los textos de los mensajes
    if isinstance(data, dict) and 'messages' in data:
        textos = [msg['content'] for msg in data['messages']
                  if isinstance(msg, dict) and 'content' in msg and isinstance(msg['content'], str)]

    if not textos:
        print(f"⚠️ No se encontraron textos en {name_file}")
        continue

    try:
        embeddings = [generar_embedding(t) for t in textos]
    except Exception as e:
        print(f"❌ Error generando embeddings en {name_file}: {e}")
        continue

    nombre_salida = os.path.splitext(name_file)[0] + "_embeddings.json"
    ruta_salida = os.path.join(out_path, nombre_salida)

    with open(ruta_salida, 'w', encoding='utf-8') as f:
        json.dump(embeddings, f)

    print(f"✅ Procesado y guardado: {ruta_salida} ({len(embeddings)} vectores)")


📁 Carpeta de entrada: /mnt/batch/tasks/shared/LS_root/mounts/clusters/elena1/code/Users/elena/elena/raw data/processed_json
📄 Archivos detectados: ['.amlignore', '.amlignore.amltmp', '10ER3iMDW5w4lIlSnHBrIq-in.json', '10er3imdw5w4lilsnhbriq-in.json.amltmp', '111jaycyxHF58KEVIsa0kA-in.json', '11fDRFTlZAPCMeBqTSeYmg-in.json', '12EdU3GcQf2AgjlcgTLb4R-in.json', '12UHWJ1DhbmBIcfVJlbSPK-in.json', '13KMJrTS9SO2aIA8r7SNCj-in.json', '147nMJUjrHyIxKu9Aoka4U-in.json']
✅ Procesado y guardado: embeddingopenAIFiles/10ER3iMDW5w4lIlSnHBrIq-in_embeddings.json (41 vectores)
✅ Procesado y guardado: embeddingopenAIFiles/111jaycyxHF58KEVIsa0kA-in_embeddings.json (26 vectores)
✅ Procesado y guardado: embeddingopenAIFiles/11fDRFTlZAPCMeBqTSeYmg-in_embeddings.json (55 vectores)
✅ Procesado y guardado: embeddingopenAIFiles/12EdU3GcQf2AgjlcgTLb4R-in_embeddings.json (18 vectores)
✅ Procesado y guardado: embeddingopenAIFiles/12UHWJ1DhbmBIcfVJlbSPK-in_embeddings.json (12 vectores)
✅ Procesado y guardado: embedding