# Nombre: Cesar Duque

# Taller: Uso de herramientas en la nube para la recuperación de información

## Objetivo:

Aprender a utilizar dos potentes bases de datos vectoriales, ChromaDB y Pinecone, para realizar búsquedas de similitud con embeddings de texto. Las bases de datos vectoriales son herramientas esenciales en el campo de la Recuperación de Información (IR) y se utilizan ampliamente en diversas aplicaciones, como motores de búsqueda, sistemas de recomendación y procesamiento de lenguaje natural (NLP).

In [1]:
import pandas as pd
import chromadb
import numpy as np
from pinecone import Pinecone, ServerlessSpec
from google.colab import drive
from gensim.models import KeyedVectors

Primero inicializamos Pinecone con la clave API que obtenemos de nuestra cuenta personal

In [2]:
pc = Pinecone(api_key="19b4dd74-dccc-4a35-8664-249e80166bbf")

creo un nuevo indice que se llamara vinos300 de 300 dimensiones y se usa la metrica cosine para que calcule la similitud coseno entre vectores

In [3]:
pc.create_index(
    name="vinos300",
    dimension=300, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

Cargamos el indice en la siguiente variable

In [4]:
index = pc.Index("vinos300")

cargamos el archivo de vinos:

In [5]:
wine_df = pd.read_csv("./winemag-data_first150k.csv")
wine_df

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
...,...,...,...,...,...,...,...,...,...,...,...
150925,150925,Italy,Many people feel Fiano represents southern Ita...,,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Feudi di San Gregorio
150926,150926,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,91,27.0,Champagne,Champagne,,Champagne Blend,H.Germain
150927,150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,91,20.0,Southern Italy,Fiano di Avellino,,White Blend,Terredora
150928,150928,France,"A perfect salmon shade, with scents of peaches...",Grand Brut Rosé,90,52.0,Champagne,Champagne,,Champagne Blend,Gosset


Cargamos el modelo word2vec

In [6]:
drive.mount('/content/drive')
word2vec_model = KeyedVectors.load_word2vec_format('/content/drive/My Drive/modelos/GoogleNews-vectors-negative300.bin', binary=True)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


creamos el corpus con las columna del  id y descripcion


In [7]:
corpus = wine_df[['Unnamed: 0','description']][:30]
corpus

Unnamed: 0.1,Unnamed: 0,description
0,0,This tremendous 100% varietal wine hails from ...
1,1,"Ripe aromas of fig, blackberry and cassis are ..."
2,2,Mac Watson honors the memory of a wine once ma...
3,3,"This spent 20 months in 30% new French oak, an..."
4,4,"This is the top wine from La Bégude, named aft..."
5,5,"Deep, dense and pure from the opening bell, th..."
6,6,Slightly gritty black-fruit aromas include a s...
7,7,Lush cedary black-fruit aromas are luxe and of...
8,8,This re-named vineyard was formerly bottled as...
9,9,The producer sources from two blocks of the vi...


definimos la siguiente funcion para crear el embedding del texto que se pase usando word2vec



In [15]:
def generate_word2vec_embeddings(texts):
    embeddings = []
    for text in texts:
        tokens = text.lower().split()
        word_vectors = [word2vec_model[word] for word in tokens if word in word2vec_model]
        if word_vectors:
            embeddings.append(np.mean(word_vectors, axis=0))
        else:
            embeddings.append(np.zeros(word2vec_model.vector_size))
    return np.array(embeddings)


Generamos el embedding del la descripcion del corpus

In [23]:
word2vec_embeddings = generate_word2vec_embeddings(corpus['description'])
print("Word2Vec Shape:", word2vec_embeddings.shape)

Word2Vec Shape: (30, 300)


In [22]:
pd.DataFrame(word2vec_embeddings)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,0.019787,0.034147,-0.008846,0.051933,-0.016366,-0.037549,0.05248,-0.133699,0.041895,0.135602,...,-0.087264,0.004612,-0.051172,-0.012683,-0.025946,-0.033582,0.039914,-0.015733,0.066266,-0.027847
1,0.001686,-0.001247,-0.000655,0.14048,-0.014441,-0.00249,0.091776,-0.131765,0.019266,0.165058,...,-0.100096,-0.047632,-0.022128,-0.020958,-0.025494,-0.013828,0.03543,-0.044538,0.064084,0.032215
2,-0.017582,0.064089,0.024086,0.108896,-0.010757,-0.030537,0.055744,-0.153785,0.043589,0.120384,...,-0.079395,-0.03962,-0.038397,-0.008357,-0.034771,-0.088459,0.011192,-0.040925,0.091102,0.017694
3,0.03255,0.045905,0.005632,0.101596,-0.020803,-0.022248,0.034838,-0.134144,0.052734,0.130473,...,-0.097524,-0.013736,-0.045929,-0.005602,-0.041582,-0.00159,0.058337,-0.03576,0.049629,0.008004
4,0.021038,0.003638,0.034816,0.05965,-0.024282,0.007119,0.025699,-0.151583,0.075873,0.108792,...,-0.080433,-0.027292,-0.055293,0.064706,-0.023048,0.005717,0.015189,-0.022822,0.063493,-0.004583
5,0.024773,0.037948,-0.007518,0.09352,-0.030345,-0.016964,0.05942,-0.092487,0.068705,0.138466,...,-0.065226,-0.039165,-0.037136,0.026566,-0.007809,0.00599,0.022459,-0.006457,0.071308,0.031391
6,-0.013163,-0.010038,-0.005573,0.128697,-0.024703,0.009325,0.059494,-0.081149,0.05303,0.11236,...,-0.042375,-0.046615,-0.009515,-0.017981,-0.015074,0.004817,0.036903,-0.040093,0.026081,0.03755
7,0.028637,0.001697,-0.002092,0.106546,-0.003952,0.024813,0.073684,-0.121805,0.044484,0.142862,...,-0.092592,-0.074621,-0.000606,-0.000127,-0.038869,0.022445,0.018914,-0.05629,0.112041,0.033354
8,0.033648,0.006038,-0.031523,0.078414,-0.034637,0.045929,0.032513,-0.133533,0.08311,0.127848,...,-0.054502,-0.038911,-0.053522,0.070187,-0.008705,-0.02346,0.046324,-0.019357,0.128292,-0.008424
9,0.018751,0.038643,0.005528,0.069594,-0.038254,0.006929,0.085988,-0.118543,-0.010177,0.148384,...,-0.075537,-0.036528,-0.018738,-0.00163,-0.016193,-0.00911,0.036957,-0.076431,0.065915,0.017405


creamos el vector que usaremos para subir al pinecone

In [17]:
vectors = [{'id': str(i), 'values': word2vec_embeddings[i]} for i in range(30)]
pd.DataFrame(vectors)

Unnamed: 0,id,values
0,0,"[0.019786645, 0.034147214, -0.008846283, 0.051..."
1,1,"[0.0016860962, -0.001247406, -0.00065493584, 0..."
2,2,"[-0.01758194, 0.06408924, 0.02408564, 0.108896..."
3,3,"[0.032550234, 0.04590465, 0.0056318804, 0.1015..."
4,4,"[0.021037823, 0.0036380924, 0.03481565, 0.0596..."
5,5,"[0.024773298, 0.03794842, -0.007517787, 0.0935..."
6,6,"[-0.013163249, -0.010037509, -0.005572695, 0.1..."
7,7,"[0.028636653, 0.0016968889, -0.0020921289, 0.1..."
8,8,"[0.033648174, 0.006037839, -0.031522624, 0.078..."
9,9,"[0.018751256, 0.038642913, 0.0055281697, 0.069..."


y ahora si usamos el vector creado para subirlo al pinecon

In [18]:
index.upsert(vectors=vectors, namespace='vectors')


{'upserted_count': 30}

Imprimimos las stats para ver su dimension el tamaño del vector

In [19]:
print(index.describe_index_stats())


{'dimension': 300,
 'index_fullness': 0.0,
 'namespaces': {'vectors': {'vector_count': 30}},
 'total_vector_count': 30}


definimos una query y generamos su embedding con la misma funcion:

In [21]:
query_str = 'wine'
query_vector = generate_word2vec_embeddings([query_str])
pd.DataFrame(query_vector)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,-0.175781,-0.109375,-0.189453,0.15625,0.006775,0.233398,0.05542,-0.296875,0.132812,0.287109,...,0.045898,-0.209961,0.115723,0.186523,0.172852,-0.084473,0.296875,-0.037842,0.308594,0.050049


usando el vector de la query generado hacemos la consulta:

In [24]:
index.query(
    namespace="vectors",
    vector=query_vector.tolist(),
    top_k=3,
    include_values=False
)

{'matches': [{'id': '4', 'score': 0.537523627, 'values': []},
             {'id': '29', 'score': 0.53684485, 'values': []},
             {'id': '8', 'score': 0.535202205, 'values': []}],
 'namespace': 'vectors',
 'usage': {'read_units': 5}}

Se muestra el texto que mas se relaciona con la query

In [35]:
print('texto: ',wine_df[wine_df['Unnamed: 0'] == 13]['description'].values[0])

texto:  This wine is in peak condition. The tannins and the secondary flavors dominate this ripe leather-textured wine. The fruit is all there as well: dried berries and hints of black-plum skins. It is a major wine right at the point of drinking with both the mature flavors and the fruit in the right balance.


# Chromadb

Inicializamos el cliente de ChromaDB

In [36]:
client = chromadb.Client()

Se crea una colección y con su respectivo nombre:

In [47]:
collection = client.create_collection(name="vinos_collection")

creamos los ids y los vectores que se van a insertar en la coleccion

In [48]:
ids = [str(i) for i in range(30)]
vectors = [word2vec_embeddings[i].tolist() for i in range(30)]

Insertamos los ids y vectores en la coleccion

In [49]:
collection.add(ids=ids, embeddings=vectors)

hacemos una consulta y sacamos su embedding con word2vec

In [50]:
query_str = 'wine'
query_vector = generate_word2vec_embeddings([query_str])[0].tolist()

Y Realizamos la consulta en la colección y le decimos cuantas resultados queremos

In [51]:
results = collection.query(query_embeddings=query_vector, n_results=5)
results

{'ids': [['29', '8', '21', '17', '11']],
 'distances': [[8.08436107635498,
   8.170435905456543,
   8.233306884765625,
   8.275045394897461,
   8.287361145019531]],
 'metadatas': [[None, None, None, None, None]],
 'embeddings': None,
 'documents': [[None, None, None, None, None]],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}

mostramos los 5 resultados que mas coinciden con la consulta junto a su distancia y su id

In [70]:
for dist,id in zip(results['distances'][0], results['ids'][0]):
   print('id: ',id)
   print('distancia: ',dist)
   print('texto: ',wine_df[wine_df['Unnamed: 0'] == int(id)]['description'].values[0])
   print('\n')

id:  29
distancia:  8.08436107635498
texto:  This standout Rocks District wine brings earth shaking aromas of black-olive brine, tapenade, green olive, stargazer lilies, orange peel and crushed gravel. The smoked meat, charcuterie and blue-fruit flavors don't hold back, bringing a lovely sense of texture and detail. It's an intense wine that completely demands your attention.


id:  8
distancia:  8.170435905456543
texto:  This re-named vineyard was formerly bottled as deLancellotti. You'll find striking minerality underscoring chunky black fruits. Accents of citrus and graphite comingle, with exceptional midpalate concentration. This is a wine to cellar, though it is already quite enjoyable. Drink now through 2030.


id:  21
distancia:  8.233306884765625
texto:  Alluring, complex and powerful aromas of grilled meat, berries, tea, smoke, vanilla and spice cover every base. An intense palate is concentrated but still elegant. Blackberry, molasses and mocha flavors finish with chocolaty o