<a href="https://colab.research.google.com/github/Viny2030/HUMAI/blob/main/3_Embeddings_Pre_entrenados.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/institutohumai/cursos-python/blob/master/NLP/3_Embeddings/3_Embeddings_Pre-entrenados.ipynb"> <img src='https://colab.research.google.com/assets/colab-badge.svg' /> </a>

# Importar Embeddings Pre-entrenados

Hasta ahora aprendimos como se hace para entrenar desde 0 nuestros embeddings con un dataset propio. Sin embargo, muchas veces esto es innecesario ya que seguramente mucha gente ya se ha enfrentado al mismo problema y ha entrenado embeddings que pueden ser reutilizados.

En esta clase aprenderemos cómo importar Embeddings pre-entrenados en nuestros modelos.

## Descargando los datos

En esta sección descargaremos un archivo txt que contiene 1.000.653 embeddings de palabras de dimensión 300 entrenadas con el [Spanish Billion Words Corpus](https://crscardellino.ar/SBWCE/). Estas incrustaciones fueron entrenadas usando word2vec.
Los hiperparámtero usados para el entrenamiento son:

Las incrustaciones de palabras se entrenaron utilizando los siguientes parámetros:

* El algoritmo seleccionado fue el modelo skip-gram con muestreo negativo.
* La frecuencia mínima de palabras fue de 5.
* La cantidad de “palabras ruidosas” para el muestreo negativo fue de 20.
* Las 273 palabras más comunes se submuestrearon.
* La dimensión de la incrustación de la palabra final fue 300.

El corpus original tenía la siguiente cantidad de datos:

* Un total de 1420665810 palabras sin procesar.
* Un total de 46925295 oraciones.
* Un total de 3817833 tokens únicos.

Luego de aplicar el modelo skip-gram, filtrado de palabras con menos de 5 ocurrencias así como el downsampling de las 273 palabras más comunes, se obtuvieron los siguientes valores:

* Un total de 771508817 palabras sin procesar.
* Un total de 1000653 tokens únicos.

La siguiente celda descarga y descomprime el archivo txt con los embeddings. El algoritmo de compresión bzip2 es algo lento así que tenga paciencia, puede tardar unos minutos en descomprimirse el archivo.

In [1]:
!wget https://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.txt.bz2
!bzip2 -d SBW-vectors-300-min5.txt.bz2

--2025-01-27 11:32:55--  https://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.txt.bz2
Resolving cs.famaf.unc.edu.ar (cs.famaf.unc.edu.ar)... 200.16.17.55
Connecting to cs.famaf.unc.edu.ar (cs.famaf.unc.edu.ar)|200.16.17.55|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 818175453 (780M) [application/x-bzip2]
Saving to: ‘SBW-vectors-300-min5.txt.bz2’


2025-01-27 11:33:37 (19.1 MB/s) - ‘SBW-vectors-300-min5.txt.bz2’ saved [818175453/818175453]



El archivo txt indica en su primera linea la cantidad de embeddings y la dimensión de cada uno de ellos. Luego, cada linea contendrá el token vectorizado y luego el vector propiamente dicho.

In [2]:
with open( "SBW-vectors-300-min5.txt", 'r') as f:
  n_lin = 0
  for line in f:
    print(line)
    n_lin += 1
    if n_lin>3: break


1000653 300

de -0.029648 0.011336 0.019949 -0.088832 -0.025225 0.056844 0.025473 0.014068 0.163694 -0.067154 0.014738 0.027134 0.066443 -0.044846 -0.044987 -0.040898 0.030311 0.034196 -0.049240 0.008537 -0.068091 -0.087938 0.035300 0.149385 -0.012350 0.012613 0.029350 0.069596 0.039111 0.057652 0.069954 -0.066217 -0.041784 0.028623 0.026772 -0.066392 0.002953 -0.012188 -0.030363 0.040222 0.034858 0.027469 -0.029034 -0.048748 -0.038582 -0.051553 -0.033501 -0.019008 0.003043 0.110712 -0.025096 0.111082 0.035244 0.114207 0.010195 0.051511 -0.040649 -0.113944 0.044873 0.052011 0.067360 0.049054 -0.127085 -0.031846 0.032848 0.040825 -0.084873 0.059801 -0.067424 0.016531 -0.084565 0.057024 0.083288 -0.010136 -0.048508 0.051757 0.046664 0.018102 -0.052320 -0.000765 0.053662 -0.009967 0.082858 0.009068 0.054575 -0.003466 -0.023376 0.023069 0.088513 0.018504 -0.039503 -0.032980 -0.002139 0.000010 -0.107627 0.007699 0.046351 -0.003062 0.030500 0.113650 0.032536 -0.097301 -0.013734 0.098345 0.08

## Cargando el contenido en memoria

A continuación se crea una clase que nos permitirá almacenar los embeddings en memoria y acceder a ellos de manera más estructurada.

Esta clase contendrá varios atributos útiles:
* idx_to_token: es una lista que contendrá los tokens
* idx_to_vec: es una lista que contendrá los embeddings
* dim: es la dimensión de los embeddings
* token_to_idx: devuelve el id correspondiente al token pasado como parámetro


In [3]:
import torch
class TokenEmbedding:
  """Token Embedding."""
  def __init__(self, file_name, n):
    self.idx_to_token, self.idx_to_vec, self.dim = self._load_embedding(
        file_name, n)
    self.unknown_idx = 0
    self.token_to_idx = {token: idx for idx, token in
                          enumerate(self.idx_to_token)}


  def _load_embedding(self, file_name, n):
    idx_to_token, idx_to_vec = ['<unk>'], []
    with open( file_name, 'r') as f:
      first_read = True
      i=0
      for line in f:
        if n<i: break
        else: i+=1
        if first_read:
          first_read = False
          continue
        elems = line.rstrip().split(' ')
        token, elems = elems[0], [float(elem) for elem in elems[1:]]
        # Skip header information, such as the top row in fastText
        if len(elems) > 1:
            idx_to_token.append(token)
            idx_to_vec.append(elems)
    idx_to_vec = [[0] * len(idx_to_vec[0])] + idx_to_vec
    return idx_to_token, torch.tensor(idx_to_vec), len(idx_to_vec[0])

  def __getitem__(self, tokens):
    indices = [self.token_to_idx.get(token, self.unknown_idx)
                for token in tokens]
    vecs = self.idx_to_vec[torch.tensor(indices)]
    return vecs

  def __len__(self):
    return len(self.idx_to_token)

A continuación cargaremos los embeddings en el objeto `spanish_w2v`. Cabe aclarar que sólo cargaremos 500k tokens debido a que no entran todos en memoria.



In [8]:
spanish_w2v = TokenEmbedding("SBW-vectors-300-min5.txt",500000)

De esta manera es fácil acceder al embedding de cualquier palabra que queramos.

In [9]:
id_mesa = spanish_w2v.token_to_idx["mesa"]
spanish_w2v.idx_to_vec[id_mesa]

tensor([-7.2965e-02,  1.8812e-02,  1.0573e-01,  3.3087e-02, -8.5658e-02,
         9.9318e-02, -6.4557e-02,  4.0981e-02,  2.6975e-02, -6.2916e-02,
         5.1500e-04, -2.4984e-02,  1.7487e-02, -3.5730e-03, -2.8035e-02,
         1.3317e-02,  5.2800e-02, -3.8670e-03,  2.7829e-02, -6.9262e-02,
        -3.3747e-02, -4.3120e-02, -4.5179e-02,  7.9108e-02,  9.0945e-02,
        -2.9899e-02, -1.9439e-02,  1.1969e-01,  5.3333e-02,  4.1652e-02,
        -7.9298e-02, -1.1909e-01,  4.7590e-03, -8.5445e-02, -5.1491e-02,
         2.8829e-02,  4.3646e-02,  2.3469e-02, -3.9472e-02,  6.6565e-02,
         3.4349e-02, -1.1352e-01,  2.9633e-02, -6.8393e-02, -7.8980e-02,
        -5.1030e-02,  1.5873e-02,  6.8210e-03, -1.3847e-02, -1.0377e-01,
         1.5152e-02,  6.3327e-02,  4.6139e-02,  4.1723e-02,  7.0962e-02,
        -5.1490e-02, -3.6193e-02, -6.5074e-02,  5.2661e-02,  4.8595e-02,
        -3.9467e-02,  1.7295e-02,  2.7180e-02, -5.6163e-02, -1.5222e-01,
        -1.8895e-02, -3.7351e-02, -9.4126e-02,  1.0

## Evaluando los embeddings pre-entrenados

Usando los vectores word2vec cargados, demostraremos su semántica aplicándolos en las siguientes tareas de analogía y similitud de palabras.

### Similitud de Palabras

Para encontrar palabras semánticamente similares para una palabra de entrada, basados en las similitudes de coseno entre vectores de palabras, implementamos la siguiente función knn (vecinos más cercanos).

In [17]:
def get_similar_tokens(query_token, k, embed):
    topk, cos = knn(embed.idx_to_vec, embed[[query_token]], k + 1)
    for i, c in zip(topk[1:], cos[1:]):  # Exclude the input word
        print(f'cosine sim={float(c):.3f}: {embed.idx_to_token[int(i)]}')

get_similar_tokens('mesa', 10, spanish_w2v)

cosine sim=0.666: mesas
cosine sim=0.611: tapete
cosine sim=0.599: sentarse
cosine sim=0.585: sentarlos
cosine sim=0.569: mantel
cosine sim=0.562: redonda
cosine sim=0.557: sentasen
cosine sim=0.533: sentándose
cosine sim=0.525: mesita
cosine sim=0.522: sillas


In [10]:
def knn(W, x, k):
    cos = torch.mv(W, x.reshape(-1,)) / (
        torch.sqrt(torch.sum(W * W, axis=1) + 1e-9) *
        torch.sqrt((x * x).sum()))
    _, topk = torch.topk(cos, k=k)
    return topk, [cos[int(i)] for i in topk]


Luego, buscamos palabras similares utilizando los embeddings preentrenados de la instancia de `TokenEmbedding`

In [13]:
def get_similar_tokens(query_token, k, embed):
    topk, cos = knn(embed.idx_to_vec, embed[[query_token]], k + 1)
    for i, c in zip(topk[1:], cos[1:]):  # Exclude the input word
        print(f'cosine sim={float(c):.3f}: {embed.idx_to_token[int(i)]}')

get_similar_tokens('dia', 7, spanish_w2v)

cosine sim=0.665: dias
cosine sim=0.604: esperança
cosine sim=0.594: despues
cosine sim=0.587: festes
cosine sim=0.584: día
cosine sim=0.577: allà
cosine sim=0.575: manana


In [11]:
def get_similar_tokens(query_token, k, embed):
    topk, cos = knn(embed.idx_to_vec, embed[[query_token]], k + 1)
    for i, c in zip(topk[1:], cos[1:]):  # Exclude the input word
        print(f'cosine sim={float(c):.3f}: {embed.idx_to_token[int(i)]}')

get_similar_tokens('muchacho', 3, spanish_w2v)

cosine sim=0.784: joven
cosine sim=0.765: niño
cosine sim=0.759: chico


### Analogías


Además de encontrar palabras similares, también podemos aplicar vectores de palabras a tareas de analogía de palabras. Por ejemplo, “hombre”:“mujer”::“hijo”:“hija” es la forma de una  analogía: “hombre” es a “mujer” como “hijo” es a “hija”. Específicamente, la tarea de completar la analogía de palabras se puede definir como: para una analogía de palabras
$a:b :: c:d$, dadas las tres primeras palabras $a$, $b$ y $c$, encuentre $d$. Denote el vector de la palabra $w$ como $vec(w)$. Para completar la analogía, buscaremos la palabra cuyo vector se parezca más al resultado de $vec(c) + vec(b) - vec(a)$ .

In [14]:
def get_analogy(token_a, token_b, token_c, embed):
    vecs = embed[[token_a, token_b, token_c]]
    x = vecs[1] - vecs[0] + vecs[2]
    topk, cos = knn(embed.idx_to_vec, x, 1)
    return embed.idx_to_token[int(topk[0])]  # Remove unknown words

Verifiquemos la analogía "hombre-mujer" usando los vectores de palabras cargados.

In [19]:
get_analogy('hombre', 'mujer', 'abuelo', spanish_w2v)

'abuela'

In [15]:
get_analogy('hombre', 'mujer', 'hijo', spanish_w2v)

'hija'

Veamos una analogía "país-gentilicio".

In [16]:
get_analogy('Argentina', 'argentino', 'España', spanish_w2v)

'español'

Pruebe sus propias analogías y vea cuáles fueron capturadas por word2vec y cuáles no.

## Generar una capa Embedding con vectores preentrenados.

Lo que debemos hacer en este punto es crear una capa Embedding, es decir, un diccionario que mapee índices enteros (que representan palabras) a vectores densos. Toma como entrada enteros, busca estos enteros en un diccionario interno y devuelve los vectores asociados.



Sin embargo, no vamos a usar todos los embeddings que descargamos, sino que solo usaremos aquellos que contenga el vocabulario que hemos construido para nuestra tarea. En nuestro caso recuperaremos el vocabulario del quijote que usamos la clase pasada.

In [3]:
!pip install torchtext



In [None]:
!pip install --force-reinstall torch torchvision torchaudio
!pip install --force-reinstall torchtext

Collecting torch
  Downloading torch-2.5.1-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting torchvision
  Downloading torchvision-0.20.1-cp311-cp311-manylinux1_x86_64.whl.metadata (6.1 kB)
Collecting torchaudio
  Downloading torchaudio-2.5.1-cp311-cp311-manylinux1_x86_64.whl.metadata (6.4 kB)
Collecting filelock (from torch)
  Downloading filelock-3.17.0-py3-none-any.whl.metadata (2.9 kB)
Collecting typing-extensions>=4.8.0 (from torch)
  Downloading typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting networkx (from torch)
  Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch)
  Downloading jinja2-3.1.5-py3-none-any.whl.metadata (2.6 kB)
Collecting fsspec (from torch)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-

Collecting torchtext
  Using cached torchtext-0.18.0-cp311-cp311-manylinux1_x86_64.whl.metadata (7.9 kB)
Collecting tqdm (from torchtext)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [1]:
import torch
print(torch.version.cuda)

12.4


In [2]:
!pip install --force-reinstall torchtext -f https://download.pytorch.org/whl/cu118/torch_stable.html

Looking in links: https://download.pytorch.org/whl/cu118/torch_stable.html
Collecting torchtext
  Using cached torchtext-0.18.0-cp311-cp311-manylinux1_x86_64.whl.metadata (7.9 kB)
Collecting tqdm (from torchtext)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting requests (from torchtext)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting torch>=2.3.0 (from torchtext)
  Using cached torch-2.5.1-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting numpy (from torchtext)
  Using cached numpy-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting filelock (from torch>=2.3.0->torchtext)
  Using cached filelock-3.17.0-py3-none-any.whl.metadata (2.9 kB)
Collecting typing-extensions>=4.8.0 (from torch>=2.3.0->torchtext)
  Using cached typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting networkx (from torch>=2.3.0->torchtext)
  Using cached networkx-3.4.2-py3-none-any.whl.metadata (

In [6]:
!pip uninstall torchtext
!pip install torchtext

Found existing installation: torchtext 0.18.0
Uninstalling torchtext-0.18.0:
  Would remove:
    /usr/local/lib/python3.11/dist-packages/torchtext-0.18.0.dist-info/*
    /usr/local/lib/python3.11/dist-packages/torchtext/*
Proceed (Y/n)? Y
  Successfully uninstalled torchtext-0.18.0
Collecting torchtext
  Using cached torchtext-0.18.0-cp311-cp311-manylinux1_x86_64.whl.metadata (7.9 kB)
Using cached torchtext-0.18.0-cp311-cp311-manylinux1_x86_64.whl (2.0 MB)
Installing collected packages: torchtext
Successfully installed torchtext-0.18.0


In [9]:
!pip install --force-reinstall torchtext

Collecting torchtext
  Using cached torchtext-0.18.0-cp311-cp311-manylinux1_x86_64.whl.metadata (7.9 kB)
Collecting tqdm (from torchtext)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting requests (from torchtext)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting torch>=2.3.0 (from torchtext)
  Using cached torch-2.5.1-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting numpy (from torchtext)
  Using cached numpy-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting filelock (from torch>=2.3.0->torchtext)
  Using cached filelock-3.17.0-py3-none-any.whl.metadata (2.9 kB)
Collecting typing-extensions>=4.8.0 (from torch>=2.3.0->torchtext)
  Using cached typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting networkx (from torch>=2.3.0->torchtext)
  Using cached networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch>=2.3.0->torchtext)
  Using cached jin

In [4]:
!pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install --upgrade torchtext --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting torch
  Downloading https://download.pytorch.org/whl/cu118/torch-2.5.1%2Bcu118-cp311-cp311-linux_x86_64.whl (838.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m838.4/838.4 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Collecting torchvision
  Downloading https://download.pytorch.org/whl/cu118/torchvision-0.20.1%2Bcu118-cp311-cp311-linux_x86_64.whl (6.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m63.3 MB/s[0m eta [36m0:00:00[0m
Collecting torchaudio
  Downloading https://download.pytorch.org/whl/cu118/torchaudio-2.5.1%2Bcu118-cp311-cp311-linux_x86_64.whl (3.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m60.3 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu11==11.8.89 (from torch)
  Downloading https://download.pytorch.org/whl/cu118/nvidia_cuda_nvrtc_cu11-11.8.89-py3-none-manylinux1_x86_64.whl (

Looking in indexes: https://download.pytorch.org/whl/cu118


In [8]:
!pip uninstall torchtext torchdata -y

Found existing installation: torchtext 0.18.0
Uninstalling torchtext-0.18.0:
  Successfully uninstalled torchtext-0.18.0
[0m

In [9]:
!pip install --no-cache-dir torchtext --index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://download.pytorch.org/whl/cu118
Collecting torchtext
  Downloading https://download.pytorch.org/whl/torchtext-0.17.0%2Bcpu-cp311-cp311-linux_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m
Collecting torch==2.2.0 (from torchtext)
  Downloading https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp311-cp311-linux_x86_64.whl (811.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m811.7/811.7 MB[0m [31m164.4 MB/s[0m eta [36m0:00:00[0m
Collecting torchdata==0.7.1 (from torchtext)
  Downloading https://download.pytorch.org/whl/torchdata-0.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.7/4.7 MB[0m [31m243.6 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cudnn-cu11==8.7.0.84 (from torch==2.2.0->torchtext)
  Downloading https://download.pytorch.org/whl/cu

In [1]:
import collections
from collections import Counter, OrderedDict
from torchtext.vocab import vocab

!wget https://www.gutenberg.org/files/2000/2000-0.txt
!mv "2000-0.txt" "quijote.txt"

def read_txt(filename, n_ignored):
    """
    Carga un archivo txt en una lista de oraciones.
    A su vez, cada oración es una lista de tokens separados por espacio.
    Ignora las primeras n_ignored lineas.
    """
    with open(filename) as f:
        raw_text = f.read()
    return [line.split() for i,line in enumerate(raw_text.split('\n')) if i>=n_ignored]

def make_vocab(oraciones,min_freq=1):
  #Comprueba que oraciones es una lista de listas
  if oraciones and isinstance(oraciones[0], list):
    #Transforma una lista anidada en una lista simple
    tokens = [token for line in oraciones for token in line]
  counter_obj = collections.Counter()
  counter_obj.update(tokens)
  sorted_by_freq_tuples = sorted(counter_obj.items(), key=lambda x: x[1], reverse=True)
  ordered_dict = OrderedDict(sorted_by_freq_tuples)
  vocabulario = vocab(ordered_dict, min_freq=min_freq)
  return vocabulario, ordered_dict

oraciones_quijote = read_txt("quijote.txt",28)
vocab_quijote, ordered_dict = make_vocab(oraciones_quijote,10)




A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/dist-packages/colab_kernel_launcher.py", line 37, in <module>
    ColabKernelApp.launch_instance()
  File "/usr/local/lib/python3.11/dist-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/usr/local/lib/python3.11/dist-packages/ipykernel/kernelapp.py", line 619, in start
    self.io_loop.start()
  File "/usr/local/lib/python3.11/dist-package

--2025-01-27 13:34:26--  https://www.gutenberg.org/files/2000/2000-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2226045 (2.1M) [text/plain]
Saving to: ‘2000-0.txt’


2025-01-27 13:34:27 (4.65 MB/s) - ‘2000-0.txt’ saved [2226045/2226045]



In [2]:
f"Este vocabulario tiene un tamaño de {len(vocab_quijote.get_itos())}"

'Este vocabulario tiene un tamaño de 3241'

Debemos construir una matriz de pesos que se cargará en la capa de incrustación de PyTorch. Su forma será igual a: `(longitud del vocabulario del conjunto de datos, dimensión de los vectores de palabras).`

Para cada palabra en el vocabulario del conjunto de datos, verificamos si está en el vocabulario de word2vec. Si lo está, cargamos su vector de palabra pre-entrenado. De lo contrario, inicializamos un vector aleatorio.

In [6]:
import numpy as np

In [9]:
!pip install numpy upgrade

[31mERROR: Could not find a version that satisfies the requirement upgrade (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for upgrade[0m[31m
[0m

In [12]:
import torch

class TokenEmbedding:
  """Token Embedding."""
  def __init__(self, file_name, n):
    self.idx_to_token, self.idx_to_vec, self.dim = self._load_embedding(
        file_name, n)
    self.unknown_idx = 0
    self.token_to_idx = {token: idx for idx, token in
                          enumerate(self.idx_to_token)}


  def _load_embedding(self, file_name, n):
    idx_to_token, idx_to_vec = ['<unk>'], []
    with open( file_name, 'r') as f:
      first_read = True
      i=0
      for line in f:
        if n<i: break
        else: i+=1
        if first_read:
          first_read = False
          continue
        elems = line.rstrip().split(' ')
        token, elems = elems[0], [float(elem) for elem in elems[1:]]
        # Skip header information, such as the top row in fastText
        if len(elems) > 1:
            idx_to_token.append(token)
            idx_to_vec.append(elems)
    idx_to_vec = [[0] * len(idx_to_vec[0])] + idx_to_vec
    return idx_to_token, torch.tensor(idx_to_vec), len(idx_to_vec[0])

  def __getitem__(self, tokens):
    indices = [self.token_to_idx.get(token, self.unknown_idx)
                for token in tokens]
    vecs = self.idx_to_vec[torch.tensor(indices)]
    # Returning the tensor directly instead of converting to numpy
    return vecs # Return vecs without converting to numpy

  def __len__(self):
    return len(self.idx_to_token)

In [14]:
!pip show numpy

Name: numpy
Version: 2.2.2
Summary: Fundamental package for array computing in Python
Home-page: https://numpy.org
Author: Travis E. Oliphant et al.
Author-email: 
License: Copyright (c) 2005-2024, NumPy Developers.
 All rights reserved.

 Redistribution and use in source and binary forms, with or without
 modification, are permitted provided that the following conditions are
 met:

     * Redistributions of source code must retain the above copyright
        notice, this list of conditions and the following disclaimer.

     * Redistributions in binary form must reproduce the above
        copyright notice, this list of conditions and the following
        disclaimer in the documentation and/or other materials provided
        with the distribution.

     * Neither the name of the NumPy Developers nor the names of any
        contributors may be used to endorse or promote products derived
        from this software without specific prior written permission.

 THIS SOFTWARE IS PROVIDED

In [17]:
!pip install Numpy



In [20]:
!pip install gensim

Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.2
    Uninstalling numpy-2.2.2:
      Successfully uninstalled numpy-2.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.20.1+cu118 requires torch==2.5.1, but you have torch 2.2.0+cu118 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.26.

In [22]:
!pip install --upgrade numpy
!pip install --upgrade scipy

Collecting numpy
  Using cached numpy-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Using cached numpy-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.4 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.20.1+cu118 requires torch==2.5.1, but you have torch 2.2.0+cu118 which is incompatible.
numba 0.60.0 requires numpy<2.1,>=1.22, but you have numpy 2.2.2 which is incompatible.
tensorflow 2.17.1 requires numpy<2.0.0,>=1.23.5; python_version <= "3.11", but you have numpy 2.2.2 which is incompatible.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.2.2 which is incompatible.
cupy-cuda12x 12

In [24]:
!pip install --upgrade scipy



In [27]:
!pip install --upgrade scipy numpy torch

Collecting torch
  Using cached torch-2.5.1-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting triton==3.1.0 (from torch)
  Using cached triton-3.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Using cached torch-2.5.1-cp311-cp311-manylinux1_x86_64.whl (906.5 MB)
Using cached triton-3.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (209.5 MB)
Installing collected packages: triton, torch
  Attempting uninstall: triton
    Found existing installation: triton 2.2.0
    Uninstalling triton-2.2.0:
      Successfully uninstalled triton-2.2.0
  Attempting uninstall: torch
    Found existing installation: torch 2.2.0+cu118
    Uninstalling torch-2.2.0+cu118:
      Successfully uninstalled torch-2.2.0+cu118
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchtext 0.17.0+cpu requires torch==2.2.0, but you h

In [28]:
import numpy as np

def dummy_npwarn_decorator_factory():
    def npwarn_decorator(x):
        return x
    return npwarn_decorator

np._no_nep50_warning = getattr(np, '_no_nep50_warning', dummy_npwarn_decorator_factory)

from gensim.models import Word2Vec # This should now work

ImportError: cannot import name 'is_torch_array' from 'scipy._lib.array_api_compat.common._helpers' (/usr/local/lib/python3.11/dist-packages/scipy/_lib/array_api_compat/common/_helpers.py)

In [29]:
import numpy as np

matrix_len = len(vocab_quijote.get_itos())
weights_matrix = np.zeros((len(vocab_quijote.get_itos()), spanish_w2v.dim))
words_found = 0
words_not_found = []

for i, word in enumerate(vocab_quijote.get_itos()):
    try:
        weights_matrix[i] = spanish_w2v[[word]][0]
        words_found += 1
    except KeyError:
        weights_matrix[i] = np.random.normal(scale=0.6, size=(spanish_w2v.dim, ))
        words_not_found.append(word)

weights_matrix.shape

RuntimeError: Numpy is not available

Al parecer, todas las palabras de nuestro vocabulario tenían su vector word2vec preentrenado.

In [30]:
words_found, matrix_len

(0, 3241)

También podemos ver que coinciden la fila de la matriz con el vector que cargamos del archivo (excepto por diferencias de punto flotante).

In [31]:
word = "mesa"
i = vocab_quijote[word]
[(weights_matrix[i][j],spanish_w2v[[word]][0][j]) for j in range(300)]


RuntimeError: Numpy is not available

Por último, creamos una red neuronal con una capa de Embedding como primera capa (cargamos en ella la matriz de pesos) y una capa GRU. Al hacer forward debemos llamar primero a la capa de embedding.

In [32]:
from torch import nn

def create_emb_layer(weights_matrix, non_trainable=False):
    num_embeddings, embedding_dim = weights_matrix.shape
    emb_layer = nn.Embedding(num_embeddings, embedding_dim)
    emb_layer.load_state_dict({'weight': torch.tensor(weights_matrix)})
    if non_trainable:
        emb_layer.weight.requires_grad = False

    return emb_layer, num_embeddings, embedding_dim

In [33]:
class ToyNN(nn.Module):
    def __init__(self, weights_matrix, hidden_size, num_layers):
        super().__init__()
        self.embedding, num_embeddings, embedding_dim = create_emb_layer(weights_matrix, True)
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.gru = nn.GRU(embedding_dim, hidden_size, num_layers, batch_first=True)

    def forward(self, inp, hidden):
        return self.gru(self.embedding(inp), hidden)

In [34]:
model = ToyNN(weights_matrix,256,2)