<a href="https://colab.research.google.com/github/ahcamachod/1904-word2vec-entrenamiento-de-word-embedding/blob/aula-5/word2vec_entrenamiento.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Word2Vec: Entrenamiento de Word Embedding

En este notebook encontrarás el desarrollo del proyecto para modelar tus propias representaciones de Word Embedding utilizando **Word2Vec** .


Los principales recursos de Python que utilizaremos como base para nuestro modelaje se encuentran en:


*https://spacy.io/*

*https://radimrehurek.com/gensim/models/word2vec.html*


La documentación referente al diseño de las arquitecturas Word2Vec la encontramos en:


*https://arxiv.org/pdf/1301.3781.pdf*

## Aula 1

### 1.2 Iniciando con spacy

In [1]:
import logging 

logging.basicConfig(format='%(asctime)s - %(message)s', level=logging.INFO)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import pandas as pd

noticias_train = pd.read_csv('/content/drive/MyDrive/word2vec/noticias_entrenamiento.csv')
noticias_test = pd.read_csv('/content/drive/MyDrive/word2vec/noticias_prueba.csv')

2022-05-21 16:26:58,041 - NumExpr defaulting to 2 threads.


In [4]:
noticias_train.shape

(91844, 7)

In [5]:
noticias_train.sample(2)

Unnamed: 0,fecha,titulo,pais,extracto,resumen,enlace,categoria
46969,2022-03-17 12:33:17,"Correos en Canarias niega que exista ""precarie...",ES,Correos en Canarias se encuentra inmersa en un...,Oficina de Correos. Archivo Correos en Canaria...,https://diariodeavisos.elespanol.com/2022/03/c...,economia
36498,2022-04-03 22:51:17,Bad Bunny gana Grammy por mejor disco de músic...,PR,El cantante tiene otras seis nominaciones,"Benito Ocasio Martínez, conocido como 'Bad Bun...",https://www.metro.pr/entretenimiento/2022/04/0...,entretenimiento


In [6]:
noticias_test.shape

(22961, 7)

In [7]:
noticias_test.sample(2)

Unnamed: 0,fecha,titulo,pais,extracto,resumen,enlace,categoria
9045,2022-03-23 08:00:00,Detuvieron en Hungría al principal sospechoso ...,AR,El principal sospechoso de la muerte del exrug...,El principal sospechoso de la muerte del exrug...,https://www.grupolaprovincia.com/deportes/detu...,deportes
18886,2022-03-31 15:11:30,Casi 35.000 firmantes piden incluir la memoria...,ES,"Entre los promotores de la iniciativa, difundi...",Representantes de organizaciones de memoria y ...,https://www.publico.es/politica/35-000-firmant...,politica


In [8]:
#!python -m spacy download es_core_news_sm

In [9]:
import spacy

nlp = spacy.load("es_core_news_sm")

### 1.3 Spacy y sus estructuras

In [10]:
texto = "Big Data es una ciencia que nos permite trabajar velozmente con grandes volúmenes de datos."

In [11]:
doc = nlp(texto)

In [12]:
type(doc)

spacy.tokens.doc.Doc

In [13]:
doc[0]

Big

In [14]:
doc.ents

(Big Data,)

In [15]:
doc[5].is_stop

True

In [16]:
doc[9].is_stop

False

In [17]:
doc[9].pos_

'ADV'

## Aula 2

### 2.1 Preprocesamiento de los datos

In [18]:
texto_para_tratamiento = (titulo.lower() for titulo in noticias_train.titulo)

In [19]:
def trata_texto(doc):
  token_valido = []
  for token in doc:
    valido = not token.is_stop and token.is_alpha
    if valido:
      token_valido.append(token.text)
  if len(token_valido) > 2:
    return " ".join(token_valido)

In [20]:
texto = "Big Data es una ciencia que nos permite trabajar velozmente con grandes volúmenes de datos."
doc = nlp(texto)
trata_texto(doc)

'Big Data ciencia permite velozmente volúmenes datos'

In [21]:
texto = "Big Data 4673423 es una ciencia! que nos permite  %¨&%$$%& trabajar velozmente con grandes -0898982 volúmenes de datos."
doc = nlp(texto)
trata_texto(doc)

'Big Data ciencia permite velozmente volúmenes datos'

### 2.2 Optimizando el tratamiento de los datos

In [22]:
from time import time

t0 = time()
texto_tratado = [trata_texto(doc) for doc in nlp.pipe(texto_para_tratamiento, batch_size=1000, n_process= -1)]
tf = time()-t0

print(tf/60)

3.1199692209561665


In [23]:
titulos_tratados = pd.DataFrame({'titulo': texto_tratado})
titulos_tratados.head()

Unnamed: 0,titulo
0,tenso debate senado argentino refinanciamiento...
1,triunfo dramático cruz azul duelo campeones atlas
2,moderar inflación urgente marcha plan calviño
3,llega the batman a hbo max mira fecha estreno ...
4,guzmán default fmi implicaba ajuste y caída pr...


In [24]:
len(titulos_tratados)

91844

## Aula 3

### 3.1 Hiperparámetros de Word2Vec

In [25]:
from gensim.models import Word2Vec

#modelo_w2v = Word2Vec(sg=0, size=300, window=2)

2022-05-21 16:30:14,118 - 'pattern' package not found; tag filters are not available for English


### 3.2 Avanzando en los hiperparámetros

In [60]:
# Alura Aluras Aura Alure
modelo_w2v = Word2Vec(sg=0, size=300, window=2, min_count=5, alpha=0.03, min_alpha=0.007)

### 3.3 Vocabulario Word2Vec

In [27]:
#modelo_w2v.build_vocab()

In [28]:
#lista_lista_tokens = [titulo.split(" ") for titulo in titulos_tratados.titulo]

In [29]:
titulos_tratados.isnull().value_counts()

titulo
False     90725
True       1119
dtype: int64

In [30]:
len(titulos_tratados)

91844

In [31]:
titulos_tratados = titulos_tratados.dropna().drop_duplicates()

In [32]:
len(titulos_tratados)

82706

In [33]:
lista_lista_tokens = [titulo.split(" ") for titulo in titulos_tratados.titulo]

In [61]:
modelo_w2v.build_vocab(lista_lista_tokens, progress_per=5000)

2022-05-21 19:31:41,917 - collecting all words and their counts
2022-05-21 19:31:41,919 - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-05-21 19:31:41,937 - PROGRESS: at sentence #5000, processed 36024 words, keeping 11773 word types
2022-05-21 19:31:41,955 - PROGRESS: at sentence #10000, processed 72289 words, keeping 17745 word types
2022-05-21 19:31:41,969 - PROGRESS: at sentence #15000, processed 108884 words, keeping 22201 word types
2022-05-21 19:31:41,985 - PROGRESS: at sentence #20000, processed 145197 words, keeping 25928 word types
2022-05-21 19:31:41,999 - PROGRESS: at sentence #25000, processed 181770 words, keeping 29043 word types
2022-05-21 19:31:42,011 - PROGRESS: at sentence #30000, processed 217994 words, keeping 31821 word types
2022-05-21 19:31:42,027 - PROGRESS: at sentence #35000, processed 254754 words, keeping 34393 word types
2022-05-21 19:31:42,046 - PROGRESS: at sentence #40000, processed 291245 words, keeping 36753 word types
2022-05

## Aula 4

### 4.1 Entrenando el modelo CBOW

In [35]:
dir(modelo_w2v)

['__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_adapt_by_suffix',
 '_check_input_data_sanity',
 '_check_training_sanity',
 '_clear_post_train',
 '_do_train_epoch',
 '_do_train_job',
 '_get_job_params',
 '_get_thread_working_mem',
 '_job_producer',
 '_load_specials',
 '_log_epoch_end',
 '_log_epoch_progress',
 '_log_progress',
 '_log_train_end',
 '_minimize_model',
 '_raw_word_count',
 '_save_specials',
 '_set_train_params',
 '_smart_save',
 '_train_epoch',
 '_train_epoch_corpusfile',
 '_update_job_params',
 '_worker_loop',
 '_worker_loop_corpusfile',
 'accuracy',
 'alpha',
 'batch_words',
 'build_vocab',
 'build_vocab_from_freq',
 'ca

In [62]:
modelo_w2v.train(lista_lista_tokens,total_examples=modelo_w2v.corpus_count, epochs= 30)

2022-05-21 19:32:04,284 - training model with 3 workers on 14832 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=2
2022-05-21 19:32:05,304 - EPOCH 1 - PROGRESS: at 69.56% examples, 344173 words/s, in_qsize 5, out_qsize 0
2022-05-21 19:32:05,646 - worker thread finished; awaiting finish of 2 more threads
2022-05-21 19:32:05,654 - worker thread finished; awaiting finish of 1 more threads
2022-05-21 19:32:05,673 - worker thread finished; awaiting finish of 0 more threads
2022-05-21 19:32:05,675 - EPOCH - 1 : training on 604214 raw words (497549 effective words) took 1.4s, 361487 effective words/s
2022-05-21 19:32:06,701 - EPOCH 2 - PROGRESS: at 71.24% examples, 352231 words/s, in_qsize 6, out_qsize 0
2022-05-21 19:32:07,066 - worker thread finished; awaiting finish of 2 more threads
2022-05-21 19:32:07,071 - worker thread finished; awaiting finish of 1 more threads
2022-05-21 19:32:07,091 - worker thread finished; awaiting finish of 0 more threads
2022-05-21 19

(14926199, 18126420)

In [38]:
modelo_w2v.wv.most_similar('google')

2022-05-21 16:47:44,439 - precomputing L2-norms of word weight vectors


[('maps', 0.6369116902351379),
 ('comandos', 0.5456988215446472),
 ('android', 0.535834789276123),
 ('chrome', 0.5238631367683411),
 ('apple', 0.49066129326820374),
 ('youtube', 0.4783938527107239),
 ('aplicación', 0.4661206901073456),
 ('gmail', 0.46569114923477173),
 ('emojis', 0.4551786482334137),
 ('búsquedas', 0.45385390520095825)]

In [39]:
modelo_w2v.wv.most_similar('microsoft')

[('nvidia', 0.6905907392501831),
 ('intel', 0.656556248664856),
 ('nothing', 0.6197691559791565),
 ('ubisoft', 0.6106555461883545),
 ('legends', 0.5697228312492371),
 ('valve', 0.5564619302749634),
 ('portátiles', 0.5517499446868896),
 ('anonymous', 0.5517434477806091),
 ('apex', 0.551315188407898),
 ('playstation', 0.5439505577087402)]

In [40]:
modelo_w2v.wv.most_similar('barcelona')

[('barça', 0.500052273273468),
 ('mazatlán', 0.4968731701374054),
 ('guaireña', 0.48919326066970825),
 ('fc', 0.4702135920524597),
 ('athletic', 0.4365805387496948),
 ('jude', 0.4323718547821045),
 ('laporta', 0.43132758140563965),
 ('ousmane', 0.42659521102905273),
 ('erling', 0.4196600914001465),
 ('mbappé', 0.41796988248825073)]

In [41]:
modelo_w2v.wv.most_similar('messi')

[('lionel', 0.6659276485443115),
 ('scaloni', 0.5988843441009521),
 ('neymar', 0.5686666965484619),
 ('stegen', 0.5630759596824646),
 ('bombonera', 0.5622174739837646),
 ('roccuzzo', 0.5572490692138672),
 ('psg', 0.5546132922172546),
 ('ter', 0.5535578727722168),
 ('lewandowski', 0.5412269830703735),
 ('alturria', 0.5191042423248291)]

In [42]:
modelo_w2v.wv.most_similar('ferrari')

[('leclerc', 0.7298225164413452),
 ('sainz', 0.6480276584625244),
 ('verstappen', 0.6208952069282532),
 ('bahrein', 0.5997858047485352),
 ('obradoiro', 0.5260280966758728),
 ('pole', 0.5216978788375854),
 ('hyundai', 0.507645845413208),
 ('position', 0.5043960809707642),
 ('australia', 0.5037184953689575),
 ('memphis', 0.48718520998954773)]

### 4.2 Entrenando el modelo Skip Gram

In [43]:
modelo_w2v_sg = Word2Vec(sg=1, size=300, window=5, min_count=5, alpha=0.03, min_alpha=0.007)

modelo_w2v_sg.build_vocab(lista_lista_tokens, progress_per=5000)

modelo_w2v_sg.train(lista_lista_tokens,total_examples=modelo_w2v_sg.corpus_count, epochs= 30)

2022-05-21 16:59:02,385 - collecting all words and their counts
2022-05-21 16:59:02,392 - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-05-21 16:59:02,414 - PROGRESS: at sentence #5000, processed 36024 words, keeping 11773 word types
2022-05-21 16:59:02,429 - PROGRESS: at sentence #10000, processed 72289 words, keeping 17745 word types
2022-05-21 16:59:02,447 - PROGRESS: at sentence #15000, processed 108884 words, keeping 22201 word types
2022-05-21 16:59:02,466 - PROGRESS: at sentence #20000, processed 145197 words, keeping 25928 word types
2022-05-21 16:59:02,483 - PROGRESS: at sentence #25000, processed 181770 words, keeping 29043 word types
2022-05-21 16:59:02,500 - PROGRESS: at sentence #30000, processed 217994 words, keeping 31821 word types
2022-05-21 16:59:02,517 - PROGRESS: at sentence #35000, processed 254754 words, keeping 34393 word types
2022-05-21 16:59:02,533 - PROGRESS: at sentence #40000, processed 291245 words, keeping 36753 word types
2022-05

(14927359, 18126420)

In [44]:
modelo_w2v.wv.most_similar('ferrari')

[('leclerc', 0.7298225164413452),
 ('sainz', 0.6480276584625244),
 ('verstappen', 0.6208952069282532),
 ('bahrein', 0.5997858047485352),
 ('obradoiro', 0.5260280966758728),
 ('pole', 0.5216978788375854),
 ('hyundai', 0.507645845413208),
 ('position', 0.5043960809707642),
 ('australia', 0.5037184953689575),
 ('memphis', 0.48718520998954773)]

In [45]:
modelo_w2v_sg.wv.most_similar('ferrari')

2022-05-21 17:03:15,045 - precomputing L2-norms of word weight vectors


[('sainz', 0.7075117230415344),
 ('leclerc', 0.684792160987854),
 ('bahrein', 0.5940914750099182),
 ('verstappen', 0.5722754001617432),
 ('position', 0.5639119148254395),
 ('baréin', 0.5567221641540527),
 ('bull', 0.5490049123764038),
 ('pole', 0.5217989087104797),
 ('charles', 0.49677011370658875),
 ('bahréin', 0.4868166148662567)]

In [46]:
modelo_w2v.wv.most_similar('google')

[('maps', 0.6369116902351379),
 ('comandos', 0.5456988215446472),
 ('android', 0.535834789276123),
 ('chrome', 0.5238631367683411),
 ('apple', 0.49066129326820374),
 ('youtube', 0.4783938527107239),
 ('aplicación', 0.4661206901073456),
 ('gmail', 0.46569114923477173),
 ('emojis', 0.4551786482334137),
 ('búsquedas', 0.45385390520095825)]

In [47]:
modelo_w2v_sg.wv.most_similar('google')

[('maps', 0.6980351209640503),
 ('chrome', 0.5779889822006226),
 ('doodle', 0.5422887206077576),
 ('caffarena', 0.48039737343788147),
 ('búsquedas', 0.46368515491485596),
 ('desenfocar', 0.45587313175201416),
 ('apk', 0.44785791635513306),
 ('desarrolladores', 0.447754442691803),
 ('chromebook', 0.4472600817680359),
 ('gmail', 0.4441489577293396)]

In [63]:
modelo_w2v.wv.save_word2vec_format('/content/drive/MyDrive/word2vec/modelo_cbow_300.txt', binary=False)
modelo_w2v_sg.wv.save_word2vec_format('/content/drive/MyDrive/word2vec/modelo_sg_300.txt', binary=False)

2022-05-21 19:33:12,226 - storing 14832x300 projection weights into /content/drive/MyDrive/word2vec/modelo_cbow_300.txt
2022-05-21 19:33:15,532 - storing 14832x300 projection weights into /content/drive/MyDrive/word2vec/modelo_sg_300.txt


## Aula 5

### 5.1 Iniciando el clasificador

### 5.2 Combinación de vectores

### 5.3 Vectorizando los titulares

## Aula 6

### 6.1 Clasificando los titulares

### 6.2 Comparando los modelos