# Setup en colab

Crea un shortcut en tu drive de los [datos](https://drive.google.com/drive/folders/1djjceNkO42vrB10PubYTzQydfccPbzdB?usp=sharing)



In [1]:

import gensim

In [2]:
# Clonamos el repo para usar el codigo de la lib
!git clone https://github.com/elsonidoq/ml-practico-2022.git
!cd ml-practico-2022; git pull

Cloning into 'ml-practico-2022'...
remote: Enumerating objects: 370, done.[K
remote: Counting objects: 100% (57/57), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 370 (delta 36), reused 35 (delta 16), pack-reused 313[K
Receiving objects: 100% (370/370), 4.83 MiB | 29.59 MiB/s, done.
Resolving deltas: 100% (227/227), done.
Already up to date.


In [3]:
import sys
sys.path.append('ml-practico-2022/lib')

In [4]:
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


# Entrenamos un modelo

In [6]:
from os import path
from taller_model_selection.serialize import iter_jl

In [7]:
from gensim.models import Word2Vec

In [31]:
from tqdm import tqdm
import re

word_pat = re.compile('\w+') # alfanumericos
num_pat = re.compile('\d+.\d*')

def tokenize(sent):
  return [
      tok for tok in word_pat.findall(sent.lower())
      if len(tok) > 1 and num_pat.match(tok) is None
  ]

class SentenceIterator:
  # esta clase es un "iterador reseteable" de oraciones para word2vec
  # que sea reseteable es que podes hacer `for thing in iterator` 
  # todas las veces que quieras
  def __init__(self, data):
    self.data = data

  def __iter__(self):
    for row in tqdm(self.data):
      yield tokenize(row['title'])
      for line in row['description'].replace('. ', '\n').replace('<br>', '\n').split('\n'):
        sent = tokenize(line)
        if sent: yield sent

In [None]:
# levantar el modelo pre-entrenado
model = Word2Vec.load('/content/gdrive/MyDrive/taller-model-selection-data/properties.w2v')

In [60]:
# entrenar un modelo

PATH = '/content/gdrive/MyDrive/taller-model-selection-data'
data = list(iter_jl(path.join(PATH, 'X_train.jl')))

model = Word2Vec(SentenceIterator(data), size=50, iter=10)

100%|██████████| 161020/161020 [00:30<00:00, 5362.53it/s]
100%|██████████| 161020/161020 [01:07<00:00, 2379.97it/s]
100%|██████████| 161020/161020 [01:01<00:00, 2606.16it/s]
100%|██████████| 161020/161020 [01:02<00:00, 2567.32it/s]
100%|██████████| 161020/161020 [01:09<00:00, 2319.44it/s]
100%|██████████| 161020/161020 [01:09<00:00, 2314.99it/s]
100%|██████████| 161020/161020 [01:00<00:00, 2683.47it/s]
100%|██████████| 161020/161020 [01:02<00:00, 2573.89it/s]
100%|██████████| 161020/161020 [01:01<00:00, 2622.77it/s]
100%|██████████| 161020/161020 [01:03<00:00, 2548.80it/s]
100%|██████████| 161020/161020 [01:04<00:00, 2511.72it/s]


In [61]:
vocab = [
    k for k, v in model.wv.vocab.items() if v.count > 100
]

In [62]:
len(vocab)

6832

In [63]:
from IPython.display import display_markdown
from random import choice

for i in range(30):
  word = choice(vocab)
  display_markdown(f'most similar to **{word}**', raw=True)
  for word2, sim in model.wv.most_similar(word):
    display_markdown(f'* {word2} ({sim:.02f})', raw=True)
  display_markdown(f'___')

most similar to **incorporados**

* tantos (0.66)

* existentes (0.63)

* funcionaron (0.62)

* agregados (0.59)

* otros (0.59)

* mecanismos (0.58)

* superan (0.57)

* presentan (0.57)

* tienen (0.57)

* rosetones (0.56)

most similar to **daniel**

* nidia (0.76)

* champanier (0.75)

* grela (0.75)

* claudia (0.75)

* pepa (0.74)

* atri (0.74)

* karina (0.74)

* rodrigo (0.74)

* silvina (0.73)

* reina (0.72)

most similar to **hs**

* cofre (0.84)

* horas (0.82)

* hrs (0.82)

* diurna (0.80)

* noctura (0.79)

* nocturna (0.73)

* prosegur (0.72)

* vigilancia (0.71)

* camaras (0.70)

* garita (0.70)

most similar to **ajuste**

* verificacin (0.69)

* verificación (0.61)

* ajust (0.61)

* confirmación (0.51)

* variación (0.51)

* verificaci (0.50)

* modificacioneslas (0.50)

* fijaestacionamiento (0.47)

* lavaboskitchenette1 (0.47)

* modificaciones (0.47)

most similar to **señorial**

* racionalista (0.86)

* clasico (0.82)

* solido (0.77)

* sólido (0.76)

* lujoso (0.74)

* antiguo (0.73)

* elegante (0.69)

* magnifico (0.68)

* clásico (0.68)

* emblematico (0.68)

most similar to **realiza**

* rota (0.68)

* fundó (0.67)

* transfiere (0.67)

* adhesion (0.66)

* sembrado (0.65)

* realizará (0.65)

* pesifica (0.65)

* actualiza (0.64)

* ajusta (0.64)

* arruine (0.64)

most similar to **dada**

* debido (0.77)

* fluidez (0.68)

* mejora (0.67)

* dinámica (0.66)

* óptima (0.66)

* comodidad (0.66)

* intensa (0.64)

* gracias (0.63)

* facilita (0.63)

* exquisita (0.60)

most similar to **vinílicos**

* vinilicos (0.93)

* parquette (0.87)

* cant (0.85)

* entarugados (0.85)

* flexiplast (0.85)

* pocelanato (0.85)

* flotantes (0.84)

* maderacarpinterías (0.84)

* tarugados (0.84)

* parquets (0.83)

most similar to **vale**

* pena (0.94)

* vení (0.81)

* veni (0.80)

* llamame (0.74)

* acompañame (0.73)

* verla (0.73)

* pedinos (0.71)

* venir (0.69)

* visitamos (0.69)

* venga (0.68)

most similar to **antiguedad**

* antigüedad (0.95)

* antigúedad (0.71)

* antigedad (0.69)

* antig (0.68)

* m2antigüedad (0.68)

* estrenaredificio (0.66)

* vendiendo (0.60)

* fabricante (0.60)

* estadoedificio (0.58)

* m2muy (0.57)

most similar to **cid**

* campeador (0.76)

* walmart (0.64)

* distante (0.64)

* apocas (0.63)

* hospital (0.62)

* obelisco (0.61)

* hosp (0.59)

* easy (0.56)

* facultada (0.56)

* ameghino (0.56)

most similar to **perú**

* méxico (0.82)

* basualdo (0.82)

* zamudio (0.81)

* tacuari (0.81)

* adolfo (0.80)

* valentín (0.80)

* helguera (0.80)

* agote (0.80)

* rocha (0.78)

* chile (0.78)

most similar to **by**

* through (0.92)

* is (0.92)

* its (0.91)

* are (0.91)

* it (0.91)

* also (0.91)

* which (0.91)

* that (0.90)

* this (0.90)

* into (0.90)

most similar to **soñás**

* soñas (0.91)

* alcanzá (0.87)

* necesites (0.87)

* necesitás (0.81)

* conozcas (0.79)

* visitamos (0.77)

* invitamos (0.77)

* estabas (0.76)

* encuentres (0.76)

* estrenas (0.75)

most similar to **yesería**

* yeseria (0.86)

* moldura (0.82)

* buñados (0.81)

* cielorasos (0.81)

* molduras (0.78)

* alpress (0.77)

* aislantes (0.76)

* terminacio (0.76)

* enteladas (0.76)

* pinturas (0.76)

most similar to **monserrat**

* balvanera (0.90)

* once (0.84)

* flores (0.81)

* barracas (0.81)

* almagro (0.80)

* boedo (0.78)

* liniers (0.77)

* constitucion (0.76)

* recoleta (0.75)

* versalles (0.74)

most similar to **sitúa**

* emplaza (0.89)

* desarrollará (0.81)

* ubica (0.81)

* ubicó (0.81)

* erige (0.81)

* alza (0.80)

* situa (0.78)

* alojará (0.78)

* implanta (0.77)

* ubicará (0.77)

most similar to **usado**

* utilizado (0.90)

* usada (0.83)

* utlizado (0.75)

* utilizada (0.73)

* atelier (0.67)

* adaptado (0.65)

* usarlo (0.64)

* havana (0.64)

* tranquilamente (0.64)

* ganado (0.62)

most similar to **mascotas**

* profesionalapto (0.72)

* mascota (0.70)

* crèdito (0.69)

* credito (0.69)

* profesionalexpensas (0.69)

* profesionalse (0.67)

* crdito (0.63)

* porfesional (0.63)

* profesional1 (0.63)

* profes (0.63)

most similar to **durlock**

* tabique (0.82)

* pared (0.74)

* durlok (0.72)

* removibles (0.63)

* divisiones (0.63)

* vidisoria (0.60)

* bañosposibilidad (0.59)

* machimbre (0.59)

* tiraron (0.59)

* levantar (0.59)

most similar to **cpu**

* cu (0.92)

* r2a (0.88)

* e3 (0.87)

* r2bi (0.86)

* r2b (0.86)

* r2b1 (0.85)

* c3ii (0.85)

* r2aii (0.85)

* r2bii (0.85)

* zonificación (0.84)

most similar to **lópez**

* vicente (0.90)

* lopez (0.84)

* solano (0.83)

* vte (0.79)

* bertazza (0.79)

* hilanderia (0.70)

* mariscal (0.70)

* louge (0.66)

* ee (0.65)

* resnick (0.64)

most similar to **desarrollador**

* término (0.63)

* indexados (0.62)

* developers (0.61)

* probada (0.61)

* pesificado (0.60)

* transcurso (0.60)

* vendedor (0.59)

* cuidando (0.59)

* forcinito (0.59)

* mra (0.59)

most similar to **escenario**

* tango (0.76)

* motivo (0.73)

* ritmo (0.73)

* había (0.71)

* testigo (0.70)

* conventillo (0.69)

* algo (0.68)

* todavía (0.68)

* llegó (0.68)

* impulso (0.67)

most similar to **importe**

* ago (0.74)

* moderadas (0.69)

* mínimas (0.69)

* anunciados (0.68)

* bajísimas (0.68)

* gasto (0.67)

* bajisimas (0.66)

* bjas (0.66)

* sep (0.66)

* enero (0.65)

most similar to **aerotermia**

* vrv (0.70)

* refrigerante (0.66)

* climatización (0.64)

* centralizado (0.64)

* autoportantes (0.63)

* aeroterapia (0.63)

* climatizacion (0.63)

* filtrado (0.62)

* videovigilancia (0.62)

* condensador (0.61)

most similar to **biblioteca**

* vajillero (0.68)

* bibliotecas (0.62)

* cama (0.62)

* salita (0.61)

* lámpara (0.61)

* almohadas (0.61)

* playroom (0.60)

* escritorio (0.60)

* frascos (0.60)

* sillon (0.59)

most similar to **virreyes**

* olmos (0.83)

* saguier (0.73)

* congresos (0.73)

* andes (0.73)

* miserere (0.72)

* ejercito (0.72)

* incas (0.71)

* periodistas (0.71)

* corrales (0.70)

* practican (0.69)

most similar to **aterrazados**

* franceses (0.74)

* vidriados (0.73)

* extensiones (0.73)

* terazas (0.73)

* corridos (0.70)

* terrazas (0.68)

* hermosos (0.66)

* expansiones (0.65)

* internos (0.65)

* canteros (0.64)

most similar to **nada**

* hacerle (0.85)

* humedades (0.80)

* serlo (0.78)

* arreglo (0.75)

* aun (0.75)

* filtraciones (0.72)

* romper (0.71)

* dejar (0.71)

* algo (0.70)

* verifique (0.69)

In [64]:
model.save('properties.w2v')