# Setup en colab

Crea un shortcut en tu drive de los [datos](https://drive.google.com/drive/folders/1djjceNkO42vrB10PubYTzQydfccPbzdB?usp=sharing)



In [1]:

import gensim

In [2]:
# Clonamos el repo para usar el codigo de la lib
!git clone https://github.com/elsonidoq/ml-practico-2022.git
!cd ml-practico-2022; git pull

Cloning into 'ml-practico-2022'...
remote: Enumerating objects: 370, done.[K
remote: Counting objects: 100% (57/57), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 370 (delta 36), reused 35 (delta 16), pack-reused 313[K
Receiving objects: 100% (370/370), 4.83 MiB | 29.59 MiB/s, done.
Resolving deltas: 100% (227/227), done.
Already up to date.


In [3]:
import sys
sys.path.append('ml-practico-2022/lib')

In [4]:
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


# Entrenamos un modelo

In [6]:
from os import path
from taller_model_selection.serialize import iter_jl

In [7]:
from gensim.models import Word2Vec

In [31]:
from tqdm import tqdm
import re

word_pat = re.compile('\w+') # alfanumericos
num_pat = re.compile('\d+.\d*')

def tokenize(sent):
  return [
      tok for tok in word_pat.findall(sent.lower())
      if len(tok) > 1 and num_pat.match(tok) is None
  ]

class SentenceIterator:
  # esta clase es un "iterador reseteable" de oraciones para word2vec
  # que sea reseteable es que podes hacer `for thing in iterator` 
  # todas las veces que quieras
  def __init__(self, data):
    self.data = data

  def __iter__(self):
    for row in tqdm(self.data):
      yield tokenize(row['title'])
      for line in row['description'].replace('. ', '\n').replace('<br>', '\n').split('\n'):
        sent = tokenize(line)
        if sent: yield sent

In [None]:
# levantar el modelo pre-entrenado
model = Word2Vec.load('/content/gdrive/MyDrive/taller-model-selection-data/properties.w2v')

In [35]:
# entrenar un modelo

PATH = '/content/gdrive/MyDrive/taller-model-selection-data'
data = list(iter_jl(path.join(PATH, 'X_train.jl')))

model = Word2Vec(SentenceIterator(data))

100%|██████████| 161020/161020 [00:28<00:00, 5696.29it/s]
100%|██████████| 161020/161020 [01:00<00:00, 2671.88it/s]
100%|██████████| 161020/161020 [00:56<00:00, 2838.99it/s]
100%|██████████| 161020/161020 [00:58<00:00, 2761.39it/s]
100%|██████████| 161020/161020 [00:57<00:00, 2806.71it/s]
100%|██████████| 161020/161020 [00:57<00:00, 2822.17it/s]


In [54]:
vocab = [
    k for k, v in model.wv.vocab.items() if v.count > 100
]

In [55]:
len(vocab)

6832

In [56]:
from IPython.display import display_markdown
from random import choice

for i in range(30):
  word = choice(vocab)
  display_markdown(f'most similar to **{word}**', raw=True)
  for word2, sim in model.wv.most_similar(word):
    display_markdown(f'* {word2} ({sim:.02f})', raw=True)
  display_markdown(f'___')

most similar to **fama**

* dbj (0.85)

* benedetto (0.83)

* jauregui (0.81)

* olivero (0.81)

* parra (0.79)

* flavio (0.78)

* fonte (0.77)

* matera (0.77)

* guanziroli (0.76)

* claudio (0.75)

most similar to **contractual**

* contractural (0.71)

* contractualtodas (0.62)

* ningún (0.58)

* estimativo (0.55)

* cmcplm (0.51)

* excluyente (0.51)

* contractuales (0.51)

* hubiese (0.51)

* bebederos (0.49)

* ningun (0.48)

most similar to **trabajando**

* plataformas (0.67)

* recorrela (0.66)

* estamos (0.64)

* estoy (0.63)

* atendiendo (0.63)

* resolver (0.58)

* sepas (0.57)

* formulario (0.57)

* asesorarlo (0.57)

* reserve (0.57)

most similar to **permanentemente**

* analizan (0.96)

* desenvolvimiento (0.94)

* efectuada (0.78)

* cliente (0.57)

* comprador (0.55)

* pida (0.54)

* rige (0.52)

* supervisa (0.52)

* copropietarios (0.52)

* sepas (0.52)

most similar to **camión**

* camiones (0.90)

* camion (0.89)

* contenedor (0.89)

* indepeniente (0.80)

* indepen (0.77)

* contenedores (0.75)

* utilitarios (0.75)

* imperial (0.74)

* porche (0.73)

* undependiente (0.72)

most similar to **estratégico**

* estrategico (0.75)

* privilegiado (0.67)

* cervecero (0.65)

* estratgico (0.60)

* punto (0.59)

* encuentro (0.58)

* neuralgico (0.54)

* estratégicamente (0.54)

* neurálgico (0.52)

* geográfico (0.50)

most similar to **continuá**

* guardá (0.81)

* buscador (0.81)

* búsqueda (0.81)

* propuestamañana (0.80)

* pedi (0.80)

* llegá (0.80)

* pedí (0.79)

* pedaleando (0.78)

* hacenos (0.77)

* visitaescuchamos (0.76)

most similar to **grifería**

* griferia (0.85)

* griferias (0.78)

* griferías (0.75)

* canillas (0.72)

* grifaría (0.72)

* cromada (0.72)

* fv (0.70)

* monocomando (0.70)

* canilla (0.69)

* hidromet (0.69)

most similar to **sedes**

* instituciones (0.72)

* universitarias (0.68)

* inmediaciones (0.65)

* primarias (0.65)

* universidades (0.64)

* públicas (0.63)

* capillas (0.62)

* entidades (0.62)

* escuelas (0.61)

* publicas (0.60)

most similar to **ángel**

* angel (0.92)

* ngel (0.72)

* rosada (0.65)

* rocatagliata (0.56)

* ho (0.56)

* calise (0.55)

* ventaph (0.55)

* alma (0.54)

* badii (0.52)

* cumbre (0.51)

most similar to **pujante**

* precisamente (0.71)

* trendy (0.62)

* densamente (0.62)

* cotizada (0.60)

* segura (0.59)

* privilegiado (0.59)

* demandadas (0.58)

* oulet (0.57)

* renombrados (0.56)

* glamorosa (0.56)

most similar to **lic**

* piersimoni (0.91)

* stella (0.90)

* nº815 (0.90)

* luciana (0.90)

* analia (0.89)

* alexis (0.89)

* fabiana (0.88)

* nancy (0.88)

* maris (0.88)

* gontmaher (0.88)

most similar to **encargado**

* encargada (0.81)

* portería (0.77)

* permanente (0.66)

* encargados (0.66)

* porteria (0.58)

* ayudante (0.55)

* contratado (0.54)

* horas (0.53)

* portero (0.52)

* limpieza (0.51)

most similar to **entran**

* mover (0.61)

* moto (0.60)

* cargador (0.58)

* caben (0.58)

* estacionados (0.57)

* motos (0.54)

* monta (0.54)

* neumáticos (0.48)

* concesionarias (0.48)

* colchonerias (0.48)

most similar to **has**

* is (0.90)

* rooms (0.90)

* floor (0.89)

* it (0.89)

* which (0.89)

* that (0.88)

* its (0.88)

* are (0.88)

* large (0.88)

* have (0.88)

most similar to **pudiendo**

* pueden (0.63)

* surgir (0.58)

* puede (0.54)

* podrían (0.48)

* podría (0.48)

* supo (0.45)

* deben (0.44)

* podrán (0.44)

* pasada (0.43)

* suele (0.43)

most similar to **otros**

* notables (0.52)

* vacíos (0.51)

* emblemáticos (0.49)

* tantos (0.48)

* otras (0.48)

* preferidos (0.48)

* algunos (0.47)

* tanguerías (0.47)

* intermediario (0.47)

* distintos (0.47)

most similar to **destinados**

* depósitos (0.67)

* destinadas (0.62)

* comerciales (0.59)

* residenciales (0.54)

* gastronomicos (0.53)

* residencia (0.53)

* barriales (0.52)

* gubernamentales (0.50)

* administrativos (0.49)

* predominantemente (0.49)

most similar to **saber**

* descriptas (0.67)

* debida (0.65)

* concordantes (0.63)

* inscripción (0.61)

* subsiguientes (0.57)

* delimitaron (0.57)

* adecue (0.56)

* descritas (0.54)

* posesiones (0.53)

* terminan (0.53)

most similar to **m²**

* m2 (0.76)

* mts2 (0.73)

* mt2 (0.66)

* mts² (0.56)

* sup (0.56)

* cub (0.54)

* mt (0.53)

* semicub (0.52)

* mtsmetros (0.52)

* m2superficie (0.50)

most similar to **requiera**

* dirige (0.70)

* referimos (0.69)

* direcciona (0.68)

* envuelve (0.67)

* llevan (0.67)

* permitan (0.67)

* trasladamos (0.66)

* deriva (0.66)

* recuerda (0.65)

* regala (0.64)

most similar to **cobre**

* desagues (0.81)

* cañería (0.77)

* drenajes (0.75)

* condensado (0.75)

* alimentación (0.71)

* vacía (0.70)

* montante (0.69)

* termomecánica (0.69)

* manojos (0.69)

* canalización (0.69)

most similar to **semicub**

* desc (0.89)

* descub (0.83)

* tot (0.80)

* cub (0.78)

* mtssup (0.78)

* sup (0.76)

* subtotal (0.76)

* semidescubierta (0.76)

* m2balcón (0.74)

* m2amenities (0.74)

most similar to **sr**

* solicite (0.69)

* ochoa (0.68)

* justos (0.67)

* entrevista (0.66)

* cel (0.65)

* dios (0.64)

* licata (0.64)

* celu (0.64)

* montero (0.63)

* salamone (0.63)

most similar to **cisterna**

* elevadoras (0.71)

* australiano (0.70)

* bombeo (0.68)

* bombas (0.67)

* termo (0.61)

* lts (0.61)

* rowa (0.61)

* tanque (0.59)

* presurizadora (0.59)

* arqueológica (0.58)

most similar to **danesa**

* ética (0.72)

* lighthouse (0.70)

* msgsssv (0.67)

* frer (0.61)

* lone (0.60)

* bajarlas (0.58)

* prestigioso (0.57)

* amazonia (0.57)

* privilegiado (0.56)

* asociadosmateriales (0.55)

most similar to **service**

* wash (0.75)

* meeting (0.75)

* bike (0.75)

* guest (0.74)

* bath (0.74)

* fun (0.73)

* space (0.73)

* meditation (0.72)

* place (0.71)

* barbecue (0.71)

most similar to **representa**

* primordial (0.60)

* verdadero (0.60)

* saben (0.60)

* significa (0.59)

* vaya (0.58)

* contiene (0.57)

* aumenta (0.56)

* refleja (0.56)

* es (0.56)

* surge (0.56)

most similar to **ndash**

* br (0.57)

* calef (0.56)

* rsaquo (0.53)

* lido (0.53)

* carpinter (0.52)

* profusi (0.51)

* ograve (0.51)

* descripci (0.50)

* bull (0.50)

* nataci (0.50)

most similar to **ninguna**

* alguna (0.67)

* garantizar (0.58)

* nada (0.57)

* afectación (0.56)

* implícitamente (0.55)

* reubicación (0.53)

* romper (0.53)

* condicionado (0.52)

* humedades (0.52)

* constituyen (0.51)

In [38]:
model.save('properties.w2v')