In [7]:
!pip install fasttext
!git clone https://github.com/elsonidoq/ml-practico-2022.git 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fasttext
  Using cached fasttext-0.9.2.tar.gz (68 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp37-cp37m-linux_x86_64.whl size=3156315 sha256=16d5659516b3a3f0eb0b2051c6b19f98e26dc145439e59183d81692883693a5b
  Stored in directory: /root/.cache/pip/wheels/4e/ca/bf/b020d2be95f7641801a6597a29c8f4f19e38f9c02a345bab9b
Successfully built fasttext
Installing collected packages: fasttext
Successfully installed fasttext-0.9.2
fatal: destination path 'ml-practico-2022' already exists and is not an empty directory.


In [1]:
import sys
from google.colab import drive

sys.path.append('ml-practico-2022/lib')
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [2]:
from tqdm import tqdm
import re

# esta copiado de la notebook de word2vec para que sea comparable el resultado
word_pat = re.compile('\w+') # alfanumericos
num_pat = re.compile('\d+.\d*')

def tokenize(sent):
  return [
      tok for tok in word_pat.findall(sent.lower())
      if len(tok) > 1 and num_pat.match(tok) is None
  ]

def normalize(sent):
  return ' '.join(tokenize(sent))



class SentenceIterator:
  # esta clase es un "iterador reseteable" de oraciones para word2vec
  # que sea reseteable es que podes hacer `for thing in iterator` 
  # todas las veces que quieras
  def __init__(self, data):
    self.data = data

  def __iter__(self):
    for row in tqdm(self.data):
      yield normalize(row['title'])
      for line in row['description'].replace('. ', '\n').replace('<br/>', '\n').replace('<br>', '\n').split('\n'):
        sent = normalize(line)
        if sent: yield sent

In [3]:
import os
import csv
from taller_model_selection.evaluate import load_train_dev_test

PATH = '/content/gdrive/MyDrive/taller-model-selection-data/'
X_train, y_train = load_train_dev_test(PATH)[0]

{'pct(train)': 0.809998757918271, 'pct(dev)': 0.09000124208172898, 'pct(test)': 0.1}


In [11]:
texts = SentenceIterator(X_train)

with open('data.txt', 'w') as f:
  first = True
  for line in texts:
    if not first: f.write('\n')
    f.write(line)
    first = False

100%|██████████| 130426/130426 [00:21<00:00, 6127.08it/s]


In [10]:
# solo con un sample de la data
!head -n 100000 data.txt > hdata.txt

In [11]:
import fasttext
model = fasttext.train_unsupervised(
    'hdata.txt', model='skipgram',# epoch=10, minn=7, maxn=12
)


In [12]:
model.save_model(os.path.join(PATH, "full_fasttext_normalized.bin"))

In [14]:
import fasttext
model = fasttext.load_model(os.path.join(PATH, "full_fasttext_normalized.bin"))



In [None]:
model.get_subwords('departamento')

In [9]:
from IPython.display import display_markdown
from random import choice

for i in range(30):
  word = choice(model.words)
  display_markdown(f'most similar to **{word}**', raw=True)
  for sim, word2 in model.get_nearest_neighbors(word):
    display_markdown(f'* {word2} ({sim:.02f})', raw=True)
  display_markdown(f'___')

most similar to **discordancias**

* inexactitudes (1.00)

* facturas (0.99)

* arrojar (0.98)

* títulos (0.97)

* planos (0.97)

* aviso (0.97)

* actuales (0.96)

* proporciones (0.96)

* proporcionados (0.96)

* lo (0.96)

most similar to **condicionada**

* hipotecario (0.96)

* refaccionada (0.95)

* medidor (0.95)

* congreso (0.94)

* limita (0.94)

* racionalista (0.94)

* transferencia (0.94)

* crédito (0.94)

* locomoción (0.94)

* país (0.94)

most similar to **balcón**

* balc (0.99)

* living (0.99)

* cómodo (0.99)

* balcon (0.99)

* toilette (0.99)

* amplia (0.99)

* amplio (0.99)

* ventila (0.99)

* pulmón (0.99)

* principal (0.99)

most similar to **desarrolla**

* desarrollo (1.00)

* desarrollada (1.00)

* microcine (1.00)

* climatizada (0.99)

* azotea (0.99)

* lacroze (0.99)

* poseen (0.99)

* eslavonia (0.99)

* teléfono (0.99)

* diaz (0.99)

most similar to **fijas**

* ambas (0.99)

* heras (0.99)

* salas (0.98)

* mesas (0.98)

* terminadas (0.98)

* barandas (0.98)

* cocheras (0.98)

* bajas (0.98)

* rosas (0.97)

* bauleras (0.97)

most similar to **ascensor**

* aluminio (0.99)

* hermosa (0.99)

* studio (0.99)

* hidromasaje (0.99)

* frc (0.99)

* luo (0.99)

* luz (0.99)

* pileta (0.99)

* hidro (0.99)

* circuito (0.99)

most similar to **zonificación**

* verificación (0.98)

* publicación (0.97)

* zonificacion (0.97)

* meramente (0.96)

* tasación (0.95)

* verificaciones (0.95)

* variación (0.95)

* documentación (0.95)

* ver (0.95)

* verificacin (0.94)

most similar to **estudiantes**

* vendibles (0.99)

* flotantes (0.98)

* detalles (0.98)

* tren (0.98)

* ortiz (0.97)

* paneles (0.97)

* niveles (0.97)

* multiples (0.97)

* principales (0.97)

* córdoba (0.97)

most similar to **ejercen**

* civil (1.00)

* código (0.99)

* lealtad (0.99)

* regulan (0.98)

* defensa (0.98)

* codigo (0.98)

* corretaje (0.98)

* leyes (0.97)

* constitucionales (0.97)

* ley (0.97)

most similar to **máximo**

* vivienda (0.97)

* comisión (0.97)

* requerir (0.96)

* máxima (0.96)

* le (0.95)

* contratos (0.94)

* contrario (0.94)

* será (0.93)

* quince (0.93)

* contrato (0.93)

most similar to **aviso**

* inexactitudes (0.97)

* discordancias (0.97)

* precios (0.96)

* títulos (0.95)

* proporciones (0.95)

* facturas (0.95)

* pueden (0.95)

* lo (0.95)

* arrojar (0.95)

* proporcionados (0.95)

most similar to **materiales**

* calles (0.99)

* detalles (0.99)

* paneles (0.99)

* flotantes (0.98)

* niveles (0.98)

* múltiples (0.98)

* multiples (0.98)

* bares (0.97)

* razonables (0.97)

* suites (0.97)

most similar to **embajadas**

* pierdas (0.98)

* ajustadas (0.97)

* características (0.97)

* casas (0.97)

* adheridas (0.96)

* días (0.96)

* mismas (0.96)

* blindadas (0.96)

* todas (0.96)

* disponibles (0.96)

most similar to **separada**

* ducha (0.99)

* toilette (0.99)

* baño (0.99)

* mucha (0.99)

* separado (0.99)

* incorporada (0.99)

* mucho (0.99)

* grande (0.99)

* completo (0.99)

* lava (0.99)

most similar to **todos**

* solados (0.99)

* autos (0.98)

* pintados (0.98)

* años (0.97)

* dos (0.97)

* altos (0.97)

* deptos (0.97)

* privados (0.96)

* os (0.96)

* fijos (0.96)

most similar to **razonables**

* multiples (1.00)

* múltiples (1.00)

* paneles (0.99)

* niveles (0.99)

* detalles (0.99)

* lugares (0.99)

* bares (0.98)

* dolares (0.98)

* ascensores (0.98)

* herrajes (0.98)

most similar to **pedido**

* afip (0.99)

* resol (0.99)

* cumplimiento (0.98)

* nº (0.98)

* requisitos (0.98)

* coti (0.98)

* supeditada (0.98)

* pedir (0.98)

* resolución (0.98)

* parte (0.95)

most similar to **optativas**

* carcter (0.98)

* reservas (0.98)

* orientativas (0.98)

* ratificarse (0.98)

* carácter (0.98)

* expresadas (0.97)

* empresas (0.97)

* gráfica (0.97)

* adheridas (0.97)

* verificacin (0.97)

most similar to **masajes**

* frances (0.99)

* herrajes (0.99)

* ascensores (0.99)

* anafes (0.99)

* multiples (0.98)

* paredes (0.98)

* sobre (0.98)

* silestone (0.98)

* jardines (0.98)

* guardacoches (0.98)

most similar to **linea**

* transporte (0.99)

* cercanía (0.98)

* fe (0.98)

* línea (0.98)

* córdoba (0.98)

* cercano (0.98)

* mitre (0.98)

* zona (0.98)

* cabildo (0.98)

* tren (0.97)

most similar to **caja**

* ultima (0.99)

* antiguo (0.99)

* memoria (0.99)

* hs (0.98)

* soho (0.98)

* guardacoches (0.98)

* rivadavia (0.98)

* azotea (0.98)

* boulevard (0.98)

* desarrollo (0.98)

most similar to **perimetral**

* alberto (1.00)

* almagro (0.99)

* alvarez (0.99)

* parking (0.99)

* nuñez (0.99)

* nivel (0.99)

* rodeada (0.99)

* out (0.99)

* histórico (0.99)

* patrimonio (0.99)

most similar to **hotel**

* loft (1.00)

* tel (0.99)

* limpieza (0.99)

* ffcc (0.99)

* harbour (0.99)

* autónoma (0.99)

* ss (0.99)

* pozo (0.99)

* ofrecen (0.99)

* ms (0.99)

most similar to **fijas**

* ambas (0.99)

* heras (0.99)

* salas (0.98)

* mesas (0.98)

* terminadas (0.98)

* barandas (0.98)

* cocheras (0.98)

* bajas (0.98)

* rosas (0.97)

* bauleras (0.97)

most similar to **médicos**

* campos (0.99)

* nosotros (0.99)

* primeros (0.99)

* teatros (0.99)

* derechos (0.99)

* terrenos (0.99)

* algunos (0.99)

* circuitos (0.99)

* fijos (0.99)

* diversos (0.98)

most similar to **uno**

* jardin (0.99)

* parquet (0.99)

* jardín (0.99)

* sector (0.99)

* uso (0.99)

* divino (0.99)

* moderno (0.99)

* ntilde (0.99)

* hermoso (0.99)

* playroom (0.99)

most similar to **vez**

* mendoza (0.99)

* pensado (0.98)

* mujer (0.98)

* menor (0.98)

* rivadavia (0.98)

* avda (0.98)

* suspendido (0.98)

* saldo (0.98)

* tipo (0.98)

* terreno (0.98)

most similar to **cnel**

* jorge (1.00)

* be (0.99)

* ofrecen (0.99)

* brokerage (0.99)

* núñez (0.99)

* limpieza (0.99)

* numero (0.99)

* quot (0.99)

* país (0.99)

* triunvirato (0.99)

most similar to **mono**

* antiguedad (0.98)

* tenis (0.98)

* modelo (0.98)

* imperdible (0.98)

* monroe (0.98)

* monitoreo (0.98)

* estratégica (0.98)

* directo (0.98)

* tour (0.98)

* listo (0.98)

most similar to **gastos**

* informes (0.96)

* info (0.95)

* físicas (0.95)

* inquilinos (0.94)

* discapacidades (0.94)

* casos (0.93)

* meses (0.93)

* sean (0.93)

* gestoría (0.93)

* art (0.93)

In [13]:
from IPython.display import display_markdown
from random import choice

words = ['cdra', 'elena', 'daniel', 'hs', 'señorial', 'vinílicos', 'by', 'monserrat', ]
for word in words:
  display_markdown(f'most similar to **{word}**', raw=True)
  for sim, word2 in model.get_nearest_neighbors(word):
    display_markdown(f'* {word2} ({sim:.02f})', raw=True)
  display_markdown(f'___')

most similar to **cdra**

* olazábal (0.82)

* cuadra (0.81)

* talcahuano (0.81)

* olazabal (0.81)

* azurduy (0.80)

* cramer (0.80)

* anchorena (0.80)

* fé (0.80)

* chivilcoy (0.80)

* ibera (0.80)

most similar to **elena**

* karina (0.82)

* susana (0.81)

* liliana (0.81)

* gomez (0.81)

* cecilia (0.80)

* claudia (0.79)

* natalia (0.79)

* silvana (0.79)

* sabrina (0.79)

* bibiana (0.79)

most similar to **daniel**

* daniela (0.96)

* vazquez (0.87)

* gonzalo (0.86)

* kuzzel (0.86)

* leandro (0.86)

* alejandro (0.85)

* luciano (0.85)

* guastello (0.85)

* muñoz (0.85)

* cmpcsi (0.84)

most similar to **hs**

* vigilancia (0.83)

* totem (0.75)

* horas (0.75)

* centralseguridad (0.72)

* hrs (0.71)

* seguridad (0.71)

* tótem (0.70)

* sábados (0.70)

* nocturna (0.70)

* cortesia (0.68)

most similar to **señorial**

* antiguo (0.77)

* imperial (0.77)

* armonía (0.76)

* clasico (0.76)

* clásico (0.75)

* armonioso (0.74)

* renovado (0.74)

* mayoria (0.74)

* goza (0.74)

* barrial (0.74)

most similar to **vinílicos**

* vinilicos (0.96)

* vinílico (0.90)

* cermicos (0.88)

* técnicos (0.87)

* melamínicos (0.85)

* porcellanatos (0.85)

* porcelanatos (0.85)

* vinilico (0.85)

* hidrolaqueados (0.85)

* spc (0.85)

most similar to **by**

* heating (0.94)

* three (0.94)

* which (0.93)

* two (0.93)

* beautiful (0.93)

* this (0.93)

* floors (0.93)

* bright (0.93)

* bedrooms (0.93)

* floor (0.93)

most similar to **monserrat**

* monseñor (0.87)

* balvanera (0.84)

* zapiola (0.83)

* aranguren (0.83)

* azurduy (0.83)

* dumont (0.83)

* chivilcoy (0.83)

* chacarita (0.82)

* boedo (0.82)

* pompeya (0.82)