<a href="https://colab.research.google.com/github/csralvall/clustering/blob/main/clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering

> Trabajo perteneciente a la cátedra "Text Mining" de Laura Alonso Alemany - FaMAF UNC. 2021

- _Corpus:_ compendio de noticias de "La Voz del Interior"
- _Referencias:_ 
    - [word_clustering](https://github.com/danibosch/word_clustering) 
    de [@danibosch](https://github.com/danibosch)
    - [textmining-clustering](https://github.com/facumolina/textmining-clustering) 
    de [@facumolina](https://github.com/facumolina/textmining-clustering)

## Etapas:

#### 1. Preprocesamiento:
   1. [Limpieza del dataset](#regex_clean):
   
       Se quitan simbolos no alfanumericos mediante regex.
       
       
   2. [Procesamiento con el pipeline de spacy](#spacy_pipeline):
   
       El pipeline de spacy ejecuta las siguientes tareas:
       - Segmentacion en tokens.
       - Vectorizacion.
       - Analisis morfologico (genero, numero, etc.).
       - Analisis de dependencias.
       - Clasificacion por reglas.
       - Lematizacion.
       - Clasificacion de entidades.
       - Conteo de frecuencias de lemas ([agregado](#lemma_freq_counter)).
       
       
   3. [Filtrado](#filtering):
   
      Se busca un subconjunto de tokens
      que cumplan las siguientes caracteristicas:
       - no contiene caracteres no alfabeticos
       - no es "stop word"
       - no expresa un numero
       
       
   4. [Generacion de **features**](#feature_generation):
   
      Por cada palabra del conjunto procesado en el paso anterior 
      generar un conjunto de features de acuerdo con:
       - el tipo de palabra.
       - su funcion sintactica.
       - su coocurrencia con palabras en su misma oracion (contexto).
       - el tipo de entidad que es, en el caso que lo sea.
       - la dependencia con su predecesor sintactico en el arbol de dependencias.
       
       
   5. [Separacion de palabras y features](#key_feature_division).
   6. [Vectorizacion de features](#feature_vectorization).
   7. [Normalizacion de features](#feature_normalization).
    

#### 2. [Clusterizacion](#clustering):
   1. Eleccion del numero de clusters.
   2. Seleccion del estado aleatorio para obtener resultados deterministicos.
   3. Instanciacion del algoritmo KMeans para obtener los clusters.
    
#### 3. Resultados:
   1. [Conteo de palabras por cluster](#summary).
   2. [Listado de clusters con la siguiente informacion](#results):
      - Numero de cluster.
      - Features mas determinantes para ese cluster.
      - Palabras dentro del cluster.

In [1]:
!pip install -U pip setuptools wheel
!pip install -U spacy sklearn numpy spacy nltk gensim



In [2]:
!python -m spacy download es_core_news_md

Collecting es-core-news-md==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_md-3.1.0/es_core_news_md-3.1.0-py3-none-any.whl (42.7 MB)
[K     |████████████████████████████████| 42.7 MB 48 kB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_md')


In [3]:
from functools import reduce
from gensim.models import Word2Vec
import multiprocessing
from nltk.cluster import kmeans, cosine_distance
import numpy as np
import re
import spacy
from spacy.tokens import Token
from spacy.language import Language
from sklearn.cluster import KMeans
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import normalize
from collections import Counter

# Preprocessing

In [4]:
# load trained model from spacy
nlp = spacy.load("es_core_news_md")

<a id="lemma_freq_counter"></a>

In [5]:
@Language.component("lemma_counter")
def lemma_counter(doc):
    '''
    custom component to count lemma frecuency
    '''
    lemmas = [token.lemma_ for token in doc]
    lemmas_count = Counter(lemmas)
    get_lemma_count = lambda x: lemmas_count[x.lemma_]
    Token.set_extension("lemma_count", getter = get_lemma_count)
    return doc

In [6]:
# add custom component to pipeline
nlp.add_pipe("lemma_counter", name="lemma_freq_counter", last=True)

<function __main__.lemma_counter>

<a id="regex_clean"></a>

In [7]:
!wget https://cs.famaf.unc.edu.ar/~laura/corpus/lavoztextodump.txt.tar.gz
!tar -xvf lavoztextodump.txt.tar.gz

--2021-10-06 18:53:05--  https://cs.famaf.unc.edu.ar/~laura/corpus/lavoztextodump.txt.tar.gz
Resolving cs.famaf.unc.edu.ar (cs.famaf.unc.edu.ar)... 200.16.17.55
Connecting to cs.famaf.unc.edu.ar (cs.famaf.unc.edu.ar)|200.16.17.55|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12136935 (12M) [application/x-gzip]
Saving to: ‘lavoztextodump.txt.tar.gz.4’


2021-10-06 18:53:08 (7.71 MB/s) - ‘lavoztextodump.txt.tar.gz.4’ saved [12136935/12136935]

lavoztextodump.txt


In [8]:
# open dataset
filename = "lavoztextodump.txt"
text_file = open(filename, "r")
# load in memory
dataset = text_file.read()
#clean data
dataset = re.sub('&#[0-9]{0,5}.', '', dataset)
# close file
text_file.close()

<a id="spacy_pipeline"></a>

In [9]:
# show pipeline steps
nlp.pipe_names

['tok2vec',
 'morphologizer',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'lemma_freq_counter']

In [10]:
# process dataset with spacy pipeline
# dataset is pruned to the maximum amount possible in a laptop or desk computer
doc = nlp(dataset[:1000000])

<a id="filtering"></a>

In [11]:
# function designed to refine dataset processing
def filter_tokens(doc: spacy.tokens.doc.Doc) -> [spacy.tokens.token.Token]:
    def is_target_token(token: spacy.tokens.token.Token) -> bool:
        return token.is_alpha and not token.is_stop and len(token.sent) > 10
    
    return filter(is_target_token, doc)

In [88]:
# more dataset preprocessing
tokens = filter_tokens(doc)

<a id="feature_generation"></a>

In [89]:
def feature_generator(
    tokens: [spacy.tokens.token.Token],
    *, # all remaining arguments should be named
    token_threshold: int,
    context_threshold: int,
) -> dict:
    dicc = {}
    for token in tokens:
        lema = token.lemma_
        # reduce dimensionality: discard words with not enough examples
        if token._.lemma_count < token_threshold:
            continue
        # check if some form of this word has been processed before
        # or create new dict
        features = dicc.get(lema, {})

        # create feature based on part of speech for this token
        pos = "POS_" + token.pos_
        # increment feature count
        features[pos] = features.get(pos, 0) + 1
    
        # create feature based on syntactic dependency of token
        dep = "DEP__" + token.dep_
        # increment feature count
        features[dep] = features.get(dep, 0) + 1
            
        # insert context based feature using co-frequence in sentences
        for word in token.sent:
            if word == token:
                continue
            # reduce dimensionality: drop unfrequent words
            if word._.lemma_count > context_threshold:
                if word.like_num:
                    context = "NUM"
                else:
                    context = word.lemma_

                # increment feature count
                features[context] = features.get(context, 0) + 1
                
        # create feature based on type of entity
        if token.ent_type > 0:
            ent = "ENT_" + token.ent_type_
            # increment feature count
            features[ent] = features.get(ent, 0) + 1

        # create feature based on parent-child dependency
        tripla = "TRIPLA__" + token.lemma_ + "__" + token.dep_ + "__" + token.head.lemma_
        # increment feature count
        features[tripla] = features.get(tripla, 0) + 1

        # create feature based on parent-child dependency structure
        tripla_struct = "TRIPLA__" + token.pos_ + "__" + token.dep_ + "__" + token.head.pos_
        # increment feature count
        features[tripla] = features.get(tripla, 0) + 1

        # create feature based on ancestor-token-child structure
        for ancestor in token.ancestors:
          for child in token.children:
            tripla_struct = "ATC_" + ancestor.pos_ + "__" + token.pos_ + "__" + child.pos_ 
            # increment feature count
            features[tripla_struct] = features.get(tripla, 0) + 1
        
        dicc[lema] = features
    
    return dicc

In [90]:
# get {word: features} dictionary
dicc = feature_generator(tokens, token_threshold=10, context_threshold=50)

<a id="key_feature_division"></a>

In [91]:
# unpack keys and features in two separated tuples to ease processing
(keys, features) = zip(*dicc.items())

<a id="feature_vectorization"></a>

In [92]:
def vectorize_features(features: (dict)) -> np.ndarray:
    vectorized_features = DictVectorizer(sparse=False).fit_transform(features)
    return vectorized_features

In [93]:
# use sklearn dict vectorizer to create arrays from each set of features
vectorized_features = vectorize_features(features)

<a id="feature_normalization"></a>

In [94]:
def reduce_matrix(matrix: np.ndarray, *, variance_treshold: float):
    print(f'INPUT SHAPE: {matrix.shape}')
    # reduce all vectors to [0, 1] space
    normalized_matrix = normalize(matrix, axis=1)
    # compute variances in each row
    matrix_variances = np.var(matrix, axis=0)
    # create mask for features with high correlation (low variance)
    bool_mask = np.where(matrix_variances < variance_treshold)
    # filter features with high correlation (variance under treshold)
    raked_matrix = np.delete(normalized_matrix, bool_mask, axis=1)
    print(f'OUTPUT SHAPE: {raked_matrix.shape}')
    return raked_matrix

In [95]:
reduced_matrix = reduce_matrix(vectorized_features, variance_treshold=0.01)

INPUT SHAPE: (1664, 40531)
OUTPUT SHAPE: (1664, 3104)


<a id="clustering"></a>

# Clustering

In [96]:
def generate_clusters(
    matrix: np.ndarray,
    n_clusters: int
) -> KMeans:
    # generate word clusters using the KMeans algorithm.
    print("\nClustering started")
    # Instantiate KMeans clusterer for n_clusters
    km_model = KMeans(n_clusters=n_clusters, random_state=3)
    # create clusters
    km_model.fit(matrix)
    print("Clustering finished")
    return km_model

In [97]:
clusters = generate_clusters(reduced_matrix, 100)


Clustering started
Clustering finished


# Results

<a id="summary"></a>

In [98]:
def display_summary(clusters: KMeans):
    cluster_count = Counter(sorted(clusters.labels_))
    for cluster in cluster_count:
        print ("Cluster#", cluster," - Total words:", cluster_count[cluster])

In [99]:
# show number of words captured by each cluster
display_summary(clusters)

Cluster# 0  - Total words: 3
Cluster# 1  - Total words: 23
Cluster# 2  - Total words: 17
Cluster# 3  - Total words: 1
Cluster# 4  - Total words: 5
Cluster# 5  - Total words: 7
Cluster# 6  - Total words: 6
Cluster# 7  - Total words: 25
Cluster# 8  - Total words: 21
Cluster# 9  - Total words: 15
Cluster# 10  - Total words: 23
Cluster# 11  - Total words: 21
Cluster# 12  - Total words: 74
Cluster# 13  - Total words: 9
Cluster# 14  - Total words: 20
Cluster# 15  - Total words: 23
Cluster# 16  - Total words: 3
Cluster# 17  - Total words: 13
Cluster# 18  - Total words: 20
Cluster# 19  - Total words: 9
Cluster# 20  - Total words: 3
Cluster# 21  - Total words: 23
Cluster# 22  - Total words: 1
Cluster# 23  - Total words: 26
Cluster# 24  - Total words: 7
Cluster# 25  - Total words: 48
Cluster# 26  - Total words: 1
Cluster# 27  - Total words: 1
Cluster# 28  - Total words: 16
Cluster# 29  - Total words: 20
Cluster# 30  - Total words: 24
Cluster# 31  - Total words: 4
Cluster# 32  - Total words: 1
Cl

<a id="results"></a>

In [100]:
def display_clusters(clusters: KMeans, keys: (str)):
    cluster_count = Counter(sorted(clusters.labels_))
    print("Top words per cluster:\n")
    #sort cluster centers by proximity to centroid
    order_centroids = clusters.cluster_centers_.argsort()[:, ::-1] 
    
    for cluster_idx in cluster_count:
        print(f"Cluster {cluster_idx} - Total words: {cluster_count[cluster_idx]}")
        print("\n\nWords:", end='')
        # get words inside each cluster
        cluster_words = np.where(clusters.labels_ == cluster_idx)[0]
        # print all words inside cluster
        for idx in cluster_words:
            print(f" {keys[idx]}", end=",")
        print("\n\n")

In [101]:
def display_cluster_features(clusters: KMeans, features: (dict)):
    cluster_count = Counter(sorted(clusters.labels_))
    print("Most decisive features for each cluster:\n")
    #sort cluster centers by proximity to centroid
    order_centroids = clusters.cluster_centers_.argsort()[:, ::-1] 
    
    # flatten features dict
    flat_features = reduce(lambda x, y: {**x, **y}, features, {})
    # get feature keys
    feature_keys = list(flat_features.keys())

    # iterate over each cluster
    for cluster_idx in cluster_count:
        print(f"Cluster {cluster_idx} - Total words: {cluster_count[cluster_idx]}")
        print("Frequent terms:", end='')
        # print most determinant features for this cluster
        for ind in order_centroids[cluster_idx, :10]:
            print(f' {feature_keys[ind]}', end=',')

        print("\n\n")

In [102]:
display_clusters(clusters, keys)

Top words per cluster:

Cluster 0 - Total words: 3


Words: armar, añadir, repetir,


Cluster 1 - Total words: 23


Words: entrada, hogar, censo, feriado, predio, trabajo, texto, palabra, escena, entorno, tendencia, especialista, fenómeno, despacho, muerto, hospital, aula, esquina, justicia, horario, piso, tía, tarifa,


Cluster 2 - Total words: 17


Words: a, y, estar, o, haber, acerca, mañana, justo, mil, ser, e, ciento, deber, nueve, poder, ambos, cuyo,


Cluster 3 - Total words: 1


Words: pesar,


Cluster 4 - Total words: 5


Words: Latinoamérica, Jujuy, Honduras, Venezuela, Cuba,


Cluster 5 - Total words: 7


Words: Mestre, Perón, Chávez, Edwards, Mario, Vargas, Llosa,


Cluster 6 - Total words: 6


Words: diverso, distinto, tal, demasiado, semejante, tanto,


Cluster 7 - Total words: 25


Words: Gobierno, Consejo, Ejecutivo, Justicia, Legislatura, Corte, Congreso, Frente, Cívico, PJ, Peronismo, UCR, Concejo, Plan, Premio, Twitter, PV, Iglesia, Policía, Internet, Clarín, Asamble

In [103]:
def search_word_cluster(clusters: KMeans, corpus: (str), searched_word: str) -> [str]:
    clusts = clusters.labels_
    search_idx = corpus.index(searched_word)
    return [word for idx, word in enumerate(corpus, 0) if clusts[search_idx] == clusts[idx]]

In [104]:
search_word_cluster(clusters, keys, 'viernes')

['miércoles',
 'mejora',
 'reforma',
 'martes',
 'actor',
 'factor',
 'municipio',
 'desafío',
 'viernes',
 'convocatoria',
 'lunes',
 'comercio',
 'terreno',
 'lluvia',
 'intervención',
 'sanción',
 'oficialismo',
 'radicalismo',
 'establecimiento',
 'calle',
 'jueves',
 'crisis',
 'economía',
 'carne',
 'presidenta',
 'industria',
 'decreto',
 'anuncio',
 'economista',
 'institucionalidad',
 'oferta',
 'disputa',
 'expectativa',
 'respaldo',
 'verano',
 'interna',
 'disidente',
 'bloque',
 'fórmula',
 'aplicación',
 'oposición',
 'domingo',
 'superficie',
 'cultura',
 'herramienta',
 'margen',
 'administración',
 'usuario',
 'urna',
 'organismo',
 'boleta',
 'aspirante',
 'aeropuerto',
 'mérito',
 'partida',
 'comicio',
 'edil',
 'novela']

In [105]:
search_word_cluster(clusters, keys, 'Brasil')

['l',
 'Unidos',
 'Córdoba',
 'Capital',
 'Cruz',
 'Santa',
 'Argentina',
 'San',
 'Cuarto',
 'España',
 'Colón',
 'Alemania',
 'Judicial',
 'Federal',
 'Curitiba',
 'China',
 'Brasilia',
 'Brasil',
 'Chile',
 'Europa',
 'América',
 'Ecuador',
 'Quito',
 'Madrid']

In [106]:
search_word_cluster(clusters, keys, 'Sota')

['Pedro',
 'Daniel',
 'Grahovac',
 'Cristina',
 'Fernández',
 'Kirchner',
 'Néstor',
 'Juez',
 'Oscar',
 'Aguad',
 'Juan',
 'Micheli',
 'Sota',
 'Schiaretti',
 'Giacomino',
 'Liu',
 'Miguel',
 'Macri',
 'Presidenta',
 'Timerman',
 'Apablaza',
 'Piñera',
 'Sala',
 'Correa',
 'Patiño',
 'Bunge',
 'Campana',
 'Scotto']

In [107]:
search_word_cluster(clusters, keys, 'Facultad')

['Educación',
 'Ministerio',
 'Ley',
 'Facultad',
 'Provincia',
 'Nación',
 'Punilla',
 'Plaza',
 'Secretaría',
 'Universidad',
 'Suprema',
 'Casa',
 'Agricultura',
 'Comercio',
 'Servicios',
 'Cámara',
 'Afip',
 'Centro',
 'Epec',
 'Públicos',
 'Municipalidad',
 'Turismo',
 'Filosofía',
 'Tribunales',
 'Julio',
 'Palacio',
 'Fiesta',
 'Bustos']

In [108]:
search_word_cluster(clusters, keys, 'colegio')

['plazo',
 'ingreso',
 'toma',
 'medida',
 'colegio',
 'ley',
 'escuela',
 'principio',
 'fuerza',
 'lucha',
 'agua',
 'cuestión',
 'causa',
 'protesta',
 'edificio',
 'canal',
 'argumento',
 'posición',
 'información',
 'respuesta',
 'estadística',
 'tema',
 'motivo',
 'visita',
 'derecho',
 'expediente',
 'sector',
 'funcionario',
 'dirigente',
 'norma',
 'legislador',
 'presupuesto',
 'libro',
 'atención',
 'estrategia',
 'idea',
 'recurso',
 'sociedad',
 'mecanismo',
 'mercado',
 'resultado',
 'inflación',
 'crecimiento',
 'aumento',
 'impuesto',
 'democracia',
 'proceso',
 'impacto',
 'salida',
 'voto',
 'gobernador',
 'elección',
 'peronismo',
 'condición',
 'abogado',
 'intención',
 'gremio',
 'fallo',
 'tribunal',
 'kirchnerismo',
 'ausencia',
 'autoridad',
 'detalle',
 'nombre',
 'tarea',
 'noticia',
 'prensa',
 'vía',
 'conducción',
 'minero',
 'fiscal',
 'autor',
 'alumno',
 'turno',
 'magistrado',
 'sentencia',
 'vigencia',
 'monto']

In [120]:
search_word_cluster(clusters, keys, 'sábado')

['sentido',
 'proyecto',
 'objetivo',
 'forma',
 'experiencia',
 'institución',
 'sábado',
 'reclamo',
 'anteproyecto',
 'cambio',
 'plan',
 'obra',
 'debate',
 'estudiante',
 'educación',
 'posibilidad',
 'noche',
 'discusión',
 'seguridad',
 'conflicto',
 'acción',
 'infraestructura',
 'propuesta',
 'participación',
 'necesidad',
 'acto',
 'entidad',
 'ciudadano',
 'base',
 'punto',
 'reacción',
 'dato',
 'provincia',
 'presencia',
 'capital',
 'investigación',
 'sistema',
 'gestión',
 'camino',
 'aporte',
 'número',
 'mano',
 'país',
 'mayoría',
 'pregunta',
 'población',
 'diferencia',
 'reunión',
 'mitad',
 'consecuencia',
 'empresa',
 'iniciativa',
 'curso',
 'actividad',
 'informe',
 'comisión',
 'marcha',
 'valor',
 'modelo',
 'negocio',
 'precio',
 'mundo',
 'importancia',
 'presentación',
 'control',
 'campaña',
 'organización',
 'encuesta',
 'demanda',
 'costo',
 'falta',
 'intendentes',
 'línea',
 'agenda',
 'marco',
 'puesto',
 'cargo',
 'paso',
 'gobierno',
 'mediados',
 

In [121]:
def in_same_cluster(clusters: KMeans, corpus: (str), words: [str]) -> bool:
    clusts = clusters.labels_
    word_clusters = map(lambda w: clusts[corpus.index(w)], words)
    number_of_clusters = len(set(word_clusters))
    return number_of_clusters <= 1

In [122]:
in_same_cluster(clusters, keys, ['lunes', 'martes', 'jueves', 'viernes', 'domingo'])

True

# Clustering with word embeddings

In [123]:
# load trained model from spacy
nlp = spacy.load("es_core_news_md")

In [124]:
# show pipeline steps
nlp.pipe_names

['tok2vec', 'morphologizer', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [125]:
# process dataset with spacy pipeline
# dataset is pruned to the maximum amount possible in a laptop or desk computer
# disable ner and parser for speed
with nlp.select_pipes(disable=["ner"]):
  doc = nlp(dataset[:1000000])

In [126]:
# function designed to refine dataset processing
def filter_tokens_in_sent(sent: spacy.tokens.span.Span) -> [spacy.tokens.token.Token]:
    def is_target_token(token: spacy.tokens.token.Token) -> bool:
        return token.is_alpha and not token.is_stop
    
    return filter(is_target_token, sent)

In [127]:
def lemmatize(token: spacy.tokens.token.Token) -> str:
  return token.lemma_

In [128]:
def filter_sents(doc: spacy.tokens.doc.Doc) -> [[str]]:
  sents = []
  for sent in doc.sents:
      sent_tokens = filter_tokens_in_sent(sent)
      sents.append(list(map(lemmatize, sent_tokens)))
  return sents

In [129]:
def generate_embedding(sentences: [[str]]) -> ([str], np.ndarray):
  # Count the number of cores in a computer
  cores = multiprocessing.cpu_count()
  w2v_model = Word2Vec(
                     min_count=20,
                     window=2,
                     #size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=cores-1)

  w2v_model.build_vocab(sentences, progress_per=10000)

  w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

  words = w2v_model.wv.index_to_key
  normed_vectors = w2v_model.wv.get_normed_vectors()

  return (words, normed_vectors)

In [130]:
words, normed_vectors = generate_embedding(filter_sents(doc))

In [131]:
clusters = generate_clusters(normed_vectors, 45)


Clustering started
Clustering finished


In [132]:
display_summary(clusters)

Cluster# 0  - Total words: 10
Cluster# 1  - Total words: 56
Cluster# 2  - Total words: 22
Cluster# 3  - Total words: 15
Cluster# 4  - Total words: 24
Cluster# 5  - Total words: 8
Cluster# 6  - Total words: 35
Cluster# 7  - Total words: 24
Cluster# 8  - Total words: 20
Cluster# 9  - Total words: 3
Cluster# 10  - Total words: 17
Cluster# 11  - Total words: 5
Cluster# 12  - Total words: 12
Cluster# 13  - Total words: 5
Cluster# 14  - Total words: 17
Cluster# 15  - Total words: 21
Cluster# 16  - Total words: 40
Cluster# 17  - Total words: 35
Cluster# 18  - Total words: 4
Cluster# 19  - Total words: 27
Cluster# 20  - Total words: 8
Cluster# 21  - Total words: 12
Cluster# 22  - Total words: 1
Cluster# 23  - Total words: 31
Cluster# 24  - Total words: 5
Cluster# 25  - Total words: 60
Cluster# 26  - Total words: 18
Cluster# 27  - Total words: 1
Cluster# 28  - Total words: 2
Cluster# 29  - Total words: 40
Cluster# 30  - Total words: 21
Cluster# 31  - Total words: 17
Cluster# 32  - Total words: 

In [133]:
display_clusters(clusters, words)

Top words per cluster:

Cluster 0 - Total words: 10


Words: decidir, Capital, privado, septiembre, localidad, educativo, Paz, continuar, morir, plaza,


Cluster 1 - Total words: 56


Words: y, a, año, mil, nacional, María, dejar, mundo, importante, producir, generar, incluir, nivel, pensar, vuelo, víctima, Santa, kilómetro, casa, Unidos, funcionar, resto, agosto, avenida, firma, pareja, inflación, central, norte, función, vivienda, nacer, metro, cubrir, red, obstante, muerto, auto, avión, oportunidad, camión, marcar, departamento, servir, Cruz, importancia, imponer, edad, Perú, blanco, gustar, Alta, salario, proponer, Turismo, Colón,


Cluster 2 - Total words: 22


Words: Villa, Carlos, Cámara, empresario, destacar, sufrir, Nobel, discusión, director, anunciar, empezar, premio, municipal, conocido, verano, compañía, extremo, noviembre, lugar, cuestionar, opinión, entrar,


Cluster 3 - Total words: 15


Words: electrónico, caer, legislador, conocer, capital, llamado, enero, superar, pe

In [134]:
search_word_cluster(clusters, words, 'colegio')

['escuela',
 'estudiante',
 'toma',
 'colegio',
 'alumno',
 'Educación',
 'secundario',
 'tomado',
 'levantar']

In [135]:
search_word_cluster(clusters, words, 'lunes')

['Córdoba',
 'proyecto',
 'cordobés',
 'Provincia',
 'ruta',
 'pagar',
 'brasileño',
 'lunes',
 'estudio',
 'área',
 'entregar',
 'habitante',
 'registrar',
 'julio',
 'construcción',
 'par',
 'Municipalidad',
 'Jorge',
 'suerte',
 'revelar']

In [138]:
search_word_cluster(clusters, words, 'Afip')

['provincia',
 'quedar',
 'sumar',
 'barrio',
 'Afip',
 'edificio',
 'cifra',
 'insistir',
 'representar',
 'informe',
 'aula',
 'atención',
 'Legislatura',
 'cuerpo',
 'rescate',
 'suba',
 'depender']

In [137]:
in_same_cluster(clusters, words, ['lunes', 'jueves', 'viernes', 'domingo'])

False