<a href="https://colab.research.google.com/github/csralvall/clustering/blob/main/clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering

> Trabajo perteneciente a la cátedra "Text Mining" de Laura Alonso Alemany - FaMAF UNC. 2021

- _Corpus:_ compendio de noticias de "La Voz del Interior"
- _Referencias:_ 
    - [word_clustering](https://github.com/danibosch/word_clustering) 
    de [@danibosch](https://github.com/danibosch)
    - [textmining-clustering](https://github.com/facumolina/textmining-clustering) 
    de [@facumolina](https://github.com/facumolina/textmining-clustering)

## Etapas:

#### 1. Preprocesamiento:
   1. [Limpieza del dataset](#regex_clean):
   
       Se quitan simbolos no alfanumericos mediante regex.
       
       
   2. [Procesamiento con el pipeline de spacy](#spacy_pipeline):
   
       El pipeline de spacy ejecuta las siguientes tareas:
       - Segmentacion en tokens.
       - Vectorizacion.
       - Analisis morfologico (genero, numero, etc.).
       - Analisis de dependencias.
       - Clasificacion por reglas.
       - Lematizacion.
       - Clasificacion de entidades.
       - Conteo de frecuencias de lemas ([agregado](#lemma_freq_counter)).
       
       
   3. [Filtrado](#filtering):
   
      Se busca un subconjunto de tokens
      que cumplan las siguientes caracteristicas:
       - no contiene caracteres no alfabeticos
       - no es "stop word"
       - no expresa un numero
       
       
   4. [Generacion de **features**](#feature_generation):
   
      Por cada palabra del conjunto procesado en el paso anterior 
      generar un conjunto de features de acuerdo con:
       - el tipo de palabra.
       - su funcion sintactica.
       - su coocurrencia con palabras en su misma oracion (contexto).
       - el tipo de entidad que es, en el caso que lo sea.
       - la dependencia con su predecesor sintactico en el arbol de dependencias.
       
       
   5. [Separacion de palabras y features](#key_feature_division).
   6. [Vectorizacion de features](#feature_vectorization).
   7. [Normalizacion de features](#feature_normalization).
    

#### 2. [Clusterizacion](#clustering):
   1. Eleccion del numero de clusters.
   2. Seleccion del estado aleatorio para obtener resultados deterministicos.
   3. Instanciacion del algoritmo KMeans para obtener los clusters.
    
#### 3. Resultados:
   1. [Conteo de palabras por cluster](#summary).
   2. [Listado de clusters con la siguiente informacion](#results):
      - Numero de cluster.
      - Features mas determinantes para ese cluster.
      - Palabras dentro del cluster.

In [1]:
!pip install -U pip setuptools wheel
!pip install -U spacy sklearn numpy spacy nltk gensim



In [2]:
!python -m spacy download es_core_news_md

Collecting es-core-news-md==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_md-3.1.0/es_core_news_md-3.1.0-py3-none-any.whl (42.7 MB)
[K     |████████████████████████████████| 42.7 MB 48 kB/s 
Installing collected packages: es-core-news-md
Successfully installed es-core-news-md-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_md')


In [40]:
from functools import reduce
from gensim.models import Word2Vec
import multiprocessing
from nltk.cluster import kmeans, cosine_distance
import numpy as np
import re
import spacy
from spacy.tokens import Token
from spacy.language import Language
from sklearn.cluster import KMeans
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import normalize
from collections import Counter

# Preprocessing

In [111]:
# load trained model from spacy
nlp = spacy.load("es_core_news_md")

<a id="lemma_freq_counter"></a>

In [112]:
@Language.component("lemma_counter")
def lemma_counter(doc):
    '''
    custom component to count lemma frecuency
    '''
    lemmas = [token.lemma_ for token in doc]
    lemmas_count = Counter(lemmas)
    get_lemma_count = lambda x: lemmas_count[x.lemma_]
    Token.set_extension("lemma_count", getter = get_lemma_count)
    return doc

In [113]:
# add custom component to pipeline
nlp.add_pipe("lemma_counter", name="lemma_freq_counter", last=True)

<function __main__.lemma_counter>

<a id="regex_clean"></a>

In [114]:
!wget https://cs.famaf.unc.edu.ar/~laura/corpus/lavoztextodump.txt.tar.gz
!tar -xvf lavoztextodump.txt.tar.gz

--2021-10-06 18:40:37--  https://cs.famaf.unc.edu.ar/~laura/corpus/lavoztextodump.txt.tar.gz
Resolving cs.famaf.unc.edu.ar (cs.famaf.unc.edu.ar)... 200.16.17.55
Connecting to cs.famaf.unc.edu.ar (cs.famaf.unc.edu.ar)|200.16.17.55|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12136935 (12M) [application/x-gzip]
Saving to: ‘lavoztextodump.txt.tar.gz.2’


2021-10-06 18:40:39 (7.76 MB/s) - ‘lavoztextodump.txt.tar.gz.2’ saved [12136935/12136935]

lavoztextodump.txt


In [115]:
# open dataset
filename = "lavoztextodump.txt"
text_file = open(filename, "r")
# load in memory
dataset = text_file.read()
#clean data
dataset = re.sub('&#[0-9]{0,5}.', '', dataset)
# close file
text_file.close()

<a id="spacy_pipeline"></a>

In [116]:
# show pipeline steps
nlp.pipe_names

['tok2vec',
 'morphologizer',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'lemma_freq_counter']

In [117]:
# process dataset with spacy pipeline
# dataset is pruned to the maximum amount possible in a laptop or desk computer
doc = nlp(dataset[:1000000])

ValueError: ignored

<a id="filtering"></a>

In [None]:
# function designed to refine dataset processing
def filter_tokens(doc: spacy.tokens.doc.Doc) -> [spacy.tokens.token.Token]:
    def is_target_token(token: spacy.tokens.token.Token) -> bool:
        return token.is_alpha and not token.is_stop and len(token.sent) > 10
    
    return filter(is_target_token, doc)

In [None]:
# more dataset preprocessing
tokens = filter_tokens(doc)

<a id="feature_generation"></a>

In [None]:
def feature_generator(
    tokens: [spacy.tokens.token.Token],
    *, # all remaining arguments should be named
    token_threshold: int,
    context_threshold: int,
) -> dict:
    dicc = {}
    for token in tokens:
        lema = token.lemma_
        # reduce dimensionality: discard words with not enough examples
        if token._.lemma_count < token_threshold:
            continue
        # check if some form of this word has been processed before
        # or create new dict
        features = dicc.get(lema, {})

        # create feature based on part of speech for this token
        pos = "POS_" + token.pos_
        # increment feature count
        features[pos] = features.get(pos, 0) + 1
    
        # create feature based on syntactic dependency of token
        dep = "DEP__" + token.dep_
        # increment feature count
        features[dep] = features.get(dep, 0) + 1
            
        # insert context based feature using co-frequence in sentences
        for word in token.sent:
            if word == token:
                continue
            # reduce dimensionality: drop unfrequent words
            if word._.lemma_count > context_threshold:
                if word.like_num:
                    context = "NUM"
                else:
                    context = word.lemma_

                # increment feature count
                features[context] = features.get(context, 0) + 1
                
        # create feature based on type of entity
        if token.ent_type > 0:
            ent = "ENT_" + token.ent_type_
            # increment feature count
            features[ent] = features.get(ent, 0) + 1

        # create feature based on parent-child dependency
        tripla = "TRIPLA__" + token.lemma_ + "__" + token.dep_ + "__" + token.head.lemma_
        # increment feature count
        features[tripla] = features.get(tripla, 0) + 1

        # create feature based on parent-child dependency structure
        tripla_struct = "TRIPLA__" + token.pos_ + "__" + token.dep_ + "__" + token.head.pos_
        # increment feature count
        features[tripla] = features.get(tripla, 0) + 1

        # create feature based on ancestor-token-child structure
        for ancestor in token.ancestors:
          for child in token.children:
            tripla_struct = "ATC_" + ancestor.pos_ + "__" + token.pos_ + "__" + child.pos_ 
            # increment feature count
            features[tripla] = features.get(tripla, 0) + 1


    
        dicc[lema] = features
    
    return dicc

In [None]:
# get {word: features} dictionary
dicc = feature_generator(tokens, token_threshold=15, context_threshold=50)

<a id="key_feature_division"></a>

In [31]:
# unpack keys and features in two separated tuples to ease processing
(keys, features) = zip(*dicc.items())

<a id="feature_vectorization"></a>

In [32]:
def vectorize_features(features: (dict)) -> np.ndarray:
    vectorized_features = DictVectorizer(sparse=False).fit_transform(features)
    return vectorized_features

In [33]:
# use sklearn dict vectorizer to create arrays from each set of features
vectorized_features = vectorize_features(features)

<a id="feature_normalization"></a>

In [38]:
def reduce_matrix(matrix: np.ndarray, *, variance_treshold: float):
    print(f'INPUT SHAPE: {matrix.shape}')
    # reduce all vectors to [0, 1] space
    normalized_matrix = normalize(matrix, axis=1)
    # compute variances in each row
    matrix_variances = np.var(matrix, axis=0)
    # create mask for features with high correlation (low variance)
    bool_mask = np.where(matrix_variances < variance_treshold)
    # filter features with high correlation (variance under treshold)
    raked_matrix = np.delete(normalized_matrix, bool_mask, axis=1)
    print(f'OUTPUT SHAPE: {raked_matrix.shape}')
    return raked_matrix

In [46]:
reduced_matrix = reduce_matrix(vectorized_features, variance_treshold=0.001)

INPUT SHAPE: (1664, 39633)
OUTPUT SHAPE: (1664, 39632)


<a id="clustering"></a>

# Clustering

In [21]:
def generate_clusters(
    matrix: np.ndarray,
    n_clusters: int
) -> KMeans:
    # generate word clusters using the KMeans algorithm.
    print("\nClustering started")
    # Instantiate KMeans clusterer for n_clusters
    km_model = KMeans(n_clusters=n_clusters, random_state=3)
    # create clusters
    km_model.fit(matrix)
    print("Clustering finished")
    return km_model

In [41]:
clusters = generate_clusters(reduced_matrix, 150)


Clustering started
Clustering finished


# Results

<a id="summary"></a>

In [24]:
def display_summary(clusters: KMeans):
    cluster_count = Counter(sorted(clusters.labels_))
    for cluster in cluster_count:
        print ("Cluster#", cluster," - Total words:", cluster_count[cluster])

In [None]:
# show number of words captured by each cluster
display_summary(clusters)

<a id="results"></a>

In [44]:
def display_clusters(clusters: KMeans, keys: (str)):
    cluster_count = Counter(sorted(clusters.labels_))
    print("Top words per cluster:\n")
    #sort cluster centers by proximity to centroid
    order_centroids = clusters.cluster_centers_.argsort()[:, ::-1] 
    
    for cluster_idx in cluster_count:
        print(f"Cluster {cluster_idx} - Total words: {cluster_count[cluster_idx]}")
        print("\n\nWords:", end='')
        # get words inside each cluster
        cluster_words = np.where(clusters.labels_ == cluster_idx)[0]
        # print all words inside cluster
        for idx in cluster_words:
            print(f" {keys[idx]}", end=",")
        print("\n\n")

In [45]:
def display_cluster_features(clusters: KMeans, features: (dict)):
    cluster_count = Counter(sorted(clusters.labels_))
    print("Most decisive features for each cluster:\n")
    #sort cluster centers by proximity to centroid
    order_centroids = clusters.cluster_centers_.argsort()[:, ::-1] 
    
    # flatten features dict
    flat_features = reduce(lambda x, y: {**x, **y}, features, {})
    # get feature keys
    feature_keys = list(flat_features.keys())

    # iterate over each cluster
    for cluster_idx in cluster_count:
        print(f"Cluster {cluster_idx} - Total words: {cluster_count[cluster_idx]}")
        print("Frequent terms:", end='')
        # print most determinant features for this cluster
        for ind in order_centroids[cluster_idx, :10]:
            print(f' {feature_keys[ind]}', end=',')

        print("\n\n")

In [None]:
display_clusters(clusters)

In [31]:
def search_word_cluster(clusters: KMeans, corpus: (str), searched_word: str) -> [str]:
    clusts = clusters.labels_
    search_idx = corpus.index(searched_word)
    return [word for idx, word in enumerate(corpus, 0) if clusts[search_idx] == clusts[idx]]

In [None]:
search_word_cluster(clusters, keys, 'viernes')

In [None]:
search_word_cluster(clusters, keys, 'Brasil')

In [34]:
search_word_cluster(clusters, keys, 'Sota')

['Sota']

In [35]:
search_word_cluster(clusters, keys, 'Facultad')

['Facultad']

In [None]:
search_word_cluster(clusters, keys, 'colegio')

In [29]:
def in_same_cluster(clusters: KMeans, corpus: (str), words: [str]) -> bool:
    clusts = clusters.labels_
    word_clusters = map(lambda w: clusts[corpus.index(w)], words)
    number_of_clusters = len(set(word_clusters))
    return number_of_clusters <= 1

In [30]:
in_same_cluster(clusters, keys, ['lunes', 'jueves', 'viernes', 'domingo'])

False

In [None]:
in_same_cluster(clusters, keys, ['lunes', 'martes'])

# Clustering with word embeddings

In [66]:
# load trained model from spacy
nlp = spacy.load("es_core_news_md")

In [67]:
# show pipeline steps
nlp.pipe_names

['tok2vec', 'morphologizer', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [72]:
# process dataset with spacy pipeline
# dataset is pruned to the maximum amount possible in a laptop or desk computer
# disable ner and parser for speed
with nlp.select_pipes(disable=["ner"]):
  doc = nlp(dataset[:1000000])

In [99]:
# function designed to refine dataset processing
def filter_tokens_in_sent(sent: spacy.tokens.span.Span) -> [spacy.tokens.token.Token]:
    def is_target_token(token: spacy.tokens.token.Token) -> bool:
        return token.is_alpha and not token.is_stop
    
    return filter(is_target_token, sent)

In [100]:
def lemmatize(token: spacy.tokens.token.Token) -> str:
  return token.lemma_

In [101]:
def filter_sents(doc: spacy.tokens.doc.Doc) -> [[str]]:
  sents = []
  for sent in doc.sents:
      sent_tokens = filter_tokens_in_sent(sent)
      sents.append(list(map(lemmatize, sent_tokens)))
  return sents

In [102]:
def generate_embedding(sentences: [[str]]) -> ([str], np.ndarray):
  # Count the number of cores in a computer
  cores = multiprocessing.cpu_count()
  w2v_model = Word2Vec(
                     min_count=20,
                     window=2,
                     #size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=cores-1)

  w2v_model.build_vocab(sentences, progress_per=10000)

  w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

  words = w2v_model.wv.index_to_key
  normed_vectors = w2v_model.wv.get_normed_vectors()

  return (words, normed_vectors)

In [103]:
words, normed_vectors = generate_embedding(filter_sents(doc))

In [None]:
clusters = generate_clusters(normed_vectors, 45)

In [None]:
display_summary(clusters)

In [None]:
display_clusters(clusters, words)