# Clustering

> Trabajo perteneciente a la cátedra "Text Mining" de Laura Alonso Alemany - FaMAF UNC. 2021

- _Corpus:_ compendio de noticias de "La Voz del Interior"
- _Referencias:_ 
    - [word_clustering](https://github.com/danibosch/word_clustering) 
    de [@danibosch](https://github.com/danibosch)
    - [textmining-clustering](https://github.com/facumolina/textmining-clustering) 
    de [@facumolina](https://github.com/facumolina/textmining-clustering)

## Etapas:

#### 1. Preprocesamiento:
   1. [Limpieza del dataset](#regex_clean):
   
       Se quitan simbolos no alfanumericos mediante regex.
       
       
   2. [Procesamiento con el pipeline de spacy](#spacy_pipeline):
   
       El pipeline de spacy ejecuta las siguientes tareas:
       - Segmentacion en tokens.
       - Vectorizacion.
       - Analisis morfologico (genero, numero, etc.).
       - Analisis de dependencias.
       - Clasificacion por reglas.
       - Lematizacion.
       - Clasificacion de entidades.
       - Conteo de frecuencias de lemas ([agregado](#lemma_freq_counter)).
       
       
   3. [Filtrado](#filtering):
   
      Se busca un subconjunto de tokens
      que cumplan las siguientes caracteristicas:
       - no contiene caracteres no alfabeticos
       - no es "stop word"
       - no expresa un numero
       
       
   4. [Generacion de **features**](#feature_generation):
   
      Por cada palabra del conjunto procesado en el paso anterior 
      generar un conjunto de features de acuerdo con:
       - el tipo de palabra.
       - su funcion sintactica.
       - su coocurrencia con palabras en su misma oracion (contexto).
       - el tipo de entidad que es, en el caso que lo sea.
       - la dependencia con su predecesor sintactico en el arbol de dependencias.
       
       
   5. [Separacion de palabras y features](#key_feature_division).
   6. [Vectorizacion de features](#feature_vectorization).
   7. [Normalizacion de features](#feature_normalization).
    

#### 2. [Clusterizacion](#clustering):
   1. Eleccion del numero de clusters.
   2. Seleccion del estado aleatorio para obtener resultados deterministicos.
   3. Instanciacion del algoritmo KMeans para obtener los clusters.
    
#### 3. Resultados:
   1. [Conteo de palabras por cluster](#summary).
   2. [Listado de clusters con la siguiente informacion](#results):
      - Numero de cluster.
      - Features mas determinantes para ese cluster.
      - Palabras dentro del cluster.

In [1]:
from functools import reduce
import numpy as np
import re
import spacy
from spacy.tokens import Token
from spacy.language import Language
from sklearn.cluster import KMeans
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import normalize
from collections import Counter

# Preprocessing

In [2]:
# load trained model from spacy
nlp = spacy.load("es_core_news_sm", exclude=['vectors'])

<a id="lemma_freq_counter"></a>

In [3]:
@Language.component("lemma_counter")
def lemma_counter(doc):
    '''
    custom component to count lemma frecuency
    '''
    lemmas = [token.lemma_ for token in doc]
    lemmas_count = Counter(lemmas)
    get_lemma_count = lambda x: lemmas_count[x.lemma_]
    Token.set_extension("lemma_count", getter = get_lemma_count)
    return doc

In [4]:
# add custom component to pipeline
nlp.add_pipe("lemma_counter", name="lemma_freq_counter", last=True)

<function __main__.lemma_counter(doc)>

<a id="regex_clean"></a>

In [5]:
# open dataset
filename = "lavoztextodump.txt"
text_file = open(filename, "r")
# load in memory
dataset = text_file.read()
#clean data
dataset = re.sub('&#[0-9]{0,5}.', '', dataset)
# close file
text_file.close()

<a id="spacy_pipeline"></a>

In [6]:
# show pipeline steps
nlp.pipe_names

['tok2vec',
 'morphologizer',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'lemma_freq_counter']

In [7]:
# process dataset with spacy pipeline
# dataset is pruned to the maximum amount possible in a laptop or desk computer
doc = nlp(dataset[:1000000])

<a id="filtering"></a>

In [8]:
# function designed to refine dataset processing
def filter_tokens(doc: spacy.tokens.doc.Doc) -> [spacy.tokens.token.Token]:
    def is_target_token(token: spacy.tokens.token.Token) -> bool:
        return token.is_alpha and not token.like_num and not token.is_stop
    
    return filter(is_target_token, doc)

In [9]:
# more dataset preprocessing
tokens = filter_tokens(doc)

<a id="feature_generation"></a>

In [10]:
def feature_generator(
    tokens: [spacy.tokens.token.Token],
    threshold_token: int,
    threshold_context: int,
) -> dict:
    dicc = {}
    for token in tokens:
        lema = token.lemma_
        features = {}
        # reduce dimensionality: discard words with not enough examples
        if token._.lemma_count < threshold_token:
            continue
        # check if some form of this word has been processed before
        if lema in dicc:
            features = dicc[lema]

        # create feature based on part of speech for this token
        pos = "POS_" + token.pos_
        if not pos in features:
            features.setdefault(pos, 0)
        features[pos] += 1
    
        # create feature based on syntactic dependency of token
        dep = "DEP__" + token.dep_
        if not dep in features:
            features.setdefault(dep, 0)
        features[dep] += 1
            
        # insert context based feature using co-frequence in sentences
        for word in token.sent:
            # reduce dimensionality: drop unfrequent words
            if word._.lemma_count > threshold_context:
                if word.like_num:
                    context = "NUM"
                elif word.is_alpha and not word.is_stop:
                    context = word.lemma_
                else:
                    continue
                    
                if not context in features:
                    features.setdefault(context, 0)
                features[context] += 1
                
        # create feature based on type of entity
        if token.ent_type:
            ent = "ENT_" + token.ent_type_
            if not ent in features:
                features.setdefault(ent, 0)
            features[ent] += 1

        # create feature based on parent-child dependency
        tripla = "TRIPLA__" + token.lemma_ + "__" + token.dep_ + "__" + token.head.lemma_
        if not tripla in features:
            features.setdefault(tripla, 0)
        features[tripla] += 1
    
        dicc[lema] = features
    
    return dicc

In [11]:
# get {word: features} dictionary
dicc = feature_generator(tokens, 100, 50)

<a id="key_feature_division"></a>

In [12]:
# unpack keys and features in two separated tuples to ease processing
(keys, features) = zip(*dicc.items())

<a id="feature_vectorization"></a>

In [13]:
def vectorize_features(features: (dict)) -> np.ndarray:
    matrix = DictVectorizer(sparse=False).fit_transform(features)
    return matrix

In [14]:
# use sklearn dict vectorizer to create arrays from each set of features
matrix = vectorize_features(features)

<a id="feature_normalization"></a>

In [15]:
# reduce all vectors to [0, 1] space
matrix = normalize(matrix)

<a id="clustering"></a>

# Clustering

In [16]:
def generate_clusters(
    matrix: np.ndarray,
    n_clusters: int
) -> KMeans:
    # generate word clusters using the KMeans algorithm.
    print("\nClustering started")
    # Instantiate KMeans clusterer for n_clusters
    km_model = KMeans(n_clusters=n_clusters, random_state=3)
    # create clusters
    km_model.fit(matrix)
    print("Clustering finished")
    return km_model

In [17]:
clusters = generate_clusters(matrix, 20)


Clustering started
Clustering finished


# Results

<a id="summary"></a>

In [18]:
def display_summary(clusters: KMeans):
    cluster_count = Counter(sorted(clusters.labels_))
    for cluster in cluster_count:
        print ("Cluster#", cluster," - Total words:", cluster_count[cluster])

In [19]:
# show number of words captured by each cluster
display_summary(clusters)

Cluster# 0  - Total words: 3
Cluster# 1  - Total words: 2
Cluster# 2  - Total words: 1
Cluster# 3  - Total words: 2
Cluster# 4  - Total words: 6
Cluster# 5  - Total words: 3
Cluster# 6  - Total words: 11
Cluster# 7  - Total words: 6
Cluster# 8  - Total words: 3
Cluster# 9  - Total words: 2
Cluster# 10  - Total words: 5
Cluster# 11  - Total words: 1
Cluster# 12  - Total words: 6
Cluster# 13  - Total words: 1
Cluster# 14  - Total words: 6
Cluster# 15  - Total words: 1
Cluster# 16  - Total words: 2
Cluster# 17  - Total words: 9
Cluster# 18  - Total words: 1
Cluster# 19  - Total words: 4


<a id="results"></a>

In [20]:
def display_clusters(clusters: KMeans):
    cluster_count = Counter(sorted(clusters.labels_))
    print("Top terms and words per cluster:\n")
    #sort cluster centers by proximity to centroid
    order_centroids = clusters.cluster_centers_.argsort()[:, ::-1] 
    
    # flatten features dict
    flat_features = reduce(lambda x, y: x | y, features, {})
    # get feature keys
    feature_keys = list(flat_features.keys())

    # iterate over each cluster
    for cluster_idx in cluster_count:
        print(f"Cluster {cluster_idx} - Total words: {cluster_count[cluster_idx]}")
        print("Frequent terms:", end='')
        # print most determinant features for this cluster
        for ind in order_centroids[cluster_idx, :10]:
            print(f' {feature_keys[ind]}', end=',')

        print("\n\nWords:", end='')
        # get words inside each cluster
        cluster_words = np.where(clusters.labels_ == cluster_idx)[0]
        # print all words inside cluster
        for idx in cluster_words:
            print(f" {keys[idx]}", end=",")
        print("\n\n")

In [21]:
display_clusters(clusters)

Top terms and words per cluster:

Cluster 0 - Total words: 3
Frequent terms: colegio, TRIPLA__a__mark__cambiar, TRIPLA__tanto__nmod__otro, TRIPLA__poder__acl__Honduras, TRIPLA__Schiaretti__nsubj__eludir, TRIPLA__ser__cop__número, TRIPLA__Schiaretti__nmod__relación, lograr, TRIPLA__medio__conj__sector, grupo,

Words: haber, deber, poder,


Cluster 1 - Total words: 2
Frequent terms: grupo, TRIPLA__tanto__nmod__otro, TRIPLA__poder__aux__asistir, TRIPLA__tanto__det__sed, TRIPLA__poder__acl__Honduras, dejar, problema, o, TRIPLA__a__advmod__dejar, TRIPLA__a__case__medida,

Words: caso, tiempo,


Cluster 2 - Total words: 1
Frequent terms: TRIPLA__tanto__det__tractor, TRIPLA__a__case__unidad, TRIPLA__poder__acl__Honduras, TRIPLA__tanto__nmod__otro, TRIPLA__ciento__compound__230, DEP__fixed, persona, lograr, proyecto, TRIPLA__a__case__muerte,

Words: ver,


Cluster 3 - Total words: 2
Frequent terms: TRIPLA__a__mark__llamar, toma, Gobierno, TRIPLA__poder__acl__Honduras, año, TRIPLA__tanto__nmod_