# *La Voz del Interior* COOC and Clustering 
This notebook demonstrates standard text mining methods for unsupervised learning: a coocurrence matrix, Word2Vec neural embeddings, and k-means clustering.
### Description of dataset
The data was taken from 1666 recent articles (4646557 words) from *La Voz del Interior*, the newspaper of record in Córdoba, Argentina.


In [1]:
import pandas as pd
import numpy as np
import pickle
import nltk
import re
from sklearn import preprocessing
from sklearn.feature_extraction import DictVectorizer
from sklearn.cluster import KMeans
from collections import Counter
from nltk.cluster import kmeans
from nltk.corpus import stopwords
from nltk import sent_tokenize, word_tokenize
from gensim.models import Word2Vec
nltk.download("punkt")

[nltk_data] Downloading package punkt to /users/bjames/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

This code is if you want to download the data directly.

In [None]:
%%bash

mkdir -p ./data
python -m spacy download es
curl -L -o ./data/lavoz.txt.tar.gz https://cs.famaf.unc.edu.ar/~laura/corpus/lavoztextodump.txt.tar.gz
tar xvf ./data/lavoz.txt.tar.gz -C ./data
ls ./data
head -n 6 ./data/lavoztextodump.txt

A function to prepare the articles for the pipeline.

In [2]:
def corpus_iterator(corpus_file):
    document = {
        "title": None,
        "body": None
    }

    with open(corpus_file, "r") as fh:
        for line in fh:
            if line.strip() == "-":
                new_document = True
                document = {
                    "title": None,
                    "body": None
                }
            elif new_document:
                document["title"] = line.strip()
                new_document = False
            else:
                document["body"] = line.strip()
                yield document #this yields a dictionary with title numbered under "title" and the body of the text under "body"

-------------------------------------------------------------------------
Helper code for non-GPU users: to use a smaller dataset.

In [3]:
!wc ./data/lavoztextodump.txt 

   38809  5615246 34886711 ./data/lavoztextodump.txt


In [4]:
!head -n 5000 ./data/lavoztextodump.txt > ./data/lavoztextodump-short2.txt
# by changing the value following '-n' you can control your input
!wc ./data/lavoztextodump-short2.txt

   5000  745195 4646557 ./data/lavoztextodump-short2.txt


-------------------------------------------------------------------------


In [5]:
filename = "data/lavoztextodump-short2.txt"
text_file = open(filename, "r")
dataset = text_file.read()
text_file.close()

## Preprocessing Pipeline with nltk
1. sentence tokenize
2. word tokenize
3. eliminate numbers
4. eliminate stopwords
5. lowercase
6. lemmatize

In [6]:
# "stopwords" high frequency
stopwords = nltk.corpus.stopwords.words('spanish')
# Let's add a few pesky stopwords
punctuation = '.', '$','%','*1','*2','*asesor','*concejal','*director','*docente','*doctor','*economista','.','.', '-c','*', '&','#','!',"''",':','?', ';', ')', '(', '``', ',', '-'
stopwords.extend(punctuation)

In [7]:
# create a python dictionary from a real dictionary of lemmas in Spanish (uploaded)
lemma_file = open("lemmatization-es.txt", "r")
lemma_raw = lemma_file.read()
lemma = lemma_raw.split("\n")

lemma_dict = {}
for pair in lemma:
    w = pair.split("\t")
    if len(w) == 2:
        lemma_dict[w[1]] = w[0]

In [8]:
def lemmatize(word): # create a lemmatizing function
    if word in lemma_dict:
        word = lemma_dict[word]
    return word

In [9]:
# create a tokenizing function
def tokenize_text(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if token.isdigit():
            continue
        low_token = token.lower()
        lemma = lemmatize(low_token)
        if lemma not in stopwords:
            filtered_tokens.append(lemma)
    tok = [t for t in filtered_tokens]
    return tok

In [10]:
# execute the preprocessing process
the_news = []
for key, value in enumerate(corpus_iterator("data/lavoztextodump-short2.txt")):
    p = tokenize_text(value["body"])
    the_news.append(p)
print("We are looking at ",len(the_news),"articles.")

We are looking at  1666 articles.


# COOC: four-word window
Context is a essential to meaning in human language. A **cooccurrence matrix** gives us semantic and idiomatic insight into the language by showing us which words appear together often and if any dependencies are present between words. I chose a four-word window because I wanted to get as much context as possible into my matrix.
* Example of cooccurrence: **"albert"** will have a high cooccurrence count with the word **"einstein"**


In [11]:
def dict_cooc_gen(e,w):
    """Takes as input a list of strings, where each string is a tokenized sentence.
        Returns a list of dictionaries of coocurrences of words and a list of all of words as the index for a matrix. 
        Window, words to the left and the right, is chosen when executing."""    
   
    cooc = [] # our list of dictionaries
    idx = [] # our list of words (for the matrix index)
    for sent in e:
        for m,tok in enumerate(sent):
            
            vecinos = []
            # vecinos a la derecha
            for l in range(1,w+1):
                if m+l > len(sent)-1:
                    break
                else:
                    vecinos.append(sent[m+l])
            # vecinos a la izquierda
            for l in range(1,w+1):
                if m-l < 0:
                    break
                else:
                    vecinos.append(sent[m-l])
            
            
            # para agregar los vecinos de una nueva palabra
            if tok in idx:
                i = idx.index(tok)
                cooc[i][tok] +=1
            else :
                cooc.append({tok:1}) 
                idx.append(tok)
                i = idx.index(tok)

            for v in vecinos:
                if not v in cooc[i]:
                    cooc[i][v] =1
                else:
                    cooc[i][v] +=1
                        

    return cooc, idx

In [14]:
coocs, idx = dict_cooc_gen(the_news,4) 
coocs[558:559]# lets look at one of the dictionaries

[{'albert': 5,
  'einstein': 4,
  'célebre': 1,
  'teoría': 1,
  'año': 1,
  'postular': 1,
  'vez': 1,
  'alguno': 1,
  'comer': 1,
  'quizá': 1,
  'insigne': 1,
  'emblemático': 1,
  'tiempo': 1,
  'guerra': 1,
  'masivo': 1,
  'destrucción': 1,
  'ser': 1,
  'modelar': 1,
  'verdadero': 1,
  'popper': 1,
  'karl': 1,
  'parir': 1,
  'instinto': 1,
  'culpar': 1,
  'esfumar': 1,
  'si': 1,
  'decir': 1,
  'dar': 1,
  'jugar': 1,
  'vestir': 1,
  'sabin': 1,
  'hogar': 1,
  'iii': 1,
  'ngeles': 1,
  'colegiar': 1,
  'director': 1,
  'gorosito': 1,
  'cristino': 1}]

# To vector space!
A mathematical representation of our words allows us to apply sophisticated statistical and probabilistic methods to our data, such as the **k-means** clustering algoritm, which will enable us to see the underlying structure of the language use in *La Voz del Interior*.

In [15]:
vectorizer = DictVectorizer()
vec = vectorizer.fit_transform(coocs)

In [16]:
terms = vectorizer.get_feature_names() # 'terms' will be necessary for when we print out the clusters
for i in terms[7000:7020]:# let's just look at a snapshot of our key words
    print(i)

custodio
cutáneo
cuyo
cuzco
cuál
cuán
cuándo
cuántico
cuánto
cuñado
cuño
cy
cynthia
cábala
cáceres
cádiz
cálculo
cálido
cámara
cámaras-


In [17]:
matriz = pd.DataFrame(vec.toarray(), columns=terms)
matriz = matriz.set_axis(idx,axis=0, inplace=False)
matriz.head(10)

Unnamed: 0,*ex,*experto,*fiscal,*horacio,*integrante,*legislador,*lic,*ministro,*médica,*pablo,...,óxido,úlcera,últimamente,último,única-,únicamente,único,útero,útil,﻿1
claro,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
crespo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
rodolfo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
martínez,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
imaginar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
unir,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,11.0,0.0,0.0,4.0,0.0,0.0,7.0
preferir,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
pensarlo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
decir,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,21.0,0.0,0.0,6.0,0.0,2.0,10.0
hacer,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,28.0,0.0,0.0,15.0,0.0,0.0,9.0


Our matrix is extremly sparse (lots of zeros) so we must **normalize** by scaling the matrix as decimals (float numbers) in stead of zeros

In [19]:
matrix_normed = matriz / matriz.max(axis=0)
matrix_normed.head(50) # just a small glipse of the matrix

Unnamed: 0,*ex,*experto,*fiscal,*horacio,*integrante,*legislador,*lic,*ministro,*médica,*pablo,...,óxido,úlcera,últimamente,último,única-,únicamente,único,útero,útil,﻿1
claro,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.001256,0.0,0.0,0.0,0.0,0.0,0.0
crespo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
rodolfo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
martínez,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.003413,0.0,0.0,0.002611
imaginar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
unir,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.013819,0.0,0.0,0.013652,0.0,0.0,0.018277
preferir,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.003413,0.0,0.0,0.0
pensarlo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
decir,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.2,0.026382,0.0,0.0,0.020478,0.0,0.1,0.02611
hacer,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.035176,0.0,0.0,0.051195,0.0,0.0,0.023499


In [None]:
filename = "trained/lavoz_matrix.pickle" # save our normalized matrix just in case
fileObj = open(filename, 'wb')
pickle.dump(matrix_normed, fileObj)
fileObj.close()

In [None]:
filename = "trained/lavoz_matrix.pickle"
with open(filename, 'rb') as f:
    matrix_normed = pickle.load(f)

## Word2Vec: it's not a vectorization, it's an embedding!
We will set aside our hand-made cooccurence matrix now in favor of the more 'information-rich' **neural embeddings** of Word2Vec. 
This two-layer neural network provides us with much more than our counts-basd cooccurrence matrix. These neural embeddings carry much lower demensionality and have semantic representation, useful in processes such as [Latent Semantic Analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis).
    


In [22]:
def gen_vectors(normalized_text):
    print("\ninitialized")
    model = Word2Vec(normalized_text,
                     window=4,
                     size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20)
    vects = []
    for word in model.wv.vocab:
        vects.append(model.wv[word])

    matrix = np.array(vects)
    print("Matrix shape:",matrix.shape)
    print("finished")
    return model.wv.vocab,matrix



**Word2Vec Parameters:**
* four-word window (window=4) to be consistent with the cooc above
* dimensionality (size=300) the maximum available
* downsampling of high frequency words (sample=6e-5)
* drown out the noise words (negative=20) set at maximum

In [23]:
vocabulary, vectors = gen_vectors(the_news)


initialized
Matrix shape: (7323, 300)
finished


# Clusters!
* normalized the embeddings before clustering

In [24]:
def gen_clusters(vectors):
    print("\nclustering started")
    vectors = preprocessing.normalize(vectors)
    km_model = KMeans(n_clusters=CLUSTERS_NUMBER)
    km_model.fit(vectors)
    print("clustering finished")
    return km_model

In [25]:
CLUSTERS_NUMBER=50
km_model = gen_clusters(vectors)


clustering started
clustering finished


# Results
--------------------------------------------

In [26]:
def show_results(vocabulary,model):
    # Show results
    c = Counter(sorted(model.labels_))
    print("\nTotal clusters:",len(c))
    for cluster in c:
        print ("Cluster#",cluster," - Total words:",c[cluster])

    # Show top terms and words per cluster
    print("Top words per cluster:")
    print()

    keysVocab = list(vocabulary.keys())
    for n in range(len(c)):
        print("Cluster %d" % n)
        print("Words:", end='')
        word_indexs = [i for i,x in enumerate(list(model.labels_)) if x == n]
        for i in word_indexs:
            print(' %s' % keysVocab[i], end=',')
        print()
        print()

    print()

In [29]:
show_results(vocabulary,km_model)


Total clusters: 50
Cluster# 0  - Total words: 11
Cluster# 1  - Total words: 325
Cluster# 2  - Total words: 1
Cluster# 3  - Total words: 422
Cluster# 4  - Total words: 1
Cluster# 5  - Total words: 59
Cluster# 6  - Total words: 17
Cluster# 7  - Total words: 48
Cluster# 8  - Total words: 368
Cluster# 9  - Total words: 340
Cluster# 10  - Total words: 41
Cluster# 11  - Total words: 30
Cluster# 12  - Total words: 24
Cluster# 13  - Total words: 78
Cluster# 14  - Total words: 529
Cluster# 15  - Total words: 233
Cluster# 16  - Total words: 361
Cluster# 17  - Total words: 1
Cluster# 18  - Total words: 150
Cluster# 19  - Total words: 1
Cluster# 20  - Total words: 144
Cluster# 21  - Total words: 359
Cluster# 22  - Total words: 1
Cluster# 23  - Total words: 1
Cluster# 24  - Total words: 342
Cluster# 25  - Total words: 1
Cluster# 26  - Total words: 566
Cluster# 27  - Total words: 257
Cluster# 28  - Total words: 33
Cluster# 29  - Total words: 21
Cluster# 30  - Total words: 1
Cluster# 31  - Total wor

Words: encabezar, provincial, olga, nacional, radical, dirigente, radicalismo, legislador, definición, referente, vicepresidente, unión, frente, electoral, cívico, votar, k, senador, juez, candidatura, presidencial, elección, opositor, diputar, oscar, oficialista, pj, peronismo, peronista, kirchnerista, mestre, apoyar, respaldar, interno, eduardo, kirchnerismo, precandidato, blocar, ricardo, alfonsín, postulante, ucr, giacomino, electrónico, juárez, héctor, duhalde, boleta, coalición, comicio, ramón, vicegobernador, riutort, martí, accastello, alianza, cobos, avalle, vecinalista,

Cluster 6
Words: nadia, anonimato, tiananmen, coronar, retener, ceguera, descripción, luneta, carnaval, carrefour, dispositivo, intoxicación, strata, firmante, contagioso, estigma, ostropolsky,

Cluster 7
Words: fabio, vergonzoso, alfabetización, cuchillo, catolicismo, desencadenar, abstención, coraje, insecto, cavar, management, sindicar, tapiar, agujerar, ponce, sida, zabala, perderse, espectador, consejero

Words: secundario, ipem, belgrano, alejandro, carbó, tomar, jerónimo, luis, cabrero, colegio, edilicio, educación, estudiante, escuela, estudiantil, protestar, reclamo, docente, educativo, instituto, preuniversitario, garzón, agulla, tosco, asamblea, grahovac, colegiar, walter, alumno, buffoni,

Cluster 12
Words: conversar, ayacucho, francia, sexo, despacio, raza, américa, archivo, sinónimo, quieto, obstaculizar, económicamente, post, acampar, agostina, envidiar, empleador, mejoramiento, rayo, remero, socioeconómico, contrarrestar, negociable, alvarez,

Cluster 13
Words: disgustar, recurrir, bovino, cerdo, constancia, arbitral, impositivo, deliberante, seno, núcleo, canadiense, bienvenido, incuestionable, mc, indirecto, minimizar, difundir, concentrarse, desnudar, prófugo, balar, desventaja, cm3, cultivo, estético, postre, abastar, urbe, adopción, apodar, similitud, criollo, 3,8, concreción, paulista, proyectil, psicológico, lan, encubrimiento, prevalecer, experimental, ieral, electora

Words: casar, mujer, sorprender, madre, personar, edad, circunstancia, momento, ver, ayudar, parejo, tratar, orgullo, menos, tres, llegar, grupo, leer, familia, noche, adolescente, joven, atentar, sufrir, esgrimir, secreto, matar, indispensable, ocurrir, rápido, hombre, caso, excepcional, cómo, investigación, conti, todavía, allí, acompañar, visual, ubicar, rescatar, iglesia, ingresar, profesional, reclutar, tardar, simple, amarillo, informar, padre, discordia, talento, sospechar, tránsito, familiar, doméstico, avanzar, sexual, transmisor, honestidad, provocar, identificar, violencia, actuar, escena, hecho, niño, someter, constante, siete, march, incidir, colectivo, delito, ignorancia, domiciliar, despertar, permanencia, desear, equipar, mover, soldar, injusticia, sanar, cometer, gravar, fuente, avalar, abordar, rango, bárbaro, producir, roca, urgencia, corear, víctima, guardia, bloquear, detenido, condenar, dulce, grave, acosar, policial, juicio, brasilia, identificación, dificultad, 

Words: pensarlo, decir, hacer, tener, conocer, tiempo, llevar, haber, lograr, bien, querer, ser, siempre, pensar, gran, coser, ir, cambiar, difícil, vida, ninguno, problema, compartir, mediar, trabajar, objetivo, común, aspecto, r, día, ahora, durar, poder, postergar, interesar, económico, conflicto, poner, ¿la, importante, decidir, institución, reclamar, normativo, volver, advertir, saber, luchar, forzar, posibilidad, acta, dar, emplear, avión, contar, venir, completar, mejor, discusión, seguridad, conjuntar, calidad, proponer, justificar, especialmente, generar, ciudadano, consenso, flojo, llevarse, factor, facturar, aparecer, situación, soler, declarar, patrimonio, juntar, reconocimiento, comparar, década, conservar, extranjero, relevante, menor, seriar, explicar, considerar, mano, inclusión, subterráneo, impedir, estructurar, tierra, ejemplo, público, existir, legar, derecho, ente, preparar, capacitación, último, mayoría, hora, cuestionario, fin, permitir, gradar, discapacidad, pue

Words: gustar, después, separar, razonar, amor, ilusionar, manejar, miércoles, pedir, abrir, copérnico, martes, charlar, aguar, recordar, coincidir, públicamente, posiblemente, profesor, justar, amigo, cantar, presenciar, temer, conjunto, viernes, aprovechamiento, relativizar, roberto, figueroa, domingo, dante, prestar, vaivén, distanciar, nacimiento, propio, cuerpo, continuo, imitar, paro, multar, objeción, erosionar, gallego, justicialismo, enemigo, valoración, solá, transitar, correcto, despachar, asimismo, reelegir, pegar, corriente, tejer, rosado, lealtad, alejandra, vigo, invitar, agrupación, liderar, allegar, noviembre, fórmula, gatica, germán, confirmar, trascender, siguiente, pertenencia, marcelo, festejar, televisión, interpretar, quebrar, éxito, foto, soltar, temblar, tucumán, habitáculo, morar, consecutivo, extremar, moral, religioso, católico, tradición, red, gregorio, gavier, esbozar, jorge, jujuy, impacto, salvador, césar, ciccra, rattazzi, poderoso, giorgi, material, in

Words: asegurar, rodear, dudar, través, formar, firmar, principiar, radiar, relacionar, risa, distinguir, desarrollar, entidad, comprometer, respetar, sobrar, cuñado, asar, considerarse, vaticinio, machar, otorgar, posta, país, ste, sancionar, nativo, suspender, principio, sociedad, surgir, elaborar, presupuesto, dialogar, marchar, terciario, duro, república, caber, aclarar, aborigen, atento, alimentación, pactar, tampoco, señalar, descalificar, incluso, dificultar, conformar, dependiente, promover, concretar, alemania, europeo, obligar, justamente, composición, externo, esforzar, cuanto, rentar, requisito, específico, almacenero, adherir, comportamiento, explicitar, nominal, ong, empresario, enunciar, finalmente, recambiar, federalismo, alternativo, esperanzar, probablemente, transparentar, históricamente, descartar, descuento, gobernabilidad, ¿no, máximo, epec, francés, topar, autoridad, preparativo, tinelli, acceder, manifestación, reforzar, reacción, expresar, supervisión, centrode

Words: imaginar, ¿qué, enojar, vez, acostumbrar, descubrir, sentir, largar, feliz, hermoso, ¿por, lado, nunca, pasión, primero, empezar, luna, leyenda, construir, fábrica, besar, parar, ¿se, quedar, nervioso, aconsejar, generoso, reflejo, sugerir, irrumpir, error, interpretación, posicionar, junior, preguntar, doler, disfrutar, puerta, excusar, alimentar, bello, plato, siglo, frase, especie, unesco, lucir, visitante, histórico, recuperar, solar, multiplicar, actualidad, renovación, atender, operativo, minuto, diferenciar, feriar, homosexual, proporcionar, forraje, abril, ancho, incertidumbre, alambrar, picar, acelerar, maduro, sustentable, defensoría, retirar, españa, cultura, entrevistar, enviar, racionar, largo, ¿cuánto, influir, librar, editorial, autor, conclusión, crisis, historia, emprender, perfecto, dibujar, criar, capítulo, formal, creciente, multitud, palabra, particular, cabo, enseñanza, positivo, agravar, aun, atravesar, reglar, popular, stock, agenciar, lengua, corazón, co

# Analysis
After much trial and error with hyperparameters and parameters, patterns began to emerge in the results. Each time I ran the k-means algorithm a cluster appeared with the following: 
* numbers and scientific words of measurement
* political names and words associated with politics in Argentina
* industrial/ and agricultural words 
-------------------------------------------------------------------------
## Example Clusters from Above:

Cluster 11 - **Education**
Words: secundario, ipem, belgrano, alejandro, carbó, tomar, jerónimo, luis, cabrero, colegio, edilicio, educación, estudiante, escuela, estudiantil, protestar, reclamo, docente, educativo, instituto, preuniversitario, garzón, agulla, tosco, asamblea, grahovac, colegiar, walter, alumno, buffoni

Cluster 29 - **Agriculture and Industry**
Words: año, mil, subir, ciento, promediar, hectárea, pagar, exportación, tonelada, precio, millón, valor, cuota, dólar, mes, peso, incrementar, cosechar, soja, maíz, trigo

Cluster 31 - **Brazilian Politics**
Words: verde, marino, candidato, silva, dilma, rousseff, serra, pt, psdb, luiz, inácio, lula

Cluster 41 - **Argentine Politics**
Words: manuel, ex, daniel, presidente, josé, cristino, fernández, kirchner, intendente, néstor, gobernador, juan, sota, schiaretti


In [None]:
# save clusters to file if you like them
filename = "trained/lavoz_clusters2.pickle"
fileObj = open(filename, 'wb')
pickle.dump(km_model, fileObj)
fileObj.close()

In [28]:
#load them back up
filename = "trained/lavoz_clusters2.pickle"
with open(filename, 'rb') as f:
    km_model = pickle.load(f)