## Trabajo Práctico 1 - Clustering

DiploDatos 2018 - Aprendizaje No Supervisado

Mario Ferreyra - Emiliano Kokic

### Datos: Dataset of references (urls) to news web pages
https://archive.ics.uci.edu/ml/datasets/News+Aggregator

Descripción del problema: el dataset contiene dentro de sus atributos título y categoria de noticia:
- Entretenimiento
- Ciencia y Tecnología
- Negocios
- Salud

La idea del problema sería intentar aplicar técnicas de clustering sobre los títulos de cada noticia y ver si de acuerdo a su semántica se agrupan según las distintas categorias.

In [1]:
import pandas as pd
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import spacy
from spacy.lang.en import English
from gensim.models import Word2Vec, Doc2Vec, KeyedVectors
from nltk.cluster import KMeansClusterer
import numpy as np
import time

In [2]:
np.random.seed(0)  # For reproducibility

#### Cargamos el dataset

In [3]:
data = pd.read_csv('./NewsAggregatorDataset/newsCorpora.csv', header=None, delimiter='\t')
data.columns = ['ID', 'TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME', 'TIMESTAMP']
'''
ID          Numeric ID
TITLE       News title 
URL         Url
PUBLISHER   Publisher name
CATEGORY    News category (b = business, t = science and technology, e = entertainment, m = health)
STORY       Alphanumeric ID of the cluster that includes news about the same story
HOSTNAME    Url hostname
TIMESTAMP   Approximate time the news was published, as the number of milliseconds since the epoch
'''

'\nID          Numeric ID\nTITLE       News title \nURL         Url\nPUBLISHER   Publisher name\nCATEGORY    News category (b = business, t = science and technology, e = entertainment, m = health)\nSTORY       Alphanumeric ID of the cluster that includes news about the same story\nHOSTNAME    Url hostname\nTIMESTAMP   Approximate time the news was published, as the number of milliseconds since the epoch\n'

In [4]:
data.shape

(422419, 8)

In [5]:
# Nos quedamos solo con las dos columnas de interés
data = data[['TITLE', 'CATEGORY']]
data.head()

Unnamed: 0,TITLE,CATEGORY
0,"Fed official says weak data caused by weather,...",b
1,Fed's Charles Plosser sees high bar for change...,b
2,US open: Stocks fall after Fed official hints ...,b
3,"Fed risks falling 'behind the curve', Charles ...",b
4,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,b


In [6]:
grouped = data.groupby('TITLE').count().sort_values('CATEGORY',ascending=False)
grouped.head()

Unnamed: 0_level_0,CATEGORY
TITLE,Unnamed: 1_level_1
The article requested cannot be found! Please refresh your browser or go back ...,145
Business Highlights,59
Posted by Parvez Jabri,59
Posted by Imaduddin,53
Posted by Shoaib-ur-Rehman Siddiqui,52


Nos quedamos solo con los títulos que ocurren una sola vez.

In [7]:
unique_titles = grouped[grouped.CATEGORY == 1].index.values.tolist()
unique_titles[0:5]

['Oracle shares slide on weaker-than-expected Q3 results',
 'Only one kind of inflation matters',
 "Only one large bank fails Fed's stress test",
 "Original 'Star Wars' stars Harrison Ford, Mark Hamill and Carrie Fisher officially  ...",
 "One of America's First 'Bionic Eyes' Goes to a Former Weightlifter in Michigan"]

In [8]:
data = data[data.TITLE.isin(unique_titles)]

In [9]:
data.groupby('CATEGORY').count()

Unnamed: 0_level_0,TITLE
CATEGORY,Unnamed: 1_level_1
b,108143
e,142220
m,41908
t,101128


Nos quedamos con 10000 títulos de cada categoría (ya que sino se necesita demasiado cómputo)

In [10]:
new_data = data[data.CATEGORY == 'b'][0:10000]
new_data = new_data.append(data[data.CATEGORY == 'e'][0:10000], ignore_index=True)
new_data = new_data.append(data[data.CATEGORY == 'm'][0:10000], ignore_index=True)
new_data = new_data.append(data[data.CATEGORY == 't'][0:10000], ignore_index=True)
new_data.groupby('CATEGORY').count()

Unnamed: 0_level_0,TITLE
CATEGORY,Unnamed: 1_level_1
b,10000
e,10000
m,10000
t,10000


La idea es entonces aplicar clustering sobre los títulos de las noticias y ver si se agrupan por categoría de acuerdo a su semantica. Es decir, vamos a considerar 4 clusters.

### Análisis del texto | tokenizing

Para analizar el texto debemos estudiar la frecuencia de las palabras, es decir, separar el texto en unidades sintácticas o *tokens*.

In [11]:
def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]

    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)

    return filtered_tokens

In [12]:
totalvocab_tokenized = []

for index, row in new_data.iterrows():
    allwords_tokenized = tokenize_only(row['TITLE'])
    totalvocab_tokenized.extend(allwords_tokenized)

In [13]:
print('Hay en total ' + str(len(totalvocab_tokenized)) + ' tokens \n')
len(totalvocab_tokenized)
print (totalvocab_tokenized[0:50])

Hay en total 361184 tokens 

['fed', 'official', 'says', 'weak', 'data', 'caused', 'by', 'weather', 'should', 'not', 'slow', 'taper', 'fed', "'s", 'charles', 'plosser', 'sees', 'high', 'bar', 'for', 'change', 'in', 'pace', 'of', 'tapering', 'us', 'open', 'stocks', 'fall', 'after', 'fed', 'official', 'hints', 'at', 'accelerated', 'tapering', 'fed', 'risks', 'falling', "'behind", 'the', 'curve', 'charles', 'plosser', 'says', 'fed', "'s", 'plosser', 'nasty', 'weather']


In [14]:
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(
    stop_words='english',
    tokenizer=tokenize_only,
    ngram_range=(1, 3)
)
tfidf_matrix = tfidf_vectorizer.fit_transform(new_data['TITLE'])
print(tfidf_matrix.shape)
# terms = tfidf_vectorizer.get_feature_names()

(40000, 277162)


### Buscar clusters | Kmeans

In [15]:
num_clusters = 4

km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()
# print (clusters)

# Recuento del número de elementos en cada cluster
for i in range(num_clusters):
    print ('El cluster %i tiene %i elementos' % (i, clusters.count(i)))

El cluster 0 tiene 1832 elementos
El cluster 1 tiene 5572 elementos
El cluster 2 tiene 32067 elementos
El cluster 3 tiene 529 elementos


In [17]:
news = {
    'title': new_data['TITLE'].tolist(),
    'category': new_data['CATEGORY'].tolist(),
    'cluster': clusters
}
news = pd.DataFrame(news, index = [clusters] , columns = ['title', 'category'])

print(news.loc[0].groupby('category').count())
print(news.loc[1].groupby('category').count())
print(news.loc[2].groupby('category').count())
print(news.loc[3].groupby('category').count())

          title
category       
e           855
m           772
t           205
          title
category       
b          1396
e          1989
m           986
t          1201
          title
category       
b          8604
e          6627
m          8242
t          8594
          title
category       
e           529


Notar que uno de los clusters concentra casi la totalidad de los datos, sumado al hecho de que las categorias no parecen agruparse en distintos clusters.

### Probemos utilizando el tokenizer de Spacy

In [19]:
# NOTA: token.is_stop nos devuelve un booleano que indica si el token es o no una stopword
# NOTA: en consola ejecutar la primera vez: python -m spacy download en
nlp = spacy.load('en')
tokenizer = English().Defaults.create_tokenizer(nlp)
tokens = []
for doc in tokenizer.pipe(new_data['TITLE']):
    tokens.append([token.text for token in doc if re.search('[a-zA-Z]', token.text)
                   and not token.is_stop])    
# tokens es una lista de listas donde cada lista contiene los tokens de cada titulo

In [20]:
tokens[0]

['Fed',
 'official',
 'says',
 'weak',
 'data',
 'caused',
 'weather',
 'slow',
 'taper']

In [21]:
len(tokens)

40000

### Probemos utilizando Gensim para generar word embeddings

In [23]:
# NOTA: la dimensionalidad de los vectores generados por Gensim es por defecto igual a 100.
# Esto se puede cambiar utilizando el parámetro 'size' del método Word2Vec
model = Word2Vec(tokens, min_count=1)

In [24]:
word_vectors = model.wv

In [25]:
len(word_vectors.vectors)

25262

### Para generar el vector asociado a una oración probemos sumando los vectores de las palabras que la componen.

In [26]:
sen_vector = []
for i in range(len(tokens)):
    sen_vector.append(sum([word_vectors.get_vector(token) for token in tokens[i]]))

new_data['sen_vector'] = sen_vector

In [27]:
new_data.shape

(40000, 3)

In [28]:
new_data['sen_vector'][0]

array([-6.9365424e-01,  1.7465535e+00, -5.6100558e-03, -7.4673064e-02,
       -1.0276629e+00, -1.5303760e+00, -4.2149308e-01, -1.1297854e+00,
        2.4284315e+00,  2.0443189e+00,  6.7305312e-02, -2.3876677e+00,
       -2.1749310e+00, -1.4219019e+00,  3.2185075e+00, -2.7297885e+00,
        3.5554972e+00,  1.3318231e+00, -5.6807715e-01, -7.2824962e-02,
        4.6024850e-01,  4.2863836e+00,  3.2643213e+00, -2.4070446e+00,
        2.4743659e+00, -1.5242141e+00,  3.9516438e-02, -2.9862692e+00,
        1.1444232e+00, -1.9588170e+00,  2.5828264e+00, -9.8179710e-01,
        5.8011436e+00, -2.1580663e+00, -2.2712402e+00,  2.5019841e+00,
       -4.1087584e+00, -3.4930363e+00,  2.0053573e+00, -5.8574090e+00,
        1.6151816e-01,  6.4584394e+00, -2.7966452e-01,  2.1548975e+00,
        8.5283279e-01, -2.3360028e+00, -3.7928710e+00, -4.5333138e+00,
        1.9424528e+00,  9.6115440e-01,  1.3376893e+00,  3.3069875e+00,
        2.4346318e+00, -1.8079658e+00, -8.3988035e-01,  4.5729880e+00,
      

### Utilicemos KMeans de NLTK

In [29]:
NUM_CLUSTERS = 4
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance)
assigned_clusters = kclusterer.cluster(new_data['sen_vector'].values, assign_clusters=True)

In [30]:
for i in range(NUM_CLUSTERS):
    print ('El cluster %i tiene %i elementos' % (i, assigned_clusters.count(i)))

len(assigned_clusters)

El cluster 0 tiene 15649 elementos
El cluster 1 tiene 1482 elementos
El cluster 2 tiene 14900 elementos
El cluster 3 tiene 7969 elementos


40000

In [31]:
news = {
    'title': new_data['TITLE'].tolist(),
    'category': new_data['CATEGORY'].tolist(),
    'cluster': assigned_clusters
}
news = pd.DataFrame(
    news,
    index=[assigned_clusters],
    columns = ['title', 'category']
)

print(news.loc[0].groupby('category').count())
print(news.loc[1].groupby('category').count())
print(news.loc[2].groupby('category').count())
print(news.loc[3].groupby('category').count())

          title
category       
b          3058
e          4454
m          3099
t          5038
          title
category       
b          1239
e           133
m            75
t            35
          title
category       
b          4392
e          2102
m          5377
t          3029
          title
category       
b          1311
e          3311
m          1449
t          1898


No se obtienen buenos resultados ya sea usando distancia Euclídea o distancia Coseno.

### Tomamos el promedio de los vectores de palabras que ocurren en una oración y eso lo utilizamos como el vector que representa esa oración

In [34]:
sen_vector = []
for i in range(len(tokens)):
    sen_vector.append(np.mean([word_vectors.get_vector(token) for token in tokens[i]], axis=0))

new_data['sen_vector'] = sen_vector

In [35]:
NUM_CLUSTERS = 4
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance)
assigned_clusters = kclusterer.cluster(new_data['sen_vector'].values, assign_clusters=True)

for i in range(NUM_CLUSTERS):
    print ('El cluster %i tiene %i elementos' % (i, assigned_clusters.count(i)))

len(assigned_clusters)

news = {
    'title': new_data['TITLE'].tolist(),
    'category': new_data['CATEGORY'].tolist(),
    'cluster': assigned_clusters
}
news = pd.DataFrame(
    news,
    index=[assigned_clusters],
    columns = ['title', 'category']
)

print(news.loc[0].groupby('category').count())
print(news.loc[1].groupby('category').count())
print(news.loc[2].groupby('category').count())
print(news.loc[3].groupby('category').count())

El cluster 0 tiene 1559 elementos
El cluster 1 tiene 18205 elementos
El cluster 2 tiene 6518 elementos
El cluster 3 tiene 13718 elementos
          title
category       
b          1270
e           134
m           108
t            47
          title
category       
b          5101
e          2880
m          5871
t          4353
          title
category       
b           661
e          3177
m          1022
t          1658
          title
category       
b          2968
e          3809
m          2999
t          3942


Nuevamente no obtenemos buenos resultados.

### Elementos cercanos al centroide de cada Cluster

In [36]:
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(
    stop_words='english',
    tokenizer=tokenize_only,
    ngram_range=(1, 3)
)

tfidf_matrix = tfidf_vectorizer.fit_transform(new_data['TITLE'])
print(tfidf_matrix.shape)

# terms = tfidf_vectorizer.get_feature_names()

num_clusters = 4

km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

# print (clusters)

# Recuento del número de elementos en cada cluster
for i in range(num_clusters):
    print ('El cluster %i tiene %i elementos' % (i, clusters.count(i)))

(40000, 277162)
El cluster 0 tiene 33322 elementos
El cluster 1 tiene 529 elementos
El cluster 2 tiene 5761 elementos
El cluster 3 tiene 388 elementos


In [37]:
# Transform X to a cluster-distance space.
# In the new space, each dimension is the distance to the cluster centers.
# Note that even if X is sparse, the array returned by transform will typically be dense.

X_transformed = km.transform(tfidf_matrix)
print("Shape X_transformed = {}".format(X_transformed.shape))
display(X_transformed)

Shape X_transformed = (40000, 4)


array([[0.99944658, 1.04142837, 1.00314848, 1.04209393],
       [0.99990217, 1.04116972, 0.99875999, 1.04234858],
       [0.99966024, 1.04142866, 1.00331806, 1.04192938],
       ...,
       [1.00008886, 1.04152   , 1.00340228, 1.04158144],
       [0.99988267, 1.04171564, 1.00365004, 1.04113871],
       [0.99889601, 1.03892275, 1.00276408, 1.04104361]])

In [38]:
# function to get closest titles to centroid in cluster
def get_closest_titles(X_trans, clust, max_t=10):
    idx = np.argsort(X_trans[:, clust])[::-1][:50]
    #titles = data['TITLE'].iloc[idx].drop_duplicates()
    titles = data.iloc[idx].drop_duplicates()

    return titles[:max_t]

In [39]:
for n_clust in range(num_clusters):
    closest_titles_to_centroid = get_closest_titles(X_transformed, n_clust, max_t=5)
    print('\tCluster {}'.format(n_clust))
    print('\tTitles and Categories:')
    print(closest_titles_to_centroid)
    print('\n')

	Cluster 0
	Titles and Categories:
                                                   TITLE CATEGORY
16120  Two Weeks Remain for Healthcare Enrollment in ...        b
13121  [Weekend Poll] Are You Keeping Your Amazon Pri...        b
10920  American Idol: Jena Irene, C.J. Harris and Ale...        e
22791  Stones cancel Aussie tour following Scott's death        e
23785  Study Doubts Saturated Fat's Link to Heart Dis...        m


	Cluster 1
	Titles and Categories:
                                                   TITLE CATEGORY
23251  Danica McKellar Dances Foxtrot For Week 1 Of DWTS        e
10369  Happy birthday World Wide Web! Why the D.C. re...        t
27938              Oprah creates tea drink for Starbucks        e
34023  Half of breast cancer surgeries in UK not need...        m
30174  Mt. Gox suddenly finds 200000 missing bitcoins...        b


	Cluster 2
	Titles and Categories:
                                                   TITLE CATEGORY
21971  Dealing with compact car r

### Probemos utilizando el modelo word2vec pre-entrenado de Google

Utilizando KeyedVectors de Gensim para cargar el modelo tiene la desventaja de que no se puede seguir entrenando. Pero es más eficiente que utilizar gensim.models.Word2Vec
https://radimrehurek.com/gensim/models/keyedvectors.html#module-gensim.models.keyedvectors

In [40]:
start = time.time()
# model = KeyedVectors.load_word2vec_format('/users/ekokic/thesis/models/GoogleNews-vectors-negative300.bin', binary=True)
# model.save('/users/ekokic/thesis/models/word2vecGoogle.model')
model = KeyedVectors.load('/users/ekokic/thesis/models/word2vecGoogle.model')
end = time.time()
print('demora: {}'.format(end-start))

demora: 10.572213649749756


In [41]:
print('Cantidad de word embeddings: {}'.format(len(model.vectors)))

Cantidad de word embeddings: 3000000


In [42]:
print('Dimensionalidad de los vectores: {}'.format(model.vector_size))

Dimensionalidad de los vectores: 300


In [43]:
sen_vector = []
count = 0
for i in range(len(tokens)):
    try:
        sen_vector.append(sum([model.get_vector(token) for token in tokens[i]]))
    except Exception:
        sen_vector.append(np.nan)
        count += 1
print('Cantidad de palabras que no están en el vocabulario del modelo: {}'.format(count))        

new_data['sen_vector'] = sen_vector

Cantidad de palabras que no están en el vocabulario del modelo: 9920


In [44]:
new_data.head()

Unnamed: 0,TITLE,CATEGORY,sen_vector
0,"Fed official says weak data caused by weather,...",b,"[0.29174805, 0.81469727, -0.42895508, 0.797119..."
1,Fed's Charles Plosser sees high bar for change...,b,
2,US open: Stocks fall after Fed official hints ...,b,"[-0.42236328, 0.5252075, -0.58447266, 1.198486..."
3,"Fed risks falling 'behind the curve', Charles ...",b,"[0.033691406, 0.782959, -0.8574219, 1.7483215,..."
4,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,b,


Para simplificar el problema vamos a dropear aquellas oraciones que contienen palabras que no están en el vocabulario. 

Lo ideal sería entrenar el modelo pre-entrenado con este corpus así se actualiza el vocabulario.

In [45]:
new_data = new_data.dropna()

In [46]:
new_data.groupby('CATEGORY').count()

Unnamed: 0_level_0,TITLE,sen_vector
CATEGORY,Unnamed: 1_level_1,Unnamed: 2_level_1
b,7435,7435
e,7007,7007
m,8394,8394
t,7244,7244


In [47]:
NUM_CLUSTERS = 4
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance)
assigned_clusters = kclusterer.cluster(new_data['sen_vector'].values, assign_clusters=True)

for i in range(NUM_CLUSTERS):
    print ('El cluster %i tiene %i elementos' % (i, assigned_clusters.count(i)))

len(assigned_clusters)

news = {
    'title': new_data['TITLE'].tolist(),
    'category': new_data['CATEGORY'].tolist(),
    'cluster': assigned_clusters
}
news = pd.DataFrame(
    news,
    index=[assigned_clusters],
    columns = ['title', 'category']
)

print(news.loc[0].groupby('category').count())
print(news.loc[1].groupby('category').count())
print(news.loc[2].groupby('category').count())
print(news.loc[3].groupby('category').count())

El cluster 0 tiene 4211 elementos
El cluster 1 tiene 6438 elementos
El cluster 2 tiene 9512 elementos
El cluster 3 tiene 9919 elementos
          title
category       
b           724
e            80
m            84
t          3323
          title
category       
b           205
e          5006
m           570
t           657
          title
category       
b          2619
e          1731
m          3114
t          2048
          title
category       
b          3887
e           190
m          4626
t          1216


### Conclusiones

En nuestro caso particular aplicar clustering no obtuvo buenos resultados. Probablemente al utilizar solamente los títulos no se aporta suficiente información. Además, dichos títulos pueden contener palabras similares pero que correspondan a categorias de noticias distintas (ambigüedad).