## Trabajo Práctico 1 - Clustering

DiploDatos 2018 - Aprendizaje No Supervizado

Mario Ferreyra - Emiliano Kokic

### Datos: Dataset of references (urls) to news web pages
https://archive.ics.uci.edu/ml/datasets/News+Aggregator

Descripción del problema: el dataset contiene dentro de sus atributos título y categoria de noticia:
- Entretenimiento
- Ciencia y Tecnología
- Negocios
- Salud

La idea del problema sería intentar aplicar técnicas de clustering sobre los títulos de cada noticia y ver si de acuerdo a su semántica se agrupan según las distintas categorias.

In [36]:
import pandas as pd
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import spacy
from spacy.lang.en import English
from gensim.models import Word2Vec, Doc2Vec
from nltk.cluster import KMeansClusterer

#### Cargamos el dataset

In [2]:
data = pd.read_csv('./NewsAggregatorDataset/newsCorpora.csv', header=None, delimiter='\t')
data.columns = ['ID', 'TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME', 'TIMESTAMP']
'''
ID		Numeric ID
TITLE		News title 
URL		Url
PUBLISHER	Publisher name
CATEGORY	News category (b = business, t = science and technology, e = entertainment, m = health)
STORY		Alphanumeric ID of the cluster that includes news about the same story
HOSTNAME	Url hostname
TIMESTAMP 	Approximate time the news was published, as the number of milliseconds since the epoch
'''

'\nID\t\tNumeric ID\nTITLE\t\tNews title \nURL\t\tUrl\nPUBLISHER\tPublisher name\nCATEGORY\tNews category (b = business, t = science and technology, e = entertainment, m = health)\nSTORY\t\tAlphanumeric ID of the cluster that includes news about the same story\nHOSTNAME\tUrl hostname\nTIMESTAMP \tApproximate time the news was published, as the number of milliseconds since the epoch\n'

In [3]:
data.shape

(422419, 8)

In [4]:
# Nos quedamos solo con las dos columnas de interés
data = data[['TITLE', 'CATEGORY']]
data.head()

Unnamed: 0,TITLE,CATEGORY
0,"Fed official says weak data caused by weather,...",b
1,Fed's Charles Plosser sees high bar for change...,b
2,US open: Stocks fall after Fed official hints ...,b
3,"Fed risks falling 'behind the curve', Charles ...",b
4,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,b


In [5]:
data.groupby('TITLE').count().sort_values('CATEGORY',ascending=False).head()

Unnamed: 0_level_0,CATEGORY
TITLE,Unnamed: 1_level_1
The article requested cannot be found! Please refresh your browser or go back ...,145
Business Highlights,59
Posted by Parvez Jabri,59
Posted by Imaduddin,53
Posted by Shoaib-ur-Rehman Siddiqui,52


Removemos las filas sin título ya que no aportan información y a la hora de realizar clustering van a generar ruido. 

In [6]:
no_title = data.groupby('TITLE').count().sort_values('CATEGORY',ascending=False).head().index.values[0]
print(no_title)
data = data[data.TITLE != no_title]

The article requested cannot be found! Please refresh your browser or go back  ...


In [7]:
data.groupby('CATEGORY').count()

Unnamed: 0_level_0,TITLE
CATEGORY,Unnamed: 1_level_1
b,115965
e,152339
m,45637
t,108333


Nos quedamos con 2000 títulos de cada categoría (ya que sino se necesita demasiado cómputo)

In [8]:
new_data = data[data.CATEGORY == 'b'][0:2000]
new_data = new_data.append(data[data.CATEGORY == 'e'][0:2000], ignore_index=True)
new_data = new_data.append(data[data.CATEGORY == 'm'][0:2000], ignore_index=True)
new_data = new_data.append(data[data.CATEGORY == 't'][0:2000], ignore_index=True)
new_data.groupby('CATEGORY').count()

Unnamed: 0_level_0,TITLE
CATEGORY,Unnamed: 1_level_1
b,2000
e,2000
m,2000
t,2000


La idea es entonces aplicar clustering sobre los títulos de las noticias y ver si se agrupan por categoría de acuerdo a su semantica. Es decir, vamos a considerar 4 clusters.

### Análisis del texto | tokenizing

Para analizar el texto debemos estudiar la frecuencia de las palabras, es decir, separar el texto en unidades sintácticas o *tokens*.

In [9]:
def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [10]:
totalvocab_tokenized = []

for index, row in new_data.iterrows():
    allwords_tokenized = tokenize_only(row['TITLE'])
    totalvocab_tokenized.extend(allwords_tokenized)

In [11]:
print('Hay en total ' + str(len(totalvocab_tokenized)) + ' tokens \n')
len(totalvocab_tokenized)
print (totalvocab_tokenized[0:50])

Hay en total 71705 tokens 

['fed', 'official', 'says', 'weak', 'data', 'caused', 'by', 'weather', 'should', 'not', 'slow', 'taper', 'fed', "'s", 'charles', 'plosser', 'sees', 'high', 'bar', 'for', 'change', 'in', 'pace', 'of', 'tapering', 'us', 'open', 'stocks', 'fall', 'after', 'fed', 'official', 'hints', 'at', 'accelerated', 'tapering', 'fed', 'risks', 'falling', "'behind", 'the', 'curve', 'charles', 'plosser', 'says', 'fed', "'s", 'plosser', 'nasty', 'weather']


In [12]:
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize_only, 
                                   ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(new_data['TITLE'])
print(tfidf_matrix.shape)

terms = tfidf_vectorizer.get_feature_names()

(8000, 62858)


### Buscar clusters | Kmeans

In [13]:
num_clusters = 4

km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

# print (clusters)

# Recuento del número de elementos en cada cluster
for i in range(num_clusters):
    print ('El cluster %i tiene %i elementos' % (i, clusters.count(i)))

El cluster 0 tiene 694 elementos
El cluster 1 tiene 6890 elementos
El cluster 2 tiene 158 elementos
El cluster 3 tiene 258 elementos


In [14]:
news = { 'title': new_data['TITLE'].tolist(), 'category': new_data['CATEGORY'].tolist(), 'cluster': clusters}
news = pd.DataFrame(news, index = [clusters] , columns = ['title', 'category'])
print(news.loc[0].groupby('category').count())
print(news.loc[1].groupby('category').count())
print(news.loc[2].groupby('category').count())
print(news.loc[3].groupby('category').count())

          title
category       
e           622
m            72
          title
category       
b          2000
e          1378
m          1512
t          2000
          title
category       
m           158
          title
category       
m           258


Notar que uno de los clusters concentra casi la totalidad de los datos.

### Probemos utilizando el tokenizer de spacy

In [15]:
# NOTA: token.is_stop nos devuelve un booleano que indica si el token es o no una stopword
nlp = spacy.load('en')
tokenizer = English().Defaults.create_tokenizer(nlp)
tokens = []
for doc in tokenizer.pipe(new_data['TITLE']):
    tokens.append([token.text for token in doc if re.search('[a-zA-Z]', token.text)
                   and not token.is_stop])    
# tokens es una lista de listas donde cada lista contiene los tokens de cada titulo

In [16]:
tokens[0]

['Fed',
 'official',
 'says',
 'weak',
 'data',
 'caused',
 'weather',
 'slow',
 'taper']

In [17]:
len(tokens)

8000

### Probemos utilizando gensim para generar word embeddings

In [38]:
sentences = tokens
model = Word2Vec(sentences, min_count=1)
len(sentences)

8000

In [22]:
word_vectors = model.wv

In [23]:
len(word_vectors.vectors)

10006

### Utilicemos KMeans de nltk

In [24]:
NUM_CLUSTERS = 4
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance)
assigned_clusters = kclusterer.cluster(word_vectors.vectors, assign_clusters=True)

In [25]:
for i in range(NUM_CLUSTERS):
    print ('El cluster %i tiene %i elementos' % (i, assigned_clusters.count(i)))
len(assigned_clusters)    

El cluster 0 tiene 7406 elementos
El cluster 1 tiene 75 elementos
El cluster 2 tiene 1336 elementos
El cluster 3 tiene 1189 elementos


10006

In [49]:
# >> sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
# >>> sentence_president = 'The president greets the press in Chicago'.lower().split()
# >>> similarity = word_vectors.wmdistance(sentence_obama, sentence_president)
# >>> print("{:.4f}".format(similarity))
print(new_data['TITLE'][0])
print(new_data['TITLE'][1])
similarity = word_vectors.wmdistance(new_data['TITLE'][0], new_data['TITLE'][1])
print("{:.4f}".format(similarity))

Fed official says weak data caused by weather, should not slow taper
Fed's Charles Plosser sees high bar for change in pace of tapering
0.0644


In [35]:
word_vectors.index2word

["'s",
 'The',
 'Titanfall',
 'To',
 'Bachelor',
 'US',
 'New',
 'Xbox',
 'SXSW',
 'One',
 'In',
 'Juan',
 'Pablo',
 'Snowden',
 'For',
 'A',
 'Google',
 'Bieber',
 'Is',
 'Alzheimer',
 'Of',
 'Justin',
 'test',
 'China',
 'With',
 'Lena',
 'Dunham',
 'GM',
 'cancer',
 'On',
 'True',
 'iOS',
 'says',
 'new',
 'Apple',
 'Detective',
 'Cancer',
 'And',
 'Season',
 'Game',
 'Study',
 "n't",
 'Lindsay',
 'Blood',
 'Cosmos',
 'Miley',
 'Neil',
 'Colorado',
 'Gox',
 'Cyrus',
 'FDA',
 'Will',
 'Thrones',
 'Edward',
 'Keibler',
 'Stacy',
 'Finale',
 'Lohan',
 'Live',
 'After',
 'Chiquita',
 'Mt.',
 'Young',
 'recall',
 'Carney',
 'Bank',
 'Microsoft',
 'heart',
 'May',
 'You',
 'Stocks',
 'More',
 'What',
 'Bitcoin',
 'risk',
 'NSA',
 'It',
 'Selena',
 'Drug',
 'Android',
 'Gomez',
 'TV',
 'study',
 'data',
 'sales',
 'George',
 'Colon',
 'Flappy',
 'Company',
 'Twitter',
 'Mobile',
 'Up',
 'CEO',
 'American',
 'High',
 'Fyffes',
 'Galavis',
 'Ukraine',
 'bankruptcy',
 'Be',
 'files',
 'I',
 '

# Problema!

Tengo los words embeddings y puedo agruparlos por cluster, pero lo que en verdad quiero categorizar (clusterizar) son los titulos (sentencia completa).

In [26]:
news = { 'title': new_data['TITLE'].tolist(), 'category': new_data['CATEGORY'].tolist(),
        'cluster': assigned_clusters}
news = pd.DataFrame(news, index = [assigned_clusters] , columns = ['title', 'category'])
print(news.loc[0].groupby('category').count())
print(news.loc[1].groupby('category').count())
print(news.loc[2].groupby('category').count())
print(news.loc[3].groupby('category').count())

ValueError: Shape of passed values is (2, 8000), indices imply (2, 10006)