Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

# Lab 4.3: Keywords and Clustering

In this lab, we learn how to cluster documents. The code is partially adapted from [this notebook](https://www.kaggle.com/cherishzhang/clustering-on-papers). We compare different ways to represent the keywords in documents

## 1. Tf-idf

Calculating tf-idf (term frequency - inverse document frequency) is a simple approach to extract the key words of an article. The class TfidfVectorizer from the module sklearn calculates the tf-idf scores for all terms in our documents. 

In [None]:
import pandas as pd
import stanza
import string

# This is very simplistic pre-processing. You might want to modify it
def preprocess(article):
    processed_article = nlp.process(article)
    all_lemmas = []
    for s in processed_article.sentences: 
        if len(s.text.strip())>0:
            lemmas = [word.lemma.lower() for word in s.words if not word.lemma==None]
            stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
            clean_lemmas = [lemma for lemma in lemmas if not lemma in stopwords and not lemma in string.punctuation]
            all_lemmas.extend(clean_lemmas)
    return all_lemmas

# Read in TSV
tsv_file = "../data/veganism_overview_en.tsv"
news_content = pd.read_csv(tsv_file, sep="\t", keep_default_na=False, header=0)
nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma')

# We filter out empty articles
news_content = news_content[news_content["Text"].str.len() >0 ]
articles = news_content["Text"]


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# You can play around with the ngram range
vectorizer = TfidfVectorizer(use_idf=True, tokenizer=preprocess)
tf_idf = vectorizer.fit_transform(articles)
all_terms = vectorizer.get_feature_names_out()


The terms are ordered alphabetically. **Do some spot checks and come up with ideas for better pre-processing of the articles.**.

In [None]:
# Randomly look at some terms
print(all_terms[0:50])

# Select a document
i = 3
print(tf_idf[i])

In [None]:
print(vectorizer.get_feature_names_out()[3891])
print(vectorizer.get_feature_names_out()[2914])
print(vectorizer.get_feature_names_out()[912])

## 2. Clustering

In clustering, we try to infer groups of similar documents. Here, we use the k-means algorithm of the *sklearn* module and the tf-idf vectors as document representation. The number of clusters is an experimental parameter. **Analyze the clusters you obtain. Do they correspond to useful conceptual groups? What happens if you vary the number of clusters?**

Instead of clustering documents, you could also cluster sentences from multiple documents. This could result in argumentative clusters.

In [None]:
# How many clusters do you expect? 
from sklearn.cluster import KMeans
num_clusters = 4
km = KMeans(n_clusters=num_clusters)
km.fit(tf_idf)


In [None]:
# Output the clusters
clusters = km.labels_.tolist()
clustered_articles ={'Title': news_content["Title"],'Author': news_content["Author"],'Publisher': news_content["Publisher"], 'Cluster': clusters}
overview = pd.DataFrame(clustered_articles, columns = ['Author', 'Title', 'Publisher', 'Cluster'])
overview

## 3. Represent a document by keywords

Instead of representing a document by all of its words, we could focus on the most relevant words. In this example, we extract the words with the highest tf-idf as keywords. **Do you think these are representative keywords? What could be improved?**

In [None]:
import numpy as np
# We extract the keywords
num_keywords = 10

def get_top_tfidf_features(row, terms, top_n=25):
    top_ids = np.argsort(row)[::-1][:top_n]
    top_features = [terms[i] for i in top_ids]
    return top_features, top_ids

keywords = []
keyword_ids = []
for i in range(0, tf_idf.shape[0]):
    row = np.squeeze(tf_idf[i].toarray())
    top_terms, top_ids= get_top_tfidf_features(row, all_terms, top_n=num_keywords)
    keywords.append(top_terms)
    keyword_ids.append(top_ids)
# Show a few keywords
for x in range(8):
    print("Keywords for article " + str(x))
    print(keywords[x])



## 4. Represent keywords with vectors

We could now calculate the clusters directly on the keyword ids as document representation (might be a good idea to try this out). This representation has two disadavantages: 1. the order of the keywords is taken into account by the clustering algorithm (e.g. keyword "the" on position 2 is not similar to "the" on position 4) and 2. the ids do not capture similarities between words.

We now represent each keyword with a vector from a pre-trained embedding model (trained on Wikipedia) and then take the mean vector over all keywords. Loading the model takes time. We will learn more about word vectors in the next lecture.

In [None]:
from gensim.models import KeyedVectors
print("loading")
fasttext_model  = KeyedVectors.load_word2vec_format("../data/wiki-news-300d-1M.vec")
print("done loading")


In [None]:
all_doc_representations = []

for doc_keywords in keywords:
    doc_representation =[]
    for keyword in doc_keywords:
        try:
            word_representation = fasttext_model[keyword]
            doc_representation.append(word_representation)
        except KeyError as e:
            # We simply ignore unknown words
            print(e)


    # Take the mean over the keywords
    mean_keywords = np.mean(doc_representation, axis=0)
    all_doc_representations.append(mean_keywords)


In [None]:
# Now, let's cluster on the mean keyword vector
from sklearn.cluster import KMeans
num_clusters = 4
km = KMeans(n_clusters=num_clusters)
km.fit(all_doc_representations)
# Output the clusters
clusters = km.labels_.tolist()
clustered_articles ={'Title': news_content["Title"],'Author': news_content["Author"],'Publisher': news_content["Publisher"], 'Cluster': clusters}
overview = pd.DataFrame(clustered_articles, columns = ['Author', 'Title', 'Publisher', 'Cluster'])
overview

## 5. Word clouds

Word clouds are a way to visualize key words. They have lost in popularity recently, but can still provide a means for exploration when you want to investigate the quality of your clusters. The size of a word in the word cloud visualizes its frequency.  

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

def wordcloud_cluster_byIds(clusterId, clusters, keywords):
    words = []
    for i in range(0, len(clusters)):
        if clusters[i] == clusterId:
            for word in keywords[i]:
                words.append(word)
    print(words)
    # Generate a word cloud based on the frequency of the terms in the cluster
    wordcloud = WordCloud(max_font_size=40, relative_scaling=.8).generate(' '.join(words))
   
    plt.figure()
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.savefig(str(clusterId)+".png")

In [None]:
wordcloud_cluster_byIds(3, clusters, keywords)

## 6. Clustering by style
Instead of clustering documents based on their content, you could also cluster documents based on the stylistic features you extracted in lab 3. **Try it out!**


## 6. Clustering by style
Instead of clustering documents based on their content, you could also cluster documents based on the stylistic features you extracted in lab 3. **Try it out!**
