# Practical Topic Finding for Short Texts

## 1. Introduction


In [1]:
%matplotlib inline

In [216]:
import numpy as np
import matplotlib.pyplot as plt


from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD, NMF, LatentDirichletAllocation
from sklearn.cluster import KMeans

## 2. Topic Finding Models

There are different implementations of NMF (KL, L2...), different parameter to set. Comparison of these can be found. 

Putting theory aside, let's see how they work within sklearn

Kmeans is not a traditional topic finding model, however, can be used to...., use document vector (e.g., tfidf, word2vec if you have large data)

First, let's generate some texts for test. Use artificial texts to help us have a microscope view of the models

In [288]:
def generate_clearcut_topics():
    ## for demostration purpose, don't take it personally : )
    return np.repeat(["we love bergers", "we hate sandwiches"], [1000, 1000])

def generate_unbalanced_topics():
    return np.repeat(["we love bergers", "we hate sandwiches"], [10, 1000])


def generate_semantic_context_topics():
    return np.repeat(["we love bergers"
                      , "we hate bergers"
                      , "we love sandwiches"
                      , "we hate sandwiches"], 1000)

# def generate_noisy_topics():
#     return np.repeat(["you, me, he, and everybody love bergers"
#                       , "you, me, he, she, and everybody hate sandwiches"], 1000)

def generate_noisy_topics():
    def _random_typos(word, n):
        typo_index = np.random.randint(0, len(word), n)
        return [word[:i]+"X"+word[i+1:] for i in typo_index]
    t1 = ["we love %s" % w for w in _random_typos("bergers", 15)]
    t2 = ["we hate %s" % w for w in _random_typos("sandwiches", 15)]
    return np.r_[t1, t2]

sample_texts = {
    "clearcut topics": generate_clearcut_topics()
    , "unbalanced topics": generate_unbalanced_topics()
    , "semantic topics": generate_semantic_context_topics()
    , "noisy topics": generate_noisy_topics()
}

clearcut_topics = generate_clearcut_topics()
unbalanced_topics = generate_unbalanced_topics()

we try different texts

In [230]:
def find_topic(texts, topic_model, n_topics, vec_model="tf", thr=1e-2, **kwargs):
    """Return a list of topics from texts by topic models - for demostration of simple data
    texts: array-like strings
    topic_model: {"nmf", "svd", "lda"} for LSA_NMF, LSA_SVD, LDA
    n_topics: # of topics in texts
    vec_model: {"tf", "tfidf"} for term_freq, term_freq_inverse_doc_freq
    thr: threshold for finding keywords in a topic model
    """
    ## 1. vectorization
    vectorizer = CountVectorizer() if vec_model == "tf" else TfidfVectorizer()
    text_vec = vectorizer.fit_transform(texts)
    words = np.array(vectorizer.get_feature_names())
    ## 2. topic finding
    topic_models = {"nmf": NMF, "svd": TruncatedSVD, "lda": LatentDirichletAllocation, "kmeans": KMeans}
    topicfinder = topic_models[topic_model](n_topics, **kwargs).fit(text_vec)
    topic_dists = topicfinder.components_ if topic_model is not "kmeans" else topicfinder.cluster_centers_
    #return topic_dists
    topic_dists /= topic_dists.max(axis = 1).reshape((-1, 1))
    ## 3. keywords for topics
    ## Unlike other models, LSA_SVD will generate both positive and negative values in topic_word distribution,
    ## which makes it more ambiguous to choose keywords for topics. The sign of the weights are kept with the
    ## words for a demostration here
    def _topic_keywords(topic_dist):
        keywords_index = np.abs(topic_dist) >= thr
        keywords_prefix = np.where(np.sign(topic_dist) > 0, "", "^")[keywords_index]
        keywords = " | ".join(map(lambda x: "".join(x), zip(keywords_prefix, words[keywords_index])))
        return keywords
    
    topic_keywords = map(_topic_keywords, topic_dists)
    return "\n".join("Topic %i: %s" % (i, t) for i, t in enumerate(topic_keywords))

In [229]:

kmeans = KMeans(5, ).fit(np.random.randn(10, 3))
kmeans.cluster_centers_ / kmeans.cluster_centers_.max(axis = 1).reshape((-1, 1))

array([[ 0.16968111,  1.        , -0.42926189],
       [ 1.85104976,  1.        ,  1.55310426],
       [ 0.49873455, -1.60964749,  1.        ],
       [ 0.43113289,  0.388843  ,  1.        ],
       [ 0.14845003,  1.        ,  0.67450416]])

In [217]:
def cluster_topic(texts, n_topics, vec_model="tf", **kwargs):
    """Return a list of topics from texts by clustering
    texts: array-like strings
    n_topics: # of topics in texts
    vec_model: {"tf", "tfidf"} for term_freq, term_freq_inverse_doc_freq
    """
    ## 1. vectorization
    vectorizer = CountVectorizer() if vec_model == "tf" else TfidfVectorizer()
    text_vec = vectorizer.fit_transform(texts)
    words = np.array(vectorizer.get_feature_names())
    ## 2. document clustering
    kmeans = KMeans(n_clusters = n_topics).fit(text_vec)
    ## 3. find keywords, either by frequent words or cluster centers, we use the latter here

## SVD: from complete vector, flip some bits, then flip more bits, to keep the directions orthogonal
- easy to understand, but hard to interept its results. Because there are both positive and negative values, so cannot be interepreted as a probability distribution
- just like PCA: finding orthogonal directions that explains most of the varieties in texts
- could be useful when doing document clustering/classification because on redundant features?? all orthogonal and have bigger chance to be independent features??
- limit: only so many number of topics ...., 
- co-occurance words not necessarily have similiar weights (both high or both low) in the same topic, they could actually have different signs
- it is seldomly useful to use learned topics invidiually, but might be useful as a whole
- might be useful to find minor topics in unbalanced texts if the group size is not too small

*** explain how it works ***

difference between tf and tfidf

ORTHOGONAL

In [203]:
print(find_topic(sample_texts["clearcut topics"], "svd", 4, vec_model="tf"))

Topic 0: bergers | hate | love | sandwiches | we
Topic 1: bergers | ^hate | love | ^sandwiches
Topic 2: bergers | hate | love | sandwiches | ^we
Topic 3: ^hate | sandwiches


In [204]:
print(find_topic(sample_texts["clearcut topics"], "svd", 4, vec_model="tfidf"))

Topic 0: bergers | hate | love | sandwiches | we
Topic 1: bergers | ^hate | love | ^sandwiches
Topic 2: bergers | hate | love | sandwiches | ^we
Topic 3: bergers | ^hate | ^love | sandwiches


***unbalanced topics***

In [205]:
print(find_topic(sample_texts["unbalanced topics"], "svd", 3, vec_model="tf"))

Topic 0: hate | sandwiches | we
Topic 1: bergers | ^hate | love | ^sandwiches | we
Topic 2: ^bergers | ^hate | ^love | ^sandwiches | we


## LDA: try to glue similiar words
- find topics that is interetable to human beings: similiar words grouped together
- cooccured words tend to be grouped together as much as they can be
- it could be problem for noisy inputs, or same meaning with variety of words
- not so good at finding unbalanced minor topics

In [206]:
print(find_topic(sample_texts["clearcut topics"], "lda", 4, vec_model="tf"))

Topic 0: bergers | love | we
Topic 1: bergers | love | we
Topic 2: hate | sandwiches | we
Topic 3: bergers | love | we


In [207]:
print(find_topic(sample_texts["clearcut topics"], "lda", 4, vec_model="tfidf"))

Topic 0: bergers | love | we
Topic 1: hate | sandwiches | we
Topic 2: bergers | love | we
Topic 3: bergers | love | we


Minor topics - it has been merged with a bigger topic

In [208]:
print(find_topic(sample_texts["unbalanced topics"], "lda", 4, vec_model="tf"))

Topic 0: hate | sandwiches | we
Topic 1: bergers | hate | love | sandwiches | we
Topic 2: hate | sandwiches | we
Topic 3: hate | sandwiches | we


In [298]:
print find_topic(sample_texts["noisy topics"],"lda",2, vec_model = "tfidf",)

Topic 0: bergerx | bergexs | bergxrs | berxers | bexgers | bxrgers | hate | love | sandwichex | sandwichxs | sandwicxes | sandwixhes | sandwxches | sanxwiches | sxndwiches | we | xandwiches | xergers
Topic 1: bergerx | bergexs | bergxrs | berxers | bexgers | bxrgers | hate | love | sandwichex | sandwichxs | sandwicxes | sandwixhes | sandwxches | sanxwiches | sxndwiches | we | xandwiches | xergers


## NMF
- sth in btween: also interpretable, trying to find "independent" topics as much as possible?? 
- event work with unbalanced topics
- however, not as consistent as LDA when number of topics are getting more (potentially more than latent topics). On the other side, LDA will try to glue smaller topics found previously into big ones
- more robust to noise

In [209]:
print(find_topic(sample_texts["clearcut topics"], "nmf", 5, vec_model="tf"))

Topic 0: hate | sandwiches | we
Topic 1: bergers | love | we
Topic 2: bergers | love | we
Topic 3: hate | sandwiches | we
Topic 4: hate | sandwiches | we


In [210]:
print(find_topic(sample_texts["unbalanced topics"], "nmf", 5, vec_model="tfidf"))

Topic 0: bergers | love | we
Topic 1: hate | sandwiches | we
Topic 2: bergers | love | we
Topic 3: hate | sandwiches | we
Topic 4: hate | sandwiches | we


In [211]:
print(find_topic(sample_texts["clearcut topics"], "nmf", 25, vec_model="tfidf"))

Topic 0: hate | sandwiches | we
Topic 1: bergers | love | we
Topic 2: hate | sandwiches | we
Topic 3: hate | sandwiches | we
Topic 4: bergers | love | we
Topic 5: hate | sandwiches | we
Topic 6: hate | sandwiches | we
Topic 7: we
Topic 8: bergers | love | we
Topic 9: hate | sandwiches | we
Topic 10: bergers | love | we
Topic 11: bergers | love | we
Topic 12: love
Topic 13: bergers | love | we
Topic 14: bergers | love | we
Topic 15: bergers | love | we
Topic 16: bergers | we
Topic 17: bergers | love | we
Topic 18: bergers | love | we
Topic 19: bergers | love | we
Topic 20: love | we
Topic 21: hate | sandwiches | we
Topic 22: bergers | love | we
Topic 23: hate | sandwiches
Topic 24: bergers | love | we


In [212]:
print(find_topic(sample_texts["clearcut topics"], "lda", 20, vec_model="tf"))

Topic 0: bergers | hate | love | sandwiches | we
Topic 1: bergers | love | we
Topic 2: hate | sandwiches | we
Topic 3: bergers | love | we
Topic 4: bergers | hate | love | sandwiches | we
Topic 5: bergers | love | we
Topic 6: bergers | love | we
Topic 7: bergers | hate | love | sandwiches | we
Topic 8: bergers | hate | love | sandwiches | we
Topic 9: bergers | love | we
Topic 10: bergers | hate | love | sandwiches | we
Topic 11: bergers | love | we
Topic 12: bergers | love | we
Topic 13: bergers | love | we
Topic 14: bergers | love | we
Topic 15: bergers | love | we
Topic 16: bergers | love | we
Topic 17: bergers | love | we
Topic 18: bergers | love | we
Topic 19: bergers | love | we


In [299]:
print find_topic(sample_texts["noisy topics"],"nmf",2, vec_model = "tfidf",)

Topic 0: hate | sandwichex | sandwichxs | sandwicxes | sandwixhes | sandwxches | sanxwiches | sxndwiches | we | xandwiches
Topic 1: bergerx | bergexs | bergxrs | berxers | bexgers | bxrgers | love | we | xergers


## Finding semantic groups
None of them can find sematic similiar group of keywords, e.g., "love, hate", "sandwiches, bergers"

The reason is that short sentences don't have enough repetation of contexts to extract those things. And most models focus on co-occurance instead of context similiarity

- SVD did quite well as to capture the dimensions in terms of context ?
- LDA tends to find big topics with many co-occurend words
- NMF is somewhtere in between??

In [213]:
print(find_topic(sample_texts["semantic topics"], "nmf", 5, vec_model="tfidf"))

Topic 0: bergers | hate | we
Topic 1: bergers | we
Topic 2: love | we
Topic 3: sandwiches | we
Topic 4: hate | we


In [214]:
print(find_topic(sample_texts["semantic topics"], "lda", 5, vec_model="tfidf"))

Topic 0: love | sandwiches | we
Topic 1: bergers | love | we
Topic 2: bergers | love | we
Topic 3: bergers | love | we
Topic 4: bergers | hate | we


In [215]:
print(find_topic(sample_texts["semantic topics"], "svd", 4, vec_model="tfidf"))

Topic 0: bergers | hate | love | sandwiches | we
Topic 1: ^hate | love
Topic 2: bergers | ^sandwiches
Topic 3: bergers | hate | love | sandwiches | ^we


## KMeans
- not really a topic model, it just cluster and label documents
- no limit of number of topics
- good to find unbalanced minor topics
- result is robust to number of topics - there is reduanduncy

In [253]:
print find_topic(sample_texts["unbalanced topics"],"kmeans",10, vec_model = "tf",)

Topic 0: hate | sandwiches | we
Topic 1: bergers | love | we
Topic 2: hate | sandwiches | we
Topic 3: hate | sandwiches | we
Topic 4: hate | sandwiches | we
Topic 5: hate | sandwiches | we
Topic 6: hate | sandwiches | we
Topic 7: hate | sandwiches | we
Topic 8: hate | sandwiches | we
Topic 9: hate | sandwiches | we


Selecting a good word vector is essential. e.g., it might be dominated by noises in sentences, e.g., stop words

## Summary
- KMeans is almost always good to start with because 
    - it's cheap to run (even with large data, e.g., by MiniBatchKMeans) and 
    - provides useful information about strucutres.
    - can be combined with a variety of vector representations, e.g., tf, tfidf, ngrams, doc2vec
    - specially useful for short sentences where it's less usuall that each doc has more than one topic
    - on the other hand, it is sentitive to doc represnetation because the clusstering is .