# <center>Topic Modeling for Exploratory Text Analysis</center>
<center>An SSDA workshop</center>
<center>If you have any questions, feel free to contact me (Meng) at caimeng2@msu.edu <center>

# Before we start

In this workshop, we will be using a few packages that are not included in Anaconda by default. Please make sure to install `nltk`, `scikit-learn`, `spacy`, `pyLDAvis`, and `gensim`. The following commands should work for most systems. Detailed installation instructions could be found at each package's website (see links below).

- `conda install nltk` https://www.nltk.org/install.html

- `conda install scikit-learn` https://scikit-learn.org/stable/install.html

- `pip install spacy` https://spacy.io/usage

- `pip install pyLDAvis` https://pypi.org/project/pyLDAvis/

- `pip install gensim` https://radimrehurek.com/gensim/


# Steps for topic modeling

We've already covered what LDA topic modeling is and why it is useful. In this notebook, we will take a look at how to build a topic model in Python.

- [Step 1: data cleaning](#1)
- [Step 2: tokenization](#2)
- [Step 3: stemming / lemmatization](#3)
- [Step 4: remove stop words](#4)

At this point, the data is ready for topic modeling.

- [Step 5: find topics](#5)
- [Step 6: interpret the topics and improve the model](#6)

Optional steps:

- [Label documents](#7)
- [Visualization](#8)

Jump to:
- [Activities - try it yourself](#activity)
- [Additional resources](#additional)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer
from spacy.lang.en import English
from nltk.corpus import wordnet as wn
import gensim
from gensim import corpora
import pyLDAvis.gensim
import nltk
nltk.download("stopwords")
nltk.download("wordnet")
pd.set_option("display.max_colwidth", 80)
import warnings
warnings.simplefilter("ignore")

Note: the above cell will take a few seconds to run.

<a name="1"></a>

## Step 1: data cleaning

Data: `msutweets.csv`. Three years' tweets from [@michiganstateu](https://twitter.com/michiganstateu) (from 4/30/2018 to 4/30/2020). No retweets.

The whole dataset has 1580 tweets. Let's randomly draw a sample of 200 as our example data.

In [None]:
raw = pd.read_csv("msutweets.csv", header=None) # import data
raw.columns = ["created_at", "id", "hashtags", "user_mentions", "in_reply_to_status_id",
                   "in_reply_to_user_id", "in_reply_to_screen_name", "username", "id",
                   "profile_location", "description", "user_url", "followers", "friends",
                   "user_created_at", "verified", "geo", "coordinates", "place",
                   "retweet", "favorite", "tweets"] # set up column names
df = raw[["tweets"]] # remove columns we don't need
df = df.sample(n=200, random_state=0) # randomly sample 200 tweets and set a seed for reproducibility
df

We use **regular expressions** (also referred to as regex or regexp) to clean our text. This step usually requires customized solutions for different datasets. For tweets, generally, we would like to remove URLs, #hashtags, numbers, and punctuations. For our example, because the purpose is not to build a social network or do some sentiment analysis, we delete @mentions and emojis as well.

On a side note, regular expression is a powerful tool for pattern matching in text. However, the syntax is hard to read and write. I found https://regex101.com/ very helpful for experimenting with regular expressions.

In [None]:
def TextCleaner(dirty_text):
    """Remove links, special characters, and numbers from text."""
    dirty_text = dirty_text.replace("&amp", "") # remove "&"
    semi_clean_text = " ".join(re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)",
                                      "", dirty_text).split()) # remove @somebody, special characters, and links
    clean_text = " ".join(re.sub(r"\d+", "", semi_clean_text).split()) # remove numbers
    return clean_text

In [None]:
clean_tweets = []
for tweet in df.tweets:
    clean_tweets.append(TextCleaner(tweet))
df.insert(1, "clean_tweets", clean_tweets)

Here's how our clean data look like.

In [None]:
df.head(10)

If you would like to do a simple word count to see what the most frequently used words are, here's a function that plots the top words and n-grams.

**N-gram** is a concept in Natural Language Processing. It simply means a sequence of n word. For example, 

&emsp;&emsp;&emsp;&emsp;"SSDA" is a uni-gram (an individual word);

&emsp;&emsp;&emsp;&emsp;"good morning" is a bi-gram (2-gram);

&emsp;&emsp;&emsp;&emsp;"how are you" is tri-gram (3-gram);

&emsp;&emsp;&emsp;&emsp;"Mary had a little lamb" is a 4-gram, and there are four bi-grams in this sentence: "Mary had", "had a", "a little", and "little lamb".

In [None]:
def PlotWords(text, n, ngram_min=1, ngram_max=1):
    """
    Plot the distribution of top words in text.
    Reference: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
    """
    vec = CountVectorizer(ngram_range=(ngram_min, ngram_max), stop_words=None).fit(text)
    #vec = CountVectorizer(ngram_range=(ngram_min, ngram_max), stop_words="english").fit(text)
    bag_of_words = vec.transform(text)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
    df = pd.DataFrame(words_freq[:n], columns=["Topwords", "Count"])
    fig, ax = plt.subplots()
    ax.barh(df["Topwords"], df["Count"])
    ax.invert_yaxis()
    ax.set_xlabel("Count")
    ax.set_title("Top {} words/ phrases".format(n))
    return ax

In [None]:
PlotWords(df.clean_tweets, 10, 1, 1) # consider only individual words

In [None]:
PlotWords(df.clean_tweets, 10, 2, 3) # consider only bi- and tri-grams

<a name="2"></a>

## Step 2: tokenization

Tokenization is essentially chopping a phrase, a sentence, or a paragraph into smaller units, such as individual words or n-grams. These smaller units are called tokens.

There are many ways to tokenize text data. We introduce two here. [This article](https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/) presents six distinct ways.

### Python's split()

`split()` returns a list of strings after breaking the given string by the specified separator (only one separator at a time). It does not consider punctuation as a separate token.

In [None]:
"MSU leaders are meeting daily and taking all appropriate steps to ensure the health of the Spartan community.".split()

### spaCy

spaCy has a reputation for being super fast. It is written in [Cython](https://cython.org/) (C-extension for Python, which means accelerated code execution speed).

In [None]:
parser = English()
mytext = parser("MSU leaders are meeting daily and taking all appropriate steps to ensure the health of the Spartan community.")
tokens = []
for token in mytext:
    tokens.append(token.text)
print(tokens)

In [None]:
def Tokenize(text):
    """Break text into tokens."""
    parser = English()
    parsed_text = parser(text)
    tokens = []
    for token in parsed_text:
        if token.orth_.isspace():
            continue
        else:
            tokens.append(token.lower_) # convert to lower cases
    return tokens

In [None]:
Tokenize("MSU leaders are meeting daily and taking all appropriate steps to ensure the health of the Spartan community.")

<a name="3"></a>

## Step 3: word stemming/ lemmatization

The aim of stemming and lemmatization is the same: to return a word to its root. The difference is the way they work.

> **Stemming** usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. 
> **Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the **lemma**.
>
>Source: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

For example,

|**Word**|**Stemming**|**Lemmatization**|
|--------|------------|-----------------|
|studying| study      |study            |
|studies | studi      |study            |
|was     |wa          |be               |

Lemmatization is usually the preferred way of reducing related words to a common base.

In [None]:
lemma = wn.morphy("spartans")
print(lemma)

In [None]:
lemma = wn.morphy("is")
print(lemma)

In [None]:
lemma = wn.morphy("are") # if you're interested, try "was", what did you find?
print(lemma)

In [None]:
#print(wn.morphy.__doc__)

In [None]:
lemma = wn.morphy("are", wn.VERB)
print(lemma)

Notice that it didn't do a very good job without a part-of-speech (POS) tag. However, it is impossible to manually provide appropriate POS tags for every word for large amount of text. In this workshop, let's ignore POS tags and see what we get. If you'd like to know more about word lemmatization with POS tags, check out this [article](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/).

In [None]:
lemma = wn.morphy("msu")
print(lemma)

In [None]:
def GetLemma(word):
    """Get the root of a given word."""
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma

In [None]:
GetLemma("msu")

<a name="4"></a>

## Step 4: remove stop words

In [None]:
#print(set(nltk.corpus.stopwords.words("english"))) # check out nltk stopwords

In [None]:
def ProcessToken(mytext):
    """Tokenize, remove stopwords, and get lemma of a given sentence."""
    en_stop = set(nltk.corpus.stopwords.words("english"))
    tokens = Tokenize(mytext)
    clean_tokens = []
    for token in tokens:
        if token not in en_stop:
            clean_tokens.append(GetLemma(token))
    return clean_tokens

In [None]:
ProcessToken("MSU leaders are meeting daily and taking all appropriate steps to ensure the health of the Spartan community")

In [None]:
processed_tweets = []
for tweet in clean_tweets:
    processed_tweets.append(ProcessToken(tweet))

In [None]:
#processed_tweets

<a name="5"></a>

## Step 5: find topics

Finally, we are ready for LDA topic modeling. We will use `gensim` for this task.

We need to first initialize a **dictionary**, which is a list of unique tokens in our documents (i.e. all tweets).

In [None]:
dictionary = corpora.Dictionary(processed_tweets) # initialize a dictionary
#list(dictionary.values())
#len(dictionary) # number of unique tokens
#print(dictionary.token2id) # token to id map

Then we need to convert our documents to bags of words, which is referred to as **corpus** in `gensim`. Corpus is an object that contains the token id and its frequency in each document.

In [None]:
corpus = [dictionary.doc2bow(t) for t in processed_tweets] # initialize a corpus
#[i for i in corpus] # take a look at the corpus: word id and its frequency

Now all we need to do is to let `gensim` train a model with the dictionary and the corpus as inputs. We also need to decide how many topics there are in our tweets. Two is a good place to start with.

In [None]:
model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, passes=20) # train the model

In [None]:
model.print_topics(num_words=5)

<a name="6"></a>

## Step 6: interpret the topics and improve the model

How to interpret the topics requires a lot of human knowledge and judgment. We need to apply our domain knowledge to check if the topics generated by the model make sense. If not, go back and find ways to improve the model. Here are a few things to experiment with:

- Adjust the number of passes
- Adjust the number of topics
- Eliminate words that do not carry meaning by themselves, e.g. get
- Eliminate words that are too common, e.g. msu
- Use POS tags to improve the accuracy of lemmatization 
- Look at words that are from specific parts of speech (only nouns, only adjectives, both nouns and verbs, etc.)
- Instead only uni-grams, consider adding bi-grams or higher order n-grams

Let's try removing some words first.

In [None]:
def ProcessToken2(mytext, remove_term=[]):
    """
    Tokenize, remove stopwords, remove user-defined terms, and get lemma of a given sentence.

    Input:
    mytext -- a string;
    remove_term -- a list words to remove (default: empty).
    Output:
    clean tokens.
    """
    en_stop = set(nltk.corpus.stopwords.words("english"))
    remove_term = set(remove_term)
    tokens = Tokenize(mytext)
    clean_tokens = []
    for token in tokens:
        if token not in remove_term:
            if token not in en_stop:
                clean_tokens.append(GetLemma(token))
    final_tokens = []
    for token in clean_tokens:
        if token not in remove_term:
            final_tokens.append(token)
    return final_tokens

In [None]:
ProcessToken2("MSU leaders are meeting daily and taking all appropriate steps to ensure the health of the Spartan community",
              remove_term=["msu", "spartan"])

In [None]:
processed_tweets = []
for tweet in clean_tweets:
    processed_tweets.append(ProcessToken2(tweet, remove_term=["msu", "spartan", "get"]))

In [None]:
dictionary = corpora.Dictionary(processed_tweets)
corpus = [dictionary.doc2bow(t) for t in processed_tweets]
model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, passes=20)

In [None]:
model.print_topics(num_words=5)

<a name="activity"></a>

# Activities - try it yourself

### Increase passes and change the number of topics?

In [None]:
model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=3, passes=50)
model.print_topics()

### Eliminate words with extreme frequencies?

In [None]:
#help(dictionary.filter_extremes)

In [None]:
#Only keep tokens that are contained in at least 5 tweets and no more than 90% of tweets
dictionary = corpora.Dictionary(processed_tweets)
dictionary.filter_extremes(no_below=5, no_above=0.9)
corpus = [dictionary.doc2bow(t) for t in processed_tweets]
model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, passes=10)
model.print_topics()

### Anything else you'd like to try?

Here's a great article on how to build better topic models by looking at **topic coherence**: [Evaluate Topic Models: Latent Dirichlet Allocation](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0).

<a name="7"></a>

## Optional: label documents

In [None]:
def ShowTopics(text, model, corpus):
    """Show the topic distributions in each document."""
    topics = []
    for i in range(len(text)):
        topics.append(model[corpus[i]])
    return topics

In [None]:
def WhichTopic(text, model, corpus):
    """Find the most likely topic for each document."""
    topic = []
    for i in range(len(text)):
        topic_index, topic_value = max(model[corpus[i]], key=lambda item: item[1])
        topic.append(topic_index)
    return topic

In [None]:
df["topics"]=ShowTopics(processed_tweets, model, corpus)
df["most_likely"]=WhichTopic(processed_tweets, model, corpus)
df.head()

<a name="8"></a>

## Optional: visualization

Note: this is extremely slow.

In [None]:
lda_display = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

<a name="additional"></a>

# Additional resources

The LDA paper: [Latent Dirichlet Allocation](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)

Other topic modeling algorithms: [Topic Modeling with LSA, PLSA, LDA & lda2Vec](https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05)

Comparing different tokenization methods and libraries: [How to Get Started with NLP – 6 Unique Methods to Perform Tokenization](https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/)

Comparing different lemmatization methods and libraries: [Lemmatization Approaches with Examples in Python](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/)

Building interpretable topic models: [Evaluate Topic Models: Latent Dirichlet Allocation (LDA)](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0)

More on visualization: [pyLDAvis](https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb#topic=0&lambda=1&term=)
