## LDA topic modeling goals


The utility of topic modeling methods is their capability to uncover unobserved variables—topics—which shape the meaning of textual documents. Modern-day scholars utilize topic modeling to uncover latent topics from a wide array of textual information—from shorter texts, such as twitter posts to longer texts, such as journal articles.



This notebook applies LDA modeling to an experimental dataset investigating participants' goal inferences. 


### Key python libraries:
- gensim (https://radimrehurek.com/gensim/)
- nltk (https://www.nltk.org)
- spacy (https://spacy.io)

### Helpful Links:
- https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d
- https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158






## A Comprehensive Example:

The data represent the responses of 119 participants to the questionnaire described in the paper "A Theory-Driven Computational Measure of Goal Activation in Communication Science". 


Participants were asked to list all the goals they could think of for a total of four time points. Each goal serves as a single document, leaving us with a total of 2976 documents—each participant providing up to 40 documents, across four time points. 


LDA assumes that a single document can contribute to multiple topics simultaneously; in other words, LDA explicitly models the actual distribution of words within each document. The aim of the analysis is to investigate participants’ open-ended responses to the questionnaire.

###  Steps of the analysis:

#### 1. Preparing data for LDA
    a. Spell check
    b. Expand contractions
    c. Read the data 
    d. Check data integrity
    e. Delete missing values
#### 2. Text preprocessing
    a. Tokenization
    b. Lemmatization  
    c. Stop Word Removal
    d. Bigrams and Trigrams
    e. Exclude terms in > 99% and < 1% of documents
    f. Generate Corpus and Dictionary
#### 3. Model selection (Selecting the number of topics (k))
    a. Computing Model Perplexity
    b. Analyzing model results through pyLDAvis visualization
    c. Saving selected model results

In [None]:
## Load Required Libraries

#general
import numpy as np
import pandas as pd
import re
import pickle
from IPython.display import display

#setting up Jupyter notebook 
%matplotlib inline
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)

#text preprocessing
import nltk
from nltk.corpus import stopwords

import spacy
from spacy.lang.en import English

from gensim.models import Phrases
from gensim.utils import simple_preprocess


#modeling
import gensim
from gensim.models.ldamodel import LdaModel


#plotting
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt

In [None]:
import sys
!{sys.executable} -m pip install spacy
!{sys.executable} -m pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.0/en_core_web_md-2.2.0.tar.gz

### 1. Preparing data for LDA

Before you read in your data, you should manually run your textual data through a spell checker in order to take advantage of the semantic and syntactic context when selecting the proper correct spelling.

Similar to the spellchecker, we needed human coders to expand all English contractions (e.g., "don't" -> "do not"), to ensure accuracy.

After you read the data in, you should check its integrity to avoid unexpected and artificial errors and delete missing values (null values).

In [None]:
## File paths

#data files
file_location = './data/experimental_data.xlsx'

#stop words
stopwords_location = './data/stopwords.txt'

In [None]:
## Check to make sure the dataset looks correct
try:
    data = pd.read_excel(file_location, encoding='latin1')
    print("{} Rows.  {} Columns.".format(*data.shape))
except:
    print("Dataset could not be loaded. Is the dataset missing?")

In [None]:
## Spot checks
indices = [0,333,777]

samples = pd.DataFrame(data.loc[indices, :], columns = data.keys()).reset_index(drop = True)
print("Sample Tickets:")
display(samples)

In [None]:
## Check number of null values in each column of the full dataset
pd.DataFrame(data.isnull().sum(), columns=['Number of NULL values'])

In [None]:
data['goals'] = data['goals']

In [None]:
## Remove missing (null) values from the data

#finding null values in the full dataset
print("=============Full Dataset=============")
data['goals'] = data['goals']

print('Number of rows in goals:', len(data['goals']))
print("-------------------")
print("Null Values in goals: {}".format(data['goals'].isnull().sum()))

#removing null values from the full dataset
goals = data['goals']


### 2. Text preprocessing
This step is needed to generate a ‘bag-of-words’ LDA model and it includes: text tokenization and lemmatization, removal of Stop Words and words that appear in > 99% and < 1% of documents, including bigrams and trigrams, and generating Corpus and Dictionary.

**Tokenization** involves converting the text to lowercase, removing special characters, and punctuation from the text. Also, we should be careful to remove alphanumerics, numbers, words that appear in the corpus less than twice and extra spaces.


**Lemmatization** is used reduces the size of the vocabulary in the model. It transforms words to their lemma (e.g., assaulted -> assault). So that the model can analyze several inflected forms of a word as a single word. Also, lemmatization using Spacy allows to select certain part of speech words (e.g., noun, adj, vb, adv).


**Stop Word Removal** often is an important step to have a better input for modeling. Stop words are very common words in a language (e.g. a, an, the etc.). Note: you can edit the stop words txt file to add additional words to filter out.  We recommend filtering out as few stop words as possible, as even commonly occurring words can offer meaningful information, especially when responses are terse. However, depending on the specific characteristics of the textual data, stop word removal may be necessary to minimize model noise.

**Bigrams and Trigrams** are two and three consequent words that frequently co-occur together.

**Exclude terms in > 99% and < 1% of documents** is necessary to remove words that are contentless words in the documents. This allows to reduce model noise.


**Corpus** is our collection of documents (i.e., our textual questionnaire responses) and <br>
**Dictionary** is a list of unique words in the corpus. It takes each unique word in the corpus and assigns them an index.

In [None]:
## Convert the text to lowercase
goals = goals.str.lower()

print("=======Full Dataset==============\n")
print(goals.head(1))

In [None]:
## Remove all punctuation and tokenize texts

#define helpful function
def tokenize(sentences):
    for sentence in sentences:
        yield(simple_preprocess(str(sentence), deacc=True))  # deacc=True removes all punctuation

#tolkenize full data set
goals_tokens = list(tokenize(goals))


print("\n[INFO] goals....................\n")
print(goals_tokens[:2])

In [None]:
## Lemmatize words, keeping only noun, adj, vb, adv

#define helpful function
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV', 'SCONJ', 'PART', 'NOUN', 'INTJ', 'AUX', 'ADV', 'ADP', 'ADJ']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc])
    return texts_out

#initialise Spacy
import spacy 
import en_core_web_md 
#Initialize spacy model, keeping only tagger component (for efficiency)
nlp = en_core_web_md.load(disable=['parser', 'ner'])

#lemmatize and select only noun, adj, vb, adv
goals_lemma = lemmatization(goals_tokens, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV', 'SCONJ', 'PART', 'NOUN', 'INTJ', 'AUX', 'ADV', 'ADP', 'ADJ'])
print(str(len(goals_lemma)))
print(goals_lemma[:4])

In [None]:
## Prepare to remove stop words
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
chachedWords = stopwords.words('english')
stopwords = set(nltk.corpus.stopwords.words('english'))
newStopWords =[str(x.strip()) for x in open(stopwords_location,'r').read().split('\n')]
stopwords.update(newStopWords)
print(len(stopwords))

In [None]:
## Remove stop words 

#define helpful function
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stopwords] for doc in texts]

#remove stop words 
goals_stopwords = remove_stopwords(goals_lemma)

print("\n[INFO] goals....................\n")
print(goals_stopwords[:2])

In [None]:
## Select Trigrams and Bigrams
            
goals_bigram = Phrases(goals_stopwords, min_count=3, delimiter=b' ', threshold=1)
goals_trigram = Phrases(goals_bigram[goals_stopwords], threshold=1)

goals_bigram_mod = gensim.models.phrases.Phraser(goals_bigram)
goals_trigram_mod = gensim.models.phrases.Phraser(goals_trigram)

for idx in range(len(goals_stopwords)):
    for token in goals_trigram_mod[goals_bigram_mod[goals_stopwords[idx]]]:
        #print(token)
        if ' ' in token:
            goals_stopwords[idx].append(token)
print("\n[INFO] goals....................\n")
print(goals_stopwords[:2])

In [None]:
## Generate Corpus
dictionary_goals = gensim.corpora.Dictionary(goals_stopwords)
dictionary_goals.filter_extremes(no_below=.01, no_above=0.99)

## Generate Dictionary
corpus_goals = [dictionary_goals.doc2bow(text) for text in goals_stopwords]

## Save Corpus and Dictionary on a local drive
pickle.dump(corpus_goals, open('./output/corpus_goals.pkl', 'wb'))
dictionary_goals.save('./output/dictionary_goals.gensim')

### 3. Model selection (Selecting the number of topics (k))
A Model is represented by model Hyperparameters that define prior distribution of the topics within each document, and prior distribution of the different words within each topic. These should be defined based on theoretical assumptions about how we think the topics are actually distributed amongst our data. LDA model from gensim library has the following Hyperparameters:
- **Beta** (referred to as 'eta' in gensim) = the [distribution of the] number of words per topic
- **Alpha** =  the [distribution of the] number of topics per document



Both alpha and eta can be set to ‘symmetric’, ‘asymmetric’, or ‘auto’, where:
- ‘auto’ = the model learns the best values for the hyperparameters as it is trained on more and more data (i.e., it learns an asymmetric prior from the corpus). See http://jonathan-huang.org/research/dirichlet/dirichlet.pdf for an overview             
- 'asymmetric' = uses a fixed, normalized asymmetric prior of 1.0 / k (number of topics)
- 'symmetric' = uses a distribution of 1 / k (number of topics)



In Bayesian statistics, we have to define the distributions (i.e., prior distributions) of unknown variables (e.g., ϕ and θ) before running the data analysis. These should be defined based on theoretical assumptions about how we think the topics are actually distributed amongst our data. In our case, it makes sense to assume that some documents discuss more/less topics than other documents; thus, we set the document-topic distribution to be asymmetric. 


We recommend setting alpha = 'auto' as it sets the distribution to be asymmetric, and learns the best alpha value (i.e., lowest perplexity scores) from the data itself. It also makes sense to assume that some topics contain more words than others. Thus, we recommend setting the distribution of the number of words per topic to be asymmetric as well.

In addition, gensim LDA model has the following parameters:
- **Passes** = number of laps the model goes through the entire corpus (Increasing the number of passes reduces model bias)
- **Chunksize** = number of documents to load into memory at a time (smaller chunk sizes save memory, but take longer to train)
- **Update_every** = number of chunks to process before maximizing your model 
- **Random state** = sets the seed to make the model reproducible
- **Number of topics (k)**

**Number of topics (k)** defines the LDA model. Researchers must tell the model how many (k) prominent goal inference topics to sort each ‘bag of words’ document into. Problematically, several different k-values might work. Thus, we use a metric called perplexity to help us to determine the optimal number of topics. The utility in perplexity comes from comparing perplexity values across models with differing k-values to pinpoint the best model (i.e., the model with the lowest perplexity score). 

We recommend testing the perplexity of the model with a variety of k values, and then running the final model using the k-value with the selected perplexity score. **Model perplexity** is a frequently used metric that gauges how well a model fits the data.

In [None]:
## Set model Hyper Parameters
k = [1,2,3,4,5,6,7,8,9,10,11,12]
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta= 'auto'
per_word_topics=True

lda_model_goals = []

In [None]:
## Get Perplexity Scores of Training Dataset
print("\n***********************************************************************")
print("[INFO] goals Full Dataset LDA Results....")
print("***********************************************************************")

scores = []

for i in k:
    lda_model_goals = LdaModel(corpus=corpus_goals,
                                          id2word=dictionary_goals,
                                          num_topics=i, 
                                          random_state=random_state,
                                          update_every=update_every,
                                          chunksize=chunksize,
                                          passes=passes,
                                          iterations=iterations,
                                          alpha=alpha,
                                          eta=eta,                                                            
                                          per_word_topics=per_word_topics)

    log_perplexity = lda_model_goals.log_perplexity(corpus_goals)
    scores.append(log_perplexity)
    print('\nPerplexity (num_topics = {}): '.format(i), log_perplexity)

In [None]:
## Choosing Optimal Model (k) with Perplexity Scores

#create Figure and Axes instances
fig, ax = plt.subplots(1, figsize=(15,5))

#plot
topic_num = [n + 1 for n in range(len(scores))]
ax.plot(topic_num, scores, color='b')

#turn off y axis but keep labels 
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)
plt.xticks(topic_num)

#set title and grid
plt.title('Model Perplexity Score by number of topics (k)', size=16)
ax.set_xlabel('Number of topics (k)', size=12)
plt.grid(True, axis='y', alpha=0.5)

plt.show()

Usually, lower perplexity scores are indicative of increased model accuracy, and smaller k-values yield a more parsimonious set of topics. However, the perplexity score will often decrease as k increases. In these instances, it’s best to select the model that yields the lowest perplexity value before the values flatten out. 

**Note: when selecting the optimal number of topics, we need to find a balance between overfitting and underfitting the model**

**Overfitting** (i.e., too many topics) can make it harder for human coders to label resulted topics since there is less coherence amongst the words in each topic. At the same time, resulting topics have less overlap in words.

**Underfitting** (i.e., too few topics) doesn't produce enough variance, limiting options for statistical analyses. Labeling resulting topics becomes easier since topics have more coherent list of words comprising each topic. At the same time, resulting topics have higher overlap in used words that leads to increased variance in the distribution of topics in each document.

**pyLDAvis visualization** of a model with selected value for k helps to asses Overfitting and Underfitting. Reading pyLDAvis:

Left pane:
- The area of each circle represents the prevalence of each topic over the entire corpus 
- The distance between the center of circles indicate the similarity between topics (i.e., inter-topic differences)

Right pane:
- If you hover over a particular topic on the left, the histogram on the right side lists the top 30 most relevant terms
- The widths of the gray bars represent the corpus-wide frequencies of each term, and the widths of the red bars represent the topic-specific frequencies of each term
- A slider at the top can adjust the relevance metric (λ); however, for our purposes, be sure it i set to λ = 1. For more information on the relevance metric, see library documentation. 

Documentation for this library can be found here: https://www.aclweb.org/anthology/W14-3110.pdf. 

**In the following steps we test LDA model hyper parameters for k in range [3-11].**

## After looking at the perplexity scorews for k=1-12 topics, we can see that k=6 topics yielded the lowest perplexity value before the trend increased at k=7 topics. Accordingly, we proceeded with k=6 topics

In [None]:
## Initializing LDA Models and Parameters
topic_number = 6
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta='auto'
per_word_topics=True

print("\n***********************************************************************")
print("[INFO] goals Full Dataset LDA Results....")
print("***********************************************************************")

lda_model_goals = LdaModel(corpus=corpus_goals,
                                      id2word=dictionary_goals,
                                      num_topics=topic_number, 
                                      random_state=random_state,
                                      update_every=update_every,
                                      chunksize=chunksize,
                                      passes=passes,
                                      iterations=iterations,
                                      alpha=alpha,
                                      eta=eta,
                                      per_word_topics=per_word_topics)

print('\nPerplexity (topic_number = {}): '.format(topic_number), lda_model_goals.log_perplexity(corpus_goals))

In [None]:
## Analyze Model results
print("\n***********************************************************************")
print("[INFO] goals Full Dataset Model Results....")
print("***********************************************************************")

print("\n[INFO] Num_topics: {}\n".format(topic_number))
topics = lda_model_goals.show_topics(num_topics=topic_number, num_words=10, log=True, formatted=True)
for topic in topics:
    print(topic)

# print("goals.....k = 6...................")
# lda_display = pyLDAvis.gensim.prepare(lda_model_goals, corpus_goals, dictionary_goals)
# pyLDAvis.display(lda_display)

lda_display = pyLDAvis.gensim.prepare(lda_model_goals, corpus_goals, dictionary_goals)
pyLDAvis.display(lda_display)

lda_display = pyLDAvis.gensim.prepare(lda_model_goals, corpus_goals, dictionary_goals)
pyLDAvis.display(lda_display)

## However, the k=6 topic model was not deemed coherent by human coders (i.e., it did not explain the data well); thus we determined a k=9 topic LDA model would produce the next simplest model because k=9 topics yielded the lowest perplexity value before the trend flattened out at k=10 topics. 

##### k = 9 topics

In [None]:
## Initializing LDA Models and Parameters
topic_number = 9
random_state=42
update_every=1
chunksize=1800
passes=300
iterations=850
alpha='auto'
eta='auto'
per_word_topics=True

print("\n***********************************************************************")
print("[INFO] goals Full Dataset LDA Results....")
print("***********************************************************************")

lda_model_goals = LdaModel(corpus=corpus_goals,
                                      id2word=dictionary_goals,
                                      num_topics=topic_number, 
                                      random_state=random_state,
                                      update_every=update_every,
                                      chunksize=chunksize,
                                      passes=passes,
                                      iterations=iterations,
                                      alpha=alpha,
                                      eta=eta,
                                      per_word_topics=per_word_topics)

print('\nPerplexity (topic_number = {}): '.format(topic_number), lda_model_goals.log_perplexity(corpus_goals))

In [None]:
## Analyze Model results
print("\n***********************************************************************")
print("[INFO] goals Full Dataset Model Results....")
print("***********************************************************************")

print("\n[INFO] Num_topics: {}\n".format(topic_number))
topics = lda_model_goals.show_topics(num_topics=topic_number, num_words=10, log=True, formatted=True)
for topic in topics:
    print(topic)

# print("goals.....k = 9...................")
# lda_display = pyLDAvis.gensim.prepare(lda_model_goals, corpus_goals, dictionary_goals)
# pyLDAvis.display(lda_display)

lda_display = pyLDAvis.gensim.prepare(lda_model_goals, corpus_goals, dictionary_goals)
pyLDAvis.display(lda_display)

### 4. Saving selected model results
k=9 topics yielded the lowest perplexity value of the models that fit the data well before the trend flattened out at k=10 topics. We can also see that k=9 topics appear to be relatively spread out, with no overlapping topics. Thus, we determined a k=9-topic LDA model would produce the simplest model of the models that explain the data well


To save results from the LDA model with selected parameters and number of topics we 
- rerun the model with k=9
- generate a column that tells us which topic each response contributed the most to
- save the analysis results to an excel file for topic validation

In [None]:
## Initializing LDA Models and Parameters
topic_number = 9
random_state=42
update_every=1

chunksize=1800
passes=300
iterations=800
alpha='auto'
eta='auto'
per_word_topics=True

print("\n***********************************************************************")
print("[INFO] goals Full Dataset LDA Results....")
print("***********************************************************************")

lda_model_goals = LdaModel(corpus=corpus_goals,
                                      id2word=dictionary_goals,
                                      num_topics=topic_number, 
                                      random_state=random_state,
                                      update_every=update_every,
                                      chunksize=chunksize,
                                      passes=passes,
                                      iterations=iterations,
                                      alpha=alpha,
                                      eta=eta,
                                      per_word_topics=per_word_topics)

print('\nPerplexity (topic_number = {}): '.format(topic_number), lda_model_goals.log_perplexity(corpus_goals))

In [None]:
## goals Model Results
print("\n***********************************************************************")
print("[INFO] Goal Inferences Model Results....")
print("***********************************************************************")

print("\n[INFO] Number of topics: {}\n".format(topic_number))
topics = lda_model_goals.show_topics(num_topics=topic_number, num_words=11, log=True, formatted=True)
for topic in topics:
    print(topic)

# print("Goal Inferences .....k = 9...................")
# lda_display = pyLDAvis.gensim.prepare(lda_model_goals, corpus_goals, dictionary_goals)
# pyLDAvis.display(lda_display)

lda_display = pyLDAvis.gensim.prepare(lda_model_goals, corpus_goals, dictionary_goals)
pyLDAvis.display(lda_display)

In [None]:
## Generate a column that tells us which topic each response contributed the most to

#define helpful function
def format_topics_sentences(ldamodel, corpus, texts):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get the main topic of each document
    for i, row_list in enumerate(ldamodel[corpus]):
        row = row_list[0] if ldamodel.per_word_topics else row_list            
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        
        # Get the Dominant topic, Perc Contribution, and Keywords for each document
        raw_frame = {}
        for j, (topic_num, prop_topic) in enumerate(row):
            if j==0:
                raw_frame['Dominant'] = topic_num

            raw_frame['Topic' + str(topic_num)] = round(prop_topic, 8) # the '8' here should be one number smaller than your k # (e.g., 9-1=8)
            
        df = pd.DataFrame(data=raw_frame, index=[0])
        sent_topics_df = sent_topics_df.append(df)

    return(sent_topics_df)


df_topic_sents_keywords_goals = format_topics_sentences(ldamodel=lda_model_goals, 
                                                                   corpus=corpus_goals, 
                                                                   texts=goals_stopwords)

#rename index of the dataframe
#df_dominant_topic_goals = df_topic_sents_keywords_goals.reset_index()
df_dominant_topic_goals = df_topic_sents_keywords_goals.reset_index(drop=True)

df_dominant_topic_goals.index.name='Document_No';

print(df_dominant_topic_goals.head(117))

In [None]:
df_dominant_topic_goals.head(812).T

In [None]:
## Generate a data frame to export the results into

lda_topics_goals = np.array(df_dominant_topic_goals['Dominant'])
topic0_contrib_lda_topics_goals = np.array(df_dominant_topic_goals['Topic0'])
topic1_contrib_lda_topics_goals = np.array(df_dominant_topic_goals['Topic1'])
topic2_contrib_lda_topics_goals = np.array(df_dominant_topic_goals['Topic2'])
topic3_contrib_lda_topics_goals = np.array(df_dominant_topic_goals['Topic3'])
topic4_contrib_lda_topics_goals = np.array(df_dominant_topic_goals['Topic4'])
topic5_contrib_lda_topics_goals = np.array(df_dominant_topic_goals['Topic5'])
topic6_contrib_lda_topics_goals = np.array(df_dominant_topic_goals['Topic6'])
topic7_contrib_lda_topics_goals = np.array(df_dominant_topic_goals['Topic7'])
topic8_contrib_lda_topics_goals = np.array(df_dominant_topic_goals['Topic8'])

goals = np.array(data['goals'])

results = { 
    'goals': goals, 
    'lda_topics_goals': lda_topics_goals, 
    'topic0_contrib_lda_topics_goals':topic0_contrib_lda_topics_goals,
    'topic1_contrib_lda_topics_goals':topic1_contrib_lda_topics_goals,
    'topic2_contrib_lda_topics_goals':topic2_contrib_lda_topics_goals,
    'topic3_contrib_lda_topics_goals':topic3_contrib_lda_topics_goals,
    'topic4_contrib_lda_topics_goals':topic4_contrib_lda_topics_goals,
    'topic5_contrib_lda_topics_goals':topic5_contrib_lda_topics_goals,
    'topic6_contrib_lda_topics_goals':topic6_contrib_lda_topics_goals,
    'topic7_contrib_lda_topics_goals':topic7_contrib_lda_topics_goals,
    'topic8_contrib_lda_topics_goals':topic8_contrib_lda_topics_goals
}

frame = pd.DataFrame(results, columns = [
                                        'goals', 'lda_topics_goals', 
                                        'topic0_contrib_lda_topics_goals',
                                        'topic1_contrib_lda_topics_goals',
                                        'topic2_contrib_lda_topics_goals',
                                        'topic3_contrib_lda_topics_goals',
                                        'topic4_contrib_lda_topics_goals',
                                        'topic5_contrib_lda_topics_goals',
                                        'topic6_contrib_lda_topics_goals',
                                        'topic7_contrib_lda_topics_goals',
                                        'topic8_contrib_lda_topics_goals'
                                        ])

In [None]:
## Export results to an .xlsx file
frame.to_excel("./output/LDA_results.xlsx")

In [None]:
frame.to_csv("./output/LDA_results.tsv", sep="\t", index=False)

In [None]:
frame.head()