# NLP analysis

```
conda create --name NLP -c conda-forge python=3.10 jupyter pandas numpy matplotlib openpyxl nltk gensim pyldavis spacy
```

In [None]:
## If you are running this for the first time on a new installation, uncomment below and run this cell
## (This only needs to be run once.)

# import nltk
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('omw-1.4')

# import spacy
# spacy.cli.download('en_core_web_sm')

In [None]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
# import my code (and set to autoreload <-- only necessary while coding/debugging)
%load_ext autoreload
%autoreload 2

from NLPforISP import *

## Read in the data file

In [None]:
# full data file with multiple sheets
filename = 'data/ITP_CourseArtifacts_June 2021_END_of_Course_DeIDENTIFIED.xlsx'

# sheet name for this analysis, containing responses to one question
#sheet = 'Course Meta SelfEff'
sheet = 'Course Meta App'

df = pd.read_excel(filename, sheet)
df

## Get the bigrams and trigrams and create bar charts of the results

In [None]:
# add appropriate words that will be ignored in the analysis
additional_stopwords = ['1', '2', 'one', 'two', 'etc']

# get a string of the words contained in all the answers from this DataFrame
string_of_answers = getStringOfWords(df, 1)

# get the bigrams and trigrams
bigrams = getNgrams(string_of_answers, 2, additional_stopwords = additional_stopwords)
trigrams = getNgrams(string_of_answers, 3, additional_stopwords = additional_stopwords)

In [None]:
# create a plot of the bigrams and trigrams
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))
N = 20
plotNgrams(bigrams, N, ax = ax1)
plotNgrams(trigrams, N, ax = ax2)
_ = ax1.set_title(str(N) + ' Most Frequently Occuring Bigrams')
_ = ax2.set_title(str(N) + ' Most Frequently Occuring Trigrams')
plt.subplots_adjust(wspace = 0.6, left = 0.15, right = 0.99, top = 0.95, bottom = 0.07)

f.savefig('ngrams_' + sheet.replace(' ','') + '.png', bbox_inches = 'tight')

## Topic modeling

Using NLTK + gensim,  Latent Dirichlet Allocation (LDA) algorithm, which uses unsupervised learning to extract the main topics (i.e., a set of words) that occur in a collection of text samples. 

In [None]:
# run the topic model (which also generates a "dictionary" and a "bag of words")
dictionary, bow_corpus, lda_model, perplexity, coherence = runLDATopicModel(df, 1, 5, workers = 6, 
    additional_stopwords = additional_stopwords, no_below = 15, no_above = 1, keep_n = int(1e5),
    random_state = 1234)

In [None]:
# check the dictionary
printDictionary(dictionary, 10)

In [None]:
# check the bag of words
printBagOfWords(dictionary, bow_corpus, 0)

In [None]:
# check the topic model
printLDATopicModel(lda_model)

## Optimization

Run a series of LDA models and plot the coherence and perplexity scores to try to identify the optimal number of topics

In [None]:
num_topics = np.arange(10) + 1
dictionary, bow_corpus, lda_model, perplexity, coherence = runLDATopicModel(df, 1, num_topics, workers = 6, 
    additional_stopwords = additional_stopwords, no_below = 15, no_above = 1, keep_n = int(1e5),
    random_state = 1234)

In [None]:
# choose the index of the best model by selecting the maximum coherence score
# choose the 'c_v' measure of coherence for this

best_index = np.argmax(coherence['c_v'])
num_topics[best_index]

In [None]:
# plot the results
# higher coherence is better
# lower perplexity is better

f, (ax1, ax2) = plotLDAMetrics(num_topics, coherence, perplexity, best_index)
f.savefig('metrics_' + sheet.replace(' ','') + '.png', bbox_inches = 'tight')

In [None]:
# calculate the probabilities for each answer being in each topic
df_p = getLDAProbabilities(lda_model[best_index], bow_corpus, df, 1)
df_p

In [None]:
# plot a KDE of the probability distributions for each topic
f, ax = plotTopLDAProbabilitiesKDE(df_p)#, bw_method = 0.3)
f.savefig('probabilities_' + sheet.replace(' ','') + '.png', bbox_inches = 'tight')

In [None]:
# get summary information about the topics
df_p.describe()

In [None]:
# print the answers that have the maximum probability for each topic
printBestLDATopicSentences(df_p, dictionary, lda_model[best_index], n_answers = 20, n_sentences = 3)

## Visualization using pyLDAvis

- https://nbviewer.org/github/bmabey/hacker_news_topic_modelling/blob/master/HN%20Topic%20Model%20Talk.ipynb
- https://github.com/bmabey/pyLDAvis
- https://nbviewer.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb

Most of the visualization is self expanatory, but the slider to adjust the "relevant metric" takes some reading. 
From here: https://we1s.ucsb.edu/research/we1s-tools-and-software/topic-model-observatory/tmo-guide/tmo-guide-pyldavis/

"A “relevance metric” slider scale at the top of the right panel controls how the words for a topic are sorted. As defined in the article by Sievert and Shirley (the creators of LDAvis, on which pyLDAvis is based), “relevance” combines two different ways of thinking about the degree to which a word is associated with a topic:

On the one hand, we can think of a word as highly associated with a topic if its frequency in that topic is high. By default the lambda (λ) value in the slider is set to “1,” which sorts words by their frequency in the topic (i.e., by the length of their red bars).

On the other hand, we can think of a word as highly associated with a topic if its “lift” is high. “Lift”–a term that Sievert and Shirley borrow from research on topic models by others–means basically how much a word’s frequency sticks out in a topic above the baseline of its overall frequency in the model (i.e., the “the ratio of a term’s probability within a topic to its marginal probability across the corpus,” or the ratio between its red bar and blue bar).

By default, pyLDAvis is set for λ = 1, which sorts words just by their frequency within the specific topic (by their red bars).  By contrast, setting λ = 0 words sorts words by their “lift. This means that words whose red bars are nearly as long as their blue bars will be sorted at the top. "

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models

In [None]:
pyLDAvis.enable_notebook()

In [None]:
# Note: I chose the best index from the lda_models array while plotting the coherence and perplexity metrics
pyLDAvis.gensim_models.prepare(lda_model[best_index], bow_corpus, dictionary)

## Term Frequency – Inverse Document Frequency (TF-IDF) analysis

TF-IDF (using sci-kit learn’s TfidfVectorizer) measures the frequency of a word in a document and compares it to the frequencies of all words in the text to assign it a weighted score of importance.

https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
# convert the answers column to a list
list_of_answers = df[df.columns[1]].tolist() 

# preprocess each answer separately
processed_answers = []
for answer in list_of_answers:
    processed_answers.append(' '.join(preprocess(answer, additional_stopwords = additional_stopwords)))

In [None]:
# convert the answers column to a list
list_of_answers = df[df.columns[1]].tolist() 

# preprocess each answer separately
processed_answers = []
for answer in list_of_answers:
    processed_answers.append(' '.join(preprocess(answer, additional_stopwords = additional_stopwords)))
    
#TF-IDF (word level)""
vectorizer = TfidfVectorizer(analyzer='word')
tfidf_vector = vectorizer.fit_transform(processed_answers)

In [None]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), columns = vectorizer.get_feature_names())
tfidf_df

In [None]:
# limit to only the most important words
# take words that appear in at least 1% of the answers
lim = 0.01*tfidf_df.shape[0]
tfidf_df_cull = tfidf_df.loc[(tfidf_df.sum(axis=1) != 0), (tfidf_df.sum(axis=0) >= lim)]
tfidf_df_cull

#  TODO


## Try Mallet LDA?

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/ <-- this also contains some great additional steps to check out

Following steps from here : https://radimrehurek.com/gensim_3.8.3/models/wrappers/ldamallet.html

(Working in WSL to compile the Mallet code.)

```
sudo apt update
sudo apt-get install default-jdk
git clone https://github.com/mimno/Mallet.git
cd Mallet/
ant
```

But this doesn't exist in gensim anymore!

In [None]:
path_to_mallet_binary = "/c/Users/ageller/NUIT/projects/BennettGoldberg/Mallet/bin/mallet"

dictionary, bow_corpus, processed_answers = getBagOfWords(df, 1,  additional_stopwords = additional_stopwords, no_below = 15, no_above = 1, keep_n = int(1e5))

model = gensim.models.wrappers.LdaMallet(path_to_mallet_binary, corpus = bow_corpus, num_topics = 5, 
                                         id2word = dictionary)
vector = model[common_corpus[0]]  # LDA topics of a documents