# Initial NLP analysis

```
conda create --name NLP -c conda-forge python=3.10 jupyter pandas numpy matplotlib openpyxl nltk gensim pyldavis
```


# Eventually I should write this as a .py file and import functions here

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
# full data file with multiple sheets
filename = 'data/ITP_CourseArtifacts_June 2021_END_of_Course_DeIDENTIFIED.xlsx'

In [None]:
# sheet name for this analysis, containing responses to one question
sheet = 'Course Meta SelfEff'

In [None]:
df = pd.read_excel(filename, sheet)
df

## Look for n-grams

- NLTK (followed this): https://towardsdatascience.com/from-dataframe-to-n-grams-e34e29df3460
- textBlob (haven't tried) : https://levelup.gitconnected.com/simple-nlp-in-python-f5196db63aff


In [None]:
import unicodedata
import re

import nltk
from nltk.corpus import stopwords

import gensim

In [None]:
# this only needs to be run once
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('omw-1.4')

In [None]:
# add appropriate words that will be ignored in the analysis
additional_stopwords = ['1', '2', 'one', 'two', 'etc']

In [None]:
def preprocess(text, additional_stopwords = [''], wlen = 3, stem = True, ):
    """
    A simple function to clean up the data. All the words that
    are not designated as a stop word is then lemmatized and (optionally) stemmed
    after encoding and basic regex parsing are performed.
    
    originally from here : https://towardsdatascience.com/from-dataframe-to-n-grams-e34e29df3460
    with modifications by AMG
    """
    
    # define the lemmatizer, stemmer and stopwords
    wnl = nltk.stem.WordNetLemmatizer()
    stemmer = nltk.stem.SnowballStemmer('english')
    #stemmer = nltk.stem.PorterStemmer()
    stopwords = nltk.corpus.stopwords.words('english') + additional_stopwords
    
    # initial simple regex parsing and create a list of words
    text = (unicodedata.normalize('NFKD', text)
        .encode('ascii', 'ignore')
        .decode('utf-8', 'ignore')
        .lower())
    words = re.sub(r'[^\w\s]', '', text).split()
    
    # pass through the lemmatizer and (optionally) the stemmer
    processed = []
    for word in words:
        if (word not in stopwords and len(word) > wlen):
            w = wnl.lemmatize(word)
            #print(word, wnl.lemmatize(word), stemmer.stem(word), stemmer.stem(w))
            if (stem):# and not w.endswith('e')):
                processed.append(stemmer.stem(w))
            else:
                processed.append(w)
    
    return processed

In [None]:
# testing
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
string_of_words = ''.join(str(original_words))
print(preprocess(string_of_words, additional_stopwords = additional_stopwords, stem = True))
print(preprocess(string_of_words, additional_stopwords = additional_stopwords, stem = False))

In [None]:
# get all the words in order (excluding the stop words)

# 1. convert the answers column to a list
list_of_answers = df[df.columns[1]].tolist() 

# 2. convert that list to a long string
string_of_answers = ''.join(str(list_of_answers))

# 3. run through the "preprocess" function that will return a list of (lemmatized) words
# SHOULD I STEM FIRST??
processed_words = preprocess(string_of_answers, additional_stopwords = additional_stopwords, stem = True)
processed_words[:10]

In [None]:
# get the bigrams
bigrams = pd.Series(nltk.ngrams(processed_words, 2)).value_counts()
bigrams

In [None]:
# get the trigrams
trigrams = pd.Series(nltk.ngrams(processed_words, 3)).value_counts()
trigrams

In [None]:
# plot the results

N = 20
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))

ind = np.arange(N)

bigrams_plot = bigrams[0:N].sort_values()
ax1.barh(ind, bigrams_plot, 0.9, color = 'gray')
ax1.set_yticks(ind)
_ = ax1.set_yticklabels(bigrams_plot.index.str.join(sep=' '))
_ = ax1.set_title(str(N) + ' Most Frequently Occuring Bigrams')
_ = ax1.set_xlabel('# of Occurances')

trigrams_plot = trigrams[0:N].sort_values()
ax2.barh(ind, trigrams_plot, 0.9, color = 'gray')
ax2.set_yticks(ind)
_ = ax2.set_yticklabels(trigrams_plot.index.str.join(sep=' '))
_ = ax2.set_title(str(N) + ' Most Frequently Occuring Trigrams')
_ = ax2.set_xlabel('# of Occurances')

plt.subplots_adjust(wspace = 0.9, left = 0.15, right = 0.99, top = 0.95, bottom = 0.07)

plt.savefig('ngrams.png')

## Topic modeling

- NLTK and gensim : https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925
- NLTK and gensim : https://towardsdatascience.com/introduction-to-nlp-part-5b-unsupervised-topic-model-in-python-ab04c186f295
- pyLDAvis : https://www.projectpro.io/article/10-nlp-techniques-every-data-scientist-should-know/415#toc-10
- pyLDAvis : https://neptune.ai/blog/pyldavis-topic-modelling-exploration-tool-that-every-nlp-data-scientist-should-know

Trying NLTK + gensim,  Latent Dirichlet Allocation (LDA) algorithm, which uses unsupervised learning to extract the main topics (i.e., a set of words) that occur in a collection of text samples.  The first link above has a very good general explanation of the method, and [here's the Jupyter notebook on their github repo](https://github.com/priya-dwivedi/Deep-Learning/blob/master/topic_modeling/LDA_Newsgroup.ipynb). 

In [None]:
# preprocess each answer separatley
processed_answers = []
for answer in list_of_answers:
    processed_answers.append(preprocess(answer, additional_stopwords = additional_stopwords))
processed_answers

In [None]:
# convert to a "bag of words"
dictionary = gensim.corpora.Dictionary(processed_answers)

# filter (optional)
# remove words appearing less than no_below times
# remove words appearing in more than no_above (fraction) of all documents
no_below = 15
no_above = 1 # don't use this
keep_n = int(1e5)
dictionary.filter_extremes(no_below = no_below, no_above = no_above, keep_n = keep_n)

# Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
# words and how many times those words appear. Save this to 'bow_corpus'
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_answers]

In [None]:
# Checking dictionary created
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

In [None]:
# check the bag of words
answer_num = 0
bow_answer = bow_corpus[answer_num]

for i in range(len(bow_answer)):
    print(f'Word {bow_answer[i][0]} ("{dictionary[bow_answer[i][0]]}") appears {bow_answer[i][1]} time.')

## Running LDA using Bag of Word

*This markdown is taken directly from [here](https://github.com/priya-dwivedi/Deep-Learning/blob/master/topic_modeling/LDA_Newsgroup.ipynb).*

**We will be running LDA using multiple CPU cores to parallelize and speed up model training.**

Some of the parameters we will be tweaking are:

* **num_topics** is the number of requested latent topics to be extracted from the training corpus.
* **id2word** is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.
* **workers** is the number of extra processes to use for parallelization. Uses all available cores by default.
* **alpha** and **eta** are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. We will let these be the default values for now(default value is `1/num_topics`)
    - Alpha is the per document topic distribution.
        * High alpha: Every document has a mixture of all topics(documents appear similar to each other).
        * Low alpha: Every document has a mixture of very few topics

    - Eta is the per topic word distribution.
        * High eta: Each topic has a mixture of most words(topics appear similar to each other).
        * Low eta: Each topic has a mixture of few words.
* **passes** is the number of training passes through the corpus. 

Documentation here: https://radimrehurek.com/gensim/models/ldamodel.html

# I NEED TO CHECK ALL THESE PARAMETERS

In [None]:
# LDA mono-core -- fallback code in case LdaMulticore throws an error on your machine
# lda_model = gensim.models.LdaModel(bow_corpus, 
#                                    num_topics = 5, 
#                                    id2word = dictionary,                                    
#                                    passes = 50)

# Train your lda model using gensim.models.LdaMulticore 
lda_model =  gensim.models.LdaMulticore(
    bow_corpus, 
    num_topics = 8, 
    id2word = dictionary,  
    passes = 20,
    workers = 2
)

In [None]:
# For each topic, explore the words occuring in that topic and its relative weight
# Then a human would need to give each topic a name (or "theme")
for idx, topic in lda_model.print_topics():
    print(f'Topic: {idx}\nWords: {topic}\n')

In [None]:
# Compute Coherence Score
# I need to look up the args here as well
coherence_model_lda = gensim.models.coherencemodel.CoherenceModel(
    model = lda_model, 
    texts = processed_answers, 
    dictionary = dictionary, 
    coherence = 'c_v'
)
coherence_lda = coherence_model_lda.get_coherence()
print('gensim coherence score: ', coherence_lda)

## Optimization (todo)

I could calculate the coherence score with various changes in the args for the lda model (e.g., number of topics), and then pick the one with the best coherence score

## Visualization using pyLDAvis

- https://nbviewer.org/github/bmabey/hacker_news_topic_modelling/blob/master/HN%20Topic%20Model%20Talk.ipynb
- https://github.com/bmabey/pyLDAvis
- https://nbviewer.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb

Most of the visualization is self expanatory, butthe slider to adjust the "relevant metric" takes some reading. 
From here: https://we1s.ucsb.edu/research/we1s-tools-and-software/topic-model-observatory/tmo-guide/tmo-guide-pyldavis/

"A “relevance metric” slider scale at the top of the right panel controls how the words for a topic are sorted. As defined in the article by Sievert and Shirley (the creators of LDAvis, on which pyLDAvis is based), “relevance” combines two different ways of thinking about the degree to which a word is associated with a topic:

On the one hand, we can think of a word as highly associated with a topic if its frequency in that topic is high. By default the lambda (λ) value in the slider is set to “1,” which sorts words by their frequency in the topic (i.e., by the length of their red bars).

On the other hand, we can think of a word as highly associated with a topic if its “lift” is high. “Lift”–a term that Sievert and Shirley borrow from research on topic models by others–means basically how much a word’s frequency sticks out in a topic above the baseline of its overall frequency in the model (i.e., the “the ratio of a term’s probability within a topic to its marginal probability across the corpus,” or the ratio between its red bar and blue bar).

By default, pyLDAvis is set for λ = 1, which sorts words just by their frequency within the specific topic (by their red bars).  By contrast, setting λ = 0 words sorts words by their “lift. This means that words whose red bars are nearly as long as their blue bars will be sorted at the top. "

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models

In [None]:
pyLDAvis.enable_notebook()

In [None]:
pyLDAvis.gensim_models.prepare(lda_model, bow_corpus, dictionary)