# NLP analysis

```
conda create --name nlp -c conda-forge python=3.10 jupyter pandas numpy matplotlib openpyxl nltk gensim pyldavis spacy scikit-learn kneed
conda activate nlp 
## pip install bertopic
## pip install git+https://github.com/boudinfl/pke.git 
```

In [None]:
## If you are running this for the first time on a new installation, uncomment below and run this cell
## (This only needs to be run once.)

# import nltk
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('omw-1.4')

# import spacy
# spacy.cli.download('en_core_web_sm')

In [None]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
# set to autoreload <-- only necessary while coding/debugging
%load_ext autoreload
%autoreload 2

# import my code 
from NLPforISP import *

## Read in the data file

In [None]:
# full data file with multiple sheets
filename = 'data/ITP_CourseArtifacts_June 2021_END_of_Course_DeIDENTIFIED.xlsx'

# sheet name for this analysis, containing responses to one question
#sheet = 'Course Meta SelfEff'
sheet = 'Course Meta App'

df = pd.read_excel(filename, sheet)
df

## Get the bigrams and trigrams and create bar charts of the results

In [None]:
# add appropriate words that will be ignored in the analysis
additional_stopwords = ['1', '2', 'one', 'two', 'etc']

# get a string of the words contained in all the answers from this DataFrame
string_of_answers = getStringOfWords(df, 1)

# get the bigrams and trigrams
bigrams = getNgrams(string_of_answers, 2, additional_stopwords = additional_stopwords)
trigrams = getNgrams(string_of_answers, 3, additional_stopwords = additional_stopwords)

In [None]:
# create a plot of the bigrams and trigrams
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))
N = 20
plotNgrams(bigrams, N, ax = ax1)
plotNgrams(trigrams, N, ax = ax2)
_ = ax1.set_title(str(N) + ' Most Frequently Occuring Bigrams')
_ = ax2.set_title(str(N) + ' Most Frequently Occuring Trigrams')
plt.subplots_adjust(wspace = 0.6, left = 0.15, right = 0.99, top = 0.95, bottom = 0.07)

f.savefig('ngrams_' + sheet.replace(' ','') + '.png', bbox_inches = 'tight')

## Topic modeling

Using NLTK + gensim,  Latent Dirichlet Allocation (LDA) algorithm, which uses unsupervised learning to extract the main topics (i.e., a set of words) that occur in a collection of text samples. 

In [None]:
# run the topic model (which also generates a "dictionary" and a "bag of words")
dictionary, bow_corpus, lda_model, perplexity, coherence = runLDATopicModel(df, 1, 5, workers = 6, 
    additional_stopwords = additional_stopwords, no_below = 15, no_above = 1, keep_n = int(1e5),
    random_state = 1234)

In [None]:
# check the dictionary
printDictionary(dictionary, 10)

In [None]:
# check the bag of words
printBagOfWords(dictionary, bow_corpus, 0)

In [None]:
# check the topic model
printLDATopicModel(lda_model)

## Optimization

Run a series of LDA models and plot the coherence and perplexity scores to try to identify the optimal number of topics

In [None]:
num_topics = np.arange(10) + 1
dictionary, bow_corpus, lda_model, perplexity, coherence = runLDATopicModel(df, 1, num_topics, workers = 6, 
    additional_stopwords = additional_stopwords, no_below = 15, no_above = 1, keep_n = int(1e5),
    random_state = 1234)

In [None]:
# choose the index of the best model by selecting the maximum coherence score
# choose the 'c_v' measure of coherence for this

best_index = np.argmax(coherence['c_v'])
num_topics[best_index]

In [None]:
# plot the results
# higher coherence is better
# lower perplexity is better

f, (ax1, ax2) = plotLDAMetrics(num_topics, coherence, perplexity, best_index)
f.savefig('metrics_' + sheet.replace(' ','') + '.png', bbox_inches = 'tight')

In [None]:
# calculate the probabilities for each answer being in each topic
df_p = getLDAProbabilities(lda_model[best_index], bow_corpus, df, 1)
df_p

In [None]:
# plot a KDE of the probability distributions for each topic
f, ax = plotTopLDAProbabilitiesKDE(df_p)#, bw_method = 0.3)
f.savefig('probabilities_' + sheet.replace(' ','') + '.png', bbox_inches = 'tight')

In [None]:
# get summary information about the topics
df_p.describe()

In [None]:
# print the answers that have the maximum probability for each topic
printBestLDATopicSentences(df_p, dictionary, lda_model[best_index], n_answers = 20, n_sentences = 3)

## Visualization using pyLDAvis

- https://nbviewer.org/github/bmabey/hacker_news_topic_modelling/blob/master/HN%20Topic%20Model%20Talk.ipynb
- https://github.com/bmabey/pyLDAvis
- https://nbviewer.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb

Most of the visualization is self expanatory, but the slider to adjust the "relevant metric" takes some reading. 
From here: https://we1s.ucsb.edu/research/we1s-tools-and-software/topic-model-observatory/tmo-guide/tmo-guide-pyldavis/

"A “relevance metric” slider scale at the top of the right panel controls how the words for a topic are sorted. As defined in the article by Sievert and Shirley (the creators of LDAvis, on which pyLDAvis is based), “relevance” combines two different ways of thinking about the degree to which a word is associated with a topic:

On the one hand, we can think of a word as highly associated with a topic if its frequency in that topic is high. By default the lambda (λ) value in the slider is set to “1,” which sorts words by their frequency in the topic (i.e., by the length of their red bars).

On the other hand, we can think of a word as highly associated with a topic if its “lift” is high. “Lift”–a term that Sievert and Shirley borrow from research on topic models by others–means basically how much a word’s frequency sticks out in a topic above the baseline of its overall frequency in the model (i.e., the “the ratio of a term’s probability within a topic to its marginal probability across the corpus,” or the ratio between its red bar and blue bar).

By default, pyLDAvis is set for λ = 1, which sorts words just by their frequency within the specific topic (by their red bars).  By contrast, setting λ = 0 words sorts words by their “lift. This means that words whose red bars are nearly as long as their blue bars will be sorted at the top. "

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models

In [None]:
pyLDAvis.enable_notebook()

In [None]:
# Note: I chose the best index from the lda_models array while plotting the coherence and perplexity metrics
pyLDAvis.gensim_models.prepare(lda_model[best_index], bow_corpus, dictionary)

## Term Frequency – Inverse Document Frequency (TF-IDF) analysis

TF-IDF (using sci-kit learn’s TfidfVectorizer) measures the frequency of a word in a document and compares it to the frequencies of all words in the text to assign it a weighted score of importance.

https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
additional_stopwords = ['1', '2', 'one', 'two', 'etc', 'also']
dictionary, bow_corpus, processed_answers = getBagOfWords(df, 1,  additional_stopwords = additional_stopwords)
processed_answers_list = [' '.join(x) for x in processed_answers]
vocab = [v for k, v in dictionary.iteritems()]

In [None]:
#TF-IDF (word level)""
vectorizer = TfidfVectorizer(analyzer = 'word', ngram_range = (1,2), min_df = 0.01, vocabulary = vocab)
tfidf_vector = vectorizer.fit_transform(processed_answers_list)

tfidf_df = pd.DataFrame(tfidf_vector.toarray(), columns = vectorizer.get_feature_names_out())
tfidf_df

# k-means

- using TF-IDF vectorizor (in sklearn) from above and then sklearn kmeans
- this method requires knowing the number of clusters you want
- https://towardsdatascience.com/clustering-product-names-with-python-part-1-f9418f8705c8
- https://towardsdatascience.com/clustering-product-names-with-python-part-2-648cc54ca2ac

In [None]:
from sklearn.cluster import KMeans
from kneed import KneeLocator

In [None]:
#Test increments of clusters using elbow method
sse = {}
for k in np.arange(2,20):
    kmeans = KMeans(n_clusters = k, max_iter = 1000).fit(tfidf_vector)
    sse[k] = kmeans.inertia_

In [None]:
# find the elbow in this curve
x = list(sse.keys())
y = list(sse.values())
kneedle = KneeLocator(x, y, S = 0.0, curve = "convex", direction = "decreasing", online = False, interp_method = "interp1d")

print(kneedle.knee, kneedle.elbow)

f,ax = plt.subplots()
ax.axvline(kneedle.elbow, linestyle = '--', color = 'gray')
ax.plot(x,y, '-o', color = 'k')
ax.set_xlabel('number of clusters')
ax.set_ylabel('SSE')

In [None]:
n_clusters = kneedle.elbow

km_model = KMeans(n_clusters = n_clusters)
km_model.fit(tfidf_vector)

result = pd.concat([df, 
                    pd.DataFrame(tfidf_vector.toarray(),
                                 columns = vectorizer.get_feature_names_out()
                                )
                   ],axis=1)

result['cluster'] = km_model.predict(tfidf_vector)

column_number = 1
num_words_to_keep = 5

# Label each cluster with the word(s) that all of its entries have in common
clusters = result['cluster'].unique()
labels = []
for i in range(len(clusters)):
    subset = result[result['cluster'] == clusters[i]]
    exclude = [result.columns[j] for j in range(column_number + 1)] + ['cluster']
    subset_words = subset.drop(exclude, axis = 1)

    # count the number of times each word appears and take the top N
    count = subset_words.astype(bool).sum(axis = 0).sort_values(ascending = False)
    words = ' '.join(count[0:num_words_to_keep].index)
    labels.append(words)

labels_table = pd.DataFrame(zip(clusters,labels), columns=['cluster','label']).sort_values('cluster')
# result_labelled = pd.merge(result,labels_table,on = 'cluster',how = 'left')


labels_table


In [None]:
# get sentences from the closest answers to the cluster
dist = km_model.transform(tfidf_vector)

# get a list of rows to remove that have <= 1 word in the matrix
row_sums = tfidf_vector.toarray().astype(bool).sum(axis = 1)
result_prune = result[row_sums > 1]
dist_prune = dist[row_sums > 1]


nlp = spacy.load('en_core_web_sm')

n_answers = 10
n_sentences = 2

for i in range(len(clusters)):
    print(f'********** Cluster {i} **********')
    print(labels_table.loc[labels_table['cluster'] == i]['label'].values)
    print('\n')
    
    nearest_answer_indices = np.argsort(dist_prune[:,i])[:n_answers]
    nearest_answers = result_prune.iloc[nearest_answer_indices]
    # print(nearest_answers['cluster'])
    
    # combine these into one long text
    combined_answers = nearest_answers[nearest_answers.columns[column_number]].str.cat(sep=' ')
    
    # split this into sentences
    doc = nlp(combined_answers)
    sentences = np.array([s.text for s in doc.sents])
    sentences_processed = []
    for s in doc.sents:
        text = preprocess(s.text, additional_stopwords = additional_stopwords)
        sentences_processed.append(' '.join(text))
    
    # get a new TF-IDF vector for each of these answers
    tfidf_vector_sentences = vectorizer.fit_transform(sentences_processed)
    
    # get the distances
    dist_sentences = km_model.transform(tfidf_vector_sentences)

    # prune as above
    row_sums = tfidf_vector_sentences.toarray().astype(bool).sum(axis = 1)
    sentences_prune = sentences[row_sums > 1]
    dist_sentences_prune = dist_sentences[row_sums > 1]

    # get the nearest sentences
    nearest_sentence_indices = np.argsort(dist_sentences_prune[:,i])[:n_sentences]
    nearest_sentences = sentences_prune[nearest_sentence_indices]
    nearest_distances = dist_sentences_prune[:,i][nearest_sentence_indices]
    print(f'Most relevant {n_sentences} sentence(s) from the top {n_answers} answers:\n')
        
    for (s,d) in zip(nearest_sentences, nearest_distances):
        print(f'Distance from centroid = {d:.4}')
        print(s.strip(),'\n')
        



In [None]:
filename = 'data/ITP_CourseArtifacts_June 2021_END_of_Course_DeIDENTIFIED.xlsx'
num_topics = np.arange(2,20)
results1_kmeans = runNLPPipeline(filename, sheet = 'Course Meta App', column_number = 1, num_topics = num_topics,
    n_answers = 20, n_sentences = 3, tfidf_ngram_range = (1,2), tfidf_min_df = 0.01,
    run_lda = False, run_lsi = False, run_ngrams = False,
 )

#  TODO


## Try Mallet LDA?

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/ <-- this also contains some great additional steps to check out

Following steps from here : https://radimrehurek.com/gensim_3.8.3/models/wrappers/ldamallet.html

(Working in WSL to compile the Mallet code.)

```
sudo apt update
sudo apt-get install default-jdk
git clone https://github.com/mimno/Mallet.git
cd Mallet/
ant
```

But this doesn't exist in gensim anymore!

In [None]:
path_to_mallet_binary = "/c/Users/ageller/NUIT/projects/BennettGoldberg/Mallet/bin/mallet"

dictionary, bow_corpus, processed_answers = getBagOfWords(df, 1,  additional_stopwords = additional_stopwords, no_below = 15, no_above = 1, keep_n = int(1e5))

model = gensim.models.wrappers.LdaMallet(path_to_mallet_binary, corpus = bow_corpus, num_topics = 5, 
                                         id2word = dictionary)
vector = model[common_corpus[0]]  # LDA topics of a documents

# Try PKE (Python Keyphrase Extraction)

- https://medium.com/nlplanet/two-minutes-nlp-keyword-and-keyphrase-extraction-with-pke-5a0260e75f3e
- https://github.com/boudinfl/pke (and see linked Colab notebooks)

In [None]:
import pke

In [None]:
string_of_answers = getStringOfWords(df, 1)

In [None]:
# initialize a keyphrase extraction model, here TFxIDF
extractor = pke.unsupervised.TfIdf()  

# load the content of the document  (str or spacy Doc)
extractor.load_document(input = string_of_answers)   

# identify keyphrase candidates
extractor.candidate_selection()    

# weight keyphrase candidates
extractor.candidate_weighting()    

In [None]:
# select the 10-best candidates as keyphrases
keyphrases = extractor.get_n_best(n = 10)          
keyphrases

In [None]:
# initialize keyphrase extraction model, here TopicRank
extractor = pke.unsupervised.TopicRank()

# load text
extractor.load_document(input = string_of_answers, language = 'en')

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm
# In TopicRank, candidate weighting is a three-step process:
#  1. candidate clustering (grouping keyphrase candidates into topics)
#  2. graph construction (building a complete-weighted-graph of topics)
#  3. rank topics (nodes) using a random walk algorithm
extractor.candidate_weighting()

In [None]:
# for each keyphrase candidate
for i, candidate in enumerate(extractor.candidates):
    
    # print out the candidate id, its stemmed form 
    print("candidate {}: {} (stemmed form)".format(i, candidate))
    
    # print out the surface forms of the candidate
    print(" - surface forms:", [ " ".join(u) for u in extractor.candidates[candidate].surface_forms])
    
    # print out the corresponding offsets
    print(" - offsets:", extractor.candidates[candidate].offsets)
    
    # print out the corresponding sentence ids
    print(" - sentence_ids:", extractor.candidates[candidate].sentence_ids)
    
    # print out the corresponding PoS patterns
    print(" - pos_patterns:", extractor.candidates[candidate].pos_patterns)

In [None]:
# for each topic of the document
for i, topic in enumerate(extractor.topics):
    
    # print out the topic id and the candidates it groups together
    print("topic {}: {} ".format(i, ';'.join(topic)))

In [None]:
# histogram of the weights
f,ax = plt.subplots()
ax.hist(extractor._w.values(), bins= 100)
ax.set_yscale('log')

In [None]:
# let have a look at the graph-based representation of the document
#
# here, nodes are topics, edges between topics are weighted according to 
# the strength of their semantic relation measured by the reciprocal distances
# between the offset positions of the candidate keyphrases

import networkx as nx

# set the labels as list of candidates for each topic
labels = {i: ';'.join(topic) for i, topic in enumerate(extractor.topics)}

# set the weights of the edges
edge_weights = [extractor.graph[u][v]['weight'] for u,v in extractor.graph.edges()]

# set the weights of the nodes (topic weights are stored in _w attribute)
sizes = [10e3*extractor._w[i] for i, topic in enumerate(extractor.topics)]

# draw the graph
nx.draw_shell(extractor.graph,
              #with_labels = True, labels = labels, 
              #width = edge_weights, 
              node_size = sizes)

In [None]:
# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
keyphrases = extractor.get_n_best(n = 10)

# for each of the best candidates
for i, (candidate, score) in enumerate(keyphrases):
    
    # print out the its rank, phrase and score
    print("rank {}: {} ({})".format(i, candidate, score))

In [None]:
# 1. create a YAKE extractor.
extractor = pke.unsupervised.YAKE()

# 2. load the content of the document.
extractor.load_document(input = string_of_answers, language = 'en', normalization = None)

# 3. select {1-3}-grams not containing punctuation marks and not
#    beginning/ending with a stopword as candidates.
extractor.candidate_selection(n = 3)

# 4. weight the candidates using YAKE weighting scheme, a window (in
#    words) for computing left/right contexts can be specified.
window = 2
use_stems = False # use stems instead of words for weighting
extractor.candidate_weighting(window = window, use_stems = use_stems)

In [None]:
# 5. get the 10-highest scored candidates as keyphrases.
#    redundant keyphrases are removed from the output using levenshtein
#    distance and a threshold.
threshold = 0.8
keyphrases = extractor.get_n_best(n = 10, threshold = threshold)

keyphrases

# LSI model using gensim
- https://medium.com/@zeina.thabet/topic-modeling-with-lsi-lda-and-document-clustering-with-carrot2-part-1-5b1fbec737f6

In [None]:
num_topics = np.arange(12) + 3
dictionary, bow_corpus, models, coherence = runLSITopicModel(df, column_number = 1, num_topics = num_topics, 
                                                             random_seed = 1234)

In [None]:
# choose the index of the best model by selecting the maximum coherence score
coherence_method = 'c_v'
best_index = np.argmax(coherence[coherence_method])
print(f'  -- The best model has {num_topics[best_index]} topics, using the "{coherence_method}" coherence method')

In [None]:
f, ax = plotLSIMetrics(num_topics, coherence, best_index)

In [None]:
lsi_model = models[best_index]
lsi_model.show_topics()

In [None]:
# finds the Euclidean distances between each topic and a given answer
df_d = getLSIDistances(lsi_model, bow_corpus, df)
df_d

In [None]:
# uses gensim to get the sum of the vector coefficients for each answer
# I'm not sure how this is helpful, unless there are only 2 topics
df_v = getLSIVectors(lsi_model, bow_corpus, df)
df_v

In [None]:
plotTopLSIVectorsKDE(df_d)

In [None]:
plotTopLSIVectorsKDE(df_v)

In [None]:
topic_vectors = lsi_model.get_topics()
print(topic_vectors.shape)
print(len(bow_corpus), len(dictionary))

In [None]:
print(topic_vectors[0])
print(sum(topic_vectors[0]))

In [None]:
# proving that the lsi_model[bow] is the same as summing the coefficients for the words in bow for a given topic
text = df[df.columns[1]][0]
text_list = preprocess(text)
bow = dictionary.doc2bow(text_list)

topic_index = 1

x = 0
for w in bow:
    val = topic_vectors[topic_index][w[0]]*w[1]
    print(val)
    x += val
print('sum',x)
print('check',lsi_model[bow][topic_index])

In [None]:
tvector = topic_vectors[1]
sentence_vector = np.zeros(len(tvector))
#sentence_vector[0] = tvector[0]
print(np.linalg.norm(tvector - sentence_vector))
print(np.dot(tvector.T, sentence_vector))
text = preprocess("Improving syllabus resources to be more welcoming and accommodating.  Using formative assessments to gauge classroom climate.")
bow = dictionary.doc2bow(text)
print(bow)
for i,w in enumerate(bow):
    sentence_vector[w[0]] = tvector[w[0]]
print(np.linalg.norm(tvector - sentence_vector))

In [None]:
# this doesn't seem to be working correctly
printBestLSITopicSentences(df_d, dictionary, lsi_model)#, show_answers = True)

In [None]:
s = "(1) Designing content and presentation to better address student's interests and goals."
preprocess(s)

In [None]:
# calculate the dot product to find the distance between the sentence and topic vectors
answer_vector = np.zeros(len(topic_vectors[topic_index]))
for i,w in enumerate(bow):
    answer_vector[i] = topic_vectors[topic_index][w[0]]
dist = np.dot(answer_vector, topic_vectors[topic_index])
print(dist)

In [None]:
lsi_model[bow]

In [None]:
use_vector = [(i, v) for i, v in enumerate(topic_vectors[topic_index])]
print(use_vector)

In [None]:
index = gensim.similarities.MatrixSimilarity(topic_vectors)

In [None]:
text = df[df.columns[1]][0]

nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

available_words = np.array(list(dictionary.items()))[:,1]

sentence_strength = {}
for sentence in doc.sents:
    for word in sentence:
        if (word.text in available_words):
            print(dictionary.doc2bow)
#             prob = lda_model.get_term_topics(word.text)[topic_number][1]
#             if sentence in sentence_strength.keys():
#                 sentence_strength[sentence] += prob
#             else:
#                 sentence_strength[sentence] = prob

In [None]:
dictionary, bow_corpus, processed_answers = getBagOfWords(df, 1)
lsimodel = gensim.models.LsiModel(corpus = bow_corpus, num_topics = 10, id2word = dictionary)

In [None]:
# inspect the u matrix
u_df = pd.DataFrame(data= lsimodel.projection.u)
# sort by column 0
u_df.sort_values(axis= 0, by= 0, ascending= False)

In [None]:
# inspect the s matrix
s_df = pd.DataFrame(data= lsimodel.projection.s)
s_df

In [None]:
lsimodel.show_topics(num_topics=10)

In [None]:
lsitopics = [[word for word, prob in topic]
             for topicid, topic in lsimodel.show_topics(formatted = False)]

coherence_model_lsi = gensim.models.CoherenceModel(model = lsimodel, texts = processed_answers, 
                                                   dictionary = dictionary, coherence='c_v', topics = lsitopics)
coherence_lsi = coherence_model_lsi.get_coherence()
print('\nCoherence Score: ', coherence_lsi)

# BERTopic

- https://towardsdatascience.com/meet-bertopic-berts-cousin-for-advanced-topic-modeling-ea5bf0b7faa3
- https://github.com/MaartenGr/BERTopic

In [None]:
from bertopic import BERTopic

# Use Bennett's categories to put answers into buckets

Look at the `ISTP Codebook.xslx` file for the words that define each category.

- **Identity**
    - Instructor Identity
    - Student Identity
    - Positionality
- **Course Structure**
    - Overall course design
    - Implementation
- **Course Climate**
    - Creating an inclusive course climate
    - Conflict management
- **Concept Knowledge**
    - Understanding inclusive practices
    - Interpersonal; sharing knowledge
- **Planned Implementation**
    - Synthesis of knowledge
    - Reasons for hesitancies or resistance


I also want to know for a given answer in a category if the response is already increased confidence or still gaining confidence.  I think I may need to split each answer into sentences and do the analysis on each sentence...

In [None]:
# settings
additional_stopwords = ['']
wlen = 3
stem = True
no_below = 1
no_above = 1
keep_n = int(1e10)

In [None]:
import json

In [None]:
# read in the categories
# I created this file from the ISTP Codebook.xlsx file
with open('data/category_words.json') as json_file:
    categories_all = json.load(json_file)


#categories_all

In [None]:
# first, collapse the subcategories
categories_primary = {}
for k1 in categories_all:
    categories_primary[k1] = categories_all[k1]['text']
    for k2 in categories_all[k1]['subcategories']:
        categories_primary[k1] += ' ' + categories_all[k1]['subcategories'][k2]

#categories_primary

In [None]:
processed_categories = {}
for key in categories_primary:
    processed_categories[key] = preprocess(
        categories_primary[key], additional_stopwords = additional_stopwords, wlen = wlen, stem = stem
    )
categories_names = np.array(list(processed_categories.keys()))
#processed_categories

In [None]:
# try to eliminate overlapping words between categories?

In [None]:
# full data file with multiple sheets
filename = 'data/ITP_CourseArtifacts_June 2021_END_of_Course_DeIDENTIFIED.xlsx'

# sheet name for this analysis, containing responses to one question
sheet = 'Course Meta SelfEff'

df = pd.read_excel(filename, sheet)
df

In [None]:
# for each answer, try to divide into chunks of "gained confidence" and "still gaining confidence"
# this really should have been split into two questions!
results_df = df.copy()

gained_list = ['' for x in range(len(df))]
gaining_list = ['' for x in range(len(df))]
neither_list = ['' for x in range(len(df))]
for i, row in df.iterrows():
    n_key = 0
    answer = re.sub(r'[^\w\s]', '', (unicodedata.normalize('NFKD', row[columns[1]])
        .encode('ascii', 'ignore')
        .decode('utf-8', 'ignore')
        .lower()))

    words = np.array(answer.split(' '))
    igained_start = -1
    igaining_start = -1

    gained = []
    gaining = []
    neither = []
    for k, w in enumerate(words):
        if (k < len(words) - 2):
            if ( 
                    (w == 'increasing') or 
                    (w == 'gaining') or 
                    (w + words[k + 1] == 'lessconfident') or 
                    (w + words[k + 1] == 'notconfident') or 
                    (w + words[k + 1] == 'amstill') or 
                    (w + words[k + 1] == 'imstill') or
                    (w + words[k + 1] == 'stillneed')
            ):
                if (igaining_start < 0): 
                    igaining_start = k
                if (igained_start >= 0):
                    #gained += ' '.join(words[igained_start:k])
                    gained.extend([x + igained_start for x in range(k - igained_start)])
                    igained_start = -1
                    
            elif ( (
                    (w == 'increased') or 
                    (w == 'gained') or 
                    (w == 'confident') or 
                    (w + words[k + 1] == 'nowfeel')
                ) and (k - igaining_start > 3 or igaining_start == -1)
            ): 
                if (igained_start < 0):
                    igained_start = k
                if (igaining_start >= 0):
                    #gaining += ' '.join(words[igaining_start:k])
                    gaining.extend([x + igaining_start for x in range(k - igaining_start)])
                    igaining_start = -1


                    
    if (igained_start > 0):
        #gained += ' '.join(words[igained_start:len(words)])
        gained.extend([x + igained_start for x in range(len(words) - igained_start)])

    if (igaining_start > 0):
        #gaining += ' '.join(words[igaining_start:len(words)])        
        gaining.extend([x + igaining_start for x in range(len(words) - igaining_start)])

    for x in range(len(words)):
        if (x not in gained and x not in gaining):
            neither.append(x)

    if (len(gained) > 0):
        gained_list[i] = ' '.join(words[gained])
    if (len(gaining) > 0):
        gaining_list[i] = ' '.join(words[gaining])
    if (len(neither) > 0):
        neither_list[i] = ' '.join(words[neither])

    #print(i)
    #print(answer + '\n')
    #print('GAINED', gained_list[i], '\n')
    #print('GAINING', gaining_list[i], '\n')
    #print('NEITHER', neither_list[i], '\n')
    
results_df['gained'] = gained_list
results_df['gaining'] = gaining_list
results_df['neither'] = neither_list
            

In [None]:
results_df

In [None]:
columns = results_df.columns.to_list()
cols = [2,3] #gained, gaining

for column_number in cols:
    column_name = columns[column_number]
    
    # first create a bag of words for each survey answer
    dictionary, bow_corpus, processed_answers = getBagOfWords(
        results_df, column_number,  additional_stopwords = additional_stopwords, 
        wlen = wlen, stem = stem, no_below = no_below, no_above = no_above, 
        keep_n = keep_n
    )

    # loop through and count the words
    # how do I normalize this to get a scale that will allow me to identify categories that the answers fall into
    # I could divide by the number of words in the answer, but what if the answer is relevant to multiple categories?
    # I will add a column that has the number of words in the answer so that I can use that later
    categories_scores = np.zeros((len(df), len(processed_categories.keys())))
    for i, a in enumerate(processed_answers):
        for j, key in enumerate(categories_names):
            for w in processed_categories[key]:
                if (w in a):
                    categories_scores[i,j] += 1/len(a)

    # maybe I can keep track of some kind of maximum density of words within a group based on a rolling window?
    window = 10 # this is meant to represent some average sentence length in the answers (I could calculate that)
    categories_density = np.zeros((len(df), len(processed_categories.keys())))
    for i, a in enumerate(processed_answers):
        for j, key in enumerate(categories_names):
            density = 0
            wuse = np.min([window, len(a)])
            for k, wa in enumerate(a):
                if ((len(a) - k) >= wuse):
                    wlist = a[k:(k + wuse)]
                    d = 0
                    for w in processed_categories[key]:
                        if (w in wlist):
                            d += 1/len(wlist)
                    density = np.max([density, d])
            categories_density[i,j] = density

    # add these counts to the dataframe
    for j, key in enumerate(categories_names):
        results_df[key + ' ' + column_name] = categories_scores[:, j]
    for j, key in enumerate(categories_names):
        results_df[key + ' ' + column_name + ' density'] = categories_density[:, j]


# TODO: impose some threshold to place in "Not specific"

In [None]:
results_df

In [None]:
f, ax = plt.subplots(1, 2, figsize=(12, 5))
for j, key in enumerate(categories_names):
    ax[0].scatter(results_df[key + ' gained'], results_df[key + ' gained density'], label = key)
    ax[1].scatter(results_df[key + ' gaining'], results_df[key + ' gaining density'], label = key)
ax[0].legend()
ax[0].set_xlabel('fraction gained')
ax[0].set_ylabel('density gained')
ax[1].set_xlabel('fraction gaining')
ax[1].set_ylabel('density gaining')

In [None]:
f, ax = plt.subplots(1, 2, figsize=(12, 5))
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
full_array = []
for j, key in enumerate(categories_names):
    full_array.extend(results_df[key + ' gained'].to_list())
    ax[0].hist(results_df[key + ' gained'], label = key, alpha = 0.5, color = colors[j], bins = 20)
    ax[0].hist(results_df[key + ' gained'], histtype = 'step', color = colors[j], linewidth = 2, bins = 20)
ax[0].legend()
ax[0].set_xlabel('fraction')
ax[0].set_ylabel('N')

full_array = np.array(full_array)
full_array = full_array[np.where(np.array(full_array) > 0)]

ax[1].hist(full_array, bins = 20)
median_fraction_gained, one_sig_fraction_gained, three_sig_fraction_gained = np.percentile(full_array, [50, 84.1, 99.8])
ax[1].axvline(median_fraction_gained, color = colors[1])
ax[1].axvline(one_sig_fraction_gained, linestyle = 'dashed', color = colors[1])
ax[1].axvline(three_sig_fraction_gained, linestyle = 'dotted', color = colors[1])
print(median_fraction_gained, one_sig_fraction_gained)

In [None]:
f, ax = plt.subplots(1, 2, figsize=(12, 5))
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
full_array = []
for j, key in enumerate(categories_names):
    full_array.extend(results_df[key + ' gaining'].to_list())
    ax[0].hist(results_df[key + ' gaining'], label = key, alpha = 0.5, color = colors[j], bins = 20)
    ax[0].hist(results_df[key + ' gaining'], histtype = 'step', color = colors[j], linewidth = 2, bins = 20)
ax[0].legend()
ax[0].set_xlabel('fraction')
ax[0].set_ylabel('N')

full_array = np.array(full_array)
full_array = full_array[np.where(np.array(full_array) > 0)]

ax[1].hist(full_array, bins = 20)
median_fraction_gaining, one_sig_fraction_gaining, three_sig_fraction_gaining = np.percentile(full_array, [50, 84.1, 99.8])
ax[1].axvline(median_fraction_gaining, color = colors[1])
ax[1].axvline(one_sig_fraction_gaining, linestyle = 'dashed', color = colors[1])
ax[1].axvline(three_sig_fraction_gaining, linestyle = 'dotted', color = colors[1])
print(median_fraction_gaining, one_sig_fraction_gaining)

In [None]:
f, ax = plt.subplots(1, 2, figsize=(12, 5))
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
full_array = []
for j, key in enumerate(categories_names):
    full_array.extend(results_df[key + ' gained density'].to_list())
    ax[0].hist(results_df[key + ' gained density'], label = key, alpha = 0.5, color = colors[j], bins = np.arange(11)/10.)
    ax[0].hist(results_df[key + ' gained density'], histtype = 'step', color = colors[j], linewidth = 2, bins = np.arange(11)/10.)
ax[0].legend()
ax[0].set_xlabel('density')
ax[0].set_ylabel('N')

full_array = np.array(full_array)
full_array = full_array[np.where(np.array(full_array) > 0)]

ax[1].hist(full_array, bins = np.arange(11)/10.)
median_density_gained, one_sig_density_gained, three_sig_density_gained = np.percentile(full_array, [50, 84.1, 99.8])
ax[1].axvline(median_density_gained, color = colors[1])
ax[1].axvline(one_sig_density_gained, linestyle = 'dashed', color = colors[1])
ax[1].axvline(three_sig_density_gained, linestyle = 'dotted', color = colors[1])
print(median_density_gained, one_sig_density_gained)

In [None]:
f, ax = plt.subplots(1, 2, figsize=(12, 5))
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
full_array = []
for j, key in enumerate(categories_names):
    full_array.extend(results_df[key + ' gaining density'].to_list())
    ax[0].hist(results_df[key + ' gaining density'], label = key, alpha = 0.5, color = colors[j], bins = np.arange(11)/10.)
    ax[0].hist(results_df[key + ' gaining density'], histtype = 'step', color = colors[j], linewidth = 2, bins = np.arange(11)/10.)
ax[0].legend()
ax[0].set_xlabel('density')
ax[0].set_ylabel('N')

full_array = np.array(full_array)
full_array = full_array[np.where(np.array(full_array) > 0)]

ax[1].hist(full_array, bins = np.arange(11)/10.)
median_density_gaining, one_sig_density_gaining, three_sig_density_gaining = np.percentile(full_array, [50, 84.1, 99.8])
ax[1].axvline(median_density_gaining, color = colors[1])
ax[1].axvline(one_sig_density_gaining, linestyle = 'dashed', color = colors[1])
ax[1].axvline(three_sig_density_gaining, linestyle = 'dotted', color = colors[1])
print(median_density_gaining, one_sig_density_gaining)

In [None]:
# get the numbers of answers that have a given category above some threshold
# there's probably a faster way to do this!
category_numbers_gained = {}
category_numbers_gaining = {}
for j, key in enumerate(categories_names):
    category_numbers_gained[key] = 0
    category_numbers_gaining[key] = 0

# not sure what the best numbers are here
frac_threshold_gained = one_sig_fraction_gained
dens_threshold_gained = one_sig_density_gained
for key in categories_names:
    for i, row in results_df.iterrows():
        if ((row[key + ' gained'] > frac_threshold_gained) or (row[key + ' gained density'] > dens_threshold_gained)):
            category_numbers_gained[key] += 1
                

# not sure what the best numbers are here
frac_threshold_gaining = one_sig_fraction_gaining
dens_threshold_gaining = one_sig_density_gaining
for key in categories_names:
    for i, row in results_df.iterrows():
        if ((row[key + ' gaining'] > frac_threshold_gaining) or (row[key + ' gaining density'] > dens_threshold_gaining)):
            category_numbers_gaining[key] += 1

print(category_numbers_gained)
print(category_numbers_gaining)

In [None]:
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
f,ax = plt.subplots()
data_gained = np.array(list(category_numbers_gained.values()))
data_gaining = np.array(list(category_numbers_gaining.values()))
srt_gained = np.argsort(data_gained)

x = np.arange(len(categories_names))
rects_gained = ax.barh(x + 0.2, data_gained[srt_gained], height = 0.4, 
                       tick_label = categories_names[srt_gained], color = colors[0], label = 'Gained Confidence')
rects_gaining = ax.barh(x - 0.2, data_gaining[srt_gained], height = 0.4, 
                        tick_label = categories_names[srt_gained], color = colors[1], label = 'Still Gaining Confidence')
#ax.set_xlim(0,60)
ax.legend(loc = 'lower center')

# remove all the axes, ticks and lower x label
# aoff = ['right', 'left', 'top', 'bottom']
aoff = ['right', 'top', 'bottom']
for x in aoff:
    ax.spines[x].set_visible(False)
ax.tick_params(length=0)
ax.set_xticklabels([' ']*len(data_gained))

for r in rects_gained:
    h = r.get_height()
    w = r.get_width()
    y = r.get_y()
    ax.text(w + 5, y + 0.1, w, ha = 'center', va = 'bottom', zorder = 3, color = colors[0]) 
for r in rects_gaining:
    h = r.get_height()
    w = r.get_width()
    y = r.get_y()
    ax.text(w + 5, y + 0.1, w, ha = 'center', va = 'bottom', zorder = 3, color = colors[1]) 

In [None]:
# check the script

with open('data/category_words.json') as json_file:
    categories_all = json.load(json_file)
    
# collapse the subcategories
categories_primary = {}
for k1 in categories_all:
    categories_primary[k1] = categories_all[k1]['text']
    for k2 in categories_all[k1]['subcategories']:
        categories_primary[k1] += ' ' + categories_all[k1]['subcategories'][k2]

# sheet name for this analysis, containing responses to one question
results_df, category_numbers_gained, category_numbers_gaining = separate_into_predefined_categories(
                categories_dict = categories_primary, 
                filename = 'data/ITP_CourseArtifacts_June 2021_END_of_Course_DeIDENTIFIED.xlsx', 
                sheet = 'Course Meta SelfEff', 
                column_number = 1, 
                additional_stopwords = [''], 
                wlen = 3, 
                figname = "categories_histogram_CourseMetaSelfEff_primary.png")

In [None]:
# run all the subcategories
categories_sub = {}
for k1 in categories_all:
    for k2 in categories_all[k1]['subcategories']:
        categories_sub[k2] = categories_all[k1]['subcategories'][k2]

# sheet name for this analysis, containing responses to one question
results_df, category_numbers_gained, category_numbers_gaining = separate_into_predefined_categories(
                categories_dict = categories_sub, 
                filename = 'data/ITP_CourseArtifacts_June 2021_END_of_Course_DeIDENTIFIED.xlsx', 
                sheet = 'Course Meta SelfEff', 
                column_number = 1, 
                additional_stopwords = [''], 
                wlen = 3, 
                figname = "categories_histogram_CourseMetaSelfEff_subcategories.png")