## TODO's - October

1. Refine preprocessing pipeline (use spacy or nltk or some combination of the two)
    - There are some quirks with n-grams currently, look into refining the implementation
    - Some words like "use", "since", "r", "x", are not being filtered out by stopword removal

2. Web scraping for job data
    - Collect like 50-100 examples per week and create a similar preprocessing pipeline 
    - Look for ways to programmatically filter sections we want (responsibilities and qualifications).

3. Look into topic labeling
    - Automatically extracting top n words (and sorting them by relevance)
    - Look at how relevance is computed at https://github.com/bmabey/pyLDAvis/blob/master/pyLDAvis/_prepare.py
    - BERTopic?
    
4. Finish Introduction and Data sections before midterm break
    - Literature review (Blei paper, Daniel paper, Journal of DSE paper, possibly find topic labeling papers?)
    - Decide on final dataset 


## TODO's - Final Submission

1. Use HDP output or LDA equivalent for module data. 
2. Preprocess and filter job data (maybe add more from LinkedIn if less than 150 after filtering).
3. Same analysis on job data.
4. Results section: Add visualizations and metrics.
5. Discussion section: talk about overlaps and differences between both datasets.
6. Finalize paper.

## Topic Modeling on MDS Program Lecture Material

### Some notation

- A 'document' is just a collection of words.
    - Initially, after loading the data, one document is contained in a string, containing all the text from one module.
    - After preprocessing, one document is represented in a "bag of words" format, which means it is a *list* of individual tokens (words).
- A 'corpus' is a collection of documents.
- d = number of documents in the corpus
- k = number of topics for the topic model to find
- |V| = size of vocabulary, i.e. number of distinct tokens in the corpus

### Imports and loading data

In [1]:
# Used to tokenize the text; i.e. create a dictionary mapping words to integers. The dictionary can be used to create a term-document matrix.
from gensim.corpora import Dictionary

## Preprocessing with nltk
import string   # contains a public variable with all ASCII punctuation characters
import nltk

# list of all stopwords such as 'and', 'the', 'is', etc.
nltk.download('stopwords')  

# WordNet is a lexical database of English words that groups words into sets of synonyms, while also recording semantic relationships between words such as "is-a", "part-of", and "opposite-of" relationships.
nltk.download('wordnet')    

# Open Multilingual WordNet (omw) links hand created wordnets and automatically created wordnets for different languages.
nltk.download('omw-1.4')

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk import ngrams

## Preprocessing with gensim and spacy
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from gensim.models.phrases import ENGLISH_CONNECTOR_WORDS

import spacy

from textacy import extract

import numpy as np

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\syeda\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\syeda\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\syeda\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [4]:
# For topic visualizations 
import pyLDAvis.gensim_models as gensim_vis
import pyLDAvis
# For enabling HTML widget in Jupyter notebook
from pyLDAvis import enable_notebook

enable_notebook()

Location of environment for personal reference:  c:\Users\syeda\miniconda3\envs\dir-st\lib\ (in case large models are downloaded for testing and need to be deleted)

In [7]:
import os

def combine_text_files_to_list(input_directory):

    txt_files = [os.path.join(input_directory, file) for file in os.listdir(input_directory) if file.endswith(".txt")]
    corpus = []

    for txt_file in txt_files:
        
        try:
            # Read the entire file as a string and add the string to the corpus
            with open(txt_file, 'r', encoding='utf-8') as file:
                file_content = file.read()  
                corpus.append(file_content)  
                
        except Exception as e:
            print(f"An error occurred while reading {txt_file}: {e}")
    
    return corpus

strings_list = combine_text_files_to_list("../Dataset/Parsed_Lectures")
print("Corpus combined successfully as a list of strings.")

Corpus combined successfully as a list of strings.


Each string in `strings_list` is all the text from one PDF of lecture slides.

In [8]:
print(len(strings_list))
print(strings_list[0][:500])

162
 Learning Objectives•  Explain why it is important to understand and use correct terminology.            •          Define: computer, software, memory, data, memory size/data size, cloud            •          Explain "Big Data" and describe data growth in the coming years.            •          Compare and contrast: digital versus analog            •          Briefly explain how integers, doubles, and strings are encoded.            •          Explain why ASCII table is required for character en


In [9]:
sum = 0
doc_length = []
for doc in strings_list:
    words = doc.split()
    sum += len(words)
    print("Number of words: ", len(words))
    doc_length.append(len(words))
    
print(f"Total number of words in the corpus: {sum}")
print(f"Mean number of words per document: {round(np.mean(doc_length),2)}")
print(f"Standard deviation: {round(np.std(doc_length),2)}")

Number of words:  2429
Number of words:  4714
Number of words:  2076
Number of words:  4303
Number of words:  3140
Number of words:  1810
Number of words:  1829
Number of words:  3466
Number of words:  2130
Number of words:  3368
Number of words:  2624
Number of words:  3376
Number of words:  2477
Number of words:  3001
Number of words:  2815
Number of words:  1843
Number of words:  3389
Number of words:  2099
Number of words:  2520
Number of words:  1622
Number of words:  1099
Number of words:  2014
Number of words:  2447
Number of words:  2023
Number of words:  2621
Number of words:  2121
Number of words:  3201
Number of words:  1874
Number of words:  2069
Number of words:  5461
Number of words:  1241
Number of words:  2446
Number of words:  1683
Number of words:  5307
Number of words:  3546
Number of words:  4360
Number of words:  1584
Number of words:  2097
Number of words:  3871
Number of words:  3414
Number of words:  3068
Number of words:  3669
Number of words:  1922
Number of w

### Cleaning and preprocessing the corpus

For this task, we explored 2 options, nltk and spaCy. Overall, we found spaCy is a bit easier to use. In both cases, input is a list of strings, and the returned corpus is a list of list of strings, where each nested list of strings is a list of cleaned words from one module. 

In [11]:
def clean_with_nltk(doc):
    
    stop_words = set(stopwords.words('english'))
    punctuation = set(string.punctuation)
    lemmatizer = WordNetLemmatizer()
    lower_case_sentences = doc.lower().split()

    stop_free = " ".join([word for word in lower_case_sentences if word not in stop_words])             # only keep words that are not stopwords
    # print(stop_free)
    punc_free = "".join(ch for ch in stop_free if ch not in punctuation and not ch.isnumeric() and not ch == "•")         # only keep characters that are not punctuation and not numbers
    # print(punc_free)
    lemmatized = " ".join(lemmatizer.lemmatize(word) for word in punc_free.split())             # lemmatize words; convert words to their base or root form using their context in the sentence
    # print(lemmatized)

    # We do this separately later for nltk
    # bigrams = list(ngrams(lemmatized, 2))  
    # trigrams = list(ngrams(lemmatized, 3))  
    # bigram_strings = ["_".join(bigram) for bigram in bigrams]  # Join bigram words with an underscore
    # trigram_strings = ["_".join(trigram) for trigram in trigrams]

    return lemmatized 

In [12]:
"""
Old function, kept here because of how stupid it is
def clean_with_spacy(doc):
    
    spacy_doc = nlp(doc.lower())  
    ngrams = [
        ngram.text.replace(" ", "_")    # ngrams are separated by spaces, so we replace them with underscores
        for ngram in extract.ngrams(spacy_doc, n = 2, min_freq = 4, filter_punct = True, filter_nums = True, exclude_pos=["PROPN", "ORG", "DATE", "X"]) 
        if not ngram.text.__contains__("=") 
            and not ngram.text.__contains__("@") 
            and not ngram.text.__contains__("$")
    ]
    
    # Remove stopwords, punctuation, and numeric tokens
    tokens = [
        token.lemma_ 
        for token in spacy_doc 
        if not token.is_stop and not token.is_punct and not token.is_digit and token.is_alpha       # Keep only words that are not stop words
            and token.text not in ["_", "+", "=", "\n","-","*","<",">"]                             # Remove special characters       
            and not len(token.text) == 1                                                            # Remove single character words
    ]    

    tokens = [token.replace("datum", "data") for token in tokens]  # Replace 'datum' (lemma of data) with 'data' for clarity                                                                         
    
    return tokens + ngrams
"""

nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])  

# Add custom stop words 
nlp.Defaults.stop_words |= {"ubc", "mds", "lecture", "lab", "assignments", "example", "british","columbia", "introduction" ,"page", "file", "question", "ex", "import", "jeffrey", "andrews", "irene", "vrbik", "shan", "du", "ifeoma", "adaji", "gema", "rodrigues", "fatemeh", "fard", "emelie", "gustafsson", "heinz", "bauschke", "travis", "douglas", "jones", "dave", "xiaoping", "shi", "khalad", "hasan", "ladan", "tazik", "ramon", "lawrence", "chu", "miller", "casey", "ritish", "smith", "lee", "university", "ιc", "jan", "feb", "mar", "tn", "pu", "xn", "ee", "sa", "fa", "toys", "bat", "clothing", "apples", "jacknife", "jacket", "following", "treatment", "let", "return", "returns", "true", "nh", "λy", "𝑘th", "ll", "lll", "calibri", "york", "florida", "illinois", "texas", "francisco", "quartersales", "quarterpivot", "food", "wind", "steak", "xlsx", "phd", "na", "kkt", "dur", "earlier", "city", "street", "false"}

def clean_without_ngrams(doc):

    spacy_doc = nlp(doc.lower())

    # Remove stopwords, punctuation, and numeric tokens
    tokens = [
        token.text 
        for token in spacy_doc 
        if not token.is_stop and not token.is_punct and not token.is_digit and token.is_alpha       # Keep only words that are not stop words
            and token.text not in ["_", "+", "=", "\n","-","*","<",">"]                             # Remove special characters       
            and not len(token.text) == 1                                                            # Remove single character words
            # and token.pos_ in ["NOUN", "ADJ", "VERB", "ADV"]                                        # Keep only nouns, adjectives, verbs, and adverbs
    ]    
                                                                           
    return tokens

def lemmatize(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    tokens = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        sent_tokens = []
        for token in doc: 
            if "_" in token.text:
                sent_tokens.append(token.text)
            else:
                if token.pos_ in allowed_postags:
                    sent_tokens.append(token.lemma_)
                    
        sent_tokens = [token.replace("datum", "data") for token in sent_tokens]
        tokens.append(sent_tokens)

    return tokens

#### Cleaning with spaCy 

In [13]:
bag_of_words_list = [clean_without_ngrams(doc) for doc in strings_list]

bigram = Phrases(bag_of_words_list, min_count=10, threshold=20) 
bigram_mod = Phraser(bigram)    # For speed

# Add bigrams
bag_of_words_list = [bigram_mod[doc] for doc in bag_of_words_list]

# Lemmatize the words, exluding bigrams
bag_of_words_list = lemmatize(bag_of_words_list)

sum = 0
for doc in bag_of_words_list:
    sum += len(doc)

print(f"Total number of words in the cleaned corpus: {sum}")

Total number of words in the cleaned corpus: 151447


In [14]:
print(bag_of_words_list[0][:10] + bag_of_words_list[0][-10:])

['learn', 'explain', 'important', 'understand', 'use', 'correct', 'terminology', 'define', 'computer', 'software', 'necessary', 'transform', 'data', 'format', 'excel', 'analysis', 'ubco_master', 'data', 'science', 'data']


#### Cleaning with nltk

In [None]:
nltk_cleaned_corpus = [clean_with_nltk(doc).split() for doc in corpus]
print(nltk_cleaned_corpus[0])

In [None]:
sum = 0
for doc in nltk_cleaned_corpus:
    sum += len(doc)

print(f"Total number of words in the cleaned corpus: {sum}")

Total number of words in the cleaned corpus: 181461


In [None]:
bigram = Phrases(nltk_cleaned_corpus, min_count=10, connector_words=ENGLISH_CONNECTOR_WORDS)  
# trigram = Phrases(bigram[clean_corpus], threshold=10, connector_words=ENGLISH_CONNECTOR_WORDS)

bigram_mod = Phraser(bigram)
# trigram_mod = Phraser(trigram)

# add bigrams and trigrams to the clean corpus
corpus_with_bigrams = [bigram_mod[doc] for doc in nltk_cleaned_corpus]

sum = 0
for doc in corpus_with_bigrams:
    sum += len(doc)

print(f"Total number of words in the nltk corpus with ngrams: {sum}")

<class 'list'>
Total number of words in the corpus with ngrams: 164907


#### Preprocessing into Document-Term matrix and id2word dictionary 

In [139]:
# Create a dictionary mapping token ID integers to words
dictionary = Dictionary(bag_of_words_list)    

# Create a d x |V| term-document matrix, where each row represents a document and each column represents a unique token in the corpus. 
# Value at row i and column j is the how many times token j appears in document i.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in bag_of_words_list]  

print(doc_term_matrix[0])

[(0, 1), (1, 3), (2, 1), (3, 1), (4, 4), (5, 6), (6, 1), (7, 1), (8, 2), (9, 1), (10, 2), (11, 13), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 4), (20, 2), (21, 1), (22, 5), (23, 6), (24, 22), (25, 1), (26, 4), (27, 1), (28, 12), (29, 2), (30, 3), (31, 1), (32, 1), (33, 1), (34, 2), (35, 1), (36, 15), (37, 2), (38, 3), (39, 4), (40, 2), (41, 2), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 2), (50, 1), (51, 1), (52, 1), (53, 19), (54, 1), (55, 1), (56, 1), (57, 2), (58, 1), (59, 4), (60, 2), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 2), (68, 1), (69, 65), (70, 1), (71, 5), (72, 6), (73, 1), (74, 1), (75, 2), (76, 3), (77, 1), (78, 1), (79, 1), (80, 10), (81, 4), (82, 3), (83, 3), (84, 1), (85, 2), (86, 2), (87, 1), (88, 1), (89, 1), (90, 2), (91, 1), (92, 4), (93, 1), (94, 2), (95, 1), (96, 1), (97, 20), (98, 11), (99, 2), (100, 1), (101, 1), (102, 2), (103, 2), (104, 2), (105, 1), (106, 1), (107, 1), (108, 1), (109, 6), 

In [140]:
dictionary.save_as_text("lectures_dictionary.txt")

In [105]:
print(len(dictionary))

9922


### Topic modeling

#### First run of LDA

In [16]:
NUM_TOPICS = 23
PATH_TO_MODEL = f"162_Lectures_Test_LDA_spacy_{NUM_TOPICS}_topics"
lda_model = None

In [35]:
from gensim.models import LdaModel
# from pprint import pprint

lda_model = LdaModel(doc_term_matrix, num_topics=NUM_TOPICS, id2word = dictionary, random_state=448)
lda_model.show_topics(num_topics = -1, num_words = 10)
# pprint(lda_model.print_topics(num_topics=NUM_TOPICS, num_words=3))

[(0,
  '0.016*"function" + 0.015*"data" + 0.012*"value" + 0.010*"use" + 0.007*"test" + 0.006*"number" + 0.006*"list" + 0.006*"create" + 0.005*"column" + 0.005*"set"'),
 (1,
  '0.026*"data" + 0.012*"model" + 0.011*"value" + 0.008*"function" + 0.007*"probability" + 0.006*"variable" + 0.005*"time" + 0.005*"estimate" + 0.005*"sample" + 0.005*"number"'),
 (2,
  '0.011*"data" + 0.009*"function" + 0.009*"value" + 0.008*"set" + 0.008*"model" + 0.006*"sample" + 0.006*"group" + 0.005*"method" + 0.005*"probability" + 0.005*"number"'),
 (3,
  '0.020*"data" + 0.011*"value" + 0.008*"model" + 0.006*"function" + 0.006*"sample" + 0.005*"class" + 0.005*"use" + 0.004*"estimate" + 0.003*"number" + 0.003*"variable"'),
 (4,
  '0.010*"data" + 0.007*"package" + 0.007*"function" + 0.007*"use" + 0.006*"test" + 0.006*"error" + 0.005*"value" + 0.005*"table" + 0.005*"number" + 0.004*"select"'),
 (5,
  '0.013*"model" + 0.010*"data" + 0.007*"value" + 0.006*"sample" + 0.005*"set" + 0.005*"number" + 0.005*"variable" +

Each row corresponds to a topic, and each coefficient next to a word represents the probability of that word being sampled from that topic. The order of the rows is arbitrary. Note that each row actually contains |V| elements, where coefficients sum to 1, here we only show the top 10 words sorted by their coefficients. 

These topics, however, are not very good. There is no debatably no automatic quantitative metric that can be used for measuring how "good" the topics found by a topic model are. Here is an example of a metric called "coherence" using the "u_mass" based on co-occurence of word pairs, and "c_v" which has been found to have the highest correlation with human rating.

In [36]:
lda_model.top_topics(doc_term_matrix, dictionary=dictionary, coherence='u_mass', topn=10)

[([(0.02506777, 'data'),
   (0.008995723, 'model'),
   (0.008952424, 'function'),
   (0.0067993742, 'value'),
   (0.0054631867, 'time'),
   (0.0052471864, 'set'),
   (0.005177578, 'use'),
   (0.004867948, 'number'),
   (0.0047514704, 'output'),
   (0.004589467, 'variable')],
  -0.3354164662938507),
 ([(0.02665689, 'data'),
   (0.012298621, 'value'),
   (0.010319182, 'model'),
   (0.008858897, 'function'),
   (0.0065668197, 'distribution'),
   (0.0062051914, 'number'),
   (0.0058252197, 'set'),
   (0.0049118996, 'use'),
   (0.0046668258, 'time'),
   (0.004485156, 'sample')],
  -0.388593523397973),
 ([(0.013296178, 'data'),
   (0.010989369, 'model'),
   (0.006540581, 'value'),
   (0.006004543, 'sample'),
   (0.0056757624, 'use'),
   (0.0049256054, 'function'),
   (0.0048304754, 'number'),
   (0.004410559, 'estimate'),
   (0.0043707774, 'set'),
   (0.003883086, 'problem')],
  -0.4292133121228956),
 ([(0.009532973, 'data'),
   (0.0066517214, 'function'),
   (0.0055661397, 'value'),
   (0.0

As you can see, these values (the negative float values after each list of 10 words) are hard to interpret, so we won't be using these metrics moving forward and just going with a subjective perspective. Also, as we'll see later, better topics seem to have *worse* coherence values. 

In [37]:
from gensim.test.utils import datapath
lda_model.save(datapath(PATH_TO_MODEL))

# Datapath: c:\Users\syeda\miniconda3\envs\dir-st\lib\site-packages\gensim\test\test_data\

Visualizing with pyLDAvis

In [38]:
# To save computation, previous results can be loaded from the disk after running only the cell where PATH_TO_MODEL is defined
lda_model_to_display = LdaModel.load(datapath(PATH_TO_MODEL)) if lda_model is None else lda_model 

# Options for 'mds' (dimensionality reduction): mds = 'pcoa' (Principle Coordinate Analysis), 'tsne', 'mmds'
LDAvis_prepared = gensim_vis.prepare(lda_model_to_display, doc_term_matrix, dictionary, mds='mmds')
pyLDAvis.display(LDAvis_prepared)

# To save the visualization to an HTML file
pyLDAvis.save_html(LDAvis_prepared, 'Lectures_Test_run_LDA_'+ str(NUM_TOPICS) + '.html')

In [39]:
LDAvis_prepared = gensim_vis.prepare(lda_model_to_display, doc_term_matrix, dictionary, mds='pcoa')
pyLDAvis.display(LDAvis_prepared)
pyLDAvis.save_html(LDAvis_prepared, 'Lectures_Test_run_LDA_'+ str(NUM_TOPICS) + '_pcoa.html')

How to interpret the visualizations: 

- The size of the circle corresponds to the importance of the topic in the corpus (i.e. topics with words that occur more frequently in the corpus are larger)
- The distance between circles is how "far apart" they are.
- Overlaps between circles represents some similarity in the per-word probabilities for those topics.
- Note that the radii of the circles and distances between circles are not on the same scales, they are presented in this way for simplicity of visualization.

Rough overview of how the visualization is generated:

- First, the k x |V| per-word topic (phi) matrix from the LdaModel object is used to calculate pairwise distances between topics, using Jensen-Shannon Divergence (a method of measuring the similarity between two probability distributions, remembering that each row of this matrix is a probability distribution over all words in the vocabulary). 
- Then, this distance matrix is projected into a 2D plane using the various options for dimensionality reduction techniques accepted as parameters. 

For more info, see: https://pyldavis.readthedocs.io/en/latest/_modules/pyLDAvis/_prepare.html#js_PCoA

Here we show two dimensionality reduction techniques on the same LDA model: 
- MMDS (Metric Multidimensional Scaling) and PCoA (Principle Coordinate Analysis; different from Principle Component Analysis)

Commonly-occuring words like 'data', 'model', 'value', 'function' clog up the outputs for most of the topics. Not just that, most topics seem to be too general to be given narrower labels like "machine learning" or "databases". Lastly, it seems that topics after topic 13 don't have that many words within them, so maybe around 10-12 topics should also be explored. 

#### Testing an HDP model

HDP stands for Heirarchical Dirichlet Process. For an HDP topic model, we don't need to provide `k`, the number of topics. 

In [40]:
from gensim.models import HdpModel
# from pprint import pprint

hdp_model = HdpModel(doc_term_matrix, id2word = dictionary)
# hdp_model.optimal_ordering()
hdp_model.show_topics(num_topics=15)

[(0,
  '0.014*sample + 0.010*data + 0.007*probability + 0.006*population + 0.004*unit + 0.004*error + 0.004*use + 0.004*model + 0.003*distribution + 0.003*number + 0.003*value + 0.003*test + 0.003*estimate + 0.003*information + 0.003*function + 0.002*stratum + 0.002*mean + 0.002*student + 0.002*design + 0.002*system'),
 (1,
  '0.007*model + 0.007*value + 0.006*data + 0.005*estimate + 0.005*block + 0.004*mean + 0.004*distribution + 0.003*use + 0.003*variable + 0.003*plot + 0.003*time + 0.003*response + 0.003*parameter + 0.003*sample + 0.003*prediction + 0.003*selection + 0.003*problem + 0.002*proposal + 0.002*error + 0.002*design'),
 (2,
  '0.015*data + 0.007*rdd + 0.005*model + 0.005*spark + 0.004*time + 0.004*function + 0.004*number + 0.004*value + 0.004*set + 0.004*variable + 0.003*point + 0.003*notation + 0.003*partition + 0.003*use + 0.003*predictive_modelling + 0.003*program + 0.003*wage_genes + 0.003*process + 0.002*distribute + 0.002*output'),
 (3,
  '0.009*function + 0.007*valu

In [41]:
hdp_model.optimal_ordering()
hdp_model.show_topics(num_topics=15)

[(0,
  '0.014*sample + 0.010*data + 0.007*probability + 0.006*population + 0.004*unit + 0.004*error + 0.004*use + 0.004*model + 0.003*distribution + 0.003*number + 0.003*value + 0.003*test + 0.003*estimate + 0.003*information + 0.003*function + 0.002*stratum + 0.002*mean + 0.002*student + 0.002*design + 0.002*system'),
 (1,
  '0.007*model + 0.007*value + 0.006*data + 0.005*estimate + 0.005*block + 0.004*mean + 0.004*distribution + 0.003*use + 0.003*variable + 0.003*plot + 0.003*time + 0.003*response + 0.003*parameter + 0.003*sample + 0.003*prediction + 0.003*selection + 0.003*problem + 0.002*proposal + 0.002*error + 0.002*design'),
 (2,
  '0.015*data + 0.007*rdd + 0.005*model + 0.005*spark + 0.004*time + 0.004*function + 0.004*number + 0.004*value + 0.004*set + 0.004*variable + 0.003*point + 0.003*notation + 0.003*partition + 0.003*use + 0.003*predictive_modelling + 0.003*program + 0.003*wage_genes + 0.003*process + 0.002*distribute + 0.002*output'),
 (3,
  '0.009*function + 0.007*valu

These topics are better! Broadly speaking, we can see topics for probability and modeling, supervised learning techniques, databases, python programming, excel, etc. Towards the end of the list we do see some junk topics, and as you'll see, there were actually many more junk topics found after topic 14.

In [42]:
alpha, beta = hdp_model.hdp_to_lda()
print(alpha.shape, beta.shape)

(150,) (150, 9922)


However, we see that HDP actually finds 150 topics, most of which are junk topics. Obtaining a somewhat equivalent LDA model using the same alpha and beta from the HDP model, we see the following: 

In [43]:
suggested_lda_model = hdp_model.suggested_lda_model()
suggested_lda_model.show_topics(num_topics=10, num_words=15)

[(138,
  '0.000*"misclassi" + 0.000*"guenin" + 0.000*"consist" + 0.000*"clo" + 0.000*"concise" + 0.000*"email" + 0.000*"involve" + 0.000*"correspondingly" + 0.000*"chlostrol" + 0.000*"omitting" + 0.000*"mean_squared" + 0.000*"negligible" + 0.000*"loosely" + 0.000*"nonrespondent" + 0.000*"windstate"'),
 (74,
  '0.000*"init" + 0.000*"bid" + 0.000*"submodule" + 0.000*"intend" + 0.000*"actcost" + 0.000*"float" + 0.000*"strangitie" + 0.000*"seii" + 0.000*"obstacle" + 0.000*"motivate" + 0.000*"scrapy" + 0.000*"hmc" + 0.000*"preliminary" + 0.000*"dyno" + 0.000*"packagesg"'),
 (82,
  '0.000*"hurry" + 0.000*"operating" + 0.000*"result" + 0.000*"instantiate" + 0.000*"sized" + 0.000*"judgement" + 0.000*"isdecimal" + 0.000*"housing" + 0.000*"destine" + 0.000*"equijoin" + 0.000*"ˆiα" + 0.000*"destination" + 0.000*"eavesdrop" + 0.000*"nosniff" + 0.000*"dimensionsy"'),
 (115,
  '0.000*"illiteracy" + 0.000*"uncorrelate" + 0.000*"generl" + 0.000*"proper" + 0.000*"poisson_processes" + 0.000*"stump" + 0.

These topics don't make sense. Notice how all word probabilities are < 0.000

In [44]:
LDAvis_prepared = gensim_vis.prepare(suggested_lda_model, doc_term_matrix, dictionary, mds='pcoa')
pyLDAvis.display(LDAvis_prepared)
pyLDAvis.save_html(LDAvis_prepared, 'Lectures_Test_run_HDP2LDA.html')

Maybe the suggested_lda_model() doesn't work as intended, but the visualization is not very appealing. 

However, we saw earlier from the top 15 topics generated by HDP, that obtaining meaningful topics is possible

We will now try tweaking some of the parameters and inputs for the LDA model, until the results are satisfactory. LDA is more desirable as we can visualize the results with pyLDAvis, get document-topic probabilities with built-in functions, and generally be able to explain the results a bit better.

#### Testing different parameters for LDA

In [27]:
# Changing number of topics 
new_lda_model = LdaModel(doc_term_matrix, num_topics=5, id2word = dictionary)
new_lda_model.show_topics(num_words=20)

[(0,
  '0.022*"data" + 0.008*"function" + 0.007*"value" + 0.007*"use" + 0.007*"sample" + 0.007*"test" + 0.006*"set" + 0.005*"model" + 0.004*"estimate" + 0.004*"class" + 0.004*"method" + 0.004*"code" + 0.004*"database" + 0.003*"number" + 0.003*"error" + 0.003*"package" + 0.003*"create" + 0.003*"case" + 0.003*"mean" + 0.003*"time"'),
 (1,
  '0.012*"data" + 0.010*"function" + 0.010*"value" + 0.008*"model" + 0.007*"use" + 0.006*"number" + 0.005*"time" + 0.004*"give" + 0.004*"sample" + 0.004*"set" + 0.004*"select" + 0.003*"create" + 0.003*"regression" + 0.003*"distribution" + 0.003*"result" + 0.003*"need" + 0.003*"table" + 0.003*"prior" + 0.003*"list" + 0.003*"add"'),
 (2,
  '0.029*"data" + 0.009*"value" + 0.007*"model" + 0.006*"function" + 0.005*"number" + 0.005*"set" + 0.005*"column" + 0.004*"use" + 0.004*"group" + 0.004*"dataframe" + 0.004*"probability" + 0.004*"create" + 0.004*"output" + 0.003*"time" + 0.003*"variable" + 0.003*"mean" + 0.003*"new" + 0.003*"key" + 0.003*"problem" + 0.003

There is again a lot of overlap, with the top 3 words being similar in all topics, as well as "data", "function", "value", occuring frequently in multiple topics. We likely need more topics, as we can see that topic 0 has some database concepts along with "variable" and "time".

In [45]:
# Changing number of topics 
new_lda_model = LdaModel(doc_term_matrix, num_topics=13, id2word = dictionary)
new_lda_model.show_topics(num_topics = 13, num_words = 10)

[(0,
  '0.023*"data" + 0.007*"value" + 0.005*"use" + 0.005*"model" + 0.005*"number" + 0.005*"function" + 0.004*"set" + 0.003*"test" + 0.003*"find" + 0.003*"probability"'),
 (1,
  '0.023*"data" + 0.007*"function" + 0.007*"value" + 0.006*"model" + 0.005*"set" + 0.005*"sample" + 0.005*"use" + 0.004*"create" + 0.004*"result" + 0.004*"class"'),
 (2,
  '0.014*"data" + 0.009*"value" + 0.008*"use" + 0.006*"create" + 0.006*"function" + 0.005*"set" + 0.004*"output" + 0.004*"dataframe" + 0.004*"list" + 0.003*"table"'),
 (3,
  '0.015*"model" + 0.015*"data" + 0.009*"value" + 0.006*"use" + 0.005*"set" + 0.005*"time" + 0.005*"distribution" + 0.004*"number" + 0.004*"probability" + 0.004*"function"'),
 (4,
  '0.028*"data" + 0.014*"function" + 0.012*"model" + 0.011*"value" + 0.006*"use" + 0.005*"regression" + 0.005*"test" + 0.005*"set" + 0.005*"variable" + 0.005*"error"'),
 (5,
  '0.016*"data" + 0.010*"value" + 0.007*"model" + 0.006*"create" + 0.006*"class" + 0.006*"use" + 0.006*"function" + 0.005*"outp

It has somehow gotten worse at capturing some expected topics like databases from earlier. It seems that the underlying problem might be in another parameter.

In [46]:
# Trying multiple passes through the corpus
new_lda_model = LdaModel(doc_term_matrix, num_topics = 13, id2word = dictionary, passes = 10, random_state=448)
new_lda_model.show_topics(num_topics = -1, num_words = 10)

[(0,
  '0.019*"function" + 0.018*"value" + 0.015*"data" + 0.012*"use" + 0.010*"test" + 0.010*"list" + 0.008*"code" + 0.008*"create" + 0.008*"column" + 0.008*"number"'),
 (1,
  '0.024*"data" + 0.018*"model" + 0.012*"value" + 0.010*"function" + 0.009*"probability" + 0.009*"estimate" + 0.008*"distribution" + 0.008*"variable" + 0.007*"regression" + 0.006*"mean"'),
 (2,
  '0.021*"set" + 0.014*"function" + 0.013*"pc" + 0.012*"point" + 0.012*"convex" + 0.011*"case" + 0.010*"method" + 0.010*"minimizer" + 0.010*"give" + 0.009*"let"'),
 (3,
  '0.029*"data" + 0.009*"problem" + 0.008*"class" + 0.008*"big" + 0.006*"linear_discriminant" + 0.006*"machine" + 0.005*"course" + 0.005*"analysis" + 0.005*"work" + 0.005*"model"'),
 (4,
  '0.024*"package" + 0.012*"problem" + 0.011*"module" + 0.010*"function" + 0.009*"condition" + 0.009*"solution" + 0.009*"error" + 0.009*"pypi" + 0.007*"solve" + 0.006*"let"'),
 (5,
  '0.020*"layer" + 0.016*"relationship" + 0.015*"model" + 0.014*"network" + 0.013*"convolution"

These topics are not great, but we can see some topics for databases, probability, supervised learning, optimization, etc. These aren't as well-defined as the HDP topics we saw earlier, but this is a step in the right direction. 

We can see if this improves with number of passes.

In [47]:
# Trying multiple passes through the corpus
new_lda_model = LdaModel(doc_term_matrix, num_topics = 13, id2word = dictionary, passes = 25, random_state=448)
new_lda_model.show_topics(num_topics = -1, num_words = 10)

[(0,
  '0.020*"function" + 0.018*"value" + 0.015*"data" + 0.012*"use" + 0.010*"list" + 0.010*"test" + 0.008*"code" + 0.008*"create" + 0.008*"string" + 0.008*"number"'),
 (1,
  '0.025*"data" + 0.019*"model" + 0.012*"value" + 0.010*"function" + 0.009*"estimate" + 0.009*"probability" + 0.008*"variable" + 0.008*"distribution" + 0.007*"regression" + 0.007*"mean"'),
 (2,
  '0.022*"set" + 0.015*"function" + 0.014*"pc" + 0.013*"point" + 0.012*"convex" + 0.011*"minimizer" + 0.011*"case" + 0.011*"method" + 0.010*"let" + 0.010*"give"'),
 (3,
  '0.031*"data" + 0.009*"big" + 0.009*"problem" + 0.008*"class" + 0.007*"linear_discriminant" + 0.006*"machine" + 0.006*"course" + 0.006*"analysis" + 0.005*"work" + 0.005*"privacy"'),
 (4,
  '0.026*"package" + 0.015*"module" + 0.011*"problem" + 0.010*"condition" + 0.010*"error" + 0.009*"pypi" + 0.009*"function" + 0.008*"solution" + 0.007*"solve" + 0.006*"install"'),
 (5,
  '0.021*"layer" + 0.016*"relationship" + 0.015*"network" + 0.014*"model" + 0.013*"convol

These topics are definitely easier to distinguish.

In [53]:
new_lda_model = LdaModel(doc_term_matrix, num_topics = 13, id2word = dictionary, passes = 50, random_state=448)
new_lda_model.show_topics(num_topics = -1, num_words = 10)

[(0,
  '0.020*"function" + 0.018*"value" + 0.015*"data" + 0.012*"use" + 0.010*"list" + 0.010*"test" + 0.009*"code" + 0.008*"create" + 0.008*"string" + 0.008*"number"'),
 (1,
  '0.025*"data" + 0.020*"model" + 0.012*"value" + 0.010*"function" + 0.010*"estimate" + 0.009*"probability" + 0.008*"variable" + 0.008*"regression" + 0.008*"distribution" + 0.007*"mean"'),
 (2,
  '0.022*"set" + 0.015*"function" + 0.014*"pc" + 0.014*"point" + 0.012*"convex" + 0.012*"minimizer" + 0.011*"let" + 0.011*"method" + 0.011*"case" + 0.010*"give"'),
 (3,
  '0.032*"data" + 0.009*"big" + 0.008*"problem" + 0.007*"course" + 0.007*"analysis" + 0.007*"linear_discriminant" + 0.007*"machine" + 0.007*"class" + 0.005*"work" + 0.005*"time"'),
 (4,
  '0.029*"package" + 0.016*"module" + 0.011*"problem" + 0.010*"error" + 0.010*"condition" + 0.010*"pypi" + 0.007*"function" + 0.007*"install" + 0.007*"code" + 0.007*"solution"'),
 (5,
  '0.022*"layer" + 0.017*"network" + 0.016*"relationship" + 0.014*"model" + 0.014*"convolutio

These are the best so far! While some topics might not be all that clear, Topic 2 is probability/statistics, Topic 3 is linear optimization, Topic 4 seems to be Python programming, Topic 5 is neual networks, Topic 6 is database related, Topic 7 is Bayesian stats, Topic 8 is supervised learning techniques.  

Let's see if more passes improves the quality of topics.

In [49]:
new_lda_model = LdaModel(doc_term_matrix, num_topics = 13, id2word = dictionary, passes = 100, random_state=448)
new_lda_model.show_topics(num_topics = -1, num_words = 10)

[(0,
  '0.021*"function" + 0.018*"value" + 0.014*"data" + 0.012*"use" + 0.011*"list" + 0.010*"test" + 0.009*"code" + 0.009*"create" + 0.008*"string" + 0.008*"number"'),
 (1,
  '0.026*"data" + 0.021*"model" + 0.013*"value" + 0.010*"estimate" + 0.010*"function" + 0.009*"variable" + 0.009*"probability" + 0.008*"regression" + 0.007*"distribution" + 0.007*"mean"'),
 (2,
  '0.022*"set" + 0.016*"function" + 0.014*"point" + 0.014*"pc" + 0.012*"minimizer" + 0.012*"let" + 0.012*"convex" + 0.011*"method" + 0.010*"case" + 0.010*"give"'),
 (3,
  '0.032*"data" + 0.010*"big" + 0.008*"problem" + 0.008*"course" + 0.008*"analysis" + 0.007*"linear_discriminant" + 0.007*"machine" + 0.005*"class" + 0.005*"work" + 0.005*"material"'),
 (4,
  '0.030*"package" + 0.018*"module" + 0.011*"error" + 0.010*"condition" + 0.010*"pypi" + 0.010*"problem" + 0.007*"install" + 0.007*"code" + 0.007*"python_package" + 0.007*"function"'),
 (5,
  '0.023*"layer" + 0.017*"network" + 0.016*"relationship" + 0.014*"convolution" + 0

Topics didn't change much, so we can say that the estimates for topic distributions converge after sufficient number of iterations.

In [54]:
LDAvis_prepared = gensim_vis.prepare(new_lda_model, doc_term_matrix, dictionary, mds='pcoa')
pyLDAvis.display(LDAvis_prepared)
pyLDAvis.save_html(LDAvis_prepared, 'Lectures_100pass_LDA_13.html')

From the visualization, we can see clear zones with probability, excel, databases, and Python programming! It is especially interesting to see how there is a supercluster for probability distributions, which encompasses/is close to a topic about Markov-Chain related theory! Also, it should be noted that many topics are mostly junk and contain seemingly unrelated words while accounting for few tokens.

We should eventually hone in on an appropriate value for k, after deciding on all the other parameter values.

Before that, we can also try changing another parameter in LdaModel called 'alpha'. The default value of alpha is 'symmetric' which assumes a symmetric Dirichlet prior over the topic distributions. 

Essentially, this means that all our topic models till now have assumed alpha to be a vector of length k where all values are equal (default value is 1/k).

In [51]:
print(new_lda_model.alpha)

# Since k = 13, all values of alpha will be 1/13

[0.07692308 0.07692308 0.07692308 0.07692308 0.07692308 0.07692308
 0.07692308 0.07692308 0.07692308 0.07692308 0.07692308 0.07692308
 0.07692308]


In [55]:
# Testing alpha = 'auto', meaning that the alpha prior is no longer assumed to be symmetric and is learned from the data. 

auto_lda_model = LdaModel(doc_term_matrix, num_topics = 13, id2word = dictionary, alpha = 'auto', passes = 50)
auto_lda_model.show_topics(num_topics = -1, num_words = 10)

[(0,
  '0.025*"image" + 0.019*"convolution" + 0.019*"markov_chain" + 0.019*"state" + 0.013*"transition" + 0.012*"compartment" + 0.011*"probability" + 0.011*"cross_validation" + 0.010*"model" + 0.010*"layer"'),
 (1,
  '0.036*"data" + 0.016*"value" + 0.015*"model" + 0.014*"regression" + 0.010*"relationship" + 0.009*"sample" + 0.007*"variable" + 0.007*"select" + 0.007*"error" + 0.007*"residual"'),
 (2,
  '0.020*"set" + 0.014*"function" + 0.011*"point" + 0.011*"minimizer" + 0.011*"let" + 0.011*"pc" + 0.011*"convex" + 0.011*"method" + 0.011*"problem" + 0.010*"case"'),
 (3,
  '0.027*"class" + 0.015*"factor" + 0.015*"test" + 0.013*"effect" + 0.011*"block" + 0.011*"experiment" + 0.010*"design" + 0.009*"classifier" + 0.009*"hyperplane" + 0.008*"support_vector"'),
 (4,
  '0.012*"time" + 0.012*"data" + 0.010*"rdd" + 0.008*"function" + 0.008*"cell" + 0.008*"problem" + 0.008*"point" + 0.007*"find" + 0.007*"value" + 0.006*"number"'),
 (5,
  '0.016*"data" + 0.011*"probability" + 0.009*"privacy" + 0.0

In [57]:
print(auto_lda_model.alpha)

[0.01037392 0.02794973 0.01626494 0.01367742 0.02262196 0.01463735
 0.02639266 0.03807209 0.03883517 0.01059209 0.04052699 0.01785196
 0.02190316]


In [56]:
LDAvis_prepared = gensim_vis.prepare(auto_lda_model, doc_term_matrix, dictionary, mds='pcoa')
pyLDAvis.display(LDAvis_prepared)
pyLDAvis.save_html(LDAvis_prepared, 'Lectures_Alpha_LDA_13.html')

This is even better! Neural networks, traditional machine learning algorithms, predictive modeling on time series, etc... Plus a new smaller cluster for command line and source control. Seems like 10 topics is a good amount. 

We can do the same thing with the parameter "eta" which is the name used by gensim for the "beta" Dirichlet priors from the paper. 

In [58]:
auto_lda_model = LdaModel(doc_term_matrix, num_topics = 13, id2word = dictionary, alpha = 'auto', eta = 'auto', passes = 50)
auto_lda_model.show_topics(num_topics = -1, num_words = 13)

[(0,
  '0.034*"database" + 0.023*"data" + 0.021*"user" + 0.013*"privilege" + 0.010*"view" + 0.010*"macro" + 0.010*"object" + 0.008*"create" + 0.008*"access" + 0.008*"grant" + 0.007*"use" + 0.007*"select" + 0.007*"update"'),
 (1,
  '0.022*"model" + 0.020*"distribution" + 0.015*"prior" + 0.013*"data" + 0.012*"probability" + 0.012*"value" + 0.011*"posterior" + 0.011*"sample" + 0.009*"function" + 0.009*"normal" + 0.008*"mean" + 0.008*"number" + 0.008*"simulate"'),
 (2,
  '0.016*"table" + 0.009*"value" + 0.008*"select" + 0.008*"data" + 0.008*"block" + 0.008*"group" + 0.008*"factor" + 0.007*"employee" + 0.007*"attribute" + 0.007*"query" + 0.007*"relationship" + 0.007*"design" + 0.007*"relation"'),
 (3,
  '0.042*"data" + 0.010*"privacy" + 0.007*"information" + 0.006*"table" + 0.006*"use" + 0.005*"value" + 0.005*"record" + 0.005*"column" + 0.005*"security" + 0.005*"right" + 0.004*"analysis" + 0.004*"key" + 0.004*"time"'),
 (4,
  '0.012*"branch" + 0.011*"create" + 0.011*"change" + 0.010*"add" +

We will look at the visualization directly to judge these topics.

In [59]:
print(auto_lda_model.alpha)
print(auto_lda_model.eta)

[0.02473091 0.01621987 0.00616194 0.01416105 0.01736566 0.01356093
 0.01139303 0.01502603 0.01140727 0.01734054]
[0.10823856 0.08902945 0.08991552 ... 0.08943335 0.09130245 0.09040242]


In [59]:
LDAvis_prepared = gensim_vis.prepare(auto_lda_model, doc_term_matrix, dictionary, mds='pcoa')
pyLDAvis.display(LDAvis_prepared)
pyLDAvis.save_html(LDAvis_prepared, 'Lectures_AlphaEta_LDA_13.html')

From the visualization, we can see that these topics are fairly good. There are very few clearly junk topics, and "nearby" topics seem to be related how we would expect. However, looking closer, we can see that some unrelated modules are combined into topics, like visualization and version control in Topic 9. We also don't see a clear topic for deep learning concepts, which are taught in one or two modules. We need to try playing around with the topic number.

In [60]:
auto_lda_model.top_topics(doc_term_matrix, dictionary=dictionary, coherence='u_mass', topn=10)

[([(0.03080338, 'data'),
   (0.027073132, 'model'),
   (0.012912613, 'estimate'),
   (0.012740664, 'regression'),
   (0.011573426, 'value'),
   (0.010506095, 'function'),
   (0.010155926, 'observation'),
   (0.008580755, 'variable'),
   (0.008272204, 'fit'),
   (0.008062827, 'predictor')],
  -0.6835266504471039),
 ([(0.025470572, 'function'),
   (0.01914074, 'class'),
   (0.013960703, 'test'),
   (0.013805784, 'value'),
   (0.013312178, 'object'),
   (0.013249324, 'list'),
   (0.013077197, 'code'),
   (0.012108417, 'package'),
   (0.011469826, 'string'),
   (0.0109232515, 'use')],
  -0.7212922984581832),
 ([(0.022434076, 'model'),
   (0.020137763, 'distribution'),
   (0.015208991, 'prior'),
   (0.0125170695, 'data'),
   (0.012285216, 'probability'),
   (0.011503181, 'value'),
   (0.0114624845, 'posterior'),
   (0.010662859, 'sample'),
   (0.009220459, 'function'),
   (0.008731245, 'normal')],
  -0.7574259545818814),
 ([(0.021919647, 'data'),
   (0.012912154, 'time'),
   (0.01055452, 's

These coherence scores are worse than the ones we saw in the first model, which is a good enough reason not to use such metrics.

#### Testing k-values

Note that since the priors are no longer symmetric, the estimation can not be parallelized so it will take noticably longer. We first checked 7,10,13,17,25. 

7 and 10 were not enough, as there was too much overlap between unrelated topics. 25 was too much; had several junk topics and some topics with 0% of tokens. We then decided to add 15

We then compared 15 and 17, and both seemed to have a few confusing topics, with 17 causing only having two topics (Topics #13 and #16) with a few words be unrelated. We decided to have one final comparison with 17, 18, and 19 topics.

In [81]:
NUM_TOPICS = 7

In [82]:
final_lda_model = LdaModel(doc_term_matrix, num_topics = NUM_TOPICS, id2word = dictionary, alpha = 'auto', eta = 'auto', passes = 50)

LDAvis_prepared = gensim_vis.prepare(final_lda_model, doc_term_matrix, dictionary, mds='pcoa')
pyLDAvis.display(LDAvis_prepared)
pyLDAvis.save_html(LDAvis_prepared, f'Lectures_Final_{NUM_TOPICS}.html')

final_lda_model.show_topics(num_topics = -1, num_words = 10)

[(0,
  '0.021*"data" + 0.020*"model" + 0.016*"sample" + 0.012*"distribution" + 0.012*"value" + 0.009*"probability" + 0.008*"prior" + 0.008*"estimate" + 0.008*"function" + 0.008*"mean"'),
 (1,
  '0.028*"data" + 0.007*"database" + 0.006*"system" + 0.005*"class" + 0.005*"user" + 0.005*"use" + 0.005*"key" + 0.004*"web" + 0.004*"group" + 0.004*"cluster"'),
 (2,
  '0.019*"data" + 0.019*"model" + 0.011*"predictor" + 0.011*"set" + 0.011*"class" + 0.009*"relationship" + 0.009*"branch" + 0.008*"estimate" + 0.008*"classification" + 0.007*"observation"'),
 (3,
  '0.016*"data" + 0.016*"layer" + 0.015*"model" + 0.011*"network" + 0.010*"input" + 0.009*"training" + 0.008*"output" + 0.007*"image" + 0.007*"function" + 0.006*"value"'),
 (4,
  '0.042*"data" + 0.014*"value" + 0.012*"dataframe" + 0.009*"column" + 0.008*"use" + 0.007*"index" + 0.007*"output" + 0.007*"time" + 0.006*"operation" + 0.006*"rdd"'),
 (5,
  '0.014*"tree" + 0.012*"set" + 0.009*"problem" + 0.008*"factor" + 0.008*"function" + 0.008*"so

In [83]:
NUM_TOPICS = 10

In [84]:
final_lda_model = LdaModel(doc_term_matrix, num_topics = NUM_TOPICS, id2word = dictionary, alpha = 'auto', eta = 'auto', passes = 50)

LDAvis_prepared = gensim_vis.prepare(final_lda_model, doc_term_matrix, dictionary, mds='pcoa')
pyLDAvis.display(LDAvis_prepared)
pyLDAvis.save_html(LDAvis_prepared, f'Lectures_Final_{NUM_TOPICS}.html')

final_lda_model.show_topics(num_topics = -1, num_words = 10)

[(0,
  '0.023*"data" + 0.008*"class" + 0.007*"attribute" + 0.007*"use" + 0.006*"information" + 0.005*"privacy" + 0.005*"relationship" + 0.005*"project" + 0.004*"model" + 0.004*"type"'),
 (1,
  '0.037*"sample" + 0.017*"population" + 0.010*"estimate" + 0.010*"value" + 0.009*"set" + 0.008*"use" + 0.008*"unit" + 0.007*"probability" + 0.007*"mean" + 0.007*"data"'),
 (2,
  '0.017*"problem" + 0.014*"solution" + 0.012*"function" + 0.010*"point" + 0.010*"set" + 0.010*"minimizer" + 0.010*"convex" + 0.010*"let" + 0.009*"constraint" + 0.008*"find"'),
 (3,
  '0.026*"data" + 0.017*"dataframe" + 0.012*"value" + 0.012*"column" + 0.010*"output" + 0.008*"use" + 0.008*"create" + 0.007*"function" + 0.007*"rdd" + 0.007*"layer"'),
 (4,
  '0.022*"data" + 0.012*"table" + 0.010*"database" + 0.008*"select" + 0.008*"value" + 0.007*"create" + 0.007*"use" + 0.007*"user" + 0.007*"column" + 0.006*"query"'),
 (5,
  '0.010*"branch" + 0.009*"factor" + 0.008*"effect" + 0.008*"design" + 0.008*"experiment" + 0.008*"change

In [89]:
NUM_TOPICS = 15

In [90]:
final_lda_model = LdaModel(doc_term_matrix, num_topics = NUM_TOPICS, id2word = dictionary, alpha = 'auto', eta = 'auto', passes = 50)

LDAvis_prepared = gensim_vis.prepare(final_lda_model, doc_term_matrix, dictionary, mds='pcoa')
pyLDAvis.display(LDAvis_prepared)
pyLDAvis.save_html(LDAvis_prepared, f'Lectures_Final_{NUM_TOPICS}.html')

final_lda_model.show_topics(num_topics = -1, num_words = 10)

[(0,
  '0.019*"data" + 0.013*"value" + 0.010*"plot" + 0.009*"column" + 0.009*"cell" + 0.008*"use" + 0.007*"select" + 0.007*"dash" + 0.007*"function" + 0.006*"point"'),
 (1,
  '0.032*"table" + 0.030*"database" + 0.019*"query" + 0.018*"select" + 0.014*"user" + 0.012*"value" + 0.012*"employee" + 0.010*"list" + 0.010*"column" + 0.010*"string"'),
 (2,
  '0.061*"data" + 0.016*"value" + 0.012*"column" + 0.011*"dataframe" + 0.010*"type" + 0.010*"group" + 0.008*"set" + 0.008*"cluster" + 0.008*"attribute" + 0.008*"relationship"'),
 (3,
  '0.018*"function" + 0.013*"value" + 0.012*"use" + 0.011*"list" + 0.011*"data" + 0.009*"number" + 0.008*"object" + 0.008*"graph" + 0.007*"code" + 0.006*"search"'),
 (4,
  '0.066*"class" + 0.016*"module" + 0.014*"method" + 0.014*"def" + 0.012*"object" + 0.011*"attribute" + 0.011*"error" + 0.011*"package" + 0.010*"age" + 0.010*"data"'),
 (5,
  '0.022*"test" + 0.015*"branch" + 0.014*"command" + 0.012*"code" + 0.012*"change" + 0.011*"create" + 0.010*"repository" + 0.

In [92]:
NUM_TOPICS = 17

In [93]:
final_lda_model = LdaModel(doc_term_matrix, num_topics = NUM_TOPICS, id2word = dictionary, alpha = 'auto', eta = 'auto', passes = 75)

LDAvis_prepared = gensim_vis.prepare(final_lda_model, doc_term_matrix, dictionary, mds='pcoa')
pyLDAvis.display(LDAvis_prepared)
pyLDAvis.save_html(LDAvis_prepared, f'Lectures_Final_{NUM_TOPICS}.html')

final_lda_model.show_topics(num_topics = -1, num_words = 10)

[(0,
  '0.025*"data" + 0.025*"value" + 0.021*"function" + 0.013*"use" + 0.010*"list" + 0.010*"column" + 0.008*"create" + 0.008*"string" + 0.008*"object" + 0.008*"output"'),
 (1,
  '0.030*"data" + 0.015*"rdd" + 0.011*"privacy" + 0.010*"probability" + 0.009*"record" + 0.008*"information" + 0.008*"brief_history" + 0.007*"motivating_review" + 0.005*"spark" + 0.005*"individual"'),
 (2,
  '0.032*"data" + 0.025*"model" + 0.010*"estimate" + 0.009*"regression" + 0.009*"value" + 0.008*"set" + 0.008*"observation" + 0.008*"variable" + 0.008*"fit" + 0.008*"predictor"'),
 (3,
  '0.019*"audience" + 0.015*"mixture_models" + 0.014*"email" + 0.013*"proposal" + 0.012*"default" + 0.011*"student" + 0.011*"write" + 0.010*"communication" + 0.010*"use" + 0.010*"presentation"'),
 (4,
  '0.021*"likelihood_prior" + 0.020*"posterior_exchangeability" + 0.019*"prior" + 0.017*"likelihood" + 0.017*"function" + 0.015*"posterior" + 0.013*"data" + 0.012*"condition" + 0.010*"probability" + 0.010*"distribution"'),
 (5,
  

In [101]:
NUM_TOPICS = 18

In [102]:
final_lda_model = LdaModel(doc_term_matrix, num_topics = NUM_TOPICS, id2word = dictionary, alpha = 'auto', eta = 'auto', passes = 75)

LDAvis_prepared = gensim_vis.prepare(final_lda_model, doc_term_matrix, dictionary, mds='pcoa')
pyLDAvis.display(LDAvis_prepared)
pyLDAvis.save_html(LDAvis_prepared, f'Lectures_Final_Chosen_{NUM_TOPICS}.html')

final_lda_model.show_topics(num_topics = -1, num_words = 10)

[(0,
  '0.014*"address" + 0.014*"network" + 0.010*"server" + 0.009*"command" + 0.008*"client" + 0.008*"internet" + 0.008*"loop" + 0.007*"data" + 0.007*"ip_address" + 0.007*"host"'),
 (1,
  '0.028*"data" + 0.023*"value" + 0.022*"function" + 0.018*"model" + 0.013*"distribution" + 0.012*"time" + 0.011*"probability" + 0.010*"simulate" + 0.009*"mean" + 0.009*"number"'),
 (2,
  '0.012*"branch" + 0.010*"function" + 0.009*"solution" + 0.008*"problem" + 0.008*"point" + 0.008*"let" + 0.008*"constraint" + 0.008*"support_vector" + 0.007*"graph" + 0.007*"hyperplane"'),
 (3,
  '0.046*"data" + 0.008*"web" + 0.008*"group" + 0.007*"cluster" + 0.007*"use" + 0.007*"element" + 0.006*"set" + 0.006*"open" + 0.005*"graph" + 0.005*"software"'),
 (4,
  '0.021*"model" + 0.019*"data" + 0.011*"estimate" + 0.010*"layer" + 0.009*"class" + 0.008*"set" + 0.008*"training" + 0.007*"value" + 0.007*"classification" + 0.007*"network"'),
 (5,
  '0.030*"state" + 0.028*"markov_chain" + 0.028*"probability" + 0.020*"model" + 0

In [103]:
LDAvis_prepared = gensim_vis.prepare(final_lda_model, doc_term_matrix, dictionary, mds='mmds')
pyLDAvis.display(LDAvis_prepared)
pyLDAvis.save_html(LDAvis_prepared, f'Lectures_Final_Chosen_{NUM_TOPICS}.html')

In [104]:
final_lda_model.save(datapath('Lectures_Chosen_18'))

In [136]:
topic_words = []
for i in range(1,NUM_TOPICS+1): 
    topic = {}   
    topic[i] = LDAvis_prepared.sorted_terms(topic = i, _lambda = 0.67)[:30]["Term"].tolist()
    topic_words.append(topic)

In [138]:
for i in range(NUM_TOPICS):
    print(topic_words[i])

{1: ['model', 'layer', 'training', 'estimate', 'data', 'classification', 'network', 'class', 'observation', 'prediction', 'set', 'convolution', 'block', 'input', 'method', 'image', 'response', 'fit', 'predictor', 'error', 'parameter', 'predict', 'variable', 'task', 'weight', 'bootstrap', 'vector', 'matrix', 'neural_networks', 'output']}
{2: ['data', 'array', 'index', 'audience', 'search', 'macro', 'problem', 'hash', 'algorithm', 'item', 'output', 'use', 'structure', 'information', 'queue', 'stack', 'need', 'email', 'record', 'pypi', 'proposal', 'list', 'sort', 'communication', 'time', 'object', 'purpose', 'value', 'business', 'presentation']}
{3: ['function', 'test', 'code', 'package', 'value', 'def', 'exception', 'dash', 'module', 'error', 'use', 'app', 'plot', 'object', 'create', 'python', 'argument', 'try', 'widget', 'output', 'number', 'add', 'pass', 'raise', 'testing', 'column', 'write', 'list', 'run', 'install']}
{4: ['function', 'value', 'data', 'simulate', 'distribution', 'mode

In [96]:
NUM_TOPICS = 19

In [97]:
final_lda_model = LdaModel(doc_term_matrix, num_topics = NUM_TOPICS, id2word = dictionary, alpha = 'auto', eta = 'auto', passes = 50)

LDAvis_prepared = gensim_vis.prepare(final_lda_model, doc_term_matrix, dictionary, mds='pcoa')
pyLDAvis.display(LDAvis_prepared)
pyLDAvis.save_html(LDAvis_prepared, f'Lectures_Final_{NUM_TOPICS}.html')

final_lda_model.show_topics(num_topics = -1, num_words = 10)

[(0,
  '0.052*"sample" + 0.043*"error" + 0.036*"exception" + 0.021*"unit" + 0.018*"population" + 0.017*"try" + 0.014*"raise" + 0.012*"student" + 0.011*"def" + 0.011*"program"'),
 (1,
  '0.034*"data" + 0.024*"table" + 0.020*"column" + 0.017*"dataframe" + 0.016*"value" + 0.015*"create" + 0.014*"select" + 0.013*"row" + 0.012*"query" + 0.011*"attribute"'),
 (2,
  '0.020*"factor" + 0.015*"effect" + 0.013*"block" + 0.012*"mean" + 0.012*"design" + 0.011*"group" + 0.009*"variable" + 0.009*"response" + 0.009*"model" + 0.008*"level"'),
 (3,
  '0.035*"probability" + 0.020*"hyperplane" + 0.019*"classifier" + 0.017*"observation" + 0.015*"margin" + 0.015*"brief_history" + 0.014*"motivating_review" + 0.012*"maximal_margin" + 0.011*"data" + 0.011*"bayesian"'),
 (4,
  '0.048*"class" + 0.034*"classification" + 0.020*"data_carts" + 0.017*"tree" + 0.016*"regression" + 0.015*"support_vector" + 0.014*"motivate" + 0.012*"classifier" + 0.012*"observation" + 0.011*"linear"'),
 (5,
  '0.014*"macro" + 0.012*"dia

In [87]:
NUM_TOPICS = 25

In [88]:
final_lda_model = LdaModel(doc_term_matrix, num_topics = NUM_TOPICS, id2word = dictionary, alpha = 'auto', eta = 'auto', passes = 50)

LDAvis_prepared = gensim_vis.prepare(final_lda_model, doc_term_matrix, dictionary, mds='pcoa')
pyLDAvis.display(LDAvis_prepared)
pyLDAvis.save_html(LDAvis_prepared, f'Lectures_Final_{NUM_TOPICS}.html')

final_lda_model.show_topics(num_topics = -1, num_words = 10)

[(0,
  '0.024*"web" + 0.013*"data" + 0.010*"internet" + 0.010*"request" + 0.007*"new" + 0.007*"network" + 0.007*"application" + 0.007*"convex_analysis" + 0.006*"packet" + 0.006*"port"'),
 (1,
  '0.029*"layer" + 0.026*"network" + 0.017*"block" + 0.014*"address" + 0.013*"number" + 0.010*"class" + 0.009*"increase" + 0.009*"key" + 0.008*"convolution" + 0.007*"bit"'),
 (2,
  '0.030*"tree" + 0.014*"search" + 0.012*"problem" + 0.012*"graph" + 0.009*"node" + 0.009*"algorithm" + 0.009*"order" + 0.008*"data" + 0.008*"sort" + 0.007*"structure"'),
 (3,
  '0.044*"data" + 0.031*"model" + 0.014*"value" + 0.013*"regression" + 0.011*"estimate" + 0.011*"function" + 0.010*"variable" + 0.008*"set" + 0.008*"fit" + 0.007*"probability"'),
 (4,
  '0.051*"state" + 0.048*"markov_chain" + 0.033*"distribution" + 0.033*"probability" + 0.028*"transition" + 0.021*"simulate" + 0.019*"transition_matrix" + 0.018*"stationary_distribution" + 0.013*"time" + 0.012*"model"'),
 (5,
  '0.022*"data" + 0.018*"value" + 0.015*"us

#### Testing a topic model where the document term matrix is TF-IDF weighted 

In [56]:
from gensim.models import TfidfModel

# Duplicating these to avoid modifying the originals
tf_corpus = doc_term_matrix
tf_dictionary = dictionary

tfidf = TfidfModel(corpus=tf_corpus, id2word=tf_dictionary)

low_value = 0.03
words  = []
words_missing_in_tfidf = []
for i in range(0, len(tf_corpus)):
    bow = tf_corpus[i]
    low_value_words = [] #reinitialize to be safe. You can skip this.
    tfidf_ids = [id for id, value in tfidf[bow]]
    bow_ids = [id for id, value in bow]
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    drops = low_value_words+words_missing_in_tfidf
    for item in drops:
        words.append(tf_dictionary[item])
    words_missing_in_tfidf = [id for id in bow_ids if id not in tfidf_ids] # The words with tf-idf score 0 will be missing

    new_bow = [b for b in bow if b[0] not in low_value_words and b[0] not in words_missing_in_tfidf]
    tf_corpus[i] = new_bow

In [57]:
idf_lda_model = LdaModel(corpus=tf_corpus, id2word=tf_dictionary, num_topics=10, random_state=448, passes=20, alpha="auto", eta = "auto")
idf_lda_model.show_topics(num_topics = -1, num_words = 10)

[(0,
  '0.101*"prior" + 0.071*"posterior" + 0.046*"likelihood" + 0.045*"introduction" + 0.038*"normal" + 0.032*"probability" + 0.028*"beta" + 0.024*"chain" + 0.024*"bayesian" + 0.022*"regression"'),
 (1,
  '0.036*"def" + 0.032*"package" + 0.030*"git" + 0.029*"python" + 0.025*"branch" + 0.022*"exception" + 0.022*"module" + 0.019*"testing" + 0.017*"repository" + 0.017*"object"'),
 (2,
  '0.026*"command" + 0.018*"git" + 0.017*"group" + 0.017*"open" + 0.017*"line" + 0.016*"distance" + 0.014*"cluster" + 0.013*"echo" + 0.011*"key" + 0.011*"mixture"'),
 (3,
  '0.024*"emp" + 0.023*"random" + 0.023*"database" + 0.018*"query" + 0.017*"key" + 0.016*"eno" + 0.016*"sql" + 0.015*"probability" + 0.013*"density" + 0.013*"dno"'),
 (4,
  '0.030*"distribution" + 0.022*"return" + 0.019*"probability" + 0.016*"markov" + 0.016*"likelihood" + 0.015*"list" + 0.015*"string" + 0.014*"state" + 0.014*"chain" + 0.013*"regression"'),
 (5,
  '0.032*"linear" + 0.024*"regression" + 0.023*"cell" + 0.021*"mar" + 0.017*"l

In [58]:
LDAvis_prepared = gensim_vis.prepare(idf_lda_model, doc_term_matrix, dictionary, mds='pcoa')
pyLDAvis.display(LDAvis_prepared)
pyLDAvis.save_html(LDAvis_prepared, 'IDF_auto_run_LDA_10.html')

Interesting but removes too many words, which means it destroys the metric of importance of each topic (calculated by % of tokens that belong to each topic), which is the main reason that we care about using LDA for this. 

Could also try other ways of doing this with: Dictionary.filter_extremes 

Should probably retry these two with corpus consisting of documents that are individual pdfs/lectures.

In the end, we go with the 18-topic model.

Some interesting words that we could look for in the jobs dataset dictionary could be: git, training, simulate, probability, regression, database, experiment, excel, time series

## Finding word frequencies for words from jobs dictionary 

In [155]:
vocab = []

for item in dictionary.items():
    vocab.append(item[1])

In [146]:
print(vocab[:10])
print(len(vocab))

['abbreviation', 'add', 'address', 'advanced', 'allow', 'analog', 'analysis', 'analyst', 'analytic', 'app']
9922


In [178]:
words_to_find = ["deploy", "pipeline", "etl", "llm", "power_bi", "generative_ai", "gcp", "spark", "hadoop", "git", "training", "simulate", "probability", "regression", "database", "experiment", "excel", "time_series", "deep_learning", "visualization", "data", "markov_chain", "neural_network", "pytorch", "tensorflow", "kafka", "nlp"]
for word in words_to_find:
    print(word, end=": ")
    if word in vocab:
        print(final_lda_model.get_term_topics(word, minimum_probability=0.001))
    else:
        print("not found")
        

deploy: []
pipeline: []
etl: not found
llm: not found
power_bi: not found
generative_ai: not found
gcp: not found
spark: [(17, 0.010765541)]
hadoop: [(17, 0.0035910984)]
git: [(2, 0.0033630973)]
training: [(4, 0.008145138)]
simulate: [(1, 0.0097651975), (5, 0.013888511), (16, 0.002324933)]
probability: [(1, 0.011176379), (4, 0.0027122817), (5, 0.027418612), (10, 0.002061787), (13, 0.021222476), (14, 0.01113725)]
regression: [(1, 0.0037703142), (4, 0.0029165489), (7, 0.0029271643), (16, 0.02352627)]
database: [(12, 0.028981334), (15, 0.001560567), (17, 0.0012988469)]
experiment: [(10, 0.014843564), (13, 0.0021932065)]
excel: [(7, 0.0020061985), (8, 0.0017035467)]
time_series: [(1, 0.0034515331), (14, 0.0011038793)]
deep_learning: [(4, 0.0013561486)]
visualization: [(15, 0.0011943558)]
data: [(0, 0.0070041586), (1, 0.028142318), (2, 0.0014796851), (3, 0.045594253), (4, 0.018646404), (5, 0.01824003), (6, 0.002973792), (7, 0.033636827), (8, 0.023922192), (9, 0.0048884694), (10, 0.001275668

## The below section is experimental:

Note: add an explanation of what an embedding is, how they are learned, sentence vs word level embeddings (and the fact that we use word level). Also describe each approach, what worked and what didn't.  

### Trying to assign a label to a topic using word embeddings of the top 20 words in a topic sorted by relevance

In [17]:
# Top 20 words for mmds visualization of LDA model with 5 topics
top_words = ["model", "data", "value", "function", "distribution", "example", "probability", "number", "using", "use", "simulate", "sample", "independent", "average", "mean", "figure", "estimate", "variable", "measurement", "plot"]
print(len(top_words))

20


In [74]:

# import gensim.downloader as api
# model_location = api.load("fasttext-wiki-news-subwords-300", return_path=True)
# print(model_location)
# Stored at C:\Users\syeda/gensim-data\fasttext-wiki-news-subwords-300\


C:\Users\syeda/gensim-data\fasttext-wiki-news-subwords-300\fasttext-wiki-news-subwords-300.gz


In [43]:
from gensim.models.fasttext import load_facebook_model

model_location = datapath("C:/Users/syeda/OneDrive/Desktop/4th Year/DATA448/cc.en.300.bin")
pretrained_model = load_facebook_model(model_location)
finetuned_model = load_facebook_model(model_location)

In [46]:
import numpy as np

word_embeddings = [pretrained_model.wv[word] for word in top_words]
mean_vector = np.mean(word_embeddings, axis=0)

pt_similar_words = pretrained_model.wv.similar_by_vector(mean_vector, topn=5)
print(pt_similar_words)

topic_label = pt_similar_words[0][0]
print(f"Representative word for the topic: {topic_label}")

[('calculate', 0.6181637048721313), ('use', 0.6073808670043945), ('extrapolate', 0.5911571383476257), ('calculation', 0.5882555842399597), ('estimate', 0.5864962935447693)]
Representative word for the topic: calculate


In [52]:
finetuned_model.build_vocab(corpus_with_bigrams_trigrams, update=True)  # Add the new words to the vocabulary
finetuned_model.train(corpus_with_bigrams_trigrams, total_examples=len(corpus_with_bigrams_trigrams), epochs=10)  # Fine-tune the model

(57851, 291540)

In [53]:
# Now you can use the updated model with embeddings that include domain-specific words
ft_word_embeddings = [finetuned_model.wv[word] for word in top_words]
ft_mean_vector = np.mean(ft_word_embeddings, axis=0)

ft_similar_words = finetuned_model.wv.similar_by_vector(ft_mean_vector, topn=5)
print(ft_similar_words)

ft_topic_label = ft_similar_words[0][0]
print(f"Representative word for the topic: {ft_topic_label}")

[('variation', 0.9998103976249695), ('calculation', 0.9998043179512024), ('estimation', 0.9997916221618652), ('computer-simulation', 0.9997856020927429), ('correlation', 0.9997814893722534)]
Representative word for the topic: variation


In [49]:
np.allclose(mean_vector, ft_mean_vector, atol=1e-4)

False

In [54]:
from gensim.models import FastText

custom_model = FastText(vector_size=100, window=3, min_count=1, sentences=corpus_with_bigrams_trigrams, epochs=10)

In [55]:
custom_embeddings = [custom_model.wv[word] for word in top_words]
custom_mean_vector = np.mean(custom_embeddings, axis=0)

similar_words = custom_model.wv.similar_by_vector(custom_mean_vector, topn=5)
print(similar_words)

custom_topic_label = similar_words[0][0]
print(f"Representative word for the topic: {custom_topic_label}")

[('distancetraveled', 0.999996542930603), ('projected', 0.9999964833259583), ('example_consider', 0.9999963641166687), ('thersystemanintroductionandoverview', 0.9999961853027344), ('mentioned', 0.9999961256980896)]
Representative word for the topic: distancetraveled


### Trying to assign a label to a topic using a pre-trained transformer by encoding the top 20 words in a topic 

#### Finetuned T5

In [1]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("Michau/t5-base-en-generate-headline")

In [4]:
# Top 20 words for mmds visualization of LDA model with 5 topics
top_words = ["value","function","datum","random","use","model","variable", "time","figure","example","number","plot","estimate","random_variable","lag","histogram","probability","series","mean","simulate","standard","sample","r","regression","follow","level","distribution","variance","x","pseudorandom_number"]
print(len(top_words))

30


In [20]:
# Function to generate a one-word topic label from a list of words
def generate_topic_label(top_words: list) -> str:
    
    input_string = "label these topics: " + " ".join(top_words)
    print(input_string)
    
    # Tokenize the input string
    encoding = tokenizer.encode(input_string, return_tensors="pt")
    
    # Generate the label using the model
    output = model.generate(encoding, max_length=5, num_beams=4, early_stopping=True)
    
    # Decode the output to get the label
    label = tokenizer.decode(output[0], skip_special_tokens=True)
    
    return label

In [21]:
topic_label = generate_topic_label(top_words)
print(f"Generated topic label: {topic_label}")

label these topics: value function datum random use model variable time figure example number plot estimate random_variable lag histogram probability series mean simulate standard sample r regression follow level distribution variance x pseudorandom_number




Generated topic label: 


#### Finetuned BART

In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

mname = "cristian-popa/bart-tl-all"
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)

In [2]:
def generate_topic_label_with_BART(top_words: list[str]) -> str:
    enc = tokenizer(top_words, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
    outputs = model.generate(
        input_ids=enc.input_ids,
        attention_mask=enc.attention_mask,
        max_length=15,
        min_length=1,
        do_sample=False,
        num_beams=25,
        length_penalty=1.0,
        repetition_penalty=1.5
    )

    label = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return label

In [8]:
!nvidia-smi

Wed Oct 23 13:29:08 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 552.44                 Driver Version: 552.44         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3060 ...  WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   66C    P8             11W /   95W |      73MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [7]:
topic_label = generate_topic_label_with_BART(top_words)
print(f"Generated topic label: {topic_label}")

Generated topic label: rate of return


### Trying BERTopic to get topic info for the entire corpus

In [None]:
#!pip install BERTopic
# !pip install tf-keras

Collecting tf-keras
  Downloading tf_keras-2.17.0-py3-none-any.whl.metadata (1.6 kB)
Downloading tf_keras-2.17.0-py3-none-any.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ------------ --------------------------- 0.5/1.7 MB 1.7 MB/s eta 0:00:01
   ---------------------------------------- 1.7/1.7 MB 4.1 MB/s eta 0:00:00
Installing collected packages: tf-keras
Successfully installed tf-keras-2.17.0


In [52]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP

# Assuming `corpus` is a list of lists of strings
# Convert the list of lists into a list of strings (documents)
split_corpus = []

# Split each string into 5 parts
for string in corpus:
    # Calculate the length of each part
    part_length = max(1, len(string) // 5)  # Ensure at least one character per part
    parts = [string[i:i + part_length] for i in range(0, len(string), part_length)]
    
    # If there are more than 5 parts, combine excess parts
    while len(parts) > 5:
        last_part = parts.pop()
        parts[-1] += last_part  # Combine excess into the last part
    
    # Add the parts to the split_corpus
    split_corpus.extend(parts)

vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words='english')

# Initialize BERTopic model
topic_model = BERTopic(vectorizer_model=vectorizer_model)


# Fit the BERTopic model on your corpus and extract topics
topics, probabilities = topic_model.fit_transform(split_corpus)



False


In [55]:
print(topic_model.get_topic_info())
# topic_model.visualize_topics() does not work because only one topic lol

   Topic  Count                                  Name  \
0     -1     40  -1_function_random_data_distribution   

                                      Representation  \
0  [function, random, data, distribution, example...   

                                 Representative_Docs  
0  [ying this in the inverse CDF method runs as f...  


#### Less than ideal results: 

- BERTopic does not work out of the box with a corpus of 8 documents (in this case, each chapter is one document as a string so the corpus is a list of 8 strings), so we need to split the 8 documents into 40 documents (by evenly splitting each doc into 5 docs).
- The output is only one "topic" with index -1. According to BERTopic documentation, topic ID -1 is for documents that "do not fit into any topics". All of our documents are assigned to this topic. 

#### Trying BERTopic on complete modules corpus

In [68]:
from bertopic import BERTopic

topic_model = BERTopic()        # Default arguments as used on the website: https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html
topics, probabilities = topic_model.fit_transform(corpus)




In [69]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,18,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 580\n\nModeling and Simulation ...


In [16]:
topic_model.get_document_info(corpus)

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,UC\nDa...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
1,Lecture 7: Functional-style programming and\nH...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
2,Data Structures and\nAlgorithms\n\nUBCO Master...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
3,UC\nPy...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
4,UC\nSQ...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
5,Version Control\n\nUBCO Master of Data Science...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
6,Data Profiling and\nCleaning\nHandling Missing...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
7,Completely Randomized Designs (CRD)\n ...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
8,"551 Lec 5 - Tables, styling, performance\nYou ...",-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
9,Moving beyond linearity in response\n\n ...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False


Still terrible results, not sure what I'm doing wrong.

### Checking if it works in general

In [5]:
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

In [10]:
print(len(docs))
print(type(docs))
print(docs[0][:100])
print(type(docs[0]))

18846
<class 'list'>


I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about 
<class 'str'>


In [9]:
print(len(docs[0]))

712


In [11]:
topics, probs = topic_model.fit_transform(docs)

In [12]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6596,-1_to_the_of_and,"[to, the, of, and, is, for, in, you, it, that]",[\nProbably because it IS rape.\n\n\nSo nothin...
1,0,1832,0_game_team_games_he,"[game, team, games, he, players, season, hocke...","[\nWales Conference, Adams Division, Semifinal..."
2,1,616,1_key_clipper_chip_encryption,"[key, clipper, chip, encryption, keys, escrow,...",[The following document summarizes the Clipper...
3,2,464,2_israel_israeli_jews_arab,"[israel, israeli, jews, arab, jewish, arabs, p...","[\n\n""Assuming""? Also: come on, Brad. If we ar..."
4,3,451,3_ites_cheek_yep_huh,"[ites, cheek, yep, huh, ken, , , , , ]","[Ken\n, \nYep.\n, ites:]"
...,...,...,...,...,...
211,210,10,210_oil_lights_indicators_service,"[oil, lights, indicators, service, reset, indi...",[Derek....\n\nThere is a tool available to res...
212,211,10,211_needles_acupuncture_needle_syringe,"[needles, acupuncture, needle, syringe, hypode...",[\nIt is illegal to perform acupuncture with u...
213,212,10,212_alarm_sensor_alarms_shock,"[alarm, sensor, alarms, shock, car, viper, alp...",[Just found a great deal on a Clifford Delta c...
214,213,10,213_religion_supreme_arf_definition,"[religion, supreme, arf, definition, belief, l...",[\n .\n It's my understanding that ...
