## TODO's

1. Refine preprocessing pipeline (use spacy or nltk or some combination of the two)
    - There are some quirks with n-grams currently, look into refining the implementation
    - Some words like "use", "since", "r", "x", are not being filtered out by stopword removal

2. Web scraping for job data
    - Collect like 50-100 examples per week and create a similar preprocessing pipeline 
    - Look for ways to programmatically filter sections we want (responsibilities and qualifications).

3. Look into topic labeling
    - Automatically extracting top n words (and sorting them by relevance)
    - Look at how relevance is computed at https://github.com/bmabey/pyLDAvis/blob/master/pyLDAvis/_prepare.py
    - BERTopic?
    
4. Finish Introduction and Data sections before midterm break
    - Literature review (Blei paper, Daniel paper, Journal of DSE paper, possibly find topic labeling papers?)
    - Decide on final dataset 


### Imports and loading data

In [1]:
import string   # contains a public variable with all ASCII punctuation characters
import nltk

# list of all stopwords such as 'and', 'the', 'is', etc.
nltk.download('stopwords')  

# WordNet is a lexical database of English words that groups words into sets of synonyms, while also recording semantic relationships between words such as "is-a", "part-of", and "opposite-of" relationships.
nltk.download('wordnet')    

# Open Multilingual WordNet (omw) links hand created wordnets and automatically created wordnets for different languages.
nltk.download('omw-1.4')

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk import ngrams

# Used to tokenize the text; i.e. create a dictionary mapping words to integers. The dictionary can be used to create a term-document matrix.
from gensim.corpora import Dictionary

from gensim.models import Phrases
from gensim.models.phrases import Phraser
from gensim.models.phrases import ENGLISH_CONNECTOR_WORDS

import spacy

from textacy import extract

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\syeda\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\syeda\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\syeda\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Location of environment for personal reference:  c:\Users\syeda\miniconda3\envs\dir-st\lib\ (in case large models are downloaded for testing and need to be deleted)

In [2]:
import os

def combine_text_files_to_list(base_name, num_files):
    corpus = []

    # Loop through all group files and add their content as separate strings in the corpus list
    for i in range(1, num_files + 1):
        file_path = f"Parsed/group_{i}.txt"
        
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                file_content = file.read()  # Read the entire file as a string
                corpus.append(file_content)  # Add the file's content as a string to the corpus list
                
        except FileNotFoundError:
            print(f"File {file_path} not found.")
        except Exception as e:
            print(f"An error occurred while reading {file_path}: {e}")
    
    return corpus

def combine_text_files_to_list(input_directory):

    txt_files = [os.path.join(input_directory, file) for file in os.listdir(input_directory) if file.endswith(".txt")]
    corpus = []

    for txt_file in txt_files:
        
        try:
            # Read the entire file as a string and add the string to the corpus
            with open(txt_file, 'r', encoding='utf-8') as file:
                file_content = file.read()  
                corpus.append(file_content)  
                
        except Exception as e:
            print(f"An error occurred while reading {txt_file}: {e}")
    
    return corpus

# corpus = combine_text_files_to_list('ModelingSimulation_inR', 8)
corpus = combine_text_files_to_list("Parsed_Slides")
print("Corpus combined successfully as a list of strings.")

Corpus combined successfully as a list of strings.


`corpus` is currently a list of strings, where each string is all the text from one module.

In [3]:
print(len(corpus))
print(corpus[0][:500])

18
                                        UC
Data Formats

UBCO Master of Data Science – DATA 530

                                          1
---
Learning Objectives•  Explain why it is important to understand and use correct terminology.
           •          Define: computer, software, memory, data, memory size/data size, cloud
           •          Explain "Big Data" and describe data growth in the coming years.
           •          Compare and contrast: digital versus analog
           •    


In [9]:
import numpy as np

sum = 0
doc_length = []
for doc in corpus:
    sum += len(doc.split())
    print("Number of words: ", len(doc))
    doc_length.append(len(doc))
    
print(np.std(doc_length))
print(np.mean(doc_length))
print(f"Total number of words in the corpus: {sum}")

Number of words:  274069
Number of words:  145877
Number of words:  159501
Number of words:  162439
Number of words:  243960
Number of words:  210122
Number of words:  175243
Number of words:  223707
Number of words:  144625
Number of words:  150600
Number of words:  106105
Number of words:  152243
Number of words:  191219
Number of words:  239481
Number of words:  141983
Number of words:  329900
Number of words:  135871
Number of words:  179817
55102.74906229473
187042.33333333334
Total number of words in the corpus: 351454


### Cleaning the corpus

In [4]:
def clean_with_nltk(doc):
    
    stop_words = set(stopwords.words('english'))
    punctuation = set(string.punctuation)
    lemmatizer = WordNetLemmatizer()
    lower_case_sentences = doc.lower().split()

    stop_free = " ".join([word for word in lower_case_sentences if word not in stop_words])             # only keep words that are not stopwords
    # print(stop_free)
    punc_free = "".join(ch for ch in stop_free if ch not in punctuation and not ch.isnumeric() and not ch == "•")         # only keep characters that are not punctuation and not numbers
    # print(punc_free)
    lemmatized = " ".join(lemmatizer.lemmatize(word) for word in punc_free.split())             # lemmatize words; convert words to their base or root form using their context in the sentence
    # print(lemmatized)

    # We do this separately later for nltk
    # bigrams = list(ngrams(lemmatized, 2))  
    # trigrams = list(ngrams(lemmatized, 3))  
    # bigram_strings = ["_".join(bigram) for bigram in bigrams]  # Join bigram words with an underscore
    # trigram_strings = ["_".join(trigram) for trigram in trigrams]

    return lemmatized 

def clean_with_spacy(doc):

    spacy_parser = spacy.load("en_core_web_sm")
    # Add custom stop words, mostly including header and footer information like names of instructors, name of university, filler words like 'example', 'page', etc.
    spacy_parser.Defaults.stop_words |= {"ubc", "mds", "lecture", "lab", "assignments", "example", "page", "file", "question", "ex", "import", "jeffrey", "andrews", "irene", "vrbik", "shan", "du", "ifeoma", "adaji", "gema", "rodrigues", "fatemeh", "fard", "emelie", "gustafsson", "xiaoping", "shi", "ladan", "tazik", "ramon", "lawrence"}
    
    spacy_doc = spacy_parser(doc.lower())

    ngrams = [
        ngram.text.replace(" ", "_")    # ngrams are separated by spaces, so we replace them with underscores
        for ngram in extract.ngrams(spacy_doc, n = 2, min_freq = 4, filter_punct = True, filter_nums = True, exclude_pos=["PROPN", "ORG", "DATE", "X"]) 
        if not ngram.text.__contains__("=") 
            and not ngram.text.__contains__("@") 
            and not ngram.text.__contains__("$")
    ]
    
    # Remove stopwords, punctuation, and numeric tokens
    tokens = [
        token.lemma_ 
        for token in spacy_doc 
        if not token.is_stop and not token.is_punct and not token.is_digit and token.is_alpha       # Keep only words that are not stop words
            and token.text not in ["_", "+", "=", "\n","-","*","<",">"]                             # Remove special characters
            and not token.lemma_ == "datum"                                                         # Do not lemmatize words related to data       
            and not len(token.text) == 1                                                            # Remove single character words
    ]                                                                             
    
    return tokens + ngrams

Apply the cleaning functions to the entire corpus. We have 2 options, nltk and spaCy, where spaCy has some more options. In both cases, returned corpus is a list of list of strings, where each list of strings is an entire module after cleaning. 

#### Cleaning with spaCy 

In [5]:
corpus_with_bigrams = [clean_with_spacy(doc) for doc in corpus]

In [6]:
sum = 0
for doc in corpus_with_bigrams:
    sum += len(doc)

print(f"Total number of words in the cleaned corpus: {sum}")

Total number of words in the cleaned corpus: 155063


#### Cleaning with nltk

In [None]:
nltk_cleaned_corpus = [clean_with_nltk(doc).split() for doc in corpus]
print(nltk_cleaned_corpus[0])

In [None]:
sum = 0
for doc in nltk_cleaned_corpus:
    sum += len(doc)

print(f"Total number of words in the cleaned corpus: {sum}")

Total number of words in the cleaned corpus: 181461


In [None]:
bigram = Phrases(nltk_cleaned_corpus, min_count=10, connector_words=ENGLISH_CONNECTOR_WORDS)  
# trigram = Phrases(bigram[clean_corpus], threshold=10, connector_words=ENGLISH_CONNECTOR_WORDS)

bigram_mod = Phraser(bigram)
# trigram_mod = Phraser(trigram)

# add bigrams and trigrams to the clean corpus
corpus_with_bigrams = [bigram_mod[doc] for doc in nltk_cleaned_corpus]

sum = 0
for doc in corpus_with_bigrams:
    sum += len(doc)

print(f"Total number of words in the nltk corpus with ngrams: {sum}")

<class 'list'>
Total number of words in the corpus with ngrams: 164907


### Topic Modeling

In [7]:
print(corpus_with_bigrams[0])



In [8]:
dictionary = Dictionary(corpus_with_bigrams)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in corpus_with_bigrams]
print(doc_term_matrix[0])

[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 2), (7, 4), (8, 1), (9, 8), (10, 4), (11, 2), (12, 4), (13, 6), (14, 17), (15, 7), (16, 5), (17, 7), (18, 5), (19, 1), (20, 1), (21, 1), (22, 1), (23, 15), (24, 1), (25, 3), (26, 2), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 2), (35, 1), (36, 1), (37, 1), (38, 4), (39, 48), (40, 1), (41, 1), (42, 1), (43, 3), (44, 14), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 7), (53, 7), (54, 5), (55, 3), (56, 1), (57, 4), (58, 1), (59, 1), (60, 1), (61, 1), (62, 10), (63, 7), (64, 1), (65, 1), (66, 2), (67, 2), (68, 3), (69, 1), (70, 1), (71, 18), (72, 1), (73, 26), (74, 2), (75, 1), (76, 1), (77, 4), (78, 2), (79, 1), (80, 1), (81, 5), (82, 2), (83, 1), (84, 1), (85, 1), (86, 1), (87, 6), (88, 42), (89, 6), (90, 5), (91, 1), (92, 8), (93, 14), (94, 1), (95, 2), (96, 1), (97, 1), (98, 1), (99, 2), (100, 3), (101, 1), (102, 3), (103, 3), (104, 2), (105, 1), (106, 7), (107, 1), (108, 20), (109, 6),

In [9]:
NUM_TOPICS = 19
PATH_TO_MODEL = f"18_Modules_Test_LDA_spacy_{NUM_TOPICS}_topics"
lda_model = None

In [10]:
from gensim.models import LdaModel
# from pprint import pprint

lda_model = LdaModel(doc_term_matrix, num_topics=NUM_TOPICS, id2word = dictionary)
lda_model.show_topics(num_words=20)
# pprint(lda_model.print_topics(num_topics=NUM_TOPICS, num_words=3))

[(0,
  '0.015*"model" + 0.008*"value" + 0.006*"function" + 0.005*"prior" + 0.005*"use" + 0.005*"sample" + 0.005*"regression" + 0.005*"likelihood" + 0.004*"distribution" + 0.004*"time" + 0.004*"probability" + 0.004*"number" + 0.004*"linear" + 0.004*"posterior" + 0.004*"true" + 0.003*"introduction" + 0.003*"estimate" + 0.003*"set" + 0.003*"normal" + 0.003*"give"'),
 (5,
  '0.009*"model" + 0.008*"value" + 0.007*"function" + 0.005*"linear" + 0.005*"sample" + 0.005*"number" + 0.005*"variable" + 0.005*"regression" + 0.004*"probability" + 0.004*"use" + 0.004*"estimate" + 0.004*"error" + 0.003*"test" + 0.003*"time" + 0.003*"mean" + 0.003*"distribution" + 0.003*"line" + 0.003*"true" + 0.003*"code" + 0.003*"create"'),
 (9,
  '0.008*"value" + 0.007*"function" + 0.006*"model" + 0.006*"sample" + 0.004*"variable" + 0.004*"error" + 0.004*"use" + 0.004*"test" + 0.004*"time" + 0.004*"class" + 0.004*"return" + 0.003*"linear" + 0.003*"regression" + 0.003*"number" + 0.003*"observation" + 0.003*"set" + 0.0

In [11]:
lda_model.top_topics(doc_term_matrix, dictionary=dictionary, coherence='u_mass')

[([(0.00960559, 'model'),
   (0.0075156935, 'value'),
   (0.0045612715, 'test'),
   (0.004507823, 'function'),
   (0.0043454287, 'select'),
   (0.004243544, 'variable'),
   (0.0042251986, 'number'),
   (0.004034812, 'class'),
   (0.0039899326, 'linear'),
   (0.0039058563, 'regression'),
   (0.003825588, 'set'),
   (0.0035794568, 'use'),
   (0.0034324385, 'time'),
   (0.0034122407, 'sample'),
   (0.003378474, 'true'),
   (0.00333054, 'error'),
   (0.0032164913, 'method'),
   (0.0031696854, 'return'),
   (0.0031450232, 'type'),
   (0.0030944792, 'python')],
  -0.06660145604640283),
 ([(0.0118768215, 'function'),
   (0.010095084, 'model'),
   (0.008466003, 'value'),
   (0.0058045257, 'number'),
   (0.005567429, 'random'),
   (0.005316095, 'use'),
   (0.0050123553, 'estimate'),
   (0.004615025, 'variable'),
   (0.004548658, 'distribution'),
   (0.0044490886, 'mean'),
   (0.0041482616, 'sample'),
   (0.004098337, 'linear'),
   (0.0039621363, 'probability'),
   (0.0038901756, 'return'),
   (

In [12]:
# print(f"{lda_model.id2word(ID)}, {prob}" for ID,prob in lda_model.get_topic_terms(topicid = 0, topn = 20))
top_words_0 = []

for ID, prob in lda_model.get_topic_terms(topicid=0, topn=20):
    print(f"{lda_model.id2word[ID]}, {prob}")
    top_words_0.append(lda_model.id2word[ID])

model, 0.015108348801732063
value, 0.008143560029566288
function, 0.006454497575759888
prior, 0.0051698507741093636
use, 0.0051424698904156685
sample, 0.004788743797689676
regression, 0.004744137171655893
likelihood, 0.004531285259872675
distribution, 0.004434437956660986
time, 0.004343572072684765
probability, 0.004184996709227562
number, 0.003984047099947929
linear, 0.0038952650502324104
posterior, 0.0037972114514559507
true, 0.0036061606369912624
introduction, 0.0034552463330328465
estimate, 0.003364811884239316
set, 0.0033608609810471535
normal, 0.003152866382151842
give, 0.0030026957392692566


In [13]:
from gensim.test.utils import datapath
lda_model.save(datapath(PATH_TO_MODEL))

# Datapath: c:\Users\syeda\miniconda3\envs\dir-st\lib\site-packages\gensim\test\test_data\

In [14]:
from gensim.models import HdpModel
# from pprint import pprint

hdp_model = HdpModel(doc_term_matrix, id2word = dictionary)
hdp_model.show_topics()

[(0,
  '0.007*network + 0.007*layer + 0.007*command + 0.006*use + 0.006*function + 0.005*output + 0.005*cell + 0.005*number + 0.005*neural + 0.005*value + 0.005*input + 0.004*model + 0.004*git + 0.004*line + 0.004*add + 0.004*mar + 0.004*time + 0.004*create + 0.003*open + 0.003*code'),
 (1,
  '0.016*sample + 0.012*function + 0.011*value + 0.008*treatment + 0.007*population + 0.007*return + 0.007*use + 0.006*mean + 0.006*model + 0.006*variable + 0.006*unit + 0.006*list + 0.006*design + 0.005*factor + 0.005*true + 0.005*block + 0.005*time + 0.005*test + 0.005*distribution + 0.005*effect'),
 (2,
  '0.040*prior + 0.028*posterior + 0.024*model + 0.018*likelihood + 0.018*introduction + 0.015*normal + 0.013*probability + 0.012*distribution + 0.011*beta + 0.010*chain + 0.009*bayesian + 0.009*sample + 0.009*regression + 0.008*diagnostic + 0.007*binomial + 0.007*parameter + 0.007*plot + 0.006*stan + 0.006*step + 0.006*metropolis'),
 (3,
  '0.016*emp + 0.015*select + 0.015*database + 0.015*table 

In [None]:
hdp_model.hdp_to_lda()

(array([2.75000000e-01, 2.24647887e-01, 1.29123126e-01, 1.00672268e-01,
        7.73019198e-02, 5.83410715e-02, 4.31723929e-02, 3.12310928e-02,
        2.20037244e-02, 1.50269338e-02, 9.88614064e-03, 6.21414554e-03,
        3.68964892e-03, 2.03566837e-03, 8.26990274e-04, 4.13495137e-04,
        2.06747569e-04, 1.03373784e-04, 5.16868921e-05, 2.58434461e-05,
        1.29217230e-05, 6.46086152e-06, 3.23043076e-06, 1.61521538e-06,
        8.07607690e-07, 4.03803845e-07, 2.01901922e-07, 1.00950961e-07,
        5.04754806e-08, 2.52377403e-08, 1.26188702e-08, 6.30943508e-09,
        3.15471754e-09, 1.57735877e-09, 7.88679384e-10, 3.94339692e-10,
        1.97169846e-10, 9.85849231e-11, 4.92924615e-11, 2.46462308e-11,
        1.23231154e-11, 6.16155769e-12, 3.08077885e-12, 1.54038942e-12,
        7.70194711e-13, 3.85097356e-13, 1.92548678e-13, 9.62743389e-14,
        4.81371695e-14, 2.40685847e-14, 1.20342924e-14, 6.01714618e-15,
        3.00857309e-15, 1.50428655e-15, 7.52143273e-16, 3.760716

In [17]:
import pyLDAvis.gensim_models as gensim_vis
from pyLDAvis import enable_notebook

# For visualizing the topics in a Jupyter notebook
enable_notebook()

lda_model_to_display = LdaModel.load(datapath(PATH_TO_MODEL)) if lda_model is None else lda_model 

# Options for dimensionality reduction: mds = 'pcoa', 'tsne', 'mmds'
gensim_vis.prepare(lda_model_to_display, doc_term_matrix, dictionary, mds='mmds')

# To save the visualization to an HTML file
# pyLDAvis.save_html(LDAvis_prepared, 'Test_run_LDA_'+ str(NUM_TOPICS) + '.html')


#LSA not supported in pyLDAvis

#LSAvis_prepared = gensim_vis.prepare(lsa_model, doc_term_matrix, dictionary) 
#pyLDAvis.save_html(LSAvis_prepared, 'topics_modeling_basics_LSA_'+ str(NUM_TOPICS) +'.html')

## The below section is experimental:

Note: add an explanation of what an embedding is, how they are learned, sentence vs word level embeddings (and the fact that we use word level). Also describe each approach, what worked and what didn't.  

### Trying to assign a label to a topic using word embeddings of the top 20 words in a topic sorted by relevance

In [17]:
# Top 20 words for mmds visualization of LDA model with 5 topics
top_words = ["model", "data", "value", "function", "distribution", "example", "probability", "number", "using", "use", "simulate", "sample", "independent", "average", "mean", "figure", "estimate", "variable", "measurement", "plot"]
print(len(top_words))

20


In [74]:

# import gensim.downloader as api
# model_location = api.load("fasttext-wiki-news-subwords-300", return_path=True)
# print(model_location)
# Stored at C:\Users\syeda/gensim-data\fasttext-wiki-news-subwords-300\


C:\Users\syeda/gensim-data\fasttext-wiki-news-subwords-300\fasttext-wiki-news-subwords-300.gz


In [43]:
from gensim.models.fasttext import load_facebook_model

model_location = datapath("C:/Users/syeda/OneDrive/Desktop/4th Year/DATA448/cc.en.300.bin")
pretrained_model = load_facebook_model(model_location)
finetuned_model = load_facebook_model(model_location)

In [46]:
import numpy as np

word_embeddings = [pretrained_model.wv[word] for word in top_words]
mean_vector = np.mean(word_embeddings, axis=0)

pt_similar_words = pretrained_model.wv.similar_by_vector(mean_vector, topn=5)
print(pt_similar_words)

topic_label = pt_similar_words[0][0]
print(f"Representative word for the topic: {topic_label}")

[('calculate', 0.6181637048721313), ('use', 0.6073808670043945), ('extrapolate', 0.5911571383476257), ('calculation', 0.5882555842399597), ('estimate', 0.5864962935447693)]
Representative word for the topic: calculate


In [52]:
finetuned_model.build_vocab(corpus_with_bigrams_trigrams, update=True)  # Add the new words to the vocabulary
finetuned_model.train(corpus_with_bigrams_trigrams, total_examples=len(corpus_with_bigrams_trigrams), epochs=10)  # Fine-tune the model

(57851, 291540)

In [53]:
# Now you can use the updated model with embeddings that include domain-specific words
ft_word_embeddings = [finetuned_model.wv[word] for word in top_words]
ft_mean_vector = np.mean(ft_word_embeddings, axis=0)

ft_similar_words = finetuned_model.wv.similar_by_vector(ft_mean_vector, topn=5)
print(ft_similar_words)

ft_topic_label = ft_similar_words[0][0]
print(f"Representative word for the topic: {ft_topic_label}")

[('variation', 0.9998103976249695), ('calculation', 0.9998043179512024), ('estimation', 0.9997916221618652), ('computer-simulation', 0.9997856020927429), ('correlation', 0.9997814893722534)]
Representative word for the topic: variation


In [49]:
np.allclose(mean_vector, ft_mean_vector, atol=1e-4)

False

In [54]:
from gensim.models import FastText

custom_model = FastText(vector_size=100, window=3, min_count=1, sentences=corpus_with_bigrams_trigrams, epochs=10)

In [55]:
custom_embeddings = [custom_model.wv[word] for word in top_words]
custom_mean_vector = np.mean(custom_embeddings, axis=0)

similar_words = custom_model.wv.similar_by_vector(custom_mean_vector, topn=5)
print(similar_words)

custom_topic_label = similar_words[0][0]
print(f"Representative word for the topic: {custom_topic_label}")

[('distancetraveled', 0.999996542930603), ('projected', 0.9999964833259583), ('example_consider', 0.9999963641166687), ('thersystemanintroductionandoverview', 0.9999961853027344), ('mentioned', 0.9999961256980896)]
Representative word for the topic: distancetraveled


### Trying to assign a label to a topic using a pre-trained transformer by encoding the top 20 words in a topic 

#### Finetuned T5

In [1]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("Michau/t5-base-en-generate-headline")

In [4]:
# Top 20 words for mmds visualization of LDA model with 5 topics
top_words = ["value","function","datum","random","use","model","variable", "time","figure","example","number","plot","estimate","random_variable","lag","histogram","probability","series","mean","simulate","standard","sample","r","regression","follow","level","distribution","variance","x","pseudorandom_number"]
print(len(top_words))

30


In [20]:
# Function to generate a one-word topic label from a list of words
def generate_topic_label(top_words: list) -> str:
    
    input_string = "label these topics: " + " ".join(top_words)
    print(input_string)
    
    # Tokenize the input string
    encoding = tokenizer.encode(input_string, return_tensors="pt")
    
    # Generate the label using the model
    output = model.generate(encoding, max_length=5, num_beams=4, early_stopping=True)
    
    # Decode the output to get the label
    label = tokenizer.decode(output[0], skip_special_tokens=True)
    
    return label

In [21]:
topic_label = generate_topic_label(top_words)
print(f"Generated topic label: {topic_label}")

label these topics: value function datum random use model variable time figure example number plot estimate random_variable lag histogram probability series mean simulate standard sample r regression follow level distribution variance x pseudorandom_number




Generated topic label: 


#### Finetuned BART

In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

mname = "cristian-popa/bart-tl-all"
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)

In [2]:
def generate_topic_label_with_BART(top_words: list[str]) -> str:
    enc = tokenizer(top_words, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
    outputs = model.generate(
        input_ids=enc.input_ids,
        attention_mask=enc.attention_mask,
        max_length=15,
        min_length=1,
        do_sample=False,
        num_beams=25,
        length_penalty=1.0,
        repetition_penalty=1.5
    )

    label = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return label

In [8]:
!nvidia-smi

Wed Oct 23 13:29:08 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 552.44                 Driver Version: 552.44         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3060 ...  WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   66C    P8             11W /   95W |      73MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [7]:
topic_label = generate_topic_label_with_BART(top_words)
print(f"Generated topic label: {topic_label}")

Generated topic label: rate of return


### Trying BERTopic to get topic info for the entire corpus

In [None]:
#!pip install BERTopic
# !pip install tf-keras

Collecting tf-keras
  Downloading tf_keras-2.17.0-py3-none-any.whl.metadata (1.6 kB)
Downloading tf_keras-2.17.0-py3-none-any.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ------------ --------------------------- 0.5/1.7 MB 1.7 MB/s eta 0:00:01
   ---------------------------------------- 1.7/1.7 MB 4.1 MB/s eta 0:00:00
Installing collected packages: tf-keras
Successfully installed tf-keras-2.17.0


In [52]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP

# Assuming `corpus` is a list of lists of strings
# Convert the list of lists into a list of strings (documents)
split_corpus = []

# Split each string into 5 parts
for string in corpus:
    # Calculate the length of each part
    part_length = max(1, len(string) // 5)  # Ensure at least one character per part
    parts = [string[i:i + part_length] for i in range(0, len(string), part_length)]
    
    # If there are more than 5 parts, combine excess parts
    while len(parts) > 5:
        last_part = parts.pop()
        parts[-1] += last_part  # Combine excess into the last part
    
    # Add the parts to the split_corpus
    split_corpus.extend(parts)

vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words='english')

# Initialize BERTopic model
topic_model = BERTopic(vectorizer_model=vectorizer_model)


# Fit the BERTopic model on your corpus and extract topics
topics, probabilities = topic_model.fit_transform(split_corpus)



False


In [55]:
print(topic_model.get_topic_info())
# topic_model.visualize_topics() does not work because only one topic lol

   Topic  Count                                  Name  \
0     -1     40  -1_function_random_data_distribution   

                                      Representation  \
0  [function, random, data, distribution, example...   

                                 Representative_Docs  
0  [ying this in the inverse CDF method runs as f...  


#### Less than ideal results: 

- BERTopic does not work out of the box with a corpus of 8 documents (in this case, each chapter is one document as a string so the corpus is a list of 8 strings), so we need to split the 8 documents into 40 documents (by evenly splitting each doc into 5 docs).
- The output is only one "topic" with index -1. According to BERTopic documentation, topic ID -1 is for documents that "do not fit into any topics". All of our documents are assigned to this topic. 