## TODO's - October

1. Refine preprocessing pipeline (use spacy or nltk or some combination of the two)
    - There are some quirks with n-grams currently, look into refining the implementation
    - Some words like "use", "since", "r", "x", are not being filtered out by stopword removal

2. Web scraping for job data
    - Collect like 50-100 examples per week and create a similar preprocessing pipeline 
    - Look for ways to programmatically filter sections we want (responsibilities and qualifications).

3. Look into topic labeling
    - Automatically extracting top n words (and sorting them by relevance)
    - Look at how relevance is computed at https://github.com/bmabey/pyLDAvis/blob/master/pyLDAvis/_prepare.py
    - BERTopic?
    
4. Finish Introduction and Data sections before midterm break
    - Literature review (Blei paper, Daniel paper, Journal of DSE paper, possibly find topic labeling papers?)
    - Decide on final dataset 


## TODO's - Final Submission

1. Use HDP output or LDA equivalent for module data. 
2. Preprocess and filter job data (maybe add more from LinkedIn if less than 150 after filtering).
3. Same analysis on job data.
4. Results section: Add visualizations and metrics.
5. Discussion section: talk about overlaps and differences between both datasets.
6. Finalize paper.

## Topic Modeling on MDS Program Lecture Material

### Some notation

- A 'document' is just a collection of words.
    - Initially, after loading the data, one document is contained in a string, containing all the text from one module.
    - After preprocessing, one document is represented in a "bag of words" format, which means it is a *list* of individual tokens (words).
- A 'corpus' is a collection of documents.
- d = number of documents in the corpus
- k = number of topics for the topic model to find
- |V| = size of vocabulary, i.e. number of distinct tokens in the corpus

### Imports and loading data

In [9]:
import string   # contains a public variable with all ASCII punctuation characters
import nltk

# list of all stopwords such as 'and', 'the', 'is', etc.
nltk.download('stopwords')  

# WordNet is a lexical database of English words that groups words into sets of synonyms, while also recording semantic relationships between words such as "is-a", "part-of", and "opposite-of" relationships.
nltk.download('wordnet')    

# Open Multilingual WordNet (omw) links hand created wordnets and automatically created wordnets for different languages.
nltk.download('omw-1.4')

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk import ngrams

# Used to tokenize the text; i.e. create a dictionary mapping words to integers. The dictionary can be used to create a term-document matrix.
from gensim.corpora import Dictionary

from gensim.models import Phrases
from gensim.models.phrases import Phraser
from gensim.models.phrases import ENGLISH_CONNECTOR_WORDS

import spacy

from textacy import extract

import numpy as np

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\syeda\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\syeda\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\syeda\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [70]:
# For topic visualizations 
import pyLDAvis.gensim_models as gensim_vis
# For enabling HTML widget in Jupyter notebook
from pyLDAvis import enable_notebook
from pyLDAvis import display

enable_notebook()

Location of environment for personal reference:  c:\Users\syeda\miniconda3\envs\dir-st\lib\ (in case large models are downloaded for testing and need to be deleted)

In [3]:
import os

def combine_text_files_to_list(input_directory):

    txt_files = [os.path.join(input_directory, file) for file in os.listdir(input_directory) if file.endswith(".txt")]
    corpus = []

    for txt_file in txt_files:
        
        try:
            # Read the entire file as a string and add the string to the corpus
            with open(txt_file, 'r', encoding='utf-8') as file:
                file_content = file.read()  
                corpus.append(file_content)  
                
        except Exception as e:
            print(f"An error occurred while reading {txt_file}: {e}")
    
    return corpus

corpus = combine_text_files_to_list("../Dataset/Parsed_Slides")
print("Corpus combined successfully as a list of strings.")

Corpus combined successfully as a list of strings.


`corpus` is currently a list of strings, where each string is all the text from one module.

In [4]:
print(len(corpus))
print(corpus[0][:500])

18
                                        UC
Data Formats

UBCO Master of Data Science – DATA 530

                                          1
---
Learning Objectives•  Explain why it is important to understand and use correct terminology.
           •          Define: computer, software, memory, data, memory size/data size, cloud
           •          Explain "Big Data" and describe data growth in the coming years.
           •          Compare and contrast: digital versus analog
           •    


In [None]:
sum = 0
doc_length = []
for doc in corpus:
    words = doc.split()
    sum += len(words)
    print("Number of words: ", len(words))
    doc_length.append(len(words))
    
print(f"Total number of words in the corpus: {sum}")
print(f"Mean number of words per document: {round(np.mean(doc_length),2)}")
print(f"Standard deviation: {round(np.std(doc_length),2)}")

Number of words:  20326
Number of words:  20455
Number of words:  17429
Number of words:  16384
Number of words:  25640
Number of words:  18240
Number of words:  16140
Number of words:  26821
Number of words:  16101
Number of words:  16242
Number of words:  11286
Number of words:  20009
Number of words:  17638
Number of words:  26592
Number of words:  17207
Number of words:  31114
Number of words:  11464
Number of words:  22366
Total number of words in the corpus: 351454
Mean number of words: 19525.22
Standard deviation: 5144.3


### Cleaning and preprocessing the corpus

For this task, we explored 2 options, nltk and spaCy, where spaCy is a bit easier to use and nltk is a bit more manual. In both cases, input is a list of strings, and the returned corpus is a list of list of strings, where each nested list of strings is a list of cleaned words from one module. 

In [16]:
def clean_with_nltk(doc):
    
    stop_words = set(stopwords.words('english'))
    punctuation = set(string.punctuation)
    lemmatizer = WordNetLemmatizer()
    lower_case_sentences = doc.lower().split()

    stop_free = " ".join([word for word in lower_case_sentences if word not in stop_words])             # only keep words that are not stopwords
    # print(stop_free)
    punc_free = "".join(ch for ch in stop_free if ch not in punctuation and not ch.isnumeric() and not ch == "•")         # only keep characters that are not punctuation and not numbers
    # print(punc_free)
    lemmatized = " ".join(lemmatizer.lemmatize(word) for word in punc_free.split())             # lemmatize words; convert words to their base or root form using their context in the sentence
    # print(lemmatized)

    # We do this separately later for nltk
    # bigrams = list(ngrams(lemmatized, 2))  
    # trigrams = list(ngrams(lemmatized, 3))  
    # bigram_strings = ["_".join(bigram) for bigram in bigrams]  # Join bigram words with an underscore
    # trigram_strings = ["_".join(trigram) for trigram in trigrams]

    return lemmatized 

def clean_with_spacy(doc):

    nlp = spacy.load("en_core_web_sm")
    # Add custom stop words, mostly including header and footer information like names of instructors, name of university, filler words like 'example', 'page', etc.
    nlp.Defaults.stop_words |= {"ubc", "mds", "lecture", "lab", "assignments", "example", "page", "file", "question", "ex", "import", "jeffrey", "andrews", "irene", "vrbik", "shan", "du", "ifeoma", "adaji", "gema", "rodrigues", "fatemeh", "fard", "emelie", "gustafsson", "xiaoping", "shi", "ladan", "tazik", "ramon", "lawrence"}
    
    spacy_doc = nlp(doc.lower(), disable=["parser", "ner"])  # Disable the parser and named entity recognition since we only need the tokenization, lemmatization, and POS tagging

    ngrams = [
        ngram.text.replace(" ", "_")    # ngrams are separated by spaces, so we replace them with underscores
        for ngram in extract.ngrams(spacy_doc, n = 2, min_freq = 4, filter_punct = True, filter_nums = True, exclude_pos=["PROPN", "ORG", "DATE", "X"]) 
        if not ngram.text.__contains__("=") 
            and not ngram.text.__contains__("@") 
            and not ngram.text.__contains__("$")
    ]
    
    # Remove stopwords, punctuation, and numeric tokens
    tokens = [
        token.lemma_ 
        for token in spacy_doc 
        if not token.is_stop and not token.is_punct and not token.is_digit and token.is_alpha       # Keep only words that are not stop words
            and token.text not in ["_", "+", "=", "\n","-","*","<",">"]                             # Remove special characters       
            and not len(token.text) == 1                                                            # Remove single character words
    ]    

    tokens = [token.replace("datum", "data") for token in tokens]  # Replace 'datum' (lemma of data) with 'data' for clarity                                                                         
    
    return tokens + ngrams

#### Cleaning with spaCy 

In [17]:
corpus_with_bigrams = [clean_with_spacy(doc) for doc in corpus]

In [18]:
sum = 0
for doc in corpus_with_bigrams:
    sum += len(doc)

print(f"Total number of words in the cleaned corpus: {sum}")

Total number of words in the cleaned corpus: 157613


In [64]:
print(corpus_with_bigrams[0][:10] + corpus_with_bigrams[0][-10:])

['uc', 'data', 'format', 'ubco', 'master', 'data', 'science', 'data', 'learn', 'explain', 'machine_learning', 'learning_studio', 'machine_learning', 'machine_learning', 'learning_studio', 'machine_learning', 'machine_learning', 'learning_studio', 'machine_learning', 'learning_studio']


#### Cleaning with nltk

In [None]:
nltk_cleaned_corpus = [clean_with_nltk(doc).split() for doc in corpus]
print(nltk_cleaned_corpus[0])

In [None]:
sum = 0
for doc in nltk_cleaned_corpus:
    sum += len(doc)

print(f"Total number of words in the cleaned corpus: {sum}")

Total number of words in the cleaned corpus: 181461


In [None]:
bigram = Phrases(nltk_cleaned_corpus, min_count=10, connector_words=ENGLISH_CONNECTOR_WORDS)  
# trigram = Phrases(bigram[clean_corpus], threshold=10, connector_words=ENGLISH_CONNECTOR_WORDS)

bigram_mod = Phraser(bigram)
# trigram_mod = Phraser(trigram)

# add bigrams and trigrams to the clean corpus
corpus_with_bigrams = [bigram_mod[doc] for doc in nltk_cleaned_corpus]

sum = 0
for doc in corpus_with_bigrams:
    sum += len(doc)

print(f"Total number of words in the nltk corpus with ngrams: {sum}")

<class 'list'>
Total number of words in the corpus with ngrams: 164907


#### Preprocessing into Document-Term matrix and id2word dictionary 

In [81]:
# Create a dictionary mapping token ID integers to words
dictionary = Dictionary(corpus_with_bigrams)    

# Create a d x |V| term-document matrix, where each row represents a document and each column represents a unique token in the corpus. 
# Value at row i and column j is the how many times token j appears in document i.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in corpus_with_bigrams]  

print(doc_term_matrix[0])

[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 2), (7, 4), (8, 1), (9, 8), (10, 4), (11, 2), (12, 4), (13, 6), (14, 17), (15, 7), (16, 5), (17, 7), (18, 5), (19, 1), (20, 1), (21, 1), (22, 1), (23, 15), (24, 1), (25, 3), (26, 2), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 2), (35, 1), (36, 1), (37, 1), (38, 4), (39, 48), (40, 1), (41, 1), (42, 1), (43, 3), (44, 14), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 7), (53, 7), (54, 5), (55, 3), (56, 1), (57, 4), (58, 1), (59, 1), (60, 1), (61, 1), (62, 10), (63, 7), (64, 1), (65, 1), (66, 2), (67, 2), (68, 3), (69, 1), (70, 1), (71, 18), (72, 1), (73, 26), (74, 2), (75, 1), (76, 1), (77, 4), (78, 2), (79, 1), (80, 1), (81, 5), (82, 2), (83, 1), (84, 1), (85, 1), (86, 1), (87, 6), (88, 42), (89, 6), (90, 5), (91, 1), (92, 8), (93, 14), (94, 1), (95, 2), (96, 1), (97, 1), (98, 1), (99, 2), (100, 3), (101, 1), (102, 3), (103, 3), (104, 2), (105, 1), (106, 7), (107, 1), (108, 20), (109, 6),

### Topic modeling

#### First run of LDA

In [53]:
NUM_TOPICS = 18
PATH_TO_MODEL = f"18_Modules_Test_LDA_spacy_{NUM_TOPICS}_topics"
lda_model = None

In [102]:
from gensim.models import LdaModel
# from pprint import pprint

lda_model = LdaModel(doc_term_matrix, num_topics=NUM_TOPICS, id2word = dictionary)
lda_model.show_topics(num_topics = -1, num_words = 10)
# pprint(lda_model.print_topics(num_topics=NUM_TOPICS, num_words=3))

[(0,
  '0.016*"data" + 0.007*"value" + 0.006*"sample" + 0.006*"model" + 0.005*"prior" + 0.005*"number" + 0.004*"function" + 0.004*"test" + 0.004*"use" + 0.004*"class"'),
 (1,
  '0.020*"data" + 0.007*"model" + 0.007*"value" + 0.005*"function" + 0.004*"use" + 0.004*"error" + 0.004*"variable" + 0.004*"code" + 0.004*"number" + 0.004*"class"'),
 (2,
  '0.013*"function" + 0.012*"data" + 0.011*"value" + 0.007*"variable" + 0.006*"sample" + 0.006*"model" + 0.006*"number" + 0.005*"random" + 0.005*"use" + 0.005*"mean"'),
 (3,
  '0.024*"data" + 0.013*"model" + 0.012*"value" + 0.007*"linear" + 0.006*"variable" + 0.006*"function" + 0.005*"regression" + 0.005*"sample" + 0.005*"probability" + 0.004*"estimate"'),
 (4,
  '0.012*"data" + 0.009*"model" + 0.005*"function" + 0.005*"use" + 0.005*"sample" + 0.005*"prior" + 0.004*"set" + 0.004*"value" + 0.004*"time" + 0.004*"class"'),
 (5,
  '0.015*"data" + 0.012*"model" + 0.006*"value" + 0.006*"prior" + 0.005*"class" + 0.005*"estimate" + 0.005*"function" + 0.

Each row corresponds to a topic, and each coefficient next to a word represents the probability of that word being sampled from that topic. The order of the rows is arbitrary. Note that each row actually contains |V| elements, where coefficients sum to 1, here we only show the top 10 words sorted by their coefficients. 

These topics, however, are not very good. Printing the topics with their coherence scores:

In [57]:
lda_model.top_topics(doc_term_matrix, dictionary=dictionary, coherence='u_mass')

[([(0.022746926, 'data'),
   (0.0073463954, 'model'),
   (0.0072606923, 'value'),
   (0.007202027, 'function'),
   (0.0071181622, 'use'),
   (0.0055006463, 'return'),
   (0.004518674, 'code'),
   (0.004265579, 'sample'),
   (0.004188394, 'error'),
   (0.004135151, 'test'),
   (0.0038909747, 'linear'),
   (0.003801089, 'number'),
   (0.0035658542, 'time'),
   (0.0033851427, 'object'),
   (0.0032262348, 'regression'),
   (0.003181321, 'variable'),
   (0.0031317982, 'true'),
   (0.003053496, 'class'),
   (0.0030201823, 'give'),
   (0.002883781, 'set')],
  -0.06786412198060018),
 ([(0.01783041, 'data'),
   (0.009124272, 'value'),
   (0.005935375, 'model'),
   (0.005802993, 'function'),
   (0.0052927528, 'use'),
   (0.004334752, 'output'),
   (0.0038711063, 'number'),
   (0.0037848954, 'variable'),
   (0.0036884136, 'return'),
   (0.0036399919, 'time'),
   (0.0035264364, 'observation'),
   (0.0034690364, 'set'),
   (0.0033882384, 'true'),
   (0.0033833415, 'sample'),
   (0.0031649943, 'colu

Highest coherence is -0.08, which is not great.

In [58]:
from gensim.test.utils import datapath
lda_model.save(datapath(PATH_TO_MODEL))

# Datapath: c:\Users\syeda\miniconda3\envs\dir-st\lib\site-packages\gensim\test\test_data\

Visualizing with pyLDAvis

In [71]:
# To save all the computation, previous results can be loaded from the disk after running only the cell where PATH_TO_MODEL is defined
lda_model_to_display = LdaModel.load(datapath(PATH_TO_MODEL)) if lda_model is None else lda_model 

# Options for 'mds' (dimensionality reduction): mds = 'pcoa' (Principle Coordinate Analysis), 'tsne', 'mmds'
LDAvis_prepared = gensim_vis.prepare(lda_model_to_display, doc_term_matrix, dictionary, mds='mmds')
display(LDAvis_prepared)

# To save the visualization to an HTML file
# pyLDAvis.save_html(LDAvis_prepared, 'Test_run_LDA_'+ str(NUM_TOPICS) + '.html')

In [72]:
gensim_vis.prepare(lda_model_to_display, doc_term_matrix, dictionary, mds='pcoa')

How to interpret the visualizations: 

- The size of the circle corresponds to the importance of the topic in the corpus (i.e. topics with words that occur more frequently in the corpus are larger)
- The distance between circles is how "far apart" they are.
- Overlaps between circles represents some similarity in the per-word probabilities for those topics.
- Note that the radii of the circles and distances between circles are not on the same scales, they are presented in this way for simplicity of visualization.

Rough overview of how the visualization is generated:

- First, the k x |V| topic-word (beta) matrix from the LdaModel object is used to calculate pairwise distances between topics, using Jensen-Shannon Divergence (a method of measuring the similarity between two probability distributions, remembering that each row of the beta matrix is a probability distribution over all words in the vocabulary). 
- Then, this distance matrix is projected into a 2D plane using the various options for dimensionality reduction techniques accepted as parameters. 

For more info, see: https://pyldavis.readthedocs.io/en/latest/_modules/pyLDAvis/_prepare.html#js_PCoA

Here we show two dimensionality reduction techniques on the same LDA model: 
- MMDS (Metric Multidimensional Scaling) and PCoA (Principle Coordinate Analysis; different from Principle Component Analysis)

Commonly-occuring words like 'data', 'model', 'value', 'function' clog up the outputs for most of the topics. Not just that, most topics seem to be too general to be given narrower labels like "machine learning" or "databases". Lastly, it seems that topics after topic 13 don't have that many words within them, so maybe around 10-12 topics should also be explored. 

#### Testing an HDP model

HDP stands for Heirarchical Dirichlet Process. For an HDP topic model, we don't need to provide `k`, the number of topics. 

In [37]:
from gensim.models import HdpModel
# from pprint import pprint

hdp_model = HdpModel(doc_term_matrix, id2word = dictionary)
hdp_model.optimal_ordering()
hdp_model.show_topics(num_topics=15)

[(0,
  '0.012*data + 0.010*variable + 0.010*random + 0.010*function + 0.010*value + 0.007*number + 0.007*command + 0.007*probability + 0.006*model + 0.006*density + 0.006*line + 0.005*use + 0.005*time + 0.005*mean + 0.005*simulate + 0.005*git + 0.005*distribution + 0.005*poisson + 0.005*open + 0.004*sample'),
 (1,
  '0.026*data + 0.015*model + 0.013*observation + 0.008*class + 0.008*estimate + 0.007*classifier + 0.006*classification + 0.006*set + 0.006*linear + 0.006*tree + 0.006*group + 0.006*predictor + 0.006*distance + 0.005*hyperplane + 0.005*vector + 0.005*regression + 0.005*support + 0.005*mean + 0.005*margin + 0.005*training'),
 (2,
  '0.039*prior + 0.028*posterior + 0.023*model + 0.018*likelihood + 0.017*introduction + 0.015*normal + 0.012*probability + 0.012*distribution + 0.012*data + 0.011*beta + 0.009*chain + 0.009*bayesian + 0.009*sample + 0.009*regression + 0.008*diagnostic + 0.007*binomial + 0.007*parameter + 0.006*plot + 0.006*stan + 0.006*step'),
 (3,
  '0.016*emp + 0.

These topics are very good! Broadly speaking, we can see topics for probability and modeling, supervised learning techniques, databases, python programming, excel, version control, data structures and algorithms, etc.

In [39]:
alpha, beta = hdp_model.hdp_to_lda()
print(alpha.shape, beta.shape)

(150,) (150, 11094)


However, we see that HDP actually finds 150 topics, most of which are junk topics. Obtaining a somewhat equivalent LDA model using the same alpha and beta from the HDP model, we see the following: 

In [60]:
suggested_lda_model = hdp_model.suggested_lda_model()
suggested_lda_model.show_topics(num_topics=10, num_words=15)

[(106,
  '0.000*"rstrip" + 0.000*"picker" + 0.000*"repeatedly" + 0.000*"fluent" + 0.000*"pr" + 0.000*"crd_case" + 0.000*"la" + 0.000*"thread" + 0.000*"arcs" + 0.000*"classification_task" + 0.000*"richmond" + 0.000*"outsize" + 0.000*"deviance" + 0.000*"deletion" + 0.000*"people"'),
 (28,
  '0.000*"binary_tree" + 0.000*"expandtab" + 0.000*"gene" + 0.000*"posterior_mean" + 0.000*"tew" + 0.000*"mary" + 0.000*"const" + 0.000*"alertbeforeoverwrite" + 0.000*"examplesupervise" + 0.000*"yalue" + 0.000*"carefully" + 0.000*"rstan" + 0.000*"collcc" + 0.000*"drop_missing" + 0.000*"dashboardnevv"'),
 (32,
  '0.000*"resultdefective" + 0.000*"glass" + 0.000*"eve" + 0.000*"gershwin" + 0.000*"vvi" + 0.000*"stem" + 0.000*"controllable" + 0.000*"bus" + 0.000*"primer" + 0.000*"specifie" + 0.000*"aii" + 0.000*"enderby" + 0.000*"query" + 0.000*"immunologist" + 0.000*"inverted"'),
 (59,
  '0.000*"youtube" + 0.000*"cnns" + 0.000*"versa" + 0.000*"auxiliary" + 0.000*"littermate" + 0.000*"pmy" + 0.000*"malware" +

These topics don't make sense. Notice how all word probabilities are < 0.000

In [61]:
# Visualizing the equivalent LDA model output from the HdpModel object
gensim_vis.prepare(suggested_lda_model, doc_term_matrix, dictionary, mds='mmds')

Maybe the suggested_lda_model() doesn't work as intended, but the visualization is not very appealing. 

However, we saw earlier from the top 15 topics generated by HDP, that obtaining meaningful topics is possible

We will now try tweaking some of the parameters and inputs for the LDA model, until the results are satisfactory. LDA is more desirable as we can visualize the results with pyLDAvis, get document-topic probabilities with built-in functions, and generally be able to explain the results a bit better.

#### Testing different parameters for LDA

In [76]:
# Changing number of topics 
new_lda_model = LdaModel(doc_term_matrix, num_topics=5, id2word = dictionary)
new_lda_model.show_topics(num_words=20)

[(0,
  '0.013*"data" + 0.007*"model" + 0.007*"value" + 0.006*"function" + 0.005*"sample" + 0.005*"true" + 0.004*"use" + 0.004*"number" + 0.004*"column" + 0.003*"return" + 0.003*"time" + 0.003*"select" + 0.003*"create" + 0.003*"add" + 0.003*"variable" + 0.003*"code" + 0.003*"distribution" + 0.003*"probability" + 0.003*"key" + 0.003*"class"'),
 (1,
  '0.016*"data" + 0.010*"value" + 0.008*"model" + 0.005*"sample" + 0.005*"function" + 0.005*"prior" + 0.004*"use" + 0.004*"posterior" + 0.004*"number" + 0.004*"probability" + 0.004*"return" + 0.004*"select" + 0.004*"time" + 0.004*"mean" + 0.003*"likelihood" + 0.003*"true" + 0.003*"output" + 0.003*"test" + 0.003*"regression" + 0.003*"distribution"'),
 (2,
  '0.025*"data" + 0.013*"model" + 0.009*"value" + 0.006*"variable" + 0.006*"function" + 0.006*"use" + 0.005*"distribution" + 0.005*"linear" + 0.005*"prior" + 0.005*"mean" + 0.005*"regression" + 0.004*"number" + 0.004*"error" + 0.004*"probability" + 0.004*"estimate" + 0.004*"likelihood" + 0.004

There is again a lot of overlap, with the top 3 words being similar in all topics, as well as "regression", "distribution", "probability", occuring frequently in multiple topics. We likely need more topics.

In [None]:
# Changing number of topics 
new_lda_model = LdaModel(doc_term_matrix, num_topics=13, id2word = dictionary)
new_lda_model.show_topics(num_topics = 13, num_words = 10)

[(0,
  '0.015*"data" + 0.007*"function" + 0.006*"value" + 0.006*"model" + 0.005*"number" + 0.004*"prior" + 0.004*"sample" + 0.004*"use" + 0.004*"set" + 0.004*"variable"'),
 (1,
  '0.016*"data" + 0.010*"model" + 0.008*"value" + 0.005*"function" + 0.004*"time" + 0.004*"sample" + 0.004*"mean" + 0.004*"distribution" + 0.004*"use" + 0.004*"output"'),
 (2,
  '0.016*"data" + 0.010*"value" + 0.009*"function" + 0.007*"model" + 0.006*"use" + 0.004*"class" + 0.004*"true" + 0.004*"create" + 0.004*"number" + 0.003*"test"'),
 (3,
  '0.017*"data" + 0.010*"value" + 0.007*"function" + 0.005*"use" + 0.005*"model" + 0.005*"number" + 0.004*"variable" + 0.004*"true" + 0.004*"create" + 0.004*"time"'),
 (4,
  '0.027*"data" + 0.010*"value" + 0.008*"model" + 0.007*"function" + 0.005*"number" + 0.004*"variable" + 0.004*"random" + 0.004*"use" + 0.004*"linear" + 0.003*"mean"'),
 (5,
  '0.022*"data" + 0.015*"model" + 0.010*"value" + 0.009*"prior" + 0.006*"posterior" + 0.006*"linear" + 0.006*"probability" + 0.005*"

Exactly the same issue. It seems that the underlying problem might be in another parameter.

In [94]:
# Trying multiple passes through the corpus
new_lda_model = LdaModel(doc_term_matrix, num_topics = 13, id2word = dictionary, passes = 10)
new_lda_model.show_topics(num_topics = -1, num_words = 10)

[(0,
  '0.001*"data" + 0.001*"value" + 0.001*"model" + 0.001*"use" + 0.001*"function" + 0.000*"class" + 0.000*"number" + 0.000*"return" + 0.000*"variable" + 0.000*"create"'),
 (1,
  '0.015*"emp" + 0.015*"select" + 0.015*"database" + 0.014*"table" + 0.014*"data" + 0.011*"query" + 0.011*"key" + 0.010*"eno" + 0.010*"sql" + 0.008*"dno"'),
 (2,
  '0.017*"data" + 0.015*"command" + 0.010*"git" + 0.009*"open" + 0.009*"line" + 0.008*"use" + 0.007*"echo" + 0.007*"create" + 0.006*"key" + 0.006*"address"'),
 (3,
  '0.025*"data" + 0.012*"model" + 0.009*"tree" + 0.008*"algorithm" + 0.007*"value" + 0.007*"node" + 0.007*"search" + 0.007*"estimate" + 0.006*"function" + 0.006*"markov"'),
 (4,
  '0.029*"sample" + 0.016*"treatment" + 0.014*"population" + 0.011*"unit" + 0.010*"factor" + 0.010*"design" + 0.008*"block" + 0.007*"estimate" + 0.007*"effect" + 0.007*"sampling"'),
 (5,
  '0.047*"data" + 0.016*"model" + 0.013*"group" + 0.012*"distance" + 0.010*"cluster" + 0.008*"mixture" + 0.008*"mean" + 0.007*"su

These are way better! We can see some topics for version control and CLI, databases, clustering, probability, neural networks, etc. These aren't as well-defined as the HDP topics we saw earlier, but this is a step in the right direction. We can get coherence estimates for these:

In [98]:
new_lda_model.top_topics(texts=corpus_with_bigrams, topn = 5, dictionary=dictionary, coherence='u_mass')

[([(0.0014016541, 'data'),
   (0.0008516956, 'value'),
   (0.00065876043, 'model'),
   (0.00056428084, 'use'),
   (0.00051012554, 'function')],
  1.0000889005818408e-12),
 ([(0.01844681, 'data'),
   (0.013736188, 'function'),
   (0.010870896, 'value'),
   (0.009157626, 'return'),
   (0.007932162, 'use')],
  -0.017147524150961026),
 ([(0.018807977, 'random'),
   (0.017570755, 'function'),
   (0.016726077, 'variable'),
   (0.016259432, 'value'),
   (0.013538574, 'probability')],
  -0.14395388756261035),
 ([(0.03204114, 'data'),
   (0.013893893, 'linear'),
   (0.012481209, 'model'),
   (0.012240197, 'value'),
   (0.010555445, 'regression')],
  -0.1621860432420283),
 ([(0.02542874, 'data'),
   (0.01197635, 'model'),
   (0.009108014, 'tree'),
   (0.008039388, 'algorithm'),
   (0.007424762, 'value')],
  -0.22096470973196275),
 ([(0.020044066, 'layer'),
   (0.018470049, 'network'),
   (0.0131868245, 'neural'),
   (0.010748517, 'model'),
   (0.010137633, 'input')],
  -0.2315007612950553),
 ([(

We can also see if this improves with number of passes

In [99]:
# Trying multiple passes through the corpus
new_lda_model = LdaModel(doc_term_matrix, num_topics = 13, id2word = dictionary, passes = 25)
new_lda_model.show_topics(num_topics = -1, num_words = 10)

[(0,
  '0.040*"prior" + 0.028*"posterior" + 0.023*"model" + 0.018*"likelihood" + 0.017*"introduction" + 0.015*"normal" + 0.013*"probability" + 0.012*"distribution" + 0.012*"data" + 0.011*"beta"'),
 (1,
  '0.047*"data" + 0.016*"model" + 0.013*"group" + 0.012*"distance" + 0.010*"cluster" + 0.008*"mixture" + 0.008*"mean" + 0.007*"supervised" + 0.006*"fa" + 0.006*"mixture_models"'),
 (2,
  '0.016*"emp" + 0.015*"select" + 0.015*"database" + 0.014*"table" + 0.014*"data" + 0.011*"query" + 0.011*"key" + 0.010*"eno" + 0.010*"sql" + 0.008*"dno"'),
 (3,
  '0.021*"data" + 0.011*"cell" + 0.010*"mar" + 0.008*"macro" + 0.008*"excel" + 0.008*"food" + 0.007*"format" + 0.007*"toy" + 0.006*"feb" + 0.006*"jacket"'),
 (4,
  '0.038*"data" + 0.023*"model" + 0.013*"linear" + 0.013*"regression" + 0.010*"estimate" + 0.010*"value" + 0.008*"distribution" + 0.008*"error" + 0.008*"variable" + 0.007*"fit"'),
 (5,
  '0.000*"data" + 0.000*"function" + 0.000*"model" + 0.000*"value" + 0.000*"return" + 0.000*"use" + 0.00

These topics are definitely easier to distinguish.

In [100]:
new_lda_model.top_topics(texts=corpus_with_bigrams, topn = 5, dictionary=dictionary, coherence='u_mass')

[([(0.00012880351, 'data'),
   (0.000108789645, 'function'),
   (0.000108348424, 'model'),
   (0.00010589592, 'value'),
   (0.00010256286, 'return')],
  -0.022863365534955882),
 ([(0.000120268145, 'data'),
   (0.00010872201, 'model'),
   (0.00010788249, 'value'),
   (0.000102317805, 'function'),
   (0.00010081423, 'sample')],
  -0.022863365534955882),
 ([(0.014609571, 'model'),
   (0.011688336, 'value'),
   (0.011552398, 'function'),
   (0.010902283, 'variable'),
   (0.009636585, 'random')],
  -0.07292862271650179),
 ([(0.01571558, 'app'),
   (0.013955539, 'dash'),
   (0.013290086, 'data'),
   (0.013143517, 'plot'),
   (0.012876387, 'value')],
  -0.09650808960176231),
 ([(0.038469072, 'data'),
   (0.023475152, 'model'),
   (0.013386337, 'linear'),
   (0.012581783, 'regression'),
   (0.009769557, 'estimate')],
  -0.22796739026044666),
 ([(0.015549483, 'emp'),
   (0.015059328, 'select'),
   (0.014709127, 'database'),
   (0.014429376, 'table'),
   (0.014291155, 'data')],
  -0.251307752668

In [111]:
gensim_vis.prepare(new_lda_model, doc_term_matrix, dictionary, mds='mmds')

We can also try changing another parameter in LdaModel called 'alpha'

In [88]:
# The default value of alpha is 'symmetric' which assumes a symmetric Dirichlet prior over the topic distributions. 
# This means that all the models till now assumed alpha to be a vector of length k where all values are equal (and since all values must sum to 1, each value is 1/k).
print(new_lda_model.alpha)

# Since k = 13, all values of alpha will be 1/13

[0.07692308 0.07692308 0.07692308 0.07692308 0.07692308 0.07692308
 0.07692308 0.07692308 0.07692308 0.07692308 0.07692308 0.07692308
 0.07692308]


In [105]:
# Testing alpha = 'auto', meaning that the alpha prior is no longer assumed to be symmetric and is learned from the data. 

auto_lda_model = LdaModel(doc_term_matrix, num_topics = 13, id2word = dictionary, alpha = 'auto', passes = 20)
auto_lda_model.show_topics(num_topics = -1, num_words = 10)

[(0,
  '0.014*"data" + 0.010*"function" + 0.010*"value" + 0.008*"return" + 0.008*"use" + 0.008*"def" + 0.007*"class" + 0.007*"python" + 0.007*"test" + 0.007*"code"'),
 (1,
  '0.023*"prior" + 0.019*"model" + 0.016*"posterior" + 0.013*"probability" + 0.013*"likelihood" + 0.012*"normal" + 0.011*"distribution" + 0.010*"introduction" + 0.010*"data" + 0.010*"value"'),
 (2,
  '0.010*"data" + 0.009*"model" + 0.008*"layer" + 0.008*"network" + 0.007*"tree" + 0.007*"function" + 0.007*"observation" + 0.006*"value" + 0.006*"class" + 0.006*"algorithm"'),
 (3,
  '0.000*"data" + 0.000*"model" + 0.000*"function" + 0.000*"value" + 0.000*"use" + 0.000*"number" + 0.000*"key" + 0.000*"create" + 0.000*"select" + 0.000*"table"'),
 (4,
  '0.000*"data" + 0.000*"model" + 0.000*"value" + 0.000*"sample" + 0.000*"function" + 0.000*"use" + 0.000*"distribution" + 0.000*"variable" + 0.000*"prior" + 0.000*"probability"'),
 (5,
  '0.036*"data" + 0.017*"value" + 0.014*"model" + 0.012*"linear" + 0.010*"regression" + 0.00

In [107]:
auto_lda_model.top_topics(texts=corpus_with_bigrams, topn = 5, dictionary=dictionary, coherence='u_mass')

[([(0.00042033297, 'data'),
   (0.00021554166, 'model'),
   (0.00020096704, 'function'),
   (0.00018997246, 'value'),
   (0.00018637287, 'use')],
  1.0000889005818408e-12),
 ([(0.014069638, 'data'),
   (0.010341384, 'function'),
   (0.009705822, 'value'),
   (0.0083693275, 'return'),
   (0.007697625, 'use')],
  -0.017147524150961026),
 ([(0.00020413817, 'data'),
   (0.00015306866, 'model'),
   (0.0001468107, 'value'),
   (0.00013594923, 'sample'),
   (0.00012988842, 'function')],
  -0.017147524150961026),
 ([(0.03579087, 'data'),
   (0.01652196, 'value'),
   (0.014390431, 'model'),
   (0.012084322, 'linear'),
   (0.010351111, 'regression')],
  -0.18574265037330495),
 ([(0.015547936, 'emp'),
   (0.01505812, 'select'),
   (0.014706542, 'database'),
   (0.014427981, 'table'),
   (0.014290942, 'data')],
  -0.2513077526689583),
 ([(0.028842596, 'sample'),
   (0.016180214, 'treatment'),
   (0.014138098, 'population'),
   (0.01095222, 'unit'),
   (0.010135706, 'factor')],
  -0.299276460530273

In [108]:
print(auto_lda_model.alpha)

[0.02894416 0.02311588 0.02289192 0.010843   0.01012669 0.02420626
 0.01258473 0.01334439 0.01257119 0.01228462 0.01322843 0.0104722
 0.01245722]


In [109]:
gensim_vis.prepare(auto_lda_model, doc_term_matrix, dictionary, mds='mmds')

## The below section is experimental:

Note: add an explanation of what an embedding is, how they are learned, sentence vs word level embeddings (and the fact that we use word level). Also describe each approach, what worked and what didn't.  

### Trying to assign a label to a topic using word embeddings of the top 20 words in a topic sorted by relevance

In [17]:
# Top 20 words for mmds visualization of LDA model with 5 topics
top_words = ["model", "data", "value", "function", "distribution", "example", "probability", "number", "using", "use", "simulate", "sample", "independent", "average", "mean", "figure", "estimate", "variable", "measurement", "plot"]
print(len(top_words))

20


In [74]:

# import gensim.downloader as api
# model_location = api.load("fasttext-wiki-news-subwords-300", return_path=True)
# print(model_location)
# Stored at C:\Users\syeda/gensim-data\fasttext-wiki-news-subwords-300\


C:\Users\syeda/gensim-data\fasttext-wiki-news-subwords-300\fasttext-wiki-news-subwords-300.gz


In [43]:
from gensim.models.fasttext import load_facebook_model

model_location = datapath("C:/Users/syeda/OneDrive/Desktop/4th Year/DATA448/cc.en.300.bin")
pretrained_model = load_facebook_model(model_location)
finetuned_model = load_facebook_model(model_location)

In [46]:
import numpy as np

word_embeddings = [pretrained_model.wv[word] for word in top_words]
mean_vector = np.mean(word_embeddings, axis=0)

pt_similar_words = pretrained_model.wv.similar_by_vector(mean_vector, topn=5)
print(pt_similar_words)

topic_label = pt_similar_words[0][0]
print(f"Representative word for the topic: {topic_label}")

[('calculate', 0.6181637048721313), ('use', 0.6073808670043945), ('extrapolate', 0.5911571383476257), ('calculation', 0.5882555842399597), ('estimate', 0.5864962935447693)]
Representative word for the topic: calculate


In [52]:
finetuned_model.build_vocab(corpus_with_bigrams_trigrams, update=True)  # Add the new words to the vocabulary
finetuned_model.train(corpus_with_bigrams_trigrams, total_examples=len(corpus_with_bigrams_trigrams), epochs=10)  # Fine-tune the model

(57851, 291540)

In [53]:
# Now you can use the updated model with embeddings that include domain-specific words
ft_word_embeddings = [finetuned_model.wv[word] for word in top_words]
ft_mean_vector = np.mean(ft_word_embeddings, axis=0)

ft_similar_words = finetuned_model.wv.similar_by_vector(ft_mean_vector, topn=5)
print(ft_similar_words)

ft_topic_label = ft_similar_words[0][0]
print(f"Representative word for the topic: {ft_topic_label}")

[('variation', 0.9998103976249695), ('calculation', 0.9998043179512024), ('estimation', 0.9997916221618652), ('computer-simulation', 0.9997856020927429), ('correlation', 0.9997814893722534)]
Representative word for the topic: variation


In [49]:
np.allclose(mean_vector, ft_mean_vector, atol=1e-4)

False

In [54]:
from gensim.models import FastText

custom_model = FastText(vector_size=100, window=3, min_count=1, sentences=corpus_with_bigrams_trigrams, epochs=10)

In [55]:
custom_embeddings = [custom_model.wv[word] for word in top_words]
custom_mean_vector = np.mean(custom_embeddings, axis=0)

similar_words = custom_model.wv.similar_by_vector(custom_mean_vector, topn=5)
print(similar_words)

custom_topic_label = similar_words[0][0]
print(f"Representative word for the topic: {custom_topic_label}")

[('distancetraveled', 0.999996542930603), ('projected', 0.9999964833259583), ('example_consider', 0.9999963641166687), ('thersystemanintroductionandoverview', 0.9999961853027344), ('mentioned', 0.9999961256980896)]
Representative word for the topic: distancetraveled


### Trying to assign a label to a topic using a pre-trained transformer by encoding the top 20 words in a topic 

#### Finetuned T5

In [1]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("Michau/t5-base-en-generate-headline")

In [4]:
# Top 20 words for mmds visualization of LDA model with 5 topics
top_words = ["value","function","datum","random","use","model","variable", "time","figure","example","number","plot","estimate","random_variable","lag","histogram","probability","series","mean","simulate","standard","sample","r","regression","follow","level","distribution","variance","x","pseudorandom_number"]
print(len(top_words))

30


In [20]:
# Function to generate a one-word topic label from a list of words
def generate_topic_label(top_words: list) -> str:
    
    input_string = "label these topics: " + " ".join(top_words)
    print(input_string)
    
    # Tokenize the input string
    encoding = tokenizer.encode(input_string, return_tensors="pt")
    
    # Generate the label using the model
    output = model.generate(encoding, max_length=5, num_beams=4, early_stopping=True)
    
    # Decode the output to get the label
    label = tokenizer.decode(output[0], skip_special_tokens=True)
    
    return label

In [21]:
topic_label = generate_topic_label(top_words)
print(f"Generated topic label: {topic_label}")

label these topics: value function datum random use model variable time figure example number plot estimate random_variable lag histogram probability series mean simulate standard sample r regression follow level distribution variance x pseudorandom_number




Generated topic label: 


#### Finetuned BART

In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

mname = "cristian-popa/bart-tl-all"
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)

In [2]:
def generate_topic_label_with_BART(top_words: list[str]) -> str:
    enc = tokenizer(top_words, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
    outputs = model.generate(
        input_ids=enc.input_ids,
        attention_mask=enc.attention_mask,
        max_length=15,
        min_length=1,
        do_sample=False,
        num_beams=25,
        length_penalty=1.0,
        repetition_penalty=1.5
    )

    label = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return label

In [8]:
!nvidia-smi

Wed Oct 23 13:29:08 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 552.44                 Driver Version: 552.44         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3060 ...  WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   66C    P8             11W /   95W |      73MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [7]:
topic_label = generate_topic_label_with_BART(top_words)
print(f"Generated topic label: {topic_label}")

Generated topic label: rate of return


### Trying BERTopic to get topic info for the entire corpus

In [None]:
#!pip install BERTopic
# !pip install tf-keras

Collecting tf-keras
  Downloading tf_keras-2.17.0-py3-none-any.whl.metadata (1.6 kB)
Downloading tf_keras-2.17.0-py3-none-any.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ------------ --------------------------- 0.5/1.7 MB 1.7 MB/s eta 0:00:01
   ---------------------------------------- 1.7/1.7 MB 4.1 MB/s eta 0:00:00
Installing collected packages: tf-keras
Successfully installed tf-keras-2.17.0


In [52]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP

# Assuming `corpus` is a list of lists of strings
# Convert the list of lists into a list of strings (documents)
split_corpus = []

# Split each string into 5 parts
for string in corpus:
    # Calculate the length of each part
    part_length = max(1, len(string) // 5)  # Ensure at least one character per part
    parts = [string[i:i + part_length] for i in range(0, len(string), part_length)]
    
    # If there are more than 5 parts, combine excess parts
    while len(parts) > 5:
        last_part = parts.pop()
        parts[-1] += last_part  # Combine excess into the last part
    
    # Add the parts to the split_corpus
    split_corpus.extend(parts)

vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words='english')

# Initialize BERTopic model
topic_model = BERTopic(vectorizer_model=vectorizer_model)


# Fit the BERTopic model on your corpus and extract topics
topics, probabilities = topic_model.fit_transform(split_corpus)



False


In [55]:
print(topic_model.get_topic_info())
# topic_model.visualize_topics() does not work because only one topic lol

   Topic  Count                                  Name  \
0     -1     40  -1_function_random_data_distribution   

                                      Representation  \
0  [function, random, data, distribution, example...   

                                 Representative_Docs  
0  [ying this in the inverse CDF method runs as f...  


#### Less than ideal results: 

- BERTopic does not work out of the box with a corpus of 8 documents (in this case, each chapter is one document as a string so the corpus is a list of 8 strings), so we need to split the 8 documents into 40 documents (by evenly splitting each doc into 5 docs).
- The output is only one "topic" with index -1. According to BERTopic documentation, topic ID -1 is for documents that "do not fit into any topics". All of our documents are assigned to this topic. 

#### Trying BERTopic on complete modules corpus

In [68]:
from bertopic import BERTopic

topic_model = BERTopic()        # Default arguments as used on the website: https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html
topics, probabilities = topic_model.fit_transform(corpus)




In [69]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,18,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 580\n\nModeling and Simulation ...


In [16]:
topic_model.get_document_info(corpus)

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,UC\nDa...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
1,Lecture 7: Functional-style programming and\nH...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
2,Data Structures and\nAlgorithms\n\nUBCO Master...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
3,UC\nPy...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
4,UC\nSQ...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
5,Version Control\n\nUBCO Master of Data Science...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
6,Data Profiling and\nCleaning\nHandling Missing...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
7,Completely Randomized Designs (CRD)\n ...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
8,"551 Lec 5 - Tables, styling, performance\nYou ...",-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False
9,Moving beyond linearity in response\n\n ...,-1,-1_the_of_to_is,"[the, of, to, is, and, in, data, for, we, that]",[ DATA 581\n\nModeling and Simulati...,the - of - to - is - and - in - data - for - w...,0.0,False


Still terrible results, not sure what I'm doing wrong.

### Checking if it works in general

In [5]:
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

In [10]:
print(len(docs))
print(type(docs))
print(docs[0][:100])
print(type(docs[0]))

18846
<class 'list'>


I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about 
<class 'str'>


In [9]:
print(len(docs[0]))

712


In [11]:
topics, probs = topic_model.fit_transform(docs)

In [12]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6596,-1_to_the_of_and,"[to, the, of, and, is, for, in, you, it, that]",[\nProbably because it IS rape.\n\n\nSo nothin...
1,0,1832,0_game_team_games_he,"[game, team, games, he, players, season, hocke...","[\nWales Conference, Adams Division, Semifinal..."
2,1,616,1_key_clipper_chip_encryption,"[key, clipper, chip, encryption, keys, escrow,...",[The following document summarizes the Clipper...
3,2,464,2_israel_israeli_jews_arab,"[israel, israeli, jews, arab, jewish, arabs, p...","[\n\n""Assuming""? Also: come on, Brad. If we ar..."
4,3,451,3_ites_cheek_yep_huh,"[ites, cheek, yep, huh, ken, , , , , ]","[Ken\n, \nYep.\n, ites:]"
...,...,...,...,...,...
211,210,10,210_oil_lights_indicators_service,"[oil, lights, indicators, service, reset, indi...",[Derek....\n\nThere is a tool available to res...
212,211,10,211_needles_acupuncture_needle_syringe,"[needles, acupuncture, needle, syringe, hypode...",[\nIt is illegal to perform acupuncture with u...
213,212,10,212_alarm_sensor_alarms_shock,"[alarm, sensor, alarms, shock, car, viper, alp...",[Just found a great deal on a Clifford Delta c...
214,213,10,213_religion_supreme_arf_definition,"[religion, supreme, arf, definition, belief, l...",[\n .\n It's my understanding that ...
