# An Exploration of UW Pharmacy Student Reflections

Author: Michael Florip (michaelflorip@berkeley.edu)

## Experimenting with topicwizard
Medium Link: https://medium.com/@power.up1163/visualizing-topic-models-with-topicwizard-ee5b4428405e

Code is from the Medium article, with minor adjustments for my own experimentation

### Install topic-wizard from PyPi

In [71]:
#%pip install topic-wizard

### Loading a corpus to train our topic model
A corpus is a collection of authentic text or audio organized into datasets

In [72]:
from sklearn.datasets import fetch_20newsgroups

corpus = fetch_20newsgroups(subset="all").data

### Set up the topic modeling pipeline
1. Using 'Pipeline' from sklearn to streamline the process of converting text data into a bag-of-words representation
2. Applying Non-negative Matrix Factorization (NMF) for topic modeling

In [73]:
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

### Cut low and high frequency words and filter out English stopwords
> * 'min_df = 10' means a word must appear in at least 10 documents to be included in the vocabulary, which helps remove rare words
> * 'max_df = 0.5' means a word that appears in more than 50% of documents will be excluded, which is useful for removing very common words that are too frequent to be informative (beyond the stop words)
> * 'stop_words = "english"' tells the vectorizer to remove common English stop words like "the", "is", "in", etc.

In [74]:
# Transforms text documents into matrix of token counts
vectorizer = CountVectorizer(min_df=10, max_df=0.5, stop_words="english")

### Create a topic model
> * NMF is used for dimensionality reduction; it decomposes the high-dimensional bag-of-words matrix into two lower-dimensional matrices, one representing the documents and topics, the other representing the topics and words
> * 'n_components=10' specifies the number of topics (10)
> * NMF will factorize the original matrix (V) into two matricies (W, H) such that V=WH, with the constraint that all three matrices must have no negative elements (this makes W and H easier to inspect and allows them to represent additive combinations of the original features)
> * W (Document-Topic Matrix) - each row corresponds to a document, each column represents a topic. the values indicate the association strength of each *document to the topics*
> * H (Topic-Term Matrix) - each row corresponds to a topic, each column represents a word in the vocabulary. the values indicate the association strength of *words to the topics*

In [75]:
# We create a topic model with ten topics
topic_model = NMF(n_components=10)

### Pipeline Setup
> * The text data first goes through the 'CountVectorizer', stored as the variable 'vectorizer', to be transformed into a matrix of token counts
> * Then, this matrix is fed into the NMF topic model to perform topic modeling

In [76]:
# Then we set up a pipeline
topic_pipeline = Pipeline(
    [
        ("vectorizer", vectorizer),
        ("topic_model", topic_model),
    ]
)

### Training the model
> * .fit trains the NMF model to identify 10 topics within the text based on the patterns of word occurrences across documents
> * The H matrix will have the word distributions for each topic

In [77]:
topic_pipeline.fit(corpus)

### Visualize the results

In [78]:
!pip install topic-wizard





In [79]:
#!pip install --upgrade numpy
!pip install --upgrade scikit-learn



In [80]:
import topicwizard 
topicwizard.visualize(pipeline=topic_pipeline, corpus=corpus)



AttributeError: module 'sklearn.metrics._dist_metrics' has no attribute 'DistanceMetric64'

In [81]:
topicwizard.load(filename="topic_data.joblib")

NameError: name 'topicwizard' is not defined