# An Exploration of UW Pharmacy Student Reflections

Author: Michael Florip (michaelflorip@berkeley.edu)

## Experimenting with topicwizard
Medium Link: https://medium.com/@power.up1163/visualizing-topic-models-with-topicwizard-ee5b4428405e

Code is from the Medium article, with minor adjustments for my own experimentation

### Install topic-wizard from PyPi

In [44]:
%pip install topic-wizard

Collecting topic-wizard
  Obtaining dependency information for topic-wizard from https://files.pythonhosted.org/packages/0c/53/ed1062e9e51aab0632cf0626046cd07671483082d6bcc6566ed0c4f11bbe/topic_wizard-0.5.0-py3-none-any.whl.metadata
  Downloading topic_wizard-0.5.0-py3-none-any.whl.metadata (9.3 kB)
Collecting Pillow<11.0.0,>=10.1.0 (from topic-wizard)
  Obtaining dependency information for Pillow<11.0.0,>=10.1.0 from https://files.pythonhosted.org/packages/4a/92/a6eb4a8210d3597897ddf2d6af37898eb74e116bd2c6d2bcd9ac4080ebb5/pillow-10.2.0-cp310-cp310-macosx_10_10_x86_64.whl.metadata
  Downloading pillow-10.2.0-cp310-cp310-macosx_10_10_x86_64.whl.metadata (9.7 kB)
Collecting dash<3.0.0,>=2.7.1 (from topic-wizard)
  Obtaining dependency information for dash<3.0.0,>=2.7.1 from https://files.pythonhosted.org/packages/b2/10/388c4a697275417a6974033e6ea7235d61e648e6c39d9cc06fcc6a6f71d4/dash-2.15.0-py3-none-any.whl.metadata
  Downloading dash-2.15.0-py3-none-any.whl.metadata (11 kB)
Collecting d

  Downloading pynndescent-0.5.11-py3-none-any.whl.metadata (6.8 kB)
Collecting Werkzeug<3.1 (from dash<3.0.0,>=2.7.1->topic-wizard)
  Obtaining dependency information for Werkzeug<3.1 from https://files.pythonhosted.org/packages/c3/fc/254c3e9b5feb89ff5b9076a23218dafbc99c96ac5941e900b71206e6313b/werkzeug-3.0.1-py3-none-any.whl.metadata
  Downloading werkzeug-3.0.1-py3-none-any.whl.metadata (4.1 kB)
Collecting itsdangerous>=2.1.2 (from Flask<3.1,>=1.0.4->dash<3.0.0,>=2.7.1->topic-wizard)
  Using cached itsdangerous-2.1.2-py3-none-any.whl (15 kB)
Collecting blinker>=1.6.2 (from Flask<3.1,>=1.0.4->dash<3.0.0,>=2.7.1->topic-wizard)
  Obtaining dependency information for blinker>=1.6.2 from https://files.pythonhosted.org/packages/fa/2a/7f3714cbc6356a0efec525ce7a0613d581072ed6eb53eb7b9754f33db807/blinker-1.7.0-py3-none-any.whl.metadata
  Downloading blinker-1.7.0-py3-none-any.whl.metadata (1.9 kB)
Collecting cachelib<0.10.0,>=0.9.0 (from Flask-Caching<3.0.0,>=2.1.0->dash-extensions<2.0.0,>=1.

Installing collected packages: editorconfig, dataclass-wizard, dash-table, dash-mantine-components, dash-iconify, dash-html-components, dash-core-components, Werkzeug, tenacity, retrying, Pillow, more-itertools, llvmlite, jsbeautifier, joblib, itsdangerous, cachelib, blinker, scikit-learn, plotly, pandas, numba, Flask, wordcloud, pynndescent, Flask-Caching, dash, umap-learn, dash-extensions, topic-wizard
  Attempting uninstall: Werkzeug
    Found existing installation: Werkzeug 2.3.7
    Uninstalling Werkzeug-2.3.7:
      Successfully uninstalled Werkzeug-2.3.7
  Attempting uninstall: Pillow
    Found existing installation: Pillow 9.2.0
    Uninstalling Pillow-9.2.0:
      Successfully uninstalled Pillow-9.2.0
  Attempting uninstall: joblib
    Found existing installation: joblib 1.1.1
    Uninstalling joblib-1.1.1:
      Successfully uninstalled joblib-1.1.1
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.1.3
    Uninstalling scikit-learn-1.1.3:
  

### Loading a corpus to train our topic model
A corpus is a collection of authentic text or audio organized into datasets

In [24]:
from sklearn.datasets import fetch_20newsgroups

corpus = fetch_20newsgroups(subset="all").data

### Set up the topic modeling pipeline
1. Using 'Pipeline' from sklearn to streamline the process of converting text data into a bag-of-words representation
2. Applying Non-negative Matrix Factorization (NMF) for topic modeling

In [25]:
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

### Cut low and high frequency words and filter out English stopwords
> * 'min_df = 10' means a word must appear in at least 10 documents to be included in the vocabulary, which helps remove rare words
> * 'max_df = 0.5' means a word that appears in more than 50% of documents will be excluded, which is useful for removing very common words that are too frequent to be informative (beyond the stop words)
> * 'stop_words = "english"' tells the vectorizer to remove common English stop words like "the", "is", "in", etc.

In [26]:
# Transforms text documents into matrix of token counts
vectorizer = CountVectorizer(min_df=10, max_df=0.5, stop_words="english")

### Create a topic model
> * NMF is used for dimensionality reduction; it decomposes the high-dimensional bag-of-words matrix into two lower-dimensional matrices, one representing the documents and topics, the other representing the topics and words
> * 'n_components=10' specifies the number of topics (10)
> * NMF will factorize the original matrix (V) into two matricies (W, H) such that V=WH, with the constraint that all three matrices must have no negative elements (this makes W and H easier to inspect and allows them to represent additive combinations of the original features)
> * W (Document-Topic Matrix) - each row corresponds to a document, each column represents a topic. the values indicate the association strength of each *document to the topics*
> * H (Topic-Term Matrix) - each row corresponds to a topic, each column represents a word in the vocabulary. the values indicate the association strength of *words to the topics*

In [27]:
# We create a topic model with ten topics
topic_model = NMF(n_components=10)

### Pipeline Setup
> * The text data first goes through the 'CountVectorizer', stored as the variable 'vectorizer', to be transformed into a matrix of token counts
> * Then, this matrix is fed into the NMF topic model to perform topic modeling

In [28]:
# Then we set up a pipeline
topic_pipeline = Pipeline(
    [
        ("vectorizer", vectorizer),
        ("topic_model", topic_model),
    ]
)

### Training the model
> * .fit trains the NMF model to identify 10 topics within the text based on the patterns of word occurrences across documents
> * The H matrix will have the word distributions for each topic

In [29]:
topic_pipeline.fit(corpus)

### Visualize the results

In [42]:
topicwizard = visualize(pipeline=topic_pipeline, corpus=corpus)

NameError: name 'visualize' is not defined

In [43]:
topicwizard.load(filename="topic_data.joblib")

NameError: name 'topicwizard' is not defined