<a href="https://colab.research.google.com/github/banned-books/project_banned_books/blob/main/unsupervised_topic_modeling/BERTopic_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Mount GDrive & Bring in Custom BERTopic Evaluation functions

In [1]:
from google.colab import drive
drive.mount('/content/drive')

# Bring in our custom BERTopic evaluation methods notebook and functions
%run '/content/drive/MyDrive/Colab Notebooks/unsupervised_topic_modeling/Evaluation_Functions_BERTopic_Modeling.ipynb'

Mounted at /content/drive
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/MaartenGr/BERTopic.git
  Cloning https://github.com/MaartenGr/BERTopic.git to /tmp/pip-req-build-02nggaze
  Running command git clone --filter=blob:none --quiet https://github.com/MaartenGr/BERTopic.git /tmp/pip-req-build-02nggaze
  Resolved https://github.com/MaartenGr/BERTopic.git to commit f1e80797fd4361b46c54d041e76f6a3752b617d9
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting hdbscan>=0.8.29
  Downloading hdbscan-0.8.29.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m82.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing met

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
#  Uncomment this import locale code if you get any errors. 
#  This is to counter the occassional Colab NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968.
import locale
locale.getpreferredencoding = lambda: "UTF-8"

## Import Libraries

In [None]:
# Import data cleaning & manipulation libraries
import pandas as pd
import numpy as np
from tqdm import tqdm
import re
from itertools import combinations

# Import data visualization libraries
import matplotlib
import matplotlib.pyplot as plt
import scattertext as st
import seaborn as sns
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'colab'
import chart_studio.plotly as py
import chart_studio.tools as tls

# Dimensionality reduction
from umap import UMAP
from hdbscan import HDBSCAN

# Import libraries for NLP work
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from bertopic.representation import MaximalMarginalRelevance
from scipy import linalg
from scipy.cluster import hierarchy as sch
import gensim
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel
import nltk
nltk.download('punkt')
nltk.download('stopwords')

%matplotlib inline
np.set_printoptions(suppress=True)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Import Cleaned Topic Modeling Data on Banned Books & Amazon.com Reviews




### Banned Books Meta Dataset

In [None]:
df = pd.read_pickle('/content/drive/MyDrive/Colab Notebooks/data/cleaned_trained_data/clean_topic_modeling.pkl')
df = df.drop_duplicates('title')
df.head(1)

Unnamed: 0,goodreads_image_url,title,author,goodreads_published_date,goodreads_description,goodreads_tags,type_of_ban,state,district,ban_date,...,goodreads_product_url,amazon_url,secondary_authors,illustrators,translators,text,Tokens,Joined_Tokens,top_10_words,parse
0,https://images-na.ssl-images-amazon.com/images...,The Haters,"Andrews, Jesse",2016-04-07,From Jesse Andrews author of the New York Tim...,"young adult, contemporary, music, fiction, rea...",Banned Pending Investigation,Michigan,Rochester Community Schools District,April 2022,...,https://www.goodreads.com/book/show/26095121-t...,https://www.amazon.com/Haters-Jesse-Andrews/pr...,,,,"the haters young adult, contemporary, music, f...","[haters, young, adult, contemporary, music, fi...",haters young adult contemporary music fiction ...,"[haters, road, trip, music, humor, contemporar...","(haters, young, adult, contemporary, music, fi..."


### Amazon.com Reviews Dataset

We broke this data down into **one year of Amazon reviews before the first book ban** and **one year of Amazon reviews after the first book ban**.

In [None]:
before_amz_df = pd.read_pickle('/content/drive/MyDrive/Colab Notebooks/data/cleaned_trained_data/before_ban_amazon_review_data.pkl')
before_amz_df.head(1)

Unnamed: 0,review_date,reviews,pre_process
0,2021-06-06,"An Engrossing Page Turner About Race, Class an...",an engrossing page turner about race class an...


In [None]:
after_amz_df = pd.read_pickle('/content/drive/MyDrive/Colab Notebooks/data/cleaned_trained_data/after_ban_amazon_review_data.pkl')
after_amz_df.head(1)

Unnamed: 0,review_date,reviews,pre_process
2,2022-09-12,Wow I ordered this book for my teenage daughte...,wow i ordered this book for my teenage daughte...


# BERTopic Modeling | Model Training (and Hyper-Tuning) on Banned Books Metadata


## Hypertune BERTopic paramaters

We created evaluation methods to tune and evaluate our BERTopic model, and imported those functions into this notebook. You can find that notebook at `Evaluation_Functions_BERTopic_Modeling.ipynb`.

In [None]:
coherence_lst = []
perplexity_lst = []
proportion_lst = []
jaccard_lst = []

for i in range(2, 20):

    # Convert training set to list of documents
    docs = df["Joined_Tokens"].drop_duplicates().to_list()

    # Train the BERTopic model
    vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")
    representation_model = MaximalMarginalRelevance(diversity=0.20)
    embedding_model = SentenceTransformer('all-mpnet-base-v2')

    # Tune the model
    topic_model = BERTopic(
        representation_model = representation_model,
        vectorizer_model = vectorizer_model,
        embedding_model = embedding_model,
        language = 'english', 
        top_n_words = 3, # determined by the Term Score Decline Per Topic function below
        nr_topics = 'auto', # auto for HBSCAN (suggested by the BERTopic author)
        min_topic_size = i,
        calculate_probabilities = True,
        verbose = True
    )

    topics, probs = topic_model.fit_transform(docs)

    # Calculate coherence and append to list
    coherence = calculate_coherence(docs, topics)
    coherence_lst.append(coherence)

    # Calculate perplexity and append to list
    perplexity = calculate_perplexity()
    perplexity_lst.append(perplexity[1])

    # Calculate the proportion of unique words (for topic diversity)
    metadata_proportion_of_diverse_topics = proportion_unique_words(topic_model, topk=3)
    proportion_lst.append(metadata_proportion_of_diverse_topics)

    # Calculate the average pairwise jaccard distance between topics
    metadata_average_pairwise_jaccard_distance_between_topics = pairwise_jaccard_diversity(topic_model, topk=3)
    jaccard_lst.append(metadata_average_pairwise_jaccard_distance_between_topics)


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:30:56,384 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:31:10,467 - BERTopic - Reduced dimensionality
2023-04-15 06:31:13,138 - BERTopic - Clustered reduced embeddings
2023-04-15 06:31:30,678 - BERTopic - Reduced number of topics from 199 to 79


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:31:34,726 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:31:42,013 - BERTopic - Reduced dimensionality
2023-04-15 06:31:42,833 - BERTopic - Clustered reduced embeddings
2023-04-15 06:31:51,305 - BERTopic - Reduced number of topics from 112 to 40


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:31:55,515 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:32:02,282 - BERTopic - Reduced dimensionality
2023-04-15 06:32:02,759 - BERTopic - Clustered reduced embeddings
2023-04-15 06:32:10,204 - BERTopic - Reduced number of topics from 83 to 36


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:32:14,194 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:32:20,836 - BERTopic - Reduced dimensionality
2023-04-15 06:32:21,153 - BERTopic - Clustered reduced embeddings
2023-04-15 06:32:27,461 - BERTopic - Reduced number of topics from 61 to 41


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:32:31,408 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:32:37,973 - BERTopic - Reduced dimensionality
2023-04-15 06:32:38,222 - BERTopic - Clustered reduced embeddings
2023-04-15 06:32:42,979 - BERTopic - Reduced number of topics from 48 to 29


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:32:47,386 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:32:54,023 - BERTopic - Reduced dimensionality
2023-04-15 06:32:54,266 - BERTopic - Clustered reduced embeddings
2023-04-15 06:32:59,392 - BERTopic - Reduced number of topics from 49 to 34


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:33:03,386 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:33:10,175 - BERTopic - Reduced dimensionality
2023-04-15 06:33:10,370 - BERTopic - Clustered reduced embeddings
2023-04-15 06:33:14,246 - BERTopic - Reduced number of topics from 36 to 27


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:33:18,136 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:33:24,672 - BERTopic - Reduced dimensionality
2023-04-15 06:33:24,876 - BERTopic - Clustered reduced embeddings
2023-04-15 06:33:28,895 - BERTopic - Reduced number of topics from 40 to 25


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:33:32,774 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:33:39,793 - BERTopic - Reduced dimensionality
2023-04-15 06:33:39,956 - BERTopic - Clustered reduced embeddings
2023-04-15 06:33:43,323 - BERTopic - Reduced number of topics from 34 to 24


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:33:47,227 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:33:53,818 - BERTopic - Reduced dimensionality
2023-04-15 06:33:54,005 - BERTopic - Clustered reduced embeddings
2023-04-15 06:33:57,250 - BERTopic - Reduced number of topics from 31 to 22


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:34:01,177 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:34:07,755 - BERTopic - Reduced dimensionality
2023-04-15 06:34:07,913 - BERTopic - Clustered reduced embeddings
2023-04-15 06:34:11,081 - BERTopic - Reduced number of topics from 31 to 22


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:34:15,106 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:34:21,580 - BERTopic - Reduced dimensionality
2023-04-15 06:34:21,736 - BERTopic - Clustered reduced embeddings
2023-04-15 06:34:25,313 - BERTopic - Reduced number of topics from 29 to 29


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:34:29,375 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:34:36,664 - BERTopic - Reduced dimensionality
2023-04-15 06:34:36,790 - BERTopic - Clustered reduced embeddings
2023-04-15 06:34:38,559 - BERTopic - Reduced number of topics from 19 to 11


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:34:42,580 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:34:49,287 - BERTopic - Reduced dimensionality
2023-04-15 06:34:49,416 - BERTopic - Clustered reduced embeddings
2023-04-15 06:34:51,436 - BERTopic - Reduced number of topics from 19 to 13


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:34:55,816 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:35:02,636 - BERTopic - Reduced dimensionality
2023-04-15 06:35:02,757 - BERTopic - Clustered reduced embeddings
2023-04-15 06:35:04,364 - BERTopic - Reduced number of topics from 18 to 8


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:35:08,152 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:35:14,687 - BERTopic - Reduced dimensionality
2023-04-15 06:35:14,818 - BERTopic - Clustered reduced embeddings
2023-04-15 06:35:17,041 - BERTopic - Reduced number of topics from 21 to 17


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:35:20,898 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:35:28,118 - BERTopic - Reduced dimensionality
2023-04-15 06:35:28,222 - BERTopic - Clustered reduced embeddings
2023-04-15 06:35:29,646 - BERTopic - Reduced number of topics from 11 to 11


Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:35:33,534 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:35:40,566 - BERTopic - Reduced dimensionality
2023-04-15 06:35:40,683 - BERTopic - Clustered reduced embeddings
2023-04-15 06:35:42,324 - BERTopic - Reduced number of topics from 17 to 12


## Find the number of terms per topic to display using `.visualize_term_rank(log_scale=True)`

Per BERTopic's documentation:

> Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

> To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, the select the best number of words in a topic.

We use this method to visualize the c-TF-IDF score decline. Blow we see that the term score begins to decline around `top_n_words=2` and reaches the absolute number of terms to show per topic as 3. So we will set this parameter to 3 as having more terms per topic will not improve the outcome.



In [None]:
topic_model.visualize_term_rank(log_scale=True)

## Calculate the `min_topic_size` parameter by checking coherence, perplexity, and topic diversity metrics for a range of `min_topic_size` parameters (along with subjectively verifying the interpretability of the topics to the human eye)

In [None]:
i_lst = [2, 3, 4, 5, 6, 7, 8 , 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

metadata_df = {
    'min_topic_size': i_lst,
    'coherence': coherence_lst,
    'perplexity': perplexity_lst,
    'topic_diversity_proportion': proportion_lst,
    'topic_diversity_jaccard': jaccard_lst
}

metadata_df = pd.DataFrame(metadata_df)
metadata_df.head()

Unnamed: 0,min_topic_size,coherence,perplexity,topic_diversity_proportion,topic_diversity_jaccard
0,2,0.86654,-7.161762,0.987179,0.999734
1,3,0.749495,-7.0896,1.0,1.0
2,4,0.717416,-7.100289,0.952381,0.998319
3,5,0.696587,-7.099599,0.933333,0.997821
4,6,0.630622,-7.165181,0.940476,0.997354


## Chart best evaluation scores for a range of `min_topic_size` values

In [None]:
fig = px.line(metadata_df, x="min_topic_size", y="coherence", title='Topic coherence per min_topic_size')
fig.show()

In [None]:
fig = px.line(metadata_df, x="min_topic_size", y="perplexity", title='Topic perplexity per min_topic_size')
fig.show()

In [None]:
fig = px.line(metadata_df, x="min_topic_size", y="topic_diversity_proportion", title='Proportion of unique words (for topic diversity) per min_topic_size')
fig.show()

In [None]:
fig = px.line(metadata_df, x="min_topic_size", y="topic_diversity_jaccard", title='Average pairwise jaccard distance (for topic diversity) between topics per min_topic_size')
fig.show()

# Train the Best Tuned BERTopic Model on the Banned Book Metadata

This is our best topic model for the banned book metadata. 

In [None]:
# Convert training set to list of documents
docs = df["Joined_Tokens"].to_list()

# Train the BERTopic model
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")
representation_model = MaximalMarginalRelevance(diversity=0.20)
embedding_model = SentenceTransformer('all-mpnet-base-v2')

# Tune the model
topic_model = BERTopic(
    representation_model = representation_model,
    vectorizer_model = vectorizer_model,
    embedding_model = embedding_model,
    language = 'english', 
    top_n_words = 3, # determined by the Term Score Decline Per Topic function below
    nr_topics = 'auto', # auto for HBSCAN (suggested by the BERTopic author)
    min_topic_size = 4, # determined by hypertuning with evaluation metrics
    calculate_probabilities = True,
    verbose = True
)

topics, probs = topic_model.fit_transform(docs)

# Calculate coherence and append to list
coherence = calculate_coherence(docs, topics)

# Calculate perplexity and append to list
perplexity = calculate_perplexity()[1]

# Calculate the proportion of unique words (for topic diversity)
metadata_proportion_of_diverse_topics = proportion_unique_words(topic_model, topk=3)

# Calculate the average pairwise jaccard distance between topics
metadata_average_pairwise_jaccard_distance_between_topics = pairwise_jaccard_diversity(topic_model, topk=3)

Batches:   0%|          | 0/52 [00:00<?, ?it/s]

2023-04-15 06:35:46,351 - BERTopic - Transformed documents to Embeddings
2023-04-15 06:35:55,062 - BERTopic - Reduced dimensionality
2023-04-15 06:35:55,519 - BERTopic - Clustered reduced embeddings
2023-04-15 06:36:03,846 - BERTopic - Reduced number of topics from 80 to 62


### Let's evaluate our best trained model for our books metadata

In [None]:
metadata_df = {
    'min_topic_size': [4],
    'coherence': [coherence],
    'perplexity': [perplexity],
    'topic_diversity_proportion': [metadata_proportion_of_diverse_topics],
    'topic_diversity_jaccard': [metadata_average_pairwise_jaccard_distance_between_topics]
}

metadata_df = pd.DataFrame(metadata_df)
metadata_df.head()

Unnamed: 0,min_topic_size,coherence,perplexity,topic_diversity_proportion,topic_diversity_jaccard
0,4,0.754232,-7.099415,0.918033,0.998087


# View our Topic Modeling Results

## Topic Word Scores

In [None]:
fig = topic_model.visualize_barchart(
    top_n_topics=4, 
    n_words=3, 
    title='Banned Book Topic Word Scores',
    width=400  
)

fig.show()

# Hierarchical Clustering

In [None]:
hierarchical_topics = topic_model.hierarchical_topics(docs)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics, top_n_topics=4)

100%|██████████| 15/15 [00:00<00:00, 18.05it/s]


In [None]:
tree = topic_model.get_topic_tree(hierarchical_topics, tight_layout=True)
print(tree)

.
├─lgbt_realistic fiction_books
│ ├─lgbt_realistic fiction_books
│ │ ├─lgbt_realistic fiction_books
│ │ │ ├─■──mental health_adult lgbt_fiction mental ── Topic: 2
│ │ │ └─■──lgbt_realistic fiction_books ── Topic: 0
│ │ └─■──coming age_contemporary african_adult contemporary ── Topic: 11
│ └─■──adult sports_audiobook_contemporary ── Topic: 15
└─race_nonfiction_social justice
  ├─race_nonfiction_social justice
  │ ├─social justice_nonfiction_anti racist
  │ │ ├─social justice_nonfiction_sociology
  │ │ │ ├─■──colores_colores nuestra_multicultural ── Topic: 13
  │ │ │ └─■──social justice_anti racist_sociology ── Topic: 1
  │ │ └─nonfiction_pakistan_biography nonfiction
  │ │   ├─■──nonfiction_books biography_picture books ── Topic: 6
  │ │   └─■──pakistan_malala_nonfiction childrens ── Topic: 8
  │ └─childrens_cultural_muslims
  │   ├─muslims_islam_religion
  │   │ ├─muslims_islam_religion
  │   │ │ ├─■──family childrens_books family_lgbt ── Topic: 10
  │   │ │ └─■──muslims_islam_religio

# Let's check out our Document and Topic embeddings

In [None]:
# Prepare embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)

# Run the visualization with the original embeddings
topic_model.visualize_documents(docs, embeddings=embeddings)

# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings, topics=[0,1,2,3])

# Let's check out a similarity metric heatmap between topics

In [None]:
topic_model.visualize_heatmap()

# Let's graph how these topics have changed over time

**Per the PEN America / ALA Report**

> "Beginning in 2021, a range of individuals and groups sought to remove from schools books focused on issues of race or the history of slavery and racism, mirroring a campaign pushed by some legislators to pass educational gag orders—bills restricting discussion of these and other concepts in school classrooms and curricula. **Although the campaign to enact educational gag orders initially focused on misapplications of the academic term “critical race theory” to censor discussions of race and racism, over the past year, it morphed to include a heightened focus on LGBTQ+ issues and identities.**"

We can see this in our topics over time below (mapping the ban date and the frequency of topics over time). 




In [None]:
# Prepare timestamp data
timestamps = df['ban_date'].to_list()

# Create topics over time
topics_over_time = topic_model.topics_over_time(
    docs, 
    timestamps,
    nr_bins=20, 
    global_tuning=True, 
    evolution_tuning=True
)

# Clean up topic labels
topic_labels = topic_model.generate_topic_labels(nr_words=3,
                                                 topic_prefix=False,
                                                 separator=", ")

topic_model.set_topic_labels(topic_labels)

12it [00:07,  1.67it/s]


In [None]:
fig = topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=4, custom_labels=True)

# Move the location of the legend
fig.update_layout(legend=dict(yanchor="top", y=1, xanchor="left", x=0.009))

# Update titles and axes
fig.update_layout(
    title={
        'text': "Banned Book Topics Over Time",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Date of Book Ban",
    yaxis_title="Frequency",
)

fig.show()

# Save the best banned book model

In [None]:
topic_model.save("metadata_best_tm.pkl")

# BERTopic on Amazon.com Reviews

## One Year Before the First Book Ban

### Run Evaluations and Hyper Tune the BERTopic Model

In [None]:
# Set some parameters for the BERTopic model
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")
representation_model = MaximalMarginalRelevance(diversity=0.10)

# Prepare data
before_amz_df = before_amz_df.dropna()
before_timestamps = before_amz_df['review_date'].to_list()
docs = before_amz_df['pre_process'].to_list()

# Train the before book ban Amazon reviews BERTopic model
topic_model = BERTopic(
    vectorizer_model = vectorizer_model, 
    representation_model = representation_model,
    language = 'english', 
    nr_topics ='auto', 
    n_gram_range = (1,2),
    min_topic_size = 10, 
    verbose=True
)

topics, probs = topic_model.fit_transform(docs)

Batches:   0%|          | 0/1172 [00:00<?, ?it/s]

2023-04-16 09:37:19,272 - BERTopic - Transformed documents to Embeddings
2023-04-16 09:37:38,668 - BERTopic - Reduced dimensionality
2023-04-16 09:37:41,380 - BERTopic - Clustered reduced embeddings
2023-04-16 09:38:26,578 - BERTopic - Reduced number of topics from 405 to 324


In [None]:
# Calculate coherence 
before_coherence = calculate_coherence(docs, topics)

# Calculate perplexity 
before_perplexity = calculate_perplexity()[1]

# Calculate the proportion of unique words (for topic diversity)
before_metadata_proportion_of_diverse_topics = proportion_unique_words(topic_model, topk=3)

# Calculate the average pairwise jaccard distance between topics (for topic diversity)
before_metadata_average_pairwise_jaccard_distance_between_topics = pairwise_jaccard_diversity(topic_model, topk=3)

In [None]:
before_df = {
    'min_topic_size': [10],
    'coherence': [before_coherence],
    'perplexity': [before_perplexity],
    'topic_diversity_proportion': [before_metadata_proportion_of_diverse_topics],
    'topic_diversity_jaccard': [before_metadata_average_pairwise_jaccard_distance_between_topics]
}

before_df = pd.DataFrame(before_df)
before_df.head()

Unnamed: 0,min_topic_size,coherence,perplexity,topic_diversity_proportion,topic_diversity_jaccard
0,10,0.558982,-9.729903,0.916667,0.990253


Let's see how many words we should show per topic.

In [None]:
topic_model.visualize_term_rank(log_scale=True)

### Let's view topic scores

In [None]:
fig = topic_model.visualize_barchart(
    top_n_topics=5, 
    n_words=2, 
    title='Before First Book Ban Amazon Reviews Topic Word Scores',
    width=400  
)

fig.show()

### Let's view the topics over time

In [None]:
# Create topics over time
topics_over_time = topic_model.topics_over_time(
    docs, 
    before_timestamps,
    nr_bins=20, 
    global_tuning=True, 
    evolution_tuning=True
)

# Clean up topic labels
topic_labels = topic_model.generate_topic_labels(nr_words=3,
                                                 topic_prefix=False,
                                                 separator=", ")

topic_model.set_topic_labels(topic_labels)

20it [07:56, 23.83s/it]


In [None]:
# Visualize topics over time
fig = topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=2, custom_labels=True, topics = [0, 1])

# Move the location of the legend
fig.update_layout(legend=dict(yanchor="top", y=1, xanchor="left", x=0.8))

# Update titles and axes
fig.update_layout(
    title={
        'text': "Amazon Reviews Topics Over Time (Before First Book Ban)",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Date",
    yaxis_title="Amazon.com Review Date",
)

fig.show()

### Save the Best Model re: Amazon Reviews on Banned Books (One Year Before First Book Ban)

In [None]:
topic_model.save("before_amazon_bert_tm.pkl")

## One year after the first book ban

#### Run Evaluations and Hyper Tune the BERTopic Model

In [None]:
# Set some parameters for the BERTopic model
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")
representation_model = MaximalMarginalRelevance(diversity=0.10)

# Prepare data
after_amz_df = after_amz_df.dropna()
after_timestamps = after_amz_df['review_date'].to_list()
docs = after_amz_df['pre_process'].to_list()

# Train the before book ban Amazon reviews BERTopic model
topic_model = BERTopic(
    vectorizer_model = vectorizer_model, 
    representation_model = representation_model,
    language = 'english', 
    nr_topics ='auto', 
    n_gram_range = (1,2),
    min_topic_size = 10, 
    verbose=True
)

topics, probs = topic_model.fit_transform(docs)

Batches:   0%|          | 0/775 [00:00<?, ?it/s]

2023-04-15 09:20:11,056 - BERTopic - Transformed documents to Embeddings
2023-04-15 09:20:24,464 - BERTopic - Reduced dimensionality
2023-04-15 09:20:26,565 - BERTopic - Clustered reduced embeddings
2023-04-15 09:20:53,734 - BERTopic - Reduced number of topics from 273 to 208


In [None]:
# Calculate coherence 
after_coherence = calculate_coherence(docs, topics)

# Calculate perplexity 
after_perplexity = calculate_perplexity()[1]

# Calculate the proportion of unique words (for topic diversity)
after_metadata_proportion_of_diverse_topics = proportion_unique_words(topic_model, topk=3)

# Calculate the average pairwise jaccard distance between topics (for topic diversity)
after_metadata_average_pairwise_jaccard_distance_between_topics = pairwise_jaccard_diversity(topic_model, topk=3)

In [None]:
after_df = {
    'min_topic_size': [10],
    'coherence': [after_coherence],
    'perplexity': [after_perplexity],
    'topic_diversity_proportion': [after_metadata_proportion_of_diverse_topics],
    'topic_diversity_jaccard': [after_metadata_average_pairwise_jaccard_distance_between_topics]
}

after_df = pd.DataFrame(after_df)
after_df.head()

Unnamed: 0,min_topic_size,coherence,perplexity,topic_diversity_proportion,topic_diversity_jaccard
0,10,0.442687,-9.280038,0.956522,0.999481


Let's see how many words we should show per topic.

In [None]:
topic_model.visualize_term_rank(log_scale=True)

### Let's view topic scores

In [None]:
fig = topic_model.visualize_barchart(
    top_n_topics = 5, 
    n_words = 2, # Determined by the .visual_term_ranking() method
    title = 'After Book Ban Amazon Reviews Topic Word Scores',
    width = 400  
)

fig.show()

#### Let's view the topics over time

In [None]:
# Create topics over time
topics_over_time = topic_model.topics_over_time(
    docs, 
    after_timestamps,
    nr_bins=20, 
    global_tuning=True, 
    evolution_tuning=True
)

# Clean up topic labels
topic_labels = topic_model.generate_topic_labels(nr_words=3,
                                                 topic_prefix=False,
                                                 separator=", ")

topic_model.set_topic_labels(topic_labels)

20it [03:56, 11.82s/it]


In [None]:
# Visualize topics over time
fig = topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=3, custom_labels=True, topics = [0, 1, 2, 3])

# Move the location of the legend
fig.update_layout(legend=dict(yanchor="top", y=1, xanchor="left", x=0.8))

# Update titles and axes
fig.update_layout(
    title={
        'text': "Amazon Reviews Topics Over Time (After First Book Ban)",
        'y':1,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Amazon.com Review Date",
    yaxis_title="Frequency",
)

fig.show()

### Save the Best Model re: Amazon Reviews on Banned Books (One Year After the First Book Ban)

In [None]:
topic_model.save("after_amazon_bert_tm.pkl")


Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.

