# Topic Modeling

Using BERTopic

## Set up environment

In [None]:
!pip install transformers
!pip install torch
!pip install datasets
!pip install bertopic[flair]

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
%cd gdrive/My\ Drive/amicus-iv

Mounted at /content/gdrive
/content/gdrive/My Drive/amicus-iv


Saving locations -- change these for different models!

In [None]:
model_folder = 'topic-modeling/models/legalbert-RRamicus/'
output_folder = 'topic-modeling/output/legalbert-RRamicus/'

you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

In [None]:
import pandas as pd
import numpy as np
from html import unescape

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from collections import Counter

from transformers import AutoTokenizer
from datasets import load_dataset, load_metric, Dataset

from huggingface_hub import notebook_login

from bertopic import BERTopic
from flair.embeddings import TransformerDocumentEmbeddings

from sklearn.preprocessing import MinMaxScaler
from umap import UMAP
from typing import List
import hdbscan
import matplotlib.pyplot as plt

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Define similarity function

We want to group topics based on how similar they are. This is an adoption of the visualize_topics() function

In [None]:
def get_similar_topics(topic_model,
                     topics: List[int] = None,
                     top_n_topics: int = None,
                     width: int = 650,
                     height: int = 650):
    # Select topics based on top_n and topics args
    if topics is not None:
        topics = list(topics)
    elif top_n_topics is not None:
        topics = sorted(topic_model.get_topic_freq().Topic.to_list()[1:top_n_topics + 1])
    else:
        topics = sorted(list(topic_model.get_topics().keys()))

    # Extract topic words and their frequencies
    topic_list = sorted(topics)
    frequencies = [topic_model.topic_sizes[topic] for topic in topic_list]
    words = [" | ".join([word[0] for word in topic_model.get_topic(topic)[:5]]) for topic in topic_list]

    # seed
    np.random.seed(11)

    # Embed c-TF-IDF into 2D
    all_topics = sorted(list(topic_model.get_topics().keys()))
    indices = np.array([all_topics.index(topic) for topic in topics])
    embeddings = topic_model.c_tf_idf.toarray()[indices]
    embeddings = MinMaxScaler().fit_transform(embeddings)
    embeddings = UMAP(n_neighbors=2, n_components=2, metric='hellinger', random_state=42).fit_transform(embeddings)

    # cluster based on above
    labels = hdbscan.HDBSCAN(min_samples=1, min_cluster_size=3).fit_predict(embeddings)

    # Visualize with plotly
    df = pd.DataFrame({"x": embeddings[1:, 0], "y": embeddings[1:, 1], 'Label':labels[1:],
                       "Topic": topic_list[1:], "Words": words[1:], "Size": frequencies[1:]})
    return df

## Data

BERTopic function takes a list of documents, so we need to set this up ourselves. 

## Option 0: Read in text from drive

I have saved a file on google drive called "data/amicus_text_512.csv" which contains the result of following the steps of option 1 below. since this produces the same results each time, we don't need to keep re-running it.

In [None]:
df = pd.read_csv('data/amicus_text_clean_512.csv')
df.head(5)

Unnamed: 0,case,id,brief,brief_party,text
0,Rust v Sullivan,861819857503,"Rust v Sullivan. Amici Brief for Respondent, b...",0,abortion battle conflict enumerated right life...
1,Rust v Sullivan,861819857503,"Rust v Sullivan. Amici Brief for Respondent, b...",0,center reproductive health technical ability p...
2,Rust v Sullivan,861819857503,"Rust v Sullivan. Amici Brief for Respondent, b...",0,erus since made recently fertilized ovum possi...
3,Rust v Sullivan,861819857503,"Rust v Sullivan. Amici Brief for Respondent, b...",0,breakthroughs may comprise invention new medic...
4,Rust v Sullivan,861819857503,"Rust v Sullivan. Amici Brief for Respondent, b...",0,ally approximately babies survived extreme sta...


# Part 0: Train bertopic using fine-tuned transformer

Flair allows you to choose almost any 🤗 transformers model. Select any public model from the HF model hub and pass it to BERTopic.

In [None]:
model_checkpoint = 'repro-rights-amicus-briefs/legal-bert-base-uncased-finetuned-RRamicus'

## Training

So, we can use our fine-tuned model here! Here, we use bert-base-uncased finetuned (`bbu_ft`) on our reproductive rights amicus.

Note that you have to make the model public in order to do this. 

**Only do this once! Skip to 'load saved model' section if this has already been completed**

Takes 9 minutes.

In [None]:
# init embeddings and model
bbu_ft_embed = TransformerDocumentEmbeddings(model_checkpoint)
bbu_ft_tm = BERTopic(embedding_model=bbu_ft_embed, language = 'english', calculate_probabilities=True, verbose=True)

Some weights of BertModel were not initialized from the model checkpoint at repro-rights-amicus-briefs/legal-bert-base-uncased-finetuned-RRamicus and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Fit the model to our data (9 mins)

In [None]:
# fit model
bbu_ft_topics, bbu_ft_probs = bbu_ft_tm.fit_transform(df['text'])

7590it [05:23, 23.49it/s]
2022-03-31 18:57:49,884 - BERTopic - Transformed documents to Embeddings
2022-03-31 18:58:42,419 - BERTopic - Reduced dimensionality with UMAP
2022-03-31 18:59:20,279 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Save model

In [None]:
bbu_ft_tm.save(model_folder + 'legalbert_rramicus')

  self._set_arrayXarray(i, j, x)


## Extract Topics

Only do this once; once saved, skip to next section.

In [None]:
bbu_ft_freq = bbu_ft_tm.get_topic_info()
bbu_ft_freq.head(10)

Unnamed: 0,Topic,Count,Name
0,-1,1979,-1_abortion_court_health_state
1,0,316,0_respectfully_counsel_attorney_submitted
2,1,131,1_trimester_interest_pregnancy_state
3,2,101,2_minor_minors_parental_parents
4,3,88,3_madsen_speech_injunction_zone
5,4,87,4_federal_jurisdiction_appeal_declaratory
6,5,85,5_standing_thirdparty_assert_singleton
7,6,77,6_animus_class_interstate_travel
8,7,74,7_unborn_child_person_personhood
9,8,72,8_griswold_privacy_right_appellants


Save files -- make sure to change file names! 

* topics_model_name.csv = full list of words for each topic
* topic_classification_model_name.csv = topic classification for each paragraph in the data

In [None]:
# full list of topics
bbu_full_topics = bbu_ft_tm.get_topics()

#convert full topic dict to df and transpose
topics_df = pd.DataFrame(bbu_full_topics,
                         index=['word1', 'word2', 'word3', 'word4', 'word5', 'word6', 'word7', 'word8', 'word9', 'word10'])\
                         .transpose()

# get just the word
topics_df = topics_df.applymap(lambda x: x[0])

# add col w/concatenated list
#topics_df['all_words'] = topics_df.apply(', '.join, axis=1) #insert at end
topics_df.insert(0, 'topic', topics_df.apply(', '.join, axis=1))

# remove indiv. word columns (word1,...,word10)
topics_df.drop(list(topics_df.filter(regex = 'word')), axis = 1, inplace = True)

# convert index to a column (this is the topic id)
topics_df.insert(0, 'topic_id', topics_df.index)

# add count frequency 
topic_ct = bbu_ft_freq[['Topic', 'Count']]
topics_df = topics_df.merge(topic_ct, how='left', left_on='topic_id', right_on='Topic')
topics_df.drop('Topic', axis=1, inplace=True)

# save
topics_df.to_csv(output_folder + 'topics_legalbert_rramicus.csv', index=False)

In [None]:
# classification by paragraph
topic_id = bbu_ft_freq[['Topic', 'Name']]
output_df = df.copy()
output_df['topic_id'] = bbu_ft_topics
output_df = output_df.merge(topic_id, how='left', left_on='topic_id', right_on='Topic')
output_df.drop('Topic',axis=1,inplace=True)
output_df.rename({'Name' : 'topic_name'},axis=1, inplace=True)
output_df.to_csv(output_folder + 'topic_classification_legalbert_rramicus.csv', index=False)

# Part 1: Load Saved Model

Run this code if the previous sections (training, extract topics) have already been run once

In [None]:
# read in model
bbu_ft_tm = BERTopic.load(model_folder + 'legalbert_rramicus')
# frequency of each topic
bbu_ft_freq = bbu_ft_tm.get_topic_info()
# full topics
bbu_ft_topics = bbu_ft_tm.get_topics()

## Investigate topics

Get similar topics to a word

In [None]:
similar_topics, similarity = bbu_ft_tm.find_topics("physician", top_n=5)
print(similar_topics)
bbu_ft_tm.get_topic(similar_topics[1])

[221, 308, 30, 1, 10]


[('adolescents', 0.010568672251375847),
 ('national', 0.008347920650109343),
 ('bolton', 0.007635618346580972),
 ('nonhospital', 0.007278862325871135),
 ('health', 0.007069897410251013),
 ('organization', 0.006916993516579048),
 ('black', 0.0064672655865262415),
 ('censors', 0.006422766222265313),
 ('leaflet', 0.005916714514360049),
 ('doe', 0.005912550652236706)]

In [None]:
representative_docs = bbu_ft_tm.get_representative_docs(85)
representative_docs

["2003, pub. l. no. 105, § 2 ( 14 ) ( g ), 117 stat. 1201, 1205 ( “ in addition promoting maternal health, prohibition draw bright line clearly distinguishes abortion infanticide, preserves integrity medical profession, promotes respect human life ” ). kinds concerns, bearing matters moral consequence, protection persons injuries suffered hands other, private persons, concerns traditionally fallen within police powers state local government. traditionally, exclusively. civil rights acts fourteenth amendment, example, authority congress flexed protect people minority races assaults hands private thugs varying degrees organization. could assembled cohorts ku klux klan, members lynching party brought forth ad hoc way, private persons collusion authorities. see, e. g., * 11 united states v. price, 383 u. s. 787 ( 1966 ) ( applying federal statute law enforcement officials killed three civil rights workers ). difficulty civil rights analogy, though, context civil rights act, congress reachi

## Visualize Topics

In [None]:
bbu_ft_tm.visualize_topics()

## Topics per class

We can divide up the topics into those that appear in one class vs the other (fem briefs and opp briefs)

In [None]:
topics_per_class = bbu_ft_tm.topics_per_class(sequences, bbu_ft_topics, brief_party)
topics_per_class.head(10)

NameError: ignored

In [None]:
#fem_brief_bbu_topics = topics_per_class[topics_per_class['Class']==1].drop(['Class'],axis=1,inplace=False)
bbu_ft_tm.visualize_topics_per_class(topics_per_class, top_n_topics=5, normalize_frequency=True)

## Reduce n topics

This is a manual decision -- skip for now, this will be done later after we select a model! 

In [None]:
#new_topics, new_probs = topic_model.reduce_topics(df['text'], topics, probs, nr_topics=60)

## Topic hierarchy

Another way to visually examine how topcis are related to one another. Just from looking on this, I think it would make more sense to topic model pro-women and pro-opp briefs separately, since they often use similar language/topics but are articulating very different points on them! 

In [None]:
topic_model.visualize_hierarchy(top_n_topics=25)

NameError: ignored

## Topic Similarity

Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [None]:
topic_model.visualize_heatmap(n_clusters=10, width=1000, height=1000)

NameError: ignored

# Part 2: Seed topics (*skip*)

https://maartengr.github.io/BERTopic/api/bertopic.html

In [None]:
seed_topic_list = [['physician', 'doctor', 'medical professional', 'medical expert'], ['women', 'mother']]
seed_topic_model = BERTopic(language = 'english', calculate_probabilities=True, verbose=True,
                            seed_topic_list = seed_topic_list)
seed_topics, seed_probs = seed_topic_model.fit_transform(sequences)

Batches:   0%|          | 0/498 [00:00<?, ?it/s]

2022-03-16 17:40:37,439 - BERTopic - Transformed documents to Embeddings


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2022-03-16 17:41:03,827 - BERTopic - Reduced dimensionality with UMAP
2022-03-16 17:41:24,975 - BERTopic - Clustered UMAP embeddings with HDBSCAN


In [None]:
seed_freq = seed_topic_model.get_topic_info()
seed_freq.head(10)

Unnamed: 0,Topic,Count,Name
0,-1,5744,-1_the_to_of_and
1,0,473,0_parental_minors_parents_minor
2,1,372,1_court_federal_courts_legislative
3,2,303,2_mortality_pregnancy_complications_deaths
4,3,265,3_hill_zone_buffer_speech
5,4,224,4_roe_privacy_wade_right
6,5,200,5_louisiana_620_privileges_admitting
7,6,180,6_psychological_women_study_mental
8,7,179,7_children_parents_family_parental
9,8,171,8_human_life_being_we


In [None]:
seed_women_similar_topics, seed_women_similarity = seed_topic_model.find_topics("women's rights", top_n=5)
print(seed_women_similar_topics)
seed_topic_model.get_topic(seed_women_similar_topics[0])

[165, 139, 86, 91, 127]


[('women', 0.014136297774918507),
 ('laws', 0.011999050580616942),
 ('illegal', 0.010237065081556607),
 ('enforcement', 0.009734576408527344),
 ('dying', 0.0094023498417061),
 ('abortions', 0.009385982433003582),
 ('rape', 0.008878073201327344),
 ('incest', 0.008369399996816313),
 ('prosecution', 0.0077237846543999214),
 ('criminal', 0.007508319106164667)]

In [None]:
seed_phys_similar_topics, seed_phys_similarity = seed_topic_model.find_topics("doctor", top_n=5)
print(seed_phys_similar_topics)
seed_topic_model.get_topic(seed_phys_similar_topics[1])

[164, 66, 77, 93, 11]


[('hospital', 0.02913038745081937),
 ('credentialing', 0.025738828794159216),
 ('privileges', 0.022390621401369354),
 ('hospitals', 0.019021824083630595),
 ('care', 0.016055733923885766),
 ('staff', 0.015923603500972786),
 ('physician', 0.015268006792430806),
 ('ms', 0.015124700368570216),
 ('admitting', 0.012263187376225206),
 ('physicians', 0.011998829970855503)]

# Part 3: Split into fem and opp

In this section, we fit models for fem and opp briefs separately in order to get more specific topic information. 

Experiment with removing additional words

numbers (including “fifth”, “ii”, “2d”, etc), dates (“1923”), “et” & “al”, “per”, “circuit” “district”, some acronyms (e.g., ussc, www), some proper names (thorp, hollen), case names (Akron, Casey))

In [None]:
df_clean = df.copy()
#df_clean['text'] = df_clean['text'].str.replace('[{}]'.format(string.punctuation), '')

rmv_list = ['ii', 'https', 'al', 'et', 'per', 'www', 'llp', 'id', 'nos', 'pdf', 'http',
            'ul', 'fi', 'ri', 'sb', 'ql', 'li', 'fs',
            'circuit', 'district', 'supra', 'supp', 'decisis', 'amici', 'curiae', 'court', 'courts', 'supreme', 'appeals',
            'appeal', 'appellants', 'appellant', 'appellee', 'appellees',
            'first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'nineth', 'tenth', 'eleventh', 'twelfth']

df_clean['text'] = df_clean['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (rmv_list)]))
#df_clean['text_2'] = df_clean['text'].apply(lambda x: [word for word in x.split()])

Split data into fem and opp

In [None]:
# split data
opp_df = df_clean[df_clean['brief_party']==0]
fem_df = df_clean[df_clean['brief_party']==1]

## Fem topic model

### Initial train + save

Init existing topic model again so we don't over-write existing model.

In [None]:
fem_tm = BERTopic.load(model_folder + 'legalbert_rramicus')

Fit the model on only the docs of interest (5 min)

In [None]:
# fit model
fem_topics, fem_probs = fem_tm.fit_transform(fem_df['text'])

3461it [02:26, 23.70it/s]
2022-03-31 19:04:57,290 - BERTopic - Transformed documents to Embeddings
2022-03-31 19:05:35,604 - BERTopic - Reduced dimensionality with UMAP
2022-03-31 19:05:36,089 - BERTopic - Clustered UMAP embeddings with HDBSCAN


In [None]:
fem_topic_info = fem_tm.get_topic_info()
fem_topic_info.head(5)

Unnamed: 0,Topic,Count,Name
0,0,1619,0_abortion_women_health_care
1,-1,574,-1_abortion_state_health_women
2,1,140,1_injunction_petitioners_zone_speech
3,2,128,2_respectfully_counsel_conclusion_attorney
4,3,71,3_minors_minor_parental_parents


In [None]:
fem_full_topics = fem_tm.get_topics()

Save

In [None]:
# full list of topics
fem_full_topics = fem_tm.get_topics()

#convert full topic dict to df and transpose
topics_df = pd.DataFrame(fem_full_topics,
                         index=['word1', 'word2', 'word3', 'word4', 'word5', 'word6', 'word7', 'word8', 'word9', 'word10'])\
                         .transpose()

# get just the word
topics_df = topics_df.applymap(lambda x: x[0])

# add col w/concatenated list
#topics_df['all_words'] = topics_df.apply(', '.join, axis=1) #insert at end
topics_df.insert(0, 'topic', topics_df.apply(', '.join, axis=1))

# remove indiv. word columns (word1,...,word10)
topics_df.drop(list(topics_df.filter(regex = 'word')), axis = 1, inplace = True)

# convert index to a column (this is the topic id)
topics_df.insert(0, 'topic_id', topics_df.index)

# add count frequency 
topic_ct = fem_topic_info[['Topic', 'Count']]
topics_df = topics_df.merge(topic_ct, how='left', left_on='topic_id', right_on='Topic')
topics_df.drop('Topic', axis=1, inplace=True)

# save
topics_df.to_csv(output_folder + 'fem_topics_clean_legalbert_rramicus.csv', index=False)

# classification by paragraph
topic_id = fem_topic_info[['Topic', 'Name']]
output_df = fem_df.copy()
output_df['topic_id'] = fem_topics
output_df = output_df.merge(topic_id, how='left', left_on='topic_id', right_on='Topic')
output_df.drop('Topic',axis=1,inplace=True)
output_df.rename({'Name' : 'topic_name'},axis=1, inplace=True)
output_df.to_csv(output_folder + 'fem_topic_clean_classification_legalbert_rramicus.csv', index=False)

Next, cluster the topics using hdbscan 

In [None]:
fem_embed = get_similar_topics(fem_tm)

In [None]:
fem_topic_df = pd.read_csv(output_folder + 'fem_topics_clean_legalbert_rramicus.csv')
fem_embed = fem_embed.sort_values('Label')
fem_embed.rename({'Topic':'topic_id', 'Label':'label'}, axis=1, inplace=True)
fem_embed = fem_embed.merge(fem_topic_df, how='left', on = 'topic_id')
fem_embed.drop(['Words', 'Size'], axis=1, inplace=True)
fem_embed = fem_embed[['topic_id', 'label', 'topic', 'Count', 'x', 'y']]
fem_embed.to_csv(output_folder + 'fem_topics_clean_labels_legalbert_rramicus.csv')

Save model

In [None]:
fem_tm.save(model_folder + 'fem_legalbert_rramicus')

### Load from saved

In [None]:
fem_tm = BERTopic.load(model_folder + 'fem_bbu_rramicus')

Downloading:   0%|          | 0.00/664 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/321 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

### Explore

In [None]:
fem_tm.visualize_topics()

Find topics

In [None]:
similar_topics, similarity = fem_tm.find_topics("medical", top_n=5)
print(similar_topics)
print(similarity)
for i in range(len(similar_topics)):
  print(fem_tm.get_topic(similar_topics[i]))

[57, 46, -1, 35, 59]
[0.9860490372255521, 0.9815115778404839, 0.9811166197960979, 0.9800830365274396, 0.9793343559125567]
[('act', 0.01543954118686647), ('emergency', 0.014231235818460135), ('medical', 0.013855040367193103), ('physician', 0.01369661624339168), ('patient', 0.013609853728641), ('hampshire', 0.012316739126711787), ('health', 0.012138903189156435), ('abortion', 0.01207191176826808), ('physicians', 0.011911218625010789), ('judge', 0.011798782385998659)]
[('organization', 0.03879029805705921), ('rights', 0.028612205730736546), ('reproductive', 0.02238318008521494), ('national', 0.021545567586049752), ('women', 0.021178794951134908), ('civil', 0.01938167511672356), ('legal', 0.01754986814847482), ('education', 0.016915347950587963), ('health', 0.01645128531672538), ('advocacy', 0.015872087366163745)]
[('court', 0.009003044817547996), ('abortion', 0.0078714103351809), ('state', 0.007542976329646191), ('right', 0.006502928377333628), ('health', 0.006451944211089597), ('women', 

## Opp topic model

### Initial train + save

Init existing topic model again so we don't over-write existing model.

In [None]:
opp_tm = BERTopic.load(model_folder + 'legalbert_rramicus')

Fit the model on only the docs of interest (5 min)

In [None]:
# fit model
opp_topics, opp_probs = opp_tm.fit_transform(opp_df['text'])

4129it [03:01, 22.69it/s]
2022-03-31 19:11:20,390 - BERTopic - Transformed documents to Embeddings
2022-03-31 19:11:37,142 - BERTopic - Reduced dimensionality with UMAP
2022-03-31 19:11:40,134 - BERTopic - Clustered UMAP embeddings with HDBSCAN


In [None]:
opp_topic_info = opp_tm.get_topic_info()
opp_topic_info.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,1048,-1_abortion_roe_state_right
1,0,188,0_respectfully_submitted_counsel_conclusion
2,1,184,1_parental_parents_minor_minors
3,2,162,2_hobbs_extortion_property_act
4,3,129,3_murder_evidence_any_unborn


Save

In [None]:
# full list of topics
opp_full_topics = opp_tm.get_topics()

#convert full topic dict to df and transpose
topics_df = pd.DataFrame(opp_full_topics,
                         index=['word1', 'word2', 'word3', 'word4', 'word5', 'word6', 'word7', 'word8', 'word9', 'word10'])\
                         .transpose()

# get just the word
topics_df = topics_df.applymap(lambda x: x[0])

# add col w/concatenated list
#topics_df['all_words'] = topics_df.apply(', '.join, axis=1) #insert at end
topics_df.insert(0, 'topic', topics_df.apply(', '.join, axis=1))

# remove indiv. word columns (word1,...,word10)
topics_df.drop(list(topics_df.filter(regex = 'word')), axis = 1, inplace = True)

# convert index to a column (this is the topic id)
topics_df.insert(0, 'topic_id', topics_df.index)

# add count frequency 
topic_ct = opp_topic_info[['Topic', 'Count']]
topics_df = topics_df.merge(topic_ct, how='left', left_on='topic_id', right_on='Topic')
topics_df.drop('Topic', axis=1, inplace=True)

# save
topics_df.to_csv(output_folder + 'opp_topics_clean_legalbert_rramicus.csv', index=False)

# classification by paragraph
topic_id = opp_topic_info[['Topic', 'Name']]
output_df = opp_df.copy()
output_df['topic_id'] = opp_topics
output_df = output_df.merge(topic_id, how='left', left_on='topic_id', right_on='Topic')
output_df.drop('Topic',axis=1,inplace=True)
output_df.rename({'Name' : 'topic_name'},axis=1, inplace=True)
output_df.to_csv(output_folder + 'opp_topic_clean_classification_legalbert_rramicus.csv', index=False)

Next, cluster the topics using hdbscan 

In [None]:
opp_embed = get_similar_topics(opp_tm)

In [None]:
opp_topic_df = pd.read_csv(output_folder + 'opp_topics_clean_legalbert_rramicus.csv')
opp_embed = opp_embed.sort_values('Label')
opp_embed.rename({'Topic':'topic_id', 'Label':'label'}, axis=1, inplace=True)
opp_embed = opp_embed.merge(opp_topic_df, how='left', on = 'topic_id')
opp_embed.drop(['Words', 'Size'], axis=1, inplace=True)
opp_embed = opp_embed[['topic_id', 'label', 'topic', 'Count', 'x', 'y']]
opp_embed.to_csv(output_folder + 'opp_topics_clean_labels_legalbert_rramicus.csv')

In [None]:
opp_tm.save(model_folder + 'opp_legalbert_rramicus')

  self._set_arrayXarray(i, j, x)


### Load from saved

In [None]:
opp_tm = BERTopic.load(model_folder + 'opp_bbu_rramicus')

### Explore

In [None]:
#opp_tm.visualize_topics(top_n_topics=50)
opp_tm.visualize_topics()