# Topic Modeling

Using BERTopic

In [11]:
model_checkpoint = 'bert-base-uncased'

## Set up environment

In [None]:
!pip install transformers
!pip install datasets
!pip install bertopic

you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

In [8]:
import pandas as pd
import numpy as np
from html import unescape

from bertopic import BERTopic

from transformers import pipeline
from transformers import DataCollatorWithPadding
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset, load_metric, Dataset

from huggingface_hub import notebook_login

## Data

BERTopic function takes a list of documents, so we need to set this up ourselves. 

### Option 1: Tokenize text then decode back to original text

In [6]:
!git config --global credential.helper store
# get access token on Huggingface website > settings > access token (make sure it's a write token)
notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token


Read in HF dataset

In [None]:
ds_path = 'repro-rights-amicus-briefs/repro-rights-amicus'
# use_auth_token must be true bc this is a private dataset
ds = load_dataset(ds_path, use_auth_token=True)

# remove html characters
ds = ds.map(
    lambda x: {"text": [unescape(o) for o in x["text"]]}, batched=True
)

Tokenize

In [None]:
#instantiate tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# split documents into text of size 512 tokens
def tokenize_and_split(examples):
    result = tokenizer(
        examples["text"],
        truncation = True,
        max_length = 510,#512,
        stride = 128,
        return_overflowing_tokens = True,
        padding = 'max_length'
    )
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

# tokenize
tokenized_ds = ds.map(tokenize_and_split, batched = True, batch_size = 100)

# decode tokenized text back to original text 
def decode_chunks(example):
  result = tokenizer.batch_decode(
      example['input_ids'],
      skip_special_tokens=True,
      clean_up_tokenization_spaces=True
  )
  example['text_chunk'] = result
  return example

# decode
tokenized_ds = tokenized_ds.map(decode_chunks, batched=True, batch_size=100)

Put document chunks into a list (since bertopic model only takes lists)

In [13]:
# new way using decoded tokenized text
sequences = tokenized_ds['train']['text_chunk'] + tokenized_ds['valid']['text_chunk'] + tokenized_ds['test']['text_chunk']
case = tokenized_ds['train']['case'] + tokenized_ds['valid']['case'] + tokenized_ds['test']['case']
brief_ids = tokenized_ds['train']['id'] + tokenized_ds['valid']['id'] + tokenized_ds['test']['id']
brief_names = tokenized_ds['train']['brief'] + tokenized_ds['valid']['brief'] + tokenized_ds['test']['brief']
brief_party = tokenized_ds['train']['brief_party'] + tokenized_ds['valid']['brief_party'] + tokenized_ds['test']['brief_party']

# check we have the results we expect
print(type(sequences))
print(type(sequences[0]))
print(len(sequences))

<class 'list'>
<class 'str'>
15911


### Option 2: split text by words 

Define function to split text into 512 words. Since we aren't using huggingface pipelines, we have to make this rough cut and be okay with the fact that we're introducing inefficiencies into our process. 

In [None]:
def split_text(text, n):
  # split text on space
  text = text.split()
  # grab tokens back into strings, with n words each 
  text = [' '.join(text[i:i+n]) for i in range(0,len(text),n)]

  return text

In [None]:
n = 512
df_512 = df.copy()
df_512['txt_split'] = df_512.apply(lambda row: split_text(row['txt_short'], n), axis=1)
df_512 = df_512.explode('txt_split')
df_512.drop('txt_short', axis=1, inplace=True)
df_512.rename({'txt_split': 'text'}, axis=1, inplace=True)
len(df_512)

11804

In [None]:
df_512.head(1)

Unnamed: 0,case,brief,id,text
0,Anders v Floyd,Anders v Floyd - amicus brief for appellant (o...,861815186515,many roe v wade killings are murder the eviden...


Make a list of documents -- do not shuffle! 

In [None]:
list_512 = list(df_512['text'])

## Training

Instantiate BERTopic, set language to english. Note we aren't doing any fine-tuning here. 

In [14]:
topic_model = BERTopic(language = 'english', calculate_probabilities=True, verbose=True)

"Train"

Note the `fit_transform` function can take either a list of documents or pre-trained document embeddings. 

In [19]:
topics, probs = topic_model.fit_transform(sequences)

Batches:   0%|          | 0/498 [00:00<?, ?it/s]

2022-03-14 22:52:23,072 - BERTopic - Transformed documents to Embeddings
2022-03-14 22:52:43,352 - BERTopic - Reduced dimensionality with UMAP
2022-03-14 22:53:08,808 - BERTopic - Clustered UMAP embeddings with HDBSCAN


## Extracting Topics

Topics by frequency -- note that topic -1 can be ignored.

In [20]:
freq = topic_model.get_topic_info()
freq.head(20)

Unnamed: 0,Topic,Count,Name
0,-1,5703,-1_the_of_to_in
1,0,1090,0_privileges_texas_admitting_hospital
2,1,380,1_parental_parents_minors_minor
3,2,293,2_zone_hill_buffer_speech
4,3,222,3_akron_462_informed_information
5,4,222,4_roe_wade_privacy_right
6,5,195,5_undue_burden_test_casey
7,6,186,6_standing_party_third_singleton
8,7,161,7_human_life_being_you
9,8,160,8_clinic_violence_1993_clinics


We can examine some of these topics more closely

In [None]:
topic_model.get_topic(0)

[('parents', 0.014397456990439846),
 ('parental', 0.01402645158938964),
 ('minors', 0.01370549086483294),
 ('minor', 0.01106826919559555),
 ('notification', 0.007774049548184148),
 ('consent', 0.007628639853892523),
 ('their', 0.007156751210838383),
 ('to', 0.006659743026310319),
 ('and', 0.006493053856067767),
 ('the', 0.006465293374814961)]

In [None]:
topic_model.get_topic(2)

[('roe', 0.01581732814050267),
 ('right', 0.01003336402104284),
 ('wade', 0.009168815228328082),
 ('privacy', 0.00832984437226401),
 ('court', 0.008192189682724357),
 ('us', 0.0076920990797694055),
 ('the', 0.00736961382140694),
 ('in', 0.006981806785062368),
 ('of', 0.006881830186251918),
 ('this', 0.006809258071137317)]

In [None]:
topic_model.get_topic(7)

[('undue', 0.019824720214606333),
 ('burden', 0.019606947388422908),
 ('casey', 0.017283465929548318),
 ('test', 0.011102389524990456),
 ('regulations', 0.00951305775486606),
 ('regulation', 0.009493764112620744),
 ('hellerstedt', 0.009365489487234004),
 ('that', 0.007981565522492512),
 ('id', 0.007970457851775492),
 ('obstacle', 0.007930386246121999)]

## Visualize topics in space

This first visualization collapses our topics onto two dimensions so we can visually examine which topics are similar to one another. These could be grouped to reduce our topic dimensionality. Note that this is an interactive visual.

In [None]:
topic_model.visualize_topics()

## Topic hierarchy

Another way to visually examine how topcis are related to one another. Just from looking on this, I think it would make more sense to topic model pro-women and pro-opp briefs separately, since they often use similar langague/topics but are articulating very different points on them! 

In [None]:
topic_model.visualize_hierarchy(top_n_topics=50)

## Reduce n topics

This is a manual decision

In [None]:
#new_topics, new_probs = topic_model.reduce_topics(list_512, topics, probs, nr_topics=60)

# Part 2: Use fine-tuned transformer

Flair allows you to choose almost any 🤗 transformers model. Simply select any from here and pass it to BERTopic:

In [None]:
!pip install bertopic[flair]

So, we can use our fine-tuned model here!

In [None]:
from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta)