#Batch Legal BERT Mockup Model – using different Embeddings

In [1]:
#PIP-installing BERTtopic

%%capture
!pip install bertopic

In [2]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
#Imports

import pandas as pd
import nltk 

import string
from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords 

from bertopic import BERTopic #BERTtopic-model: https://github.com/MaartenGr/BERTopic

*Working with BERTopics sklearn-dataset, to check out its structure*

In [4]:
#Loading + exploring data

from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

In [5]:
print(f"docs consitsts of {len(docs)} strings in a {type(docs)}")

docs consitsts of 18846 strings in a <class 'list'>


In [6]:
docs[10]

'the blood of the lamb.\n\nThis will be a hard task, because most cultures used most animals\nfor blood sacrifices. It has to be something related to our current\npost-modernism state. Hmm, what about used computers?\n\nCheers,\nKent'

In [7]:
#Training

topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(docs)


Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2022-06-01 08:27:01,519 - BERTopic - Transformed documents to Embeddings
2022-06-01 08:27:32,476 - BERTopic - Reduced dimensionality
2022-06-01 08:28:15,380 - BERTopic - Clustered reduced embeddings


In [8]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,6630,-1_to_the_is_and
1,0,1819,0_game_team_games_he
2,1,583,1_key_clipper_chip_encryption
3,2,526,2_ites_cheek_yep_huh
4,3,489,3_israel_israeli_jews_arab
...,...,...,...
219,218,10,218_drunk_cjackson_dwi_driving
220,219,10,219_religion_wars_history_crusades
221,220,10,220_accelerations_45g_pistrix_acceleration
222,221,10,221_board_motherboard_turbo_wires


In [9]:
#Trying out with less (little) data

test = docs[0:10]

In [10]:
topic_model_small = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model_small.fit_transform(test)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2022-06-01 08:28:35,218 - BERTopic - Transformed documents to Embeddings
2022-06-01 08:28:40,789 - BERTopic - Reduced dimensionality
2022-06-01 08:28:40,799 - BERTopic - Clustered reduced embeddings


In [11]:
topic_model_small.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,10,-1_the_to_and_of


*Working with Jakob's DF*

In [12]:
#Loading DF

data = pd.read_csv("/content/drive/MyDrive/test_data_scraped_new.csv")


In [13]:
data.columns

Index(['Unnamed: 0', 'Date of document', 'Title', 'Subtitle', 'CELEX number',
       'EUROVOC descriptor', 'Subject matter', 'Directory code', 'Author',
       'In force indicator', 'Content'],
      dtype='object')

In [14]:
eu_docs = data.Content.tolist()

In [15]:
print(f"docs consitsts of {len(eu_docs)} strings in a {type(docs)}")

docs consitsts of 20 strings in a <class 'list'>


In [16]:
eu_docs[1]

' (1) The objective of the Union’s policy on asylum is to develop and establish a Common European Asylum System (CEAS) that is consistent with the values and humanitarian tradition of the Union and governed by the principle of solidarity and fair sharing of responsibility. (2) A common policy on asylum based on the full and inclusive application of the Geneva Convention Relating to the Status of Refugees of 28\xa0July 1951, as amended by the New York Protocol of 31\xa0January 1967, is a constituent part of the Union’s objective of establishing progressively an area of freedom, security and justice open to third-country nationals or stateless persons who seek international protection in the Union. (3) The CEAS is based on common minimum standards for procedures for international protection, recognition and protection offered at Union level and for reception conditions, and it establishes a system for determining the Member State responsible for examining applications for international p

In [17]:
#Instantiating BERTtopic

eu_topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True).fit(eu_docs)
topics, probs = eu_topic_model.fit_transform(eu_docs)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2022-06-01 08:28:55,837 - BERTopic - Transformed documents to Embeddings
2022-06-01 08:28:58,697 - BERTopic - Reduced dimensionality
2022-06-01 08:28:58,707 - BERTopic - Clustered reduced embeddings


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2022-06-01 08:29:00,348 - BERTopic - Transformed documents to Embeddings
2022-06-01 08:29:02,860 - BERTopic - Reduced dimensionality
2022-06-01 08:29:02,870 - BERTopic - Clustered reduced embeddings


In [None]:
eu_topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,20,-1_the_of_and_to


### *Different Embedding-Models*

In [18]:
#All-mpnet – check out: https://www.sbert.net/docs/pretrained_models.html

eu_topic_model_2 = BERTopic(embedding_model="all-mpnet-base-v2").fit(eu_docs)
topics, probs = eu_topic_model_2.fit_transform(eu_docs)

2022-06-01 08:29:19,847 - BERTopic - Transformed documents to Embeddings
2022-06-01 08:29:22,383 - BERTopic - Reduced dimensionality
2022-06-01 08:29:22,393 - BERTopic - Clustered reduced embeddings
2022-06-01 08:29:24,417 - BERTopic - Transformed documents to Embeddings
2022-06-01 08:29:27,305 - BERTopic - Reduced dimensionality
2022-06-01 08:29:27,314 - BERTopic - Clustered reduced embeddings


In [19]:
eu_topic_model_2.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,20,-1_the_of_and_to


In [20]:
#Getting Top 5 topics
"""The input-data was way to small to get something meaningful here..."""

freq = topic_model.get_topic_info(5)

In [21]:
# Select the most frequent topic
topic_model.get_topic(0)  # Select the most frequent topic

[('game', 0.010320859262103797),
 ('team', 0.008996932328732101),
 ('games', 0.007170429245303449),
 ('he', 0.006975207241406137),
 ('players', 0.006282292942426023),
 ('season', 0.006213359656378095),
 ('hockey', 0.006115068609007399),
 ('play', 0.005768177503279186),
 ('25', 0.005627208580949827),
 ('year', 0.005577108940980226)]

In [22]:
#Visualize Topic Probabilities

topic_model.visualize_distribution(probs[200], min_probability=0.015)

IndexError: ignored

In [23]:
#Visualize Terms

topic_model.visualize_barchart(top_n_topics=3)