<a href="https://colab.research.google.com/github/andreea-bodea/bachelors-thesis-informatics/blob/main/BT%20INFO%20-%20Model%201%3A%20TSDAE%20on%20Parler%20%2B%20BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Model 1: TSADE on Parler + BERTopic on Parler

---

TSDAE = Tranformer-based Denoising AutoEncoder

Unsupervised Trainig Method for SBERT = Sentence Transformers 

https://www.sbert.net/examples/unsupervised_learning/TSDAE/README.html

BERTopic with Custom Embeddings 

https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html#visual-overview


In [None]:
%%capture
!pip install bertopic

In [None]:
%%capture
!pip install joblib==1.1.0

In [None]:
from bertopic import BERTopic 
from umap import UMAP

In [None]:
%%capture
!pip install sentence_transformers 
!pip install utils 

In [None]:
from sentence_transformers import SentenceTransformer, LoggingHandler
from sentence_transformers import models, util, datasets, evaluation, losses
from torch.utils.data import DataLoader

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Upload csv files with posts to Google Colab 
# Sample for trainig sentence transformer: parleys_train (100.000 posts)
# Sample for topic modelling: parleys_test (~300.000 posts)
from google.colab import files
uploaded = files.upload()

Saving parleys_test.csv to parleys_test.csv
Saving parleys_train.csv to parleys_train.csv


In [None]:
# Read csv files into pandas dataframes 
import pandas as pd
import io
parleys_train = pd.read_csv(io.BytesIO(uploaded['parleys_train.csv']))
parleys_test = pd.read_csv(io.BytesIO(uploaded['parleys_test.csv']))

In [None]:
parleys_train

Unnamed: 0,body
0,already know that next move hope trump nerve a...
1,praying rudy america need
2,electedofficialsgoingto something pay stop kin...
3,kpeklund leadership leader need leader
4,grew hour north austin
...,...
99995,think meant aim bidena deranged person speaker...
99996,pitter patter let get there country save becam...
99997,know pissed knew trump vote getting going stay...
99998,quite frankly that sure gather sheep rule over...


In [None]:
parleys_test

Unnamed: 0,body
0,glad see parler free speech actually alive wel...
1,not enough year minimum
2,wonder kamalaharris blm think white guy placed...
3,agreed seemed like close race till inner city ...
4,well well abercrombie fitch president canada e...
...,...
309063,politician concerned covering ass not represen...
309064,rent kid hell barack mike rented them
309065,whom biden carry anything especially itcome pe...
309066,pedo fly head never lie


In [None]:
# Transform pandas dataframes to lists with posts 
posts_train = parleys_train['body'].tolist()
Parler_posts_test = parleys_test['body'].tolist()

In [None]:
posts_train

In [None]:
Parler_posts_test

TSDAE

In [None]:
# Create the special denoising dataset that adds noise on-the-fly
dataset = datasets.DenoisingAutoEncoderDataset(posts_train)

In [None]:
# DataLoader to batch your data
dataloader = DataLoader(dataset, batch_size=8, shuffle=True, drop_last=True)

In [None]:
# Define your sentence transformer model (SBERT) using CLS pooling
bert = models.Transformer('bert-base-uncased')
bert.get_word_embedding_dimension

pooling = models.Pooling(bert.get_word_embedding_dimension(), 'cls')
pooling 

sentence_model = SentenceTransformer(modules=[bert, pooling])
sentence_model

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

In [None]:
# Use the denoising auto-encoder loss
loss = losses.DenoisingAutoEncoderLoss(sentence_model, tie_encoder_decoder=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.encoder.layer.3.crossattention.self.query.weight', 'bert.encoder.layer.9.crossattention.self.key.weight', 'bert.encoder.layer.0.crossattention.output.dense.bias', 'bert.encoder.layer.11.crossattention.self.query.weight', 'bert.encode

In [None]:
# Call the fit method
sentence_model.fit(
    train_objectives=[(dataloader, loss)], 
    epochs=1,
    weight_decay=0, 
    scheduler='constantlr', 
    optimizer_params={'lr': 3e-5},
    show_progress_bar=True,
    use_amp=False
)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/12500 [00:00<?, ?it/s]

In [None]:
# Save the model
sentence_model.save('output/tsdae-parler-bert-base-uncased')
# Load the saved model

BERTopic

In [None]:
# Prepare embeddings using customed trained sentence bert  
embeddings = sentence_model.encode(Parler_posts_test, show_progress_bar=True)

Batches:   0%|          | 0/9659 [00:00<?, ?it/s]

In [None]:
# Set a random_state in UMAP to prevent any stochastic behavior -> reproduce the results possible (at the expense of performance)
umap_model = UMAP(n_neighbors=15, n_components=5, 
                  min_dist=0.0, metric='cosine', random_state=42)

In [None]:
# Train topic model using customed trained embeddings
# Extract topics and generate probabilities
topic_model = BERTopic(nr_topics=10, umap_model=umap_model)
topics, probs = topic_model.fit_transform(Parler_posts_test, embeddings)

In [None]:
# Access information about all topics that were generated
# -1 refers to all outliers and should typically be ignored
topics_df = topic_model.get_topic_info()
topics_df
# topics_df.to_csv('Topics_Model_1.csv', index=False);

Unnamed: 0,Topic,Count,Name
0,-1,267126,-1_not_trump_people_like
1,0,7952,0_vaccine_covid_mask_virus
2,1,7192,1_parler_twitter_follow_welcome
3,2,4796,2_fbi_barr_biden_hunter
4,3,3867,3_party_republican_cruz_not
5,4,3374,4_thank_bless_god_christmas
6,5,3322,5_god_lord_jesus_amen
7,6,2886,6_antifa_police_blm_supporter
8,7,2859,7_stopthesteal_maga_trump_wwg
9,8,2859,8_fox_news_tucker_newsmax


In [None]:
# Access all topics
all_topics = topic_model.get_topics()
all_topics

{-1: [('not', 0.037548003242796245),
  ('trump', 0.022637567900951755),
  ('people', 0.021581741655621565),
  ('like', 0.018390835757544525),
  ('get', 0.018210999814194197),
  ('need', 0.01768560521452715),
  ('would', 0.016780761158704215),
  ('president', 0.01640371639275503),
  ('election', 0.015875827868097152),
  ('one', 0.01560033672039805)],
 0: [('vaccine', 0.0958981741061146),
  ('covid', 0.07667463291047258),
  ('mask', 0.06003965946435323),
  ('virus', 0.05816901961587197),
  ('not', 0.048379833924601316),
  ('flu', 0.041692374195482336),
  ('people', 0.0318819179442111),
  ('death', 0.027456333120661723),
  ('get', 0.02459601980601283),
  ('test', 0.024164596297214656)],
 1: [('parler', 0.08594744382690121),
  ('twitter', 0.0629607830575142),
  ('follow', 0.05603188832769673),
  ('welcome', 0.05346702047994627),
  ('glad', 0.04784652726662846),
  ('account', 0.04461550108169735),
  ('post', 0.04442878819017714),
  ('facebook', 0.040528565737872425),
  ('content', 0.0343542

In [None]:
# Transform topics to dataframe and save as CSV file
list_with_all_topics = []
list_with_one_topic = []
for key in all_topics:
  list_with_one_topic = []
  for tuple in all_topics[key]:
    list_with_one_topic.append(tuple[0])
  list_with_all_topics.append(list_with_one_topic)
print(list_with_all_topics)

topics_df = pd.DataFrame(list_with_all_topics, index = ['-1', 'Topic 0', 'Topic 1', 'Topic 2', 'Topic 3', 'Topic 4', 'Topic 5', 'Topic 6', 'Topic 7', 'Topic 8', 'Topic 9'],
                                 columns = ['Word 1', 'Word 2', 'Word 3', 'Word 4', 'Word 5', 'Word 6', 'Word 7', 'Word 8', 'Word 9', 'Word 10'])
topics_df 
topics_df.to_csv('Model_1_Topics_Complete.csv')

[['not', 'trump', 'people', 'like', 'get', 'need', 'would', 'president', 'election', 'one'], ['vaccine', 'covid', 'mask', 'virus', 'not', 'flu', 'people', 'death', 'get', 'test'], ['parler', 'twitter', 'follow', 'welcome', 'glad', 'account', 'post', 'facebook', 'content', 'not'], ['fbi', 'barr', 'biden', 'hunter', 'doj', 'flynn', 'not', 'epstein', 'cia', 'durham'], ['party', 'republican', 'cruz', 'not', 'senator', 'need', 'trump', 'gop', 'senate', 'vote'], ['thank', 'bless', 'god', 'christmas', 'happy', 'dan', 'you', 'well', 'keep', 'merry'], ['god', 'lord', 'jesus', 'amen', 'pray', 'truth', 'evil', 'not', 'prevail', 'light'], ['antifa', 'police', 'blm', 'supporter', 'capitol', 'protest', 'peaceful', 'capital', 'trump', 'not'], ['stopthesteal', 'maga', 'trump', 'wwg', 'electionfraud', 'wga', 'kag', 'voterfraud', 'election', 'trumptrain'], ['fox', 'news', 'tucker', 'newsmax', 'watch', 'hannity', 'watching', 'cnn', 'network', 'not'], ['nut', 'ball', 'like', 'not', 'hollywood', 'you', 'bi

In [None]:
# Extract representative docs for all topics
# representative_docs = topic_model.get_representative_docs()

# Extract representative docs of a specific topic
# representative_docs = topic_model.get_representative_docs(0)

# Extract representative docs for all topics as dataframe and save as CSV file
all_topics_representative_docs_df = pd.DataFrame(columns=['Topic', 'Representative Post'])
for key in all_topics.keys():
    if (key == -1):
        continue
    topic_representative_docs_list = topic_model.get_representative_docs(key)
    for representative_doc in topic_representative_docs_list:
       all_topics_representative_docs_df = all_topics_representative_docs_df.append({'Topic': key, 'Representative Post': representative_doc}, ignore_index=True)
all_topics_representative_docs_df.to_csv('Model_1_Topics_Representative_Posts.csv')
all_topics_representative_docs_df

Unnamed: 0,Topic,Representative Post
0,0,please not not need get killed way many scient...
1,0,would president take vaccine virus
2,0,house still take vaccine
3,0,hell especially mask not work
4,0,guess mean mask actually work
...,...,...
1498,9,many child attitude left gotten insane
1499,9,intelligent year old attitude
1500,9,keep opening fuck natzi cocksucker
1501,9,bhahahaha keep voting cocksucker


Visualizations of Topics

In [None]:
# Visualize Topics -> Intertopic Distance Map
topic_model.visualize_topics()

In [None]:
# Visualize Topics -> Barchart
topic_model.visualize_barchart()

In [None]:
# Visualize Topics -> Hierarchy
topic_model.visualize_hierarchy()

In [None]:
"""
# Save topic model
topic_model.save("Model_1")
# Load saved model
loaded_model = BERTopic.load("Model_1") 
# Access single topic -> topic 0 = most frequent topic that was generated
topic_model.get_topic(0)
# Find topics most similar to a search_term
similar_topics, similarity = topic_model.find_topics("election", top_n=5)
topic_model.get_topic(similar_topics[0])
# Visualize Topic -> Similarity (Heatmap)
topic_model.visualize_heatmap()
"""