<a href="https://colab.research.google.com/github/andreea-bodea/bachelors-thesis-informatics/blob/main/BT%20INFO%20-%20Model%203%3A%20TSDAE%20on%20Gab%20%2B%20BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Model 3: TSADE on Gab + BERTopic on Parler

---

TSDAE = Tranformer-based Denoising AutoEncoder

Unsupervised Trainig Method for SBERT = Sentence Transformers 

https://www.sbert.net/examples/unsupervised_learning/TSDAE/README.html

BERTopic with Custom Embeddings 

https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html#visual-overview


In [None]:
%%capture
!pip install bertopic

In [None]:
%%capture
!pip install joblib==1.1.0

In [None]:
from bertopic import BERTopic 
from umap import UMAP

In [None]:
%%capture
!pip install sentence_transformers 
!pip install utils 

In [None]:
from sentence_transformers import SentenceTransformer, LoggingHandler
from sentence_transformers import models, util, datasets, evaluation, losses
from torch.utils.data import DataLoader

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Upload csv files with posts to Google Colab 
# Sample for trainig sentence transformer: gab_train.csv (100.000 posts)
# Sample for topic modelling: parleys_test (~300.000 posts)
from google.colab import files
uploaded = files.upload()

Saving gab_train.csv to gab_train.csv
Saving parleys_test.csv to parleys_test.csv


In [None]:
# Read csv files into pandas dataframes 
import pandas as pd
import io
gab_train = pd.read_csv(io.BytesIO(uploaded['gab_train.csv']))
parleys_test = pd.read_csv(io.BytesIO(uploaded['parleys_test.csv']))

In [None]:
gab_train

Unnamed: 0,body
0,traitor alert lindsey graham revealed secret p...
1,mean let not upset goober
2,dnc obey law dnc worker suing party failing pa...
3,thus reason altright right whole develop psych...
4,sweden police scared good reason yeah jesus de...
...,...
99995,delete duckduckgo messed computer
99996,coming alttech project take note
99997,next news network youtube view edward snowden ...
99998,would rather burn abandoned christian church l...


In [None]:
parleys_test

Unnamed: 0,body
0,glad see parler free speech actually alive wel...
1,not enough year minimum
2,wonder kamalaharris blm think white guy placed...
3,agreed seemed like close race till inner city ...
4,well well abercrombie fitch president canada e...
...,...
309063,politician concerned covering ass not represen...
309064,rent kid hell barack mike rented them
309065,whom biden carry anything especially itcome pe...
309066,pedo fly head never lie


In [None]:
# Transform pandas dataframes to lists with posts 
posts_train = gab_train['body'].tolist()
Parler_posts_test = parleys_test['body'].tolist()

In [None]:
posts_train

In [None]:
Parler_posts_test

TSDAE

In [None]:
# Create the special denoising dataset that adds noise on-the-fly
dataset = datasets.DenoisingAutoEncoderDataset(posts_train)

In [None]:
# DataLoader to batch your data
dataloader = DataLoader(dataset, batch_size=8, shuffle=True, drop_last=True)

In [None]:
# Define your sentence transformer model (SBERT) using CLS pooling
bert = models.Transformer('bert-base-uncased')
bert.get_word_embedding_dimension

pooling = models.Pooling(bert.get_word_embedding_dimension(), 'cls')
pooling 

sentence_model = SentenceTransformer(modules=[bert, pooling])
sentence_model

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

In [None]:
# Use the denoising auto-encoder loss
loss = losses.DenoisingAutoEncoderLoss(sentence_model, tie_encoder_decoder=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.encoder.layer.3.crossattention.self.value.weight', 'bert.encoder.layer.3.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.4.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.3.crossattention.self.key.bias', 'bert

In [None]:
# Call the fit method
sentence_model.fit(
    train_objectives=[(dataloader, loss)], 
    epochs=1,
    weight_decay=0, 
    scheduler='constantlr', 
    optimizer_params={'lr': 3e-5},
    show_progress_bar=True,
    use_amp=False
)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/12500 [00:00<?, ?it/s]

In [None]:
# Save the model
sentence_model.save('output/tsdae-gab-bert-base-uncased')
# Load the saved model

BERTopic

In [None]:
# Prepare embeddings using customed trained sentence bert  
embeddings = sentence_model.encode(Parler_posts_test, show_progress_bar=True)

Batches:   0%|          | 0/9659 [00:00<?, ?it/s]

In [None]:
# Set a random_state in UMAP to prevent any stochastic behavior -> reproduce the results possible (at the expense of performance)
umap_model = UMAP(n_neighbors=15, n_components=5, 
                  min_dist=0.0, metric='cosine', random_state=42)

In [None]:
# Train topic model using customed trained embeddings
# Extract topics and generate probabilities
topic_model = BERTopic(nr_topics=10, umap_model=umap_model)
topics, probs = topic_model.fit_transform(Parler_posts_test, embeddings)

In [None]:
# Access information about all topics that were generated
# -1 refers to all outliers and should typically be ignored
topics_df = topic_model.get_topic_info()
topics_df
# topics_df.to_csv('Topics_Model_3.csv', index=False);

Unnamed: 0,Topic,Count,Name
0,-1,274471,-1_not_trump_people_get
1,0,6320,0_biden_joe_president_hunter
2,1,4092,1_parler_echo_welcome_glad
3,2,3804,2_party_republican_rinos_rino
4,3,3762,3_twitter_facebook_account_rumble
5,4,3592,4_god_praying_amen_jesus
6,5,3172,5_boy_basement_proud_like
7,6,2897,6_china_chinese_virus_biden
8,7,2455,7_not_people_country_reset
9,8,2324,8_fox_newsmax_news_watch


In [None]:
# Access all topics
all_topics = topic_model.get_topics()
all_topics

{-1: [('not', 0.03748692706485034),
  ('trump', 0.02262379070419398),
  ('people', 0.0218075897876576),
  ('get', 0.018198159834632317),
  ('like', 0.017852049051464404),
  ('need', 0.017472798903113203),
  ('would', 0.016698384138847992),
  ('election', 0.015748169406708694),
  ('one', 0.015635167420000467),
  ('president', 0.01555573756948953)],
 0: [('biden', 0.1914852641592044),
  ('joe', 0.12549927230194727),
  ('president', 0.054760039490187186),
  ('hunter', 0.054146139457599246),
  ('concede', 0.05157496593462588),
  ('not', 0.0499141727110443),
  ('kamala', 0.04118277245419473),
  ('trump', 0.03743797462576485),
  ('harris', 0.03466446407308461),
  ('sleepy', 0.025421120791363317)],
 1: [('parler', 0.24727091002719798),
  ('echo', 0.08311944928926272),
  ('welcome', 0.07872570048962546),
  ('glad', 0.07220895710092726),
  ('follow', 0.06864688924803998),
  ('post', 0.05145175987698095),
  ('massupvote', 0.046905428751352174),
  ('content', 0.043071195257295726),
  ('maga', 0.0

In [None]:
# Transform topics to dataframe and save as CSV file
list_with_all_topics = []
list_with_one_topic = []
for key in all_topics:
  list_with_one_topic = []
  for tuple in all_topics[key]:
    list_with_one_topic.append(tuple[0])
  list_with_all_topics.append(list_with_one_topic)
print(list_with_all_topics)

topics_df = pd.DataFrame(list_with_all_topics, index = ['-1', 'Topic 0', 'Topic 1', 'Topic 2', 'Topic 3', 'Topic 4', 'Topic 5', 'Topic 6', 'Topic 7', 'Topic 8', 'Topic 9'],
                                 columns = ['Word 1', 'Word 2', 'Word 3', 'Word 4', 'Word 5', 'Word 6', 'Word 7', 'Word 8', 'Word 9', 'Word 10'])
topics_df 
topics_df.to_csv('Model_3_Topics_Complete.csv')

[['not', 'trump', 'people', 'get', 'like', 'need', 'would', 'election', 'one', 'president'], ['biden', 'joe', 'president', 'hunter', 'concede', 'not', 'kamala', 'trump', 'harris', 'sleepy'], ['parler', 'echo', 'welcome', 'glad', 'follow', 'post', 'massupvote', 'content', 'maga', 'truly'], ['party', 'republican', 'rinos', 'rino', 'not', 'democrat', 'mitch', 'kemp', 'cruz', 'new'], ['twitter', 'facebook', 'account', 'rumble', 'tech', 'not', 'duck', 'big', 'parlor', 'deleted'], ['god', 'praying', 'amen', 'jesus', 'lord', 'pray', 'prevail', 'not', 'evil', 'prayer'], ['boy', 'basement', 'proud', 'like', 'nut', 'not', 'bitch', 'little', 'back', 'fuck'], ['china', 'chinese', 'virus', 'biden', 'beijing', 'swalwell', 'fang', 'communist', 'spy', 'not'], ['not', 'people', 'country', 'reset', 'order', 'world', 'ppl', 'america', 'american', 'one'], ['fox', 'newsmax', 'news', 'watch', 'oan', 'tucker', 'watching', 'network', 'cnn', 'oann'], ['antifa', 'blm', 'police', 'supporter', 'defund', 'peaceful

In [None]:
# Extract representative docs for all topics
# representative_docs = topic_model.get_representative_docs()

# Extract representative docs of a specific topic
# representative_docs = topic_model.get_representative_docs(0)

# Extract representative docs for all topics as dataframe and save as CSV file
all_topics_representative_docs_df = pd.DataFrame(columns=['Topic', 'Representative Post'])
for key in all_topics.keys():
    if (key == -1):
        continue
    topic_representative_docs_list = topic_model.get_representative_docs(key)
    for representative_doc in topic_representative_docs_list:
       all_topics_representative_docs_df = all_topics_representative_docs_df.append({'Topic': key, 'Representative Post': representative_doc}, ignore_index=True)
all_topics_representative_docs_df
all_topics_representative_docs_df.to_csv('Model_3_Topics_Representative_Posts.csv')

Unnamed: 0,Topic,Representative Post
0,0,well trumpers not accept biden biden accept ge...
1,0,dead people voted biden
2,0,biden never ever president
3,0,million president doctor trump voter subtract ...
4,0,believe chinajoe got million vote president tr...
...,...,...
1372,9,person shot unarmed woman unacceptable
1373,9,help shoot unarmed woman traitor
1374,9,utter disgrace see difference treatment ashli ...
1375,9,ashli babbitt may forever rest peace


Visualizations of Topics

In [None]:
# Visualize Topics -> Intertopic Distance Map
topic_model.visualize_topics()

In [None]:
# Visualize Topics -> Barchart
topic_model.visualize_barchart()

In [None]:
# Visualize Topics -> Hierarchy
topic_model.visualize_hierarchy()

In [None]:
"""
# Save topic model
topic_model.save("Model_1")
# Load saved model
loaded_model = BERTopic.load("Model_1") 
# Access single topic -> topic 0 = most frequent topic that was generated
topic_model.get_topic(0)
# Find topics most similar to a search_term
similar_topics, similarity = topic_model.find_topics("election", top_n=5)
topic_model.get_topic(similar_topics[0])
# Visualize Topic -> Similarity (Heatmap)
topic_model.visualize_heatmap()
"""