<a href="https://colab.research.google.com/github/andreea-bodea/bachelors-thesis-informatics/blob/main/BT%20INFO%20-%20Model%204%3A%20TSDAE%20on%20Parler%26Gab%20%2B%20BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Model 4: TSADE on Parler&Gab + BERTopic on Parler

---

TSDAE = Tranformer-based Denoising AutoEncoder

Unsupervised Trainig Method for SBERT = Sentence Transformers 

https://www.sbert.net/examples/unsupervised_learning/TSDAE/README.html

BERTopic with Custom Embeddings 

https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html#visual-overview


In [None]:
%%capture
!pip install bertopic

In [None]:
%%capture
!pip install joblib==1.1.0

In [None]:
from bertopic import BERTopic 
from umap import UMAP

In [None]:
%%capture
!pip install sentence_transformers 
!pip install utils 

In [None]:
from sentence_transformers import SentenceTransformer, LoggingHandler
from sentence_transformers import models, util, datasets, evaluation, losses
from torch.utils.data import DataLoader

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Upload csv files with posts to Google Colab 
# Sample for trainig sentence transformer: 1/2 gab_train.csv (50.000 posts) + 1/2 parleys_train (50.000 posts)
# Sample for topic modelling: parleys_test (~300.000 posts)
from google.colab import files
uploaded = files.upload()

Saving gab_train.csv to gab_train.csv
Saving parleys_test.csv to parleys_test.csv
Saving parleys_train.csv to parleys_train.csv


In [None]:
# Read csv files into pandas dataframes 
import pandas as pd
import io
gab_train = pd.read_csv(io.BytesIO(uploaded['gab_train.csv']))
parleys_train = pd.read_csv(io.BytesIO(uploaded['parleys_train.csv']))
sample_train = pd.concat([parleys_train.sample(n=50000, random_state=1, ignore_index=True), gab_train.sample(n=50000, random_state=1, ignore_index=True)]).reset_index(drop=True)
parleys_test = pd.read_csv(io.BytesIO(uploaded['parleys_test.csv']))

In [None]:
sample_train

Unnamed: 0,body
0,let sooner rather later
1,would rather colonoscopy receive award named c...
2,you allbreathe not over election getting start...
3,awww poor guy need attention special champ let...
4,going happen gitmo recently expanded billion o...
...,...
99995,apparently nick clegg book not well
99996,lord protect guide inspire president put divin...
99997,would fun maybe start selling cabelas brownell...
99998,society drift truth hate speak george orwell


In [None]:
parleys_test

Unnamed: 0,body
0,glad see parler free speech actually alive wel...
1,not enough year minimum
2,wonder kamalaharris blm think white guy placed...
3,agreed seemed like close race till inner city ...
4,well well abercrombie fitch president canada e...
...,...
309063,politician concerned covering ass not represen...
309064,rent kid hell barack mike rented them
309065,whom biden carry anything especially itcome pe...
309066,pedo fly head never lie


In [None]:
# Transform pandas dataframes to lists with posts 
posts_train = sample_train['body'].tolist()
Parler_posts_test = parleys_test['body'].tolist()

In [None]:
posts_train

In [None]:
Parler_posts_test

TSDAE

In [None]:
# Create the special denoising dataset that adds noise on-the-fly
dataset = datasets.DenoisingAutoEncoderDataset(posts_train)

In [None]:
# DataLoader to batch your data
dataloader = DataLoader(dataset, batch_size=8, shuffle=True, drop_last=True)

In [None]:
# Define your sentence transformer model (SBERT) using CLS pooling
bert = models.Transformer('bert-base-uncased')
bert.get_word_embedding_dimension

pooling = models.Pooling(bert.get_word_embedding_dimension(), 'cls')
pooling 

sentence_model = SentenceTransformer(modules=[bert, pooling])
sentence_model

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

In [None]:
# Use the denoising auto-encoder loss
loss = losses.DenoisingAutoEncoderLoss(sentence_model, tie_encoder_decoder=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.encoder.layer.9.crossattention.self.value.bias', 'bert.encoder.layer.8.crossattention.self.value.bias', 'bert.encoder.layer.7.crossattention.self.value.bias', 'bert.encoder.layer.4.crossattention.self.query.weight', 'bert.encoder.lay

In [None]:
# Call the fit method
sentence_model.fit(
    train_objectives=[(dataloader, loss)], 
    epochs=1,
    weight_decay=0, 
    scheduler='constantlr', 
    optimizer_params={'lr': 3e-5},
    show_progress_bar=True,
    use_amp=False
)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/12500 [00:00<?, ?it/s]

In [None]:
# Save the model
sentence_model.save('output/tsdae-parler&gab-bert-base-uncased')
# Load the saved model

BERTopic

In [None]:
# Prepare embeddings using customed trained sentence bert  
embeddings = sentence_model.encode(Parler_posts_test, show_progress_bar=True)

Batches:   0%|          | 0/9659 [00:00<?, ?it/s]

In [None]:
# Set a random_state in UMAP to prevent any stochastic behavior -> reproduce the results possible (at the expense of performance)
umap_model = UMAP(n_neighbors=15, n_components=5, 
                  min_dist=0.0, metric='cosine', random_state=42)

In [None]:
# Train topic model using customed trained embeddings
# Extract topics and generate probabilities
topic_model = BERTopic(nr_topics=10, umap_model=umap_model)
topics, probs = topic_model.fit_transform(Parler_posts_test, embeddings)

In [None]:
# Access information about all topics that were generated
# -1 refers to all outliers and should typically be ignored
topics_df = topic_model.get_topic_info()
topics_df
# topics_df.to_csv('Topics_Model_4.csv', index=False);

Unnamed: 0,Topic,Count,Name
0,-1,274126,-1_not_trump_people_like
1,0,9116,0_vaccine_covid_virus_mask
2,1,4410,1_god_jesus_lord_amen
3,2,3558,2_flynn_bless_christmas_thank
4,3,3542,3_bitch_shit_like_nut
5,4,2905,4_party_republican_rino_rinos
6,5,2686,5_fox_newsmax_news_tucker
7,6,2454,6_ballot_vote_machine_voting
8,7,2344,7_state_court_supreme_vote
9,8,1972,8_antifa_blm_police_defund


In [None]:
# Access all topics
all_topics = topic_model.get_topics()
all_topics

{-1: [('not', 0.037280593197119946),
  ('trump', 0.023260193404010004),
  ('people', 0.021344620798374417),
  ('like', 0.018368399963159403),
  ('get', 0.018225703729306382),
  ('need', 0.01748407262069445),
  ('would', 0.01676577264373836),
  ('president', 0.016307235410565713),
  ('election', 0.015731810557054783),
  ('know', 0.01539472855997045)],
 0: [('vaccine', 0.09704500742757814),
  ('covid', 0.06899175361745483),
  ('virus', 0.058049405663955084),
  ('mask', 0.05585695820108425),
  ('not', 0.04692261595683422),
  ('flu', 0.037277643239231446),
  ('people', 0.031007270409651247),
  ('get', 0.025058889447149443),
  ('fauci', 0.024491163886572357),
  ('death', 0.02438880177761878)],
 1: [('god', 0.1050679551000328),
  ('jesus', 0.051806785956881325),
  ('lord', 0.04622892713862443),
  ('amen', 0.04421058375600297),
  ('pray', 0.034506757022166365),
  ('not', 0.03346462258986835),
  ('evil', 0.030227217588475526),
  ('prevail', 0.028704746654028394),
  ('christ', 0.025102843101559

In [None]:
# Transform topics to dataframe and save as CSV file
list_with_all_topics = []
list_with_one_topic = []
for key in all_topics:
  list_with_one_topic = []
  for tuple in all_topics[key]:
    list_with_one_topic.append(tuple[0])
  list_with_all_topics.append(list_with_one_topic)
print(list_with_all_topics)

topics_df = pd.DataFrame(list_with_all_topics, index = ['-1', 'Topic 0', 'Topic 1', 'Topic 2', 'Topic 3', 'Topic 4', 'Topic 5', 'Topic 6', 'Topic 7', 'Topic 8', 'Topic 9'],
                                 columns = ['Word 1', 'Word 2', 'Word 3', 'Word 4', 'Word 5', 'Word 6', 'Word 7', 'Word 8', 'Word 9', 'Word 10'])
topics_df.to_csv('Model_4_Topics_Complete.csv')
topics_df 

[['not', 'trump', 'people', 'like', 'get', 'need', 'would', 'president', 'election', 'know'], ['vaccine', 'covid', 'virus', 'mask', 'not', 'flu', 'people', 'get', 'fauci', 'death'], ['god', 'jesus', 'lord', 'amen', 'pray', 'not', 'evil', 'prevail', 'christ', 'faith'], ['flynn', 'bless', 'christmas', 'thank', 'happy', 'god', 'general', 'merry', 'family', 'pardon'], ['bitch', 'shit', 'like', 'nut', 'lmao', 'fucking', 'face', 'not', 'piece', 'ankle'], ['party', 'republican', 'rino', 'rinos', 'new', 'trump', 'not', 'gop', 'patriot', 'vote'], ['fox', 'newsmax', 'news', 'tucker', 'watch', 'hannity', 'bye', 'oan', 'watching', 'network'], ['ballot', 'vote', 'machine', 'voting', 'recount', 'dead', 'dominion', 'not', 'audit', 'count'], ['state', 'court', 'supreme', 'vote', 'elector', 'scotus', 'penny', 'election', 'mike', 'not'], ['antifa', 'blm', 'police', 'defund', 'burn', 'capital', 'supporter', 'not', 'capitol', 'trump'], ['fbi', 'doj', 'barr', 'bill', 'swalwell', 'cia', 'russia', 'gate', 'w

Unnamed: 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10
-1,not,trump,people,like,get,need,would,president,election,know
Topic 0,vaccine,covid,virus,mask,not,flu,people,get,fauci,death
Topic 1,god,jesus,lord,amen,pray,not,evil,prevail,christ,faith
Topic 2,flynn,bless,christmas,thank,happy,god,general,merry,family,pardon
Topic 3,bitch,shit,like,nut,lmao,fucking,face,not,piece,ankle
Topic 4,party,republican,rino,rinos,new,trump,not,gop,patriot,vote
Topic 5,fox,newsmax,news,tucker,watch,hannity,bye,oan,watching,network
Topic 6,ballot,vote,machine,voting,recount,dead,dominion,not,audit,count
Topic 7,state,court,supreme,vote,elector,scotus,penny,election,mike,not
Topic 8,antifa,blm,police,defund,burn,capital,supporter,not,capitol,trump


In [None]:
# Extract representative docs for all topics
# representative_docs = topic_model.get_representative_docs()

# Extract representative docs of a specific topic
# representative_docs = topic_model.get_representative_docs(0)

# Extract representative docs for all topics as dataframe and save as CSV file
all_topics_representative_docs_df = pd.DataFrame(columns=['Topic', 'Representative Post'])
for key in all_topics.keys():
    if (key == -1):
        continue
    topic_representative_docs_list = topic_model.get_representative_docs(key)
    for representative_doc in topic_representative_docs_list:
       all_topics_representative_docs_df = all_topics_representative_docs_df.append({'Topic': key, 'Representative Post': representative_doc}, ignore_index=True)
all_topics_representative_docs_df.to_csv('Model_4_Topics_Representative_Posts.csv')
all_topics_representative_docs_df

Unnamed: 0,Topic,Representative Post
0,0,tracking bill gate population control
1,0,covid not pandemic pandemic global socialism
2,0,two test procedure came back negative reading ...
3,0,idea pretend mask working pretend wearing one
4,0,good thee not know nothing worry about always ...
...,...,...
1261,9,fang fang wear mask
1262,9,either wife friend fang fang asking friend
1263,9,email sorryass obama hillary show knew hillary...
1264,9,fbi sold fisa court fake dossier authority fact


Visualizations of Topics

In [None]:
# Visualize Topics -> Intertopic Distance Map
topic_model.visualize_topics()

In [None]:
# Visualize Topics -> Barchart
topic_model.visualize_barchart()

In [None]:
# Visualize Topics -> Hierarchy
topic_model.visualize_hierarchy()

In [None]:
"""
# Save topic model
topic_model.save("Model_1")
# Load saved model
loaded_model = BERTopic.load("Model_1") 
# Access single topic -> topic 0 = most frequent topic that was generated
topic_model.get_topic(0)
# Find topics most similar to a search_term
similar_topics, similarity = topic_model.find_topics("election", top_n=5)
topic_model.get_topic(similar_topics[0])
# Visualize Topic -> Similarity (Heatmap)
topic_model.visualize_heatmap()
"""