# BERTOPIC & ANTMN

### The following code is an example of combining (Bertopic)[https://github.com/MaartenGr/BERTopic] with ANTMN Methodology (Walter & Ophir 2019)[https://github.com/DrorWalt/ANTMN].

### First import the packages

In [None]:
from bertopic import BERTopic
import pandas as pd
import csv
import re
import string
import datetime
import scipy
import numpy
from scipy import sparse
import sys   
import unicodedata
import nltk 
import numpy as np   
import hdbscan
import time     
from scipy.sparse import csr_matrix, csc_matrix 
from umap import UMAP
import requests
import io

### Load the data from github link, always check the data

In [None]:
url = "https://raw.githubusercontent.com/aysedeniz09/bertmodels/main/data/Data_Class_ADL.csv" # Make sure the url is the raw version of the file on GitHub
download = requests.get(url).content
df = pd.read_csv(io.StringIO(download.decode('utf-8')))
print (df.head())

### Load the embedding model you will use, for more options check (hugging face transformers)[https://huggingface.co/models].

In [None]:
from flair.embeddings import TransformerDocumentEmbeddings
roberta = TransformerDocumentEmbeddings('roberta-base')

### Create a function to clean the text, in this step you can add stopwords as well.

In [None]:
def text_clean(x):

    ### Light
    x = x.lower() # lowercase everything
    x = x.encode('ascii', 'ignore').decode()  # remove unicode characters
    x = re.sub(r'https*\S+', ' ', x) # remove links
    x = re.sub(r'http*\S+', ' ', x)
    # cleaning up text
    x = re.sub(r'\'\w+', '', x) 
    x = re.sub(r'\w*\d+\w*', '', x)
    x = re.sub(r'\s{2,}', ' ', x)
    x = re.sub(r'\s[^\w\s]\s', '', x)
    
    ### Heavy
    x = re.sub(r'@\S', '', x)
    x = re.sub(r'#\S+', ' ', x)
    x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
    # remove single letters and numbers surrounded by space
    x = re.sub(r'\s[a-z]\s|\s[0-9]\s', ' ', x)

    return x

### Now first drop empty rows, then change the data to strings preparing for BERTopic

In [None]:
train = df #create a backup folder
train.dropna(subset=['originaltext'])
nan_value = float("NaN")
train.replace("", nan_value, inplace=True)
train.dropna(subset = ["originaltext"], inplace=True)
train.replace(" ", nan_value, inplace=True)
train.dropna(subset = ["originaltext"], inplace=True)
train.info() # check the dataframe
train.head() # again check the dataframe

### Apply clean text function, and change the original text variable to a list

In [None]:
train['cleaned_text'] = train.originaltext.apply(text_clean)
traintext = train.cleaned_text.to_list()

### Start the BERTopic, to run ANTMN afterwards **calculate_probabilities = True** must be TRUE!

In [None]:
start_time = time.time() # to checktime
umap_model = UMAP(n_neighbors=2, n_components=2, 
                  min_dist=0.0, metric='cosine', random_state=5) #umap fixes the BERTopic so it can be replicated
topic_model = BERTopic(umap_model=umap_model, embedding_model=roberta, nr_topics="20", calculate_probabilities = True).fit(traintext)
#for this example I have set the nr_topics as 20, however usually it is recommended to leave it as auto and let BERTopic find the optimal topics
print("--- %s seconds ---" % (time.time() - start_time))

### Map the probabilities using HDBSCAN clustering algorithm

In [None]:
start_time = time.time() # to check time
probs = hdbscan.all_points_membership_vectors(topic_model.hdbscan_model) # clustering algorithm
probs = topic_model._map_probabilities(probs, original_topics=True) # this is the document that will be used for ANTMN
topics, probs = topic_model.fit_transform(traintext)
df_prob = pd.DataFrame(probs) # THIS IS THE DOCUMENT THAT WILL BE USED IN ANTMN
#topic_model.save("Bert_Model_Outputs/antmn_sample_v1") #to save the model for future
print("--- %s seconds ---" % (time.time() - start_time))

### At this step you can save the df_prob as a csv and switch to the R code and run ANTMN on R, or continue in this script.

In [None]:
df_prob.to_csv("BERTopic_ANTMN_Probabilities.csv")

### Save document of topic names and frequencies

In [None]:
freq = topic_model.get_topic_info() 
freq.to_csv("BERTopic_ANTMN_TopicNamesandFreq.csv")

# BERTopic & ANTMN

## The method is from the supplemental code, citation: Walter, D., & Ophir, Y. (2019). News Frame Analysis: An Inductive Mixed-Method Computational Approach. Communication Methods and Measures. https://doi.org/10.1080/19312458.2019.1639145.

**Note: Due to the nature of BERTopic different than LDA, not all documents are connected with each other. Therefore had to remove SpinGlass algorithm.**

## (R Code)[https://github.com/aysedeniz09/bertmodels/blob/f27341708deff031832b70d1dd3d3cba4a13ad2b/bert_antmn_R.md] to run ANTMN on the BERTopic objects, to continue in python follow the below steps