# BERTOPIC & ANTMN 
The following code is an example of combining [Bertopic](https://github.com/MaartenGr/BERTopic) with ANTMN Methodology [(Walter & Ophir 2019)](https://github.com/DrorWalt/ANTMN). 

1. First import the packages

In [3]:
from bertopic import BERTopic
import pandas as pd
import csv
import re
import string
import datetime
import scipy
import numpy
from scipy import sparse
import sys   
import unicodedata
import nltk 
import numpy as np   
import hdbscan
import time     
from scipy.sparse import csr_matrix, csc_matrix 
from umap import UMAP
import requests
import io

2. Load the data from github link, always check the data

In [5]:
url = "https://raw.githubusercontent.com/aysedeniz09/bertmodels/main/data/Data_Class_ADL.csv" # Make sure the url is the raw version of the file on GitHub
download = requests.get(url).content
df = pd.read_csv(io.StringIO(download.decode('utf-8')))
print (df.head())

   Unnamed: 0   index        date  source.domain  \
0           1   98474  2021-11-21    foxnews.com   
1           2   23319  2021-07-29    foxnews.com   
2           3  144569  2021-04-16  dailywire.com   
3           4   38059  2021-08-26      abc13.com   
4           5   97919  2021-11-28     silive.com   

                                        originaltext  
0  chicago mayor needs to dump police boss if <U+...  
1  randi weingarten ripped after telling msnbc 'w...  
2  pfizer ceo: third covid vaccine dose <U+0091>l...  
3  texas a&m researchers develop treatment to hel...  
4  nyc civil service exam: these applications are...  


3. Load the embedding model you will use, for more options check (hugging face transformers)[https://huggingface.co/models].

In [6]:
from flair.embeddings import TransformerDocumentEmbeddings
roberta = TransformerDocumentEmbeddings('roberta-base')

4. Create a function to clean the text, in this step you can add stopwords as well. 

In [7]:
def text_clean(x):

    ### Light
    x = x.lower() # lowercase everything
    x = x.encode('ascii', 'ignore').decode()  # remove unicode characters
    x = re.sub(r'https*\S+', ' ', x) # remove links
    x = re.sub(r'http*\S+', ' ', x)
    # cleaning up text
    x = re.sub(r'\'\w+', '', x) 
    x = re.sub(r'\w*\d+\w*', '', x)
    x = re.sub(r'\s{2,}', ' ', x)
    x = re.sub(r'\s[^\w\s]\s', '', x)
    
    ### Heavy
    x = re.sub(r'@\S', '', x)
    x = re.sub(r'#\S+', ' ', x)
    x = re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
    # remove single letters and numbers surrounded by space
    x = re.sub(r'\s[a-z]\s|\s[0-9]\s', ' ', x)

    return x

5. Now first drop empty rows, then change the data to strings preparing for BERTopic

In [9]:
train = df #create a backup folder
train.dropna(subset=['originaltext'])
nan_value = float("NaN")
train.replace("", nan_value, inplace=True)
train.dropna(subset = ["originaltext"], inplace=True)
train.replace(" ", nan_value, inplace=True)
train.dropna(subset = ["originaltext"], inplace=True)
train.info() # check the dataframe
train.head() # again check the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Unnamed: 0     1500 non-null   int64 
 1   index          1500 non-null   int64 
 2   date           1500 non-null   object
 3   source.domain  1500 non-null   object
 4   originaltext   1500 non-null   object
dtypes: int64(2), object(3)
memory usage: 58.7+ KB


Unnamed: 0.1,Unnamed: 0,index,date,source.domain,originaltext
0,1,98474,2021-11-21,foxnews.com,chicago mayor needs to dump police boss if <U+...
1,2,23319,2021-07-29,foxnews.com,randi weingarten ripped after telling msnbc 'w...
2,3,144569,2021-04-16,dailywire.com,pfizer ceo: third covid vaccine dose <U+0091>l...
3,4,38059,2021-08-26,abc13.com,texas a&m researchers develop treatment to hel...
4,5,97919,2021-11-28,silive.com,nyc civil service exam: these applications are...


6. Apply clean text function, and change the original text variable to a list

In [21]:
train['cleaned_text'] = train.originaltext.apply(text_clean)
traintext = train.cleaned_text.to_list()

6. Start the BERTopic, to run ANTMN afterwards **calculate_probabilities = True** must be TRUE!

In [22]:
start_time = time.time() # to checktime
umap_model = UMAP(n_neighbors=2, n_components=2, 
                  min_dist=0.0, metric='cosine', random_state=5) #umap fixes the BERTopic so it can be replicated
topic_model = BERTopic(umap_model=umap_model, embedding_model=roberta, nr_topics="20", calculate_probabilities = True).fit(traintext)
#for this example I have set the nr_topics as 20, however usually it is recommended to leave it as auto and let BERTopic find the optimal topics
print("--- %s seconds ---" % (time.time() - start_time))

--- 230.85283041000366 seconds ---


7. Map the probabilities using HDBSCAN clustering algorithm

In [23]:
start_time = time.time() # to check time
probs = hdbscan.all_points_membership_vectors(topic_model.hdbscan_model) # clustering algorithm
probs = topic_model._map_probabilities(probs, original_topics=True) # this is the document that will be used for ANTMN
topics, probs = topic_model.fit_transform(traintext)
df_prob = pd.DataFrame(probs) # THIS IS THE DOCUMENT THAT WILL BE USED IN ANTMN
#topic_model.save("Bert_Model_Outputs/antmn_sample_v1") #to save the model for future
print("--- %s seconds ---" % (time.time() - start_time))

--- 240.29081916809082 seconds ---


8. At this step you can save the *df_prob* as a csv and switch to the R code and run ANTMN on R, or continue in this script. 

In [25]:
df_prob.to_csv("BERTopic_ANTMN_Probabilities.csv")

9. Save document of topic names and frequencies

In [26]:
freq = topic_model.get_topic_info() 
freq.to_csv("BERTopic_ANTMN_TopicNamesandFreq.csv")

## BERTopic & ANTMN

The method is from the supplemental code, citation: Walter, D., & Ophir, Y. (2019). News Frame Analysis: An Inductive Mixed-Method Computational Approach. Communication Methods and Measures. https://doi.org/10.1080/19312458.2019.1639145.

### Note: Due to the nature of BERTopic different than LDA, not all documents are connected with each other. Therefore had to remove SpinGlass algorithm.  

1. Install packages

In [28]:
!pip install igraph
!pip install latent-semantic-analysis

Collecting igraph
  Downloading igraph-0.10.1-cp39-abi3-win_amd64.whl (2.9 MB)
Collecting texttable>=1.6.2
  Downloading texttable-1.6.4-py2.py3-none-any.whl (10 kB)
Installing collected packages: texttable, igraph
Successfully installed igraph-0.10.1 texttable-1.6.4
Collecting latent-semantic-analysis
  Downloading latent_semantic_analysis-0.1.0-py3-none-any.whl (10 kB)
Installing collected packages: latent-semantic-analysis
Successfully installed latent-semantic-analysis-0.1.0


2. Load packages

3. Write the function