In [None]:
import tensorflow_docs.vis.embed as embed
embed.embed_file('coversheet.png')

# Modelling changes in Bitcoin tweets over time

## Abstract

Bitcoin's prominence as a leading cryptocurrency has prompted a huge amount of discource on social media platforms. However, its price is also known to be volatile, with fluctations caused by a wide range of factors. This project aims to uncover the topics relating Bitcoin in tweets, and explore how these topics vary over time and whether these changes align when compared the Bitcoin's trade value.

The results of this have been... 

## Introduction and literature review

Bitcoin has proven to be the most widely known and heavily traded cryptocurrency, with recent research from the Financial Conduct Authority (FCA) revealing that 78% of respondants surveyed had heard of it and that 60% of those trading in cryptocurrencies held the asset [[1]](#ref1). Because of this rise to prominence, Bitcoin has become economically and culturally significant. However, it has also generated a large amount of controversies and criticism stemming from its negative impact on the environment, lack of regulation and from its price volatility making investments risky [[2]](#ref2). All of these properties generate discourse and fuel discussions on social media, which can be used as a yardstick to measuring public perception towards Bitcoin as news and changes are reported. 

The key themes within these discussions can be summarised using topic modelling techniques. There is extensive literature detailing the effectiveness of different methods on social media data. One study used Latent Dirichlet Allocation (LDA) to uncover hidden topics in global warming tweets [[3]](#ref3). However, this paper didn't test other models or tailor the process to the data. Ortu et al produced a more detailed review, using LDA in conjunction with the Hawkes model on cryptocurrecy related Reddit posts to identify topic changes which preceded price fluctuations [[4]](#ref4). However, while LDA is very common in topic modelling, its effectiveness on short form data is called into question by a paper by Curiskis et al. A variety of feature representation methods and clustering techniques were tested on tweets, finding that doc2vec word embeddings outperforming tf-idf (term frequency- inverse word frequency), and k means clustering was more effective than LDA on shorter text lengths [[5]](#ref4). 

More recent word embeddings based models show promise, but also need to be able to monitor changes in topics over time. Bernie et al proposed first identifying word frequency changes during periods of intense crypto price volatility, then applying word2vec models to measuring how distinct the resulting topics were using cosine similarity [[6]](#ref6). While this did find interpretable topics, it also relied on identifying specific time periods of interest in advance, and may less effective when applied over wider time scales. Finally, another study tested the properties and suitability of several different model types for Twitter data, with BERTopic being noted both for its ability to create relevant topics and its adaptability allowing for use as a dynamic topic model [[7]](#ref7).

Research has also been conducted into the best methods to process tweets for modelling. While they provide a huge number of observations for study, each individual post is by nature is very short, making the extraction of meaningful topics difficult. As well as this, they are known to include high levels of noise, which can be a barrier to effective modelling. Belal et al proposes a solution to this by detailing a number of preprocessing steps in order to improve the quality of short text data, which lead to measurable improvements in classification models on clensed data compared to uncleansed [[8]](#ref8).

Given the wide range of methods available, I decided to test how LDA compared against BERTopic to determine the best model to produce interpretable results for my specific data.  

In [1]:
#Libraries
import pandas as pd
import numpy as np
from tqdm import tqdm

#cleaning
import emoji
import gensim
import re
from gensim.parsing.preprocessing import strip_tags
import spacy
import contractions
import better_profanity

#Analysis
from itertools import chain
from collections import Counter
import networkx as nx

#Modelling
import gensim.downloader as api
from bertopic import BERTopic
from hdbscan import HDBSCAN
from bertopic.vectorizers import ClassTfidfTransformer
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel

#setup
nlp = spacy.load("en_core_web_md") #Note- first install model with pip
tqdm.pandas() # progress bar for pipeline

#Display settings
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

import warnings
warnings.filterwarnings("ignore")#remove warnings for html output




## Data

I chose to use a large dataset of 4,689,354 tweets which were categorised using the Bitcoin hashtag [[9]](#ref9). I chose this as it included observations over a relatively long time range (from February 2021- January 2023). However, full analysis on this much data would be prohibitively resource intensive and unnecessary for the purposes of this report. Therefore, I examined the data in more detail to retain only the relevant data. The date and tweets themselves were the only useful variables so I adapted the 'read_csv' function to reduce processing times. The data contains no missing values but converting the date format revealed 66 erroneous values to remove.  

After examining a random sample of 500 tweets I determined approximately 25% of observations were spam. These took the form of promotional content, links to likely malicious webpages and autogenerated price reports. The prevalence of spam in tweets is well documented and detecting it could form a full project in its own right [[10]](#ref10). However, given the time constraints I instead aimed to filter out as much as possible by removing all duplicates, tweets containing some specific hashtags or which had more than 10 hashtags in total, and used regex patterns to remove trading price tweets. While this may excluse some genuine posts, there were enough observations in the dataset to allow for greedy filtering steps to extract the highest quality data.  

The maximum character limit for tweets prior to February 2023 was 280 [[11]](#ref11). While there are longer strings in the dataset, these are caused by high numbers of mentions. Unlikely higher numbers of hashtags, this didn't necessarily indicate spam, so I retained these to be cleaned later. 

I then took a sample of the remaining documents, giving an even number of 1000 for each date. Some dates included did not contain enough tweets, so were removed entirely.

The next step was to handle the noise present within the tweets themselves, including emojis. At first my instinct was to remove the emojis completely, but in the context of Bitcoin they could convey insightful information. For example, the rocket emoji conveying excitment at increasing prices. Instead, I used the 'emoji' package to convert the utf-8 to a text description. I then  removed URLs, HTML tags and mentions, but not hashtagged words as these can convey meaning [[8 pg. 98]]. The informal nature of tweets make slang terms and profanity common so I used the 'better-profanity' library to clean these. These processes were very time intensive, but should result in more interpretable topics. 

Some of the models then required additional cleaning steps. These involved using the 'spaCY' library to tokenise, remove stopwords, lemmatise the words to their base form, and retain only the nouns, verbs and adjectives from the text. These steps aren't suitable for the BERTopic model as they destroy the sentence structure, but are beneficial for the LDA model, so these formed an additional 'tokens' variable. 

Finally, in order to compare the changes in topics to changes in Bitcoin value over time I sourced a second dataset from Kaggle which gave the closing prices of Bitcoin over the same time range [[12]]. 

#### Examine data

In [2]:
#Path to file
path = "./Data/Bitcoin_tweets.csv"

#Line terminator required to prevent read error, usecols to extract only relevant information 
df = pd.read_csv(path, lineterminator='\n', parse_dates=['date'], usecols = [8,9])
display(df.head())
print("Number of rows: " + str(df.shape[0]) + "\n",
      "Number of columns: " + str(df.shape[1]))

Unnamed: 0,date,text
0,2021-02-10 23:59:04,Blue Ridge Bank shares halted by NYSE after #bitcoin ATM announcement https://t.co/xaaZmaJKiV @MyBlueRidgeBank… https://t.co/sgBxMkP1SI
1,2021-02-10 23:58:48,"😎 Today, that's this #Thursday, we will do a ""🎬 Take 2"" with our friend @LeoWandersleb, #Btc #wallet #security expe… https://t.co/go6aDgRml5"
2,2021-02-10 23:54:48,"Guys evening, I have read this article about BTC and would like to share with you all - https://t.co/QxCZgmuy3B… https://t.co/o6wn7ppkVY"
3,2021-02-10 23:54:33,$BTC A big chance in a billion! Price: \4872644.0 (2021/02/11 08:51) #Bitcoin #FX #BTC #crypto
4,2021-02-10 23:54:06,"This network is secured by 9 508 nodes as of today. Soon, the biggest bears will recognise: #BTC in too big to fail… https://t.co/1XovDA8rKw"


Number of rows: 4689354
 Number of columns: 2


In [3]:
#Find characterlengths
print("Max character length: " + str(df["text"].str.len().max()) + "\n" +
      "Shortest length: " +  str(df["text"].str.len().min()) + "\n" +
       "Average length: " + str(df["text"].str.len().mean().round(0)))

Max character length: 987
Shortest length: 5
Average length: 181.0


In [4]:
#Longer character lengths from mentions -will be removed in future cleaning steps
df[df["text"].str.len() > 500].head(3)

Unnamed: 0,date,text
80423,2021-04-23 11:34:14,@moonMacawNFT @NFTarantulas @RedPandasNFT @StrayAliens @BitAliens @ProudSquirrel @The_HashPunks @MoonicornsNFT @MonstrrrNFT @BastardGhosts @Nifty_Cats @NessiesNFT @Budgiesnft @GloriousCrypto @BitSnails @idiomacy @Skelebits @ProctonsNFT @BitBlobsNFT Tired of the same collections? Looking for something different with a statement behind it? Check out the F*CK FIAT #nft Collectables that represent the historic moment where #BITCOIN defeats FIAT Currency. #btc #lasereyes #nftcollectables #nftcollectors https://t.co/HvmAnLLlTR
131239,2021-06-23 17:45:27,@Davidskylarkk @Lilblubird87 @BallsDeeptheone @Dgord6 @rockthrower19 @SECJackson @SEC_Enforcement @Boston_SEC @SF_SEC @NewYork_SEC @SEC_DERA @HesterPeirce @BarbaraRoper1 @MicahHauptman @NubreedNutritio @BillChaaban @SecretaryCarson @jerryalmufleh @griffpatriot @ewarren @SenWarren @FINRA @lajass @mcervantes1 @GiddyUpRocks This is comedy can we stop concentrating on #bitcoin #cryptocurrency #cryptocurrencies #cryptocrash #robinhoodtraders and #stonks #apes and find how $fitx shareholder's money was spent enuf #gamestop #amc #wallstreetbets we have our own problems with these #otc #OTCQB Beauties
132621,2021-06-23 16:57:06,@AlgyHall @Gallon73 @wheeliedealer @rcwhalen @HedgeyeTV @dlacalle_IA @Hedgeye @obone747 @MY21_Oracle @slarratt1 @Financial_Orbit @asibiza1 @marxnd @elerianm @TheIdleInvestor @ToddWenning @RobertStGeorge @RobertJShiller @Halsrethink @CiovaccoCapital @TheBubbleBubble @ReformedBroker @contrariansmind @MichaelSchuman @nickbatsford @ReutersJamie @M_McDonough @jonsanchezshow @DianaEPatterson @paul_dobson @KeithMcCullough @andrewrsorkin @FerroTV @MeganCBoxall @philjoakley @ShareScope @HW_MA @achildofthe70s @DVB99 @MacroPoloChina @followingeric @claudiohfox @groupstageexit @chumba54339270 @CarrRosie @marymcdougall13 @IChronicle @cfauk NEW🔥#TwinPetesInvesting #VIDEO #Podcast 51: #INVESTING #RISKS POSITION SIZING #FTSE #AIM #BITCOIN #CRYPTO #DOGECOIN #PSYCHOLOGY #PCA #AHT #BOTB #NWG #LLOY #BARC #DIVIDENDS #EQLS #MTFB #MMAG #CLIN #GTLY #CMRS #CUSN #ROO #AML #TPG #TRADING #SPREADBETTING &amp; https://t.co/FkrDWniXFE https://t.co/zWm6DWqk3p


In [5]:
#Check missing 
miss = df.isnull().sum()
miss[miss > 0] 

Series([], dtype: int64)

In [6]:
#Remove error rows preventing data conversion
df = df.drop(df.loc[pd.to_datetime(df["date"], errors = "coerce").isnull()].index, axis = "index")
#Check date range
print("Date range from " + str(df['date'].min()) + " to " + str(df['date'].max()))

Date range from 2021-02-05 10:52:04 to 2023-01-09 23:59:54


In [7]:
#Approximately 130 spam observations
check_spam = df.sample(n = 500, random_state = 147)
check_spam.head(5)

Unnamed: 0,date,text
770314,2021-08-08 12:28:01,#Bitcoin gonna tap 50K. they were right. https://t.co/Z7T5vl3ulW
2040180,2022-01-13 17:11:26,Tether value almost equals 1 US dollar\n#Tether #cryptocurrecy #Bitcoin #Ethereum #cryptomev #cryptomevapp #CryptocurrencyNews #CryptoWireWatch #cryptocurrencies #SolanaAirdrop #Solana #NFTCommumity https://t.co/TO7Vbhl7wC
969091,2021-08-16 20:05:34,"#Ethereum #bitcoin #CyberKongzVXNFTs #NFT NFT Gorillas Burned $90,000 Worth Of Ethereum Per Minute https://t.co/ngnMVW2r78 - dumbwire https://t.co/YHIuXFmXSQ"
2153759,2022-01-22 13:12:17,#Bitcoin correction durations during larger impulsive uptrends.\n\n154 and 161 days\n\n280 and 105 days\n\n448 and 336 days (and counting) https://t.co/DZAgq1W5ZS
628118,2021-07-30 11:03:52,@cryptomedatech This is definitely going to be HUGE! as the team has been an inspiration in their innovative effort and also they are hard working!!Must join!!\n\n@SHarianto17 @LiliGea2 @Hardidolay \n \n$TECH #NFTCommunity #nftcollector #CryptoNews #ADA #gamer #BTC #BSC #giveaway


#### Filter data

In [8]:
def filter_data(df):
    '''Filtering steps prior to sampling including removing duplicates, high hashtag numbers/those including a high % of digits, and  
    specific hashtags highly associated with spam'''
    
    #Get starting row numbers
    initial_shape = df.shape[0]
    
    #Move to lower case
    df["text"] = df["text"].str.lower()
    
    #Drop duplicate tweets
    df["text"] = df["text"].drop_duplicates()
    new_shape = df.shape[0]
    removed = initial_shape - new_shape
    print("Duplicate rows removed: " + str(removed) + "\n")
    
    #Spam associated words/hashtags/patterns
    regex_patterns = [r'bitcoin price (usd):',
                     r'giveaways',
                     r'amazon',
                     r'gift',
                     r'#gaming',
                     r'coinhuntworld',
                     r'crypto prices (usd/₿)',
                     r'scan results',
                     r'#wanusdt',
                     r'bitcoin - btc\nprice:',
                     r'#watchinwales',
                     r'1 btc equals',
                     r'bull alert!',
                     r'move from unknown wallet',
                     r'#betfurysuccess',
                     r'#ethereum price update',
                     r'airdrop',
                     r'#btc_whale_alert',
                     r'\ndollar:',
                     r'binancecoin price update',
                     r'i-gaming',
                     r'prices update',
                     r'\*\*|\*']
                     
                      
                      
    for pattern in regex_patterns:
        df = df[~df["text"].str.contains(pattern, na = False)]
    
    removed = new_shape - df.shape[0]
    new_shape = df.shape[0]
    print("Regex matches removed: " + str(removed) + "\n")
    
    #Filter high hashtag counts
    df["hashtag_count"] = df["text"].str.count("#")
    df = df[df["hashtag_count"] < 11]
    
    removed = new_shape - df.shape[0]
    new_shape = df.shape[0]
    print("High hashtags removed: " + str(removed) + "\n")
    
    #Remove observations with more than 35% digits
    df["digits"] = df["text"].str.count("\d")/df["text"].str.len()
    df = df[df["digits"] < 0.35]
    
    removed = new_shape - df.shape[0]
    new_shape = df.shape[0]
    print("High digits removed: " + str(removed) + "\n")
    print("New observation total: " + str(new_shape))
    
    #df.drop(columns = ["hashtag_count", "digits"])
    
    return df   
    

In [9]:
df = filter_data(df)

Duplicate rows removed: 0

Regex matches removed: 542799

High hashtags removed: 644340

High digits removed: 6858

New observation total: 3495291


In [10]:
#Convert date to datetime
df["date"] = pd.to_datetime(df["date"])
#Extract months 
df["month"] = df["date"].dt.to_period("M")

In [11]:
#Remove missing values created from filtering
df = df.dropna(subset = ["text"])
miss = df.isnull().sum()
miss[miss > 0] 

Series([], dtype: int64)

In [12]:
#Drop unneeded columns
df = df.drop(columns = ["hashtag_count", "digits", "date"])

In [13]:
#Only keep months with at least 5000 tweets
filtered_df = df[df.groupby("month")["month"].transform("count") > 500]
filtered_df["month"].nunique()

24

In [14]:
df_sampled = filtered_df.groupby("month").sample(n = 1000, random_state = 147).reset_index()

In [15]:
display(df_sampled.head())
print(df_sampled.shape[0])

Unnamed: 0,index,text,month
0,9954,shorting out\n#btc \n#bitcoin,2021-02
1,1380,"i called to break $44,578\nhttps://t.co/poeyjufjk4\nhttps://t.co/yddqwaltra\n#bitcoin #btc #ethereum #eth #crypto… https://t.co/wqixuztj3e",2021-02
2,35577,"uniswap (uni), compound (comp), other defi coins are slipping lower \n\n#btc #bitcoin #cryptocurrency #ethereum… https://t.co/esijvuxjej",2021-02
3,6192,elon #musk ignites the #bitcoin rocket - https://t.co/jj8d7p7mhd\n#tesla #bitcoin #btc #bitcoinnews #blockchain\n\nsup… https://t.co/0nje94sojb,2021-02
4,8810,@elonmusk @tesla revelation of its bitcoin reserve will drive up the price but the amount of #bitcoin in possession… https://t.co/8a8dqix8dc,2021-02


24000


#### Cleaning pipeline

In [16]:
def clean_text(text):
    ''' Applies cleaning steps to text including removing noise, removing stopwords, lemmatising, part of speech tagging 
    and tokenisation'''
   
    #Replace emojis with description
    text = emoji.demojize(text)
    
    #Remove html
    text = gensim.parsing.preprocessing.strip_tags(text)
    
    regex_patterns = [r'@\w+', # Words after @ symbol
                      r'\n', #remaining tags
                      r'http\S+', #urls
                      r'[^\w\s]',#Remove punctuation/symbols
                      r'\d'] #Remove digits
    
    
    spelling_patterns = {r'\b(bicoin|bittcoin|bitoin|btc|btcusd|btcusdt)\b': 'bitcoin',
                         r'\busd\b': 'dollars',
                         r'\bblock chain\b': 'blockchain',
                         r'\bfincen\b': 'finance', 
                         r'\beth\b': 'ethereum',
                         r'\b(crypto|crytocurrencies|cryptocoin)\b': 'cryptocurrency',
                         r'\b(elon musk|elon|musk)\b' :'elonmusk',
                         r'\b(bnb|binance coin)\b' : 'binancecoin',
                         r'\b(nftart|nft art|nfts)\b' : 'nft',
                         r'\b(to the moon|tothemoon)\b' : 'moon',
                         r'\bada\b': 'cardano',
                         r'\b(ma|macd)\b' : 'movingaverage',
                         r'\bsats\b': 'satoshis',
                         r'\b(th|thee)\b' : 'the',
                         r'\b(doge| doge coin)\b':'dogecoin'}
    
    #Remove with regex
    for pattern in regex_patterns:
        text = re.sub(pattern, "", text)
    
    #Fix spelling
    for pattern in spelling_patterns:
        text = re.sub(pattern, spelling_patterns[pattern], text)
        
    #Remove non-english terms
    text = ''.join(word for word in text if word.isascii())

    #Expand out contractions
    text = contractions.fix(text)

    #Remove profanity
    text = better_profanity.profanity.censor(text, "")
    
    #Remove extra whitespace
    text = text.strip()
    
    return text


In [17]:
def spacy_pipeline(text):    
    
    #Convert ot spacy doc
    doc = nlp(text)
    
    #POS to retain
    tags = {"NOUN", "ADJECTIVE", "VERB"}
    
    #Apply spacy steps
    tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct and token.pos_ in tags and len(token) > 1]
    
    return tokens

In [21]:
#Run cleaning pipelines
df_sampled["clean_text"] = df_sampled["text"].progress_apply(clean_text)
df_sampled["tokens"] = df_sampled["text"].progress_apply(spacy_pipeline)
df_sampled.head()

100%|████████████████████████████████████████████████████████████████████████████| 24000/24000 [16:15<00:00, 24.59it/s]
100%|███████████████████████████████████████████████████████████████████████████| 24000/24000 [03:23<00:00, 117.91it/s]


Unnamed: 0,index,text,month,clean_text,tokens
0,9954,shorting out\n#btc \n#bitcoin,2021-02,shorting outbtc bitcoin,"[short, btc, bitcoin]"
1,1380,"i called to break $44,578\nhttps://t.co/poeyjufjk4\nhttps://t.co/yddqwaltra\n#bitcoin #btc #ethereum #eth #crypto… https://t.co/wqixuztj3e",2021-02,i called to break bitcoin ethereum ethereum cryptocurrency,"[call, break, bitcoin, btc, ethereum]"
2,35577,"uniswap (uni), compound (comp), other defi coins are slipping lower \n\n#btc #bitcoin #cryptocurrency #ethereum… https://t.co/esijvuxjej",2021-02,uniswap uni compound comp other defi coins are slipping lower bitcoin bitcoin cryptocurrency ethereum,"[compound, comp, defi, coin, slip, btc, bitcoin, cryptocurrency, ethereum]"
3,6192,elon #musk ignites the #bitcoin rocket - https://t.co/jj8d7p7mhd\n#tesla #bitcoin #btc #bitcoinnews #blockchain\n\nsup… https://t.co/0nje94sojb,2021-02,elonmusk ignites the bitcoin rocket bitcoin bitcoin bitcoinnews blockchainsup,"[musk, ignite, bitcoin, rocket, tesla, bitcoin, blockchain, sup, https://t.co/0nje94sojb]"
4,8810,@elonmusk @tesla revelation of its bitcoin reserve will drive up the price but the amount of #bitcoin in possession… https://t.co/8a8dqix8dc,2021-02,revelation of its bitcoin reserve will drive up the price but the amount of bitcoin in possession,"[@tesla, revelation, bitcoin, reserve, drive, price, bitcoin, possession]"


In [22]:
#Save cleaned dataframe
df_sampled.to_csv("./data/cleaned_data.csv", index = False)

#### Further Analysis

In [23]:
#test = pd.read_csv("./data/cleaned_data.csv")
#test.head()

In [24]:
def count_words(df):
    '''Returns sorted word counts from text'''
    
    #Get word freqs
    words = chain.from_iterable(df["clean_text"].str.split())
    count = Counter(words)
    
    #create output df
    output_df = pd.DataFrame(count.items(), columns = ["Word", "Count"])
    
    #Remove stopwords
    output_df = output_df[~output_df["Word"].isin(gensim.parsing.preprocessing.STOPWORDS)]
    filter = output_df["Word"] != ""
    output_df = output_df[filter]
    output_df = output_df.sort_values(by = "Count", ascending = False)
    return output_df
    

In [25]:
#Get top word frequencies
a =count_words(df_sampled)
a.head()

Unnamed: 0,Word,Count
2,bitcoin,32775
8,cryptocurrency,8705
7,ethereum,4718
32,price,4092
552,k,1687


In [26]:
import nltk

#Get list of words in clean text
list_text = [word for ob in df_sampled['clean_text'] for word in ob.split()]
#Calculate the bigrams
bigrams = nltk.ngrams(list_text, 2)
#Get frequencies
bigrams_freq = pd.Series(bigrams).value_counts()
#Show the top bigrams
display(bigrams_freq.head())        

(bitcoin, bitcoin)           5163
(bitcoin, ethereum)          1899
(bitcoin, cryptocurrency)    1846
(cryptocurrency, bitcoin)    1583
(in, the)                    1213
Name: count, dtype: int64

## Methods

The first model I chose to test was BERTopic, detailed in a paper by Maarten Grootendorst [[12]](#ref12). This offers a framework for topic modelling with different components which can be tested and adapted for the context of this problem. The first choice was how to generate word embeddings. While there are a host of models available for this, I chose to use the 'glove-twitter-25' embeddings, pre-trained using the GloVe model [[13]](#ref13). Given the size of my dataset, these have the advantage of significantly reducing the processing time required. They are also easily accessed through the 'gensim' library and should be suitable for use as they have been trained using a similar dataset of tweets.

The model then reduces the dimensionality of embeddings using Uniform Manifold Approximation and Projection (UMAP) before applying a clustering algorithm to group the topics (HDBScan by default). As a hierarchical method this has the benefit of not requiring a predefined number of topics, and has been shown to be faster at processing a high number of observations than other methods [[14]]('#ref14'). However, as K-means has proven to be effective with similar data [[4]](#ref4), I will also include it in the tests. 

BERTopic can then use class based term frequency-inverse term frequency (ctf-idf) to find representations of the topics over time [[12 pg. ]](#ref12). This is shown in equation 1, showing the term frequencies (tf) per term (t) for each class (c) at each timestep (i), with A being the average words per class. 

$$
\begin{equation}
  W_{t,c,i} = tf_{t,c,i}. log(1 + \frac{A}{tf_t})

\end{equation}
$$ 





## Experiment Setup

In order to compare the models, some form of evaluation metric is required. However, as topic modelling is an unsupervised machine learning method there is no definitive method for determining the quality of the models. One potential method I considered was perplexity, which tests how well the model predicts new data. However, perplexity does not optimise for the interpretability of the topics themselves, which is important in gaining meaningful insights on Bitcoin [[13]](#ref13). 

Instead, I investigated the various metrics designed to measure coherence, or how interpretable the topics themselves are. Again many different methods exist, but I decided to use coherence value (c_v), as while its not the most computationally efficient method to calculate, it has been shown to correlate highly with human evaluation of topics [[14]](#ref14). Coherence value isn't currently available in the BERTopic library, so the code for calculating this was adapted from a solution posted by Maarten Grootendorst [[15]](#ref15).  


#### BERTopic modelling

In [None]:
docs = df_sampled["clean_text"].to_list()

#Clustering model
cluster_model = HDBSCAN(prediction_data = True, min_cluster_size = 10)

#ctf-idf model
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words = True)


In [None]:
#Get pre-trained embeddings
glove_embeddings = api.load("glove-twitter-25")

#Create model
bertopic_model = BERTopic(embedding_model = glove_embeddings, ctfidf_model = ctfidf_model, 
                          verbose = True, calculate_probabilities = True, nr_topics="auto")
topics, probabilites = bertopic_model.fit_transform(docs)

In [None]:
#Code adapted from snippet by Maarten Grootendorst, reference [17].  

# Preprocess Documents
documents = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topics})
documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
cleaned_docs = bertopic_model._preprocess_text(documents_per_topic.Document.values)

# Extract vectorizer and analyzer from BERTopic
vectorizer = bertopic_model.vectorizer_model
analyzer = vectorizer.build_analyzer()

# Extract features for Topic Coherence evaluation
tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in bertopic_model.get_topic(topic)] 
               for topic in range(len(set(topics))-1)]

# Evaluate
coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_v')
coherence = coherence_model.get_coherence()

In [None]:
coherence

In [None]:
#Get topics
topic_info = bertopic_model.get_topic_info()
topic_info

In [None]:
results_df = pd.DataFrame()

In [None]:
#Show topic clustering
bertopic_model.visualize_topics()

In [None]:
#Show topic groups
bertopic_model.visualize_barchart()

In [None]:
#Model over time
#Get dates
dates = df_sampled["month"].to_list()
dynamic_topics = bertopic_model.topics_over_time(docs, dates, global_tuning = True, evolution_tuning = False)

In [None]:
#Timeseries topics
bertopic_model.visualize_topics_over_time(dynamic_topic)

## Results

In [None]:
First results with BERTopic produced 182 topics, visualisation showing a great deal of overlap between different 

#first - 182, 0.68, lots of overlap, 
#reduce topics - 40 - 0.78

In [None]:
Table of topics returned - model one/model 2 etc
- coherence scores 

- BERTopic visualisations - cluster of topics 

Graph topics over time vs price changes 

## Discussion




There are limitations to this research. Issues with the quality of the data lead to spam posts being retained even after the cleaning processes. With additional time and further research a full machine learning model could be implemented to more accurately detect and remove spam from the corpus prior to modelling. This would help uncover the pertinent information on Bitcoin hidden in the text. 

Further work could also examine the sentiment of 

While coherence gives a quantifiable method for comparing models, the true test of the model's usefulness is whether the resulting topics stand up to human judgement. While I have gauged the output, this is best conducted by domain area experts, potentially using the 'word intrusion' method. This involves adding an additional word to the topics and testing if participants can identify the intruder, which should be easier for a cohensive and well defined topic than a random one. This is a measurable evaluation of quality in what might else be subjective opinion. 




## Conclusion

## References

<a id='ref1'></a> [1] Burell, T and Aju, M. Research Note: Cryptoassets consumer research 2023 (Wave 4). Available from:
https://www.fca.org.uk/publication/research-notes/research-note-cryptoasset-consumer-research-2023-wave4.pdf, 2023.

<a id='ref2'></a> [2] Liana Badea and Mariana Claudia Mungiu-Pupzan. The economic and environmental impact of bitcoin. IEEE access, 9:48091–48104, 2021

<a id='ref3'></a> [3] Qiao F, Williams J. Topic modelling and sentiment analysis of global warming tweets: Evidence from big data analysis. Journal of Organizational and End User Computing (JOEUC). 2022 May 1;34(3):1-8.

<a id='ref4'></a> [4] Marco Ortu, Stefano Vacca, Giuseppe Destefanis, and Claudio Conversano. Cryptocurrency ecosystems and social media environments: An empirical analysis through hawkes’ models and natural language processing. Machine
Learning with Applications, 7:100229, 2022


<a id='ref5'></a> [5]  Stephan A Curiskis, Barry Drake, Thomas R Osborn, and Paul J Kennedy.An evaluation of document clustering and topic modelling in two online social networks: Twitter and reddit. Information Processing & Management,57(2):102034, 2020.

<a id='ref6'></a> [6] Andrew Burnie and Emine Yilmaz. An analysis of the change in discussions on social media with bitcoin price. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 889–892, 2019.

<a id='ref7'></a> [7]  Roman Egger and Joanne Yu. A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts. Frontiers in sociology, 7:886498, 2022.

<a id='ref8'></a> [8] Belal Abdullah Hezam Murshed, Suresha Mallappa, Osamah AM Ghaleb, and Hasib Daowd Esmail Al-ariki. Efficient twitter data cleansing model for data analysis of the pandemic tweets. Emerging Technologies During the Era of COVID-19 Pandemic, pages 93–114, 2021.

<a id='ref9'></a> [9] Kaggle. bitcoin tweets.csv. Available from: https://www.kaggle.com/datasets/kaushiksuresh147/bitcoin-tweets, 2023. Accessed: 2024 June 2nd.

<a id='ref10'></a> [10] Nan Sun, Guanjun Lin, Junyang Qiu, and Paul Rimba. Near real-time twitter spam detection with machine learning techniques. International Journal of Computers and Applications, 44(4):338–348, 2022.


<a id='ref11'></a> [11] Nicholas Reimann. Twitter Boosts Character Limit To 4,000 For Twitter Blue Subscribers. Forbes. Available from: https://www.forbes.com/sites/nicholasreimann/2023/02/08/twitter-boosts-character-limit-to-4000-for-twitter-blue-subscribers/. Accessed:2024 June 2nd.


<a id='ref12'></a> [12] Bitcoin price dataset (2017-2023). Available from:https://www.kaggle.com/datasets/jkraak/bitcoin-price-dataset, 2023. Data originally sourced from Binance API. Accessed: 2024 June 5th.

<a id='ref13'></a> [13] Maarten Grootendorst. Bertopic: Neural topic modeling with a class-based
tf-idf procedure. arXiv preprint arXiv:2203.05794, 2022.

<a id='ref13'></a> [14] Leland McInnes and John Healy. Accelerated hierarchical density based clustering. In 2017 IEEE international conference on data mining workshops (ICDMW), pages 33–42. IEEE, 2017.

<a id='ref15'></a> [15] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.

<a id='ref15'></a> [15]  Maarten Grootendorst. Code snippet: Coherence value from BERTopic (2021 April 16). {Available from: https://github.com/MaartenGr/BERTopic/issues/90#issuecomment-820915389, 2021. Accessed: 2024 June 6th. 

<a id='ref16'></a> [16] Abdelrazek A, Eid Y, Gawish E, Medhat W, Hassan A. Topic modeling algorithms and applications: A survey. Information Systems . 2023 Feb 1 [cited 2024 May 22]; 112: Article number 102131. Available from: https://doi.org/10.1016/j.is.2022.102131

