### Topic modelling

Topic modeling is a technique used in natural language processing (NLP) to identify hidden themes or topics in a collection of documents. BERTOPIC is a recently introduced topic modeling algorithm that utilizes BERT (Bidirectional Encoder Representations from Transformers) language representation to generate topic embeddings. BERTOPIC is designed to provide a scalable and efficient way to perform topic modeling on large datasets with high accuracy.

The following code uses BERTOPIC to extract topics from tweets.

In [None]:
# Installations
import sys
if 'google.colab' in sys.modules:
    !pip install emoji --upgrade
    !python -m spacy download en_core_web_lg
    !pip install gensim
    !pip install chart_studio
    !pip install bertopic

In [None]:
# Required Libraries

#Base and Cleaning 
import json
import requests
import pandas as pd
import numpy as np
import emoji
import regex
import re
import string
from collections import Counter

#Visualizations
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt 
import chart_studio
import chart_studio.plotly as py 
import chart_studio.tools as tls

#Natural Language Processing (NLP)
import spacy
from spacy.tokenizer import Tokenizer
from pprint import pprint
from wordcloud import STOPWORDS
stopwords = set(STOPWORDS)

In [None]:
data = pd.read_csv("./facup_2k.csv")
df = pd.DataFrame(data['text'])

In [None]:
df

Unnamed: 0,text
0,"🎶 Hecky, Jack and Stuart McCalllll 🎶\n\n#sufc ..."
1,Another amazing night at Old Trafford with the...
2,People talking shit on Garnarcho an 18 year ol...
3,"Three days after they won the League Cup, Manc..."
4,I'm 100% convinced that the ball is completely...
...,...
2587,🎫 Fancy going to see @BristolCity v @ManCity i...
2588,Goalkeeper #GordonBanks in the 1962/63 #FACup ...
2589,Prince Henry is introduced to the Huddersfield...
2590,@Jaylee20220 400 million new #football tactics...


In [None]:

def give_emoji_free_text(text):
    """
    Removes emoji's from tweets
    Accepts:
        Text (tweets)
    Returns:
        Text (emoji free tweets)
    """
    return emoji.replace_emoji(text)

def url_free_text(text):
    '''
    Cleans text from urls
    '''
    text = re.sub(r'http\S+', '', text)
    return text

# Apply the function above and get tweets free of emoji's
call_emoji_free = lambda x: give_emoji_free_text(x)

# Apply `call_emoji_free` which calls the function to remove all emoji's
df['emoji_free_tweets'] = df['text'].apply(call_emoji_free)

#Create a new column with url free tweets
df['url_free_tweets'] = df['emoji_free_tweets'].apply(url_free_text)
     

In [None]:
df

Unnamed: 0,text,emoji_free_tweets,url_free_tweets
0,"🎶 Hecky, Jack and Stuart McCalllll 🎶\n\n#sufc ...","Hecky, Jack and Stuart McCalllll \n\n#sufc #t...","Hecky, Jack and Stuart McCalllll \n\n#sufc #t..."
1,Another amazing night at Old Trafford with the...,Another amazing night at Old Trafford with the...,Another amazing night at Old Trafford with the...
2,People talking shit on Garnarcho an 18 year ol...,People talking shit on Garnarcho an 18 year ol...,People talking shit on Garnarcho an 18 year ol...
3,"Three days after they won the League Cup, Manc...","Three days after they won the League Cup, Manc...","Three days after they won the League Cup, Manc..."
4,I'm 100% convinced that the ball is completely...,I'm 100% convinced that the ball is completely...,I'm 100% convinced that the ball is completely...
...,...,...,...
2587,🎫 Fancy going to see @BristolCity v @ManCity i...,Fancy going to see @BristolCity v @ManCity in...,Fancy going to see @BristolCity v @ManCity in...
2588,Goalkeeper #GordonBanks in the 1962/63 #FACup ...,Goalkeeper #GordonBanks in the 1962/63 #FACup ...,Goalkeeper #GordonBanks in the 1962/63 #FACup ...
2589,Prince Henry is introduced to the Huddersfield...,Prince Henry is introduced to the Huddersfield...,Prince Henry is introduced to the Huddersfield...
2590,@Jaylee20220 400 million new #football tactics...,@Jaylee20220 400 million new #football tactics...,@Jaylee20220 400 million new #football tactics...


In [None]:
# Load spacy
# Make sure to restart the runtime after running installations and libraries tab
nlp = spacy.load('en_core_web_lg')

In [None]:
# Tokenizer
tokenizer = Tokenizer(nlp.vocab)


# Custom stopwords
custom_stopwords = ['hi','\n','\n\n', '&', ' ', '.', '-', 'got', "it's", 'it’s', "i'm", 'i’m', 'im', 'want', 'like', '$', '@']

# Customize stop words by adding to the default list
STOP_WORDS = nlp.Defaults.stop_words.union(custom_stopwords)

# ALL_STOP_WORDS = spacy + gensim + wordcloud
ALL_STOP_WORDS = STOP_WORDS.union(stopwords)


tokens = []

for doc in tokenizer.pipe(df['url_free_tweets'], batch_size=500):
    doc_tokens = []    
    for token in doc: 
        if token.text.lower() not in STOP_WORDS:
            doc_tokens.append(token.text.lower())   
    tokens.append(doc_tokens)

# Makes tokens column
df['tokens'] = tokens
     

In [None]:
df

  and should_run_async(code)


Unnamed: 0,text,emoji_free_tweets,url_free_tweets,tokens
0,"🎶 Hecky, Jack and Stuart McCalllll 🎶\n\n#sufc ...","Hecky, Jack and Stuart McCalllll \n\n#sufc #t...","Hecky, Jack and Stuart McCalllll \n\n#sufc #t...","[hecky,, jack, stuart, mccalllll, #sufc, #twit..."
1,Another amazing night at Old Trafford with the...,Another amazing night at Old Trafford with the...,Another amazing night at Old Trafford with the...,"[amazing, night, old, trafford, little, man!, ..."
2,People talking shit on Garnarcho an 18 year ol...,People talking shit on Garnarcho an 18 year ol...,People talking shit on Garnarcho an 18 year ol...,"[people, talking, shit, garnarcho, 18, year, o..."
3,"Three days after they won the League Cup, Manc...","Three days after they won the League Cup, Manc...","Three days after they won the League Cup, Manc...","[days, won, league, cup,, manchester, united, ..."
4,I'm 100% convinced that the ball is completely...,I'm 100% convinced that the ball is completely...,I'm 100% convinced that the ball is completely...,"[100%, convinced, ball, completely, bounds, we..."
...,...,...,...,...
2587,🎫 Fancy going to see @BristolCity v @ManCity i...,Fancy going to see @BristolCity v @ManCity in...,Fancy going to see @BristolCity v @ManCity in...,"[fancy, going, @bristolcity, v, @mancity, #fac..."
2588,Goalkeeper #GordonBanks in the 1962/63 #FACup ...,Goalkeeper #GordonBanks in the 1962/63 #FACup ...,Goalkeeper #GordonBanks in the 1962/63 #FACup ...,"[goalkeeper, #gordonbanks, 1962/63, #facup, se..."
2589,Prince Henry is introduced to the Huddersfield...,Prince Henry is introduced to the Huddersfield...,Prince Henry is introduced to the Huddersfield...,"[prince, henry, introduced, huddersfield, town..."
2590,@Jaylee20220 400 million new #football tactics...,@Jaylee20220 400 million new #football tactics...,@Jaylee20220 400 million new #football tactics...,"[@jaylee20220, 400, million, new, #football, t..."


In [None]:
# Make tokens a string again
df['tokens_back_to_text'] = [' '.join(map(str, l)) for l in df['tokens']]

def get_lemmas(text):
    '''Used to lemmatize the processed tweets'''
    lemmas = []
    
    doc = nlp(text)
    
    # Something goes here :P
    for token in doc: 
        if ((token.is_stop == False) and (token.is_punct == False)) and (token.pos_ != 'PRON'):
            lemmas.append(token.lemma_)
    
    return lemmas

df['lemmas'] = df['tokens_back_to_text'].apply(get_lemmas)

In [None]:
df

  and should_run_async(code)


Unnamed: 0,text,emoji_free_tweets,url_free_tweets,tokens,tokens_back_to_text,lemmas
0,"🎶 Hecky, Jack and Stuart McCalllll 🎶\n\n#sufc ...","Hecky, Jack and Stuart McCalllll \n\n#sufc #t...","Hecky, Jack and Stuart McCalllll \n\n#sufc #t...","[hecky,, jack, stuart, mccalllll, #sufc, #twit...","hecky, jack stuart mccalllll #sufc #twitterbla...","[hecky, jack, stuart, mccalllll, sufc, twitter..."
1,Another amazing night at Old Trafford with the...,Another amazing night at Old Trafford with the...,Another amazing night at Old Trafford with the...,"[amazing, night, old, trafford, little, man!, ...",amazing night old trafford little man! @manutd...,"[amazing, night, old, trafford, little, man, @..."
2,People talking shit on Garnarcho an 18 year ol...,People talking shit on Garnarcho an 18 year ol...,People talking shit on Garnarcho an 18 year ol...,"[people, talking, shit, garnarcho, 18, year, o...",people talking shit garnarcho 18 year old sad ...,"[people, talk, shit, garnarcho, 18, year, old,..."
3,"Three days after they won the League Cup, Manc...","Three days after they won the League Cup, Manc...","Three days after they won the League Cup, Manc...","[days, won, league, cup,, manchester, united, ...","days won league cup, manchester united trailed...","[day, win, league, cup, manchester, united, tr..."
4,I'm 100% convinced that the ball is completely...,I'm 100% convinced that the ball is completely...,I'm 100% convinced that the ball is completely...,"[100%, convinced, ball, completely, bounds, we...",100% convinced ball completely bounds west ham...,"[100, convince, ball, completely, bound, west,..."
...,...,...,...,...,...,...
2587,🎫 Fancy going to see @BristolCity v @ManCity i...,Fancy going to see @BristolCity v @ManCity in...,Fancy going to see @BristolCity v @ManCity in...,"[fancy, going, @bristolcity, v, @mancity, #fac...",fancy going @bristolcity v @mancity #facup? we...,"[fancy, go, @bristolcity, v, @mancity, facup, ..."
2588,Goalkeeper #GordonBanks in the 1962/63 #FACup ...,Goalkeeper #GordonBanks in the 1962/63 #FACup ...,Goalkeeper #GordonBanks in the 1962/63 #FACup ...,"[goalkeeper, #gordonbanks, 1962/63, #facup, se...",goalkeeper #gordonbanks 1962/63 #facup semifin...,"[goalkeeper, gordonbank, 1962/63, facup, semif..."
2589,Prince Henry is introduced to the Huddersfield...,Prince Henry is introduced to the Huddersfield...,Prince Henry is introduced to the Huddersfield...,"[prince, henry, introduced, huddersfield, town...",prince henry introduced huddersfield town play...,"[prince, henry, introduce, huddersfield, town,..."
2590,@Jaylee20220 400 million new #football tactics...,@Jaylee20220 400 million new #football tactics...,@Jaylee20220 400 million new #football tactics...,"[@jaylee20220, 400, million, new, #football, t...",@jaylee20220 400 million new #football tactics...,"[@jaylee20220, 400, million, new, football, ta..."


In [None]:
# Make lemmas a string again
df['lemmas_back_to_text'] = [' '.join(map(str, l)) for l in df['lemmas']]

# Tokenizer function
def tokenize(text):
    """
    Parses a string into a list of semantic units (words)
    Args:
        text (str): The string that the function will tokenize.
    Returns:
        list: tokens parsed out
    """
    # Removing url's
    pattern = r"http\S+"
    
    tokens = re.sub(pattern, "", text) # https://www.youtube.com/watch?v=O2onA4r5UaY
    tokens = re.sub('[^a-zA-Z 0-9]', '', text)
    tokens = re.sub('[%s]' % re.escape(string.punctuation), '', text) # Remove punctuation
    tokens = re.sub('\w*\d\w*', '', text) # Remove words containing numbers
    tokens = re.sub('\@*\!*\$*', '', text) # Remove @ ! $
    tokens = tokens.strip(',') # TESTING THIS LINE
    tokens = tokens.strip('?') # TESTING THIS LINE
    tokens = tokens.strip('!') # TESTING THIS LINE
    tokens = tokens.strip("'") # TESTING THIS LINE
    tokens = tokens.strip(".") # TESTING THIS LINE

    tokens = tokens.lower().split() # Make text lowercase and split it
    
    return tokens

# Apply tokenizer
df['lemma_tokens'] = df['lemmas_back_to_text'].apply(tokenize)
     

In [None]:
df

Unnamed: 0,text,emoji_free_tweets,url_free_tweets,tokens,tokens_back_to_text,lemmas,lemmas_back_to_text,lemma_tokens
0,"🎶 Hecky, Jack and Stuart McCalllll 🎶\n\n#sufc ...","Hecky, Jack and Stuart McCalllll \n\n#sufc #t...","Hecky, Jack and Stuart McCalllll \n\n#sufc #t...","[hecky,, jack, stuart, mccalllll, #sufc, #twit...","hecky, jack stuart mccalllll #sufc #twitterbla...","[hecky, jack, stuart, mccalllll, sufc, twitter...",hecky jack stuart mccalllll sufc twitterblade ...,"[hecky, jack, stuart, mccalllll, sufc, twitter..."
1,Another amazing night at Old Trafford with the...,Another amazing night at Old Trafford with the...,Another amazing night at Old Trafford with the...,"[amazing, night, old, trafford, little, man!, ...",amazing night old trafford little man! @manutd...,"[amazing, night, old, trafford, little, man, @...",amazing night old trafford little man @manutd ...,"[amazing, night, old, trafford, little, man, m..."
2,People talking shit on Garnarcho an 18 year ol...,People talking shit on Garnarcho an 18 year ol...,People talking shit on Garnarcho an 18 year ol...,"[people, talking, shit, garnarcho, 18, year, o...",people talking shit garnarcho 18 year old sad ...,"[people, talk, shit, garnarcho, 18, year, old,...",people talk shit garnarcho 18 year old sad sco...,"[people, talk, shit, garnarcho, 18, year, old,..."
3,"Three days after they won the League Cup, Manc...","Three days after they won the League Cup, Manc...","Three days after they won the League Cup, Manc...","[days, won, league, cup,, manchester, united, ...","days won league cup, manchester united trailed...","[day, win, league, cup, manchester, united, tr...",day win league cup manchester united trail say...,"[day, win, league, cup, manchester, united, tr..."
4,I'm 100% convinced that the ball is completely...,I'm 100% convinced that the ball is completely...,I'm 100% convinced that the ball is completely...,"[100%, convinced, ball, completely, bounds, we...",100% convinced ball completely bounds west ham...,"[100, convince, ball, completely, bound, west,...",100 convince ball completely bound west ham go...,"[100, convince, ball, completely, bound, west,..."
...,...,...,...,...,...,...,...,...
2587,🎫 Fancy going to see @BristolCity v @ManCity i...,Fancy going to see @BristolCity v @ManCity in...,Fancy going to see @BristolCity v @ManCity in...,"[fancy, going, @bristolcity, v, @mancity, #fac...",fancy going @bristolcity v @mancity #facup? we...,"[fancy, go, @bristolcity, v, @mancity, facup, ...",fancy go @bristolcity v @mancity facup pair ti...,"[fancy, go, bristolcity, v, mancity, facup, pa..."
2588,Goalkeeper #GordonBanks in the 1962/63 #FACup ...,Goalkeeper #GordonBanks in the 1962/63 #FACup ...,Goalkeeper #GordonBanks in the 1962/63 #FACup ...,"[goalkeeper, #gordonbanks, 1962/63, #facup, se...",goalkeeper #gordonbanks 1962/63 #facup semifin...,"[goalkeeper, gordonbank, 1962/63, facup, semif...",goalkeeper gordonbank 1962/63 facup semifinal ...,"[goalkeeper, gordonbank, 1962/63, facup, semif..."
2589,Prince Henry is introduced to the Huddersfield...,Prince Henry is introduced to the Huddersfield...,Prince Henry is introduced to the Huddersfield...,"[prince, henry, introduced, huddersfield, town...",prince henry introduced huddersfield town play...,"[prince, henry, introduce, huddersfield, town,...",prince henry introduce huddersfield town playe...,"[prince, henry, introduce, huddersfield, town,..."
2590,@Jaylee20220 400 million new #football tactics...,@Jaylee20220 400 million new #football tactics...,@Jaylee20220 400 million new #football tactics...,"[@jaylee20220, 400, million, new, #football, t...",@jaylee20220 400 million new #football tactics...,"[@jaylee20220, 400, million, new, football, ta...",@jaylee20220 400 million new football tactic r...,"[jaylee20220, 400, million, new, football, tac..."


In [None]:
from bertopic import BERTopic
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True) # it is another approach to perform topic modeling
topics, probs = topic_model.fit_transform(df["tokens_back_to_text"])

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,787,-1_facup_mufc_united_manutd
1,0,144,0_fulham_home_trafford_mufc
2,1,122,1_grimsby_southampton_town_grimsbytown
3,2,88,2_erik_europaleague_hag_europa
4,3,87,3_garnacho_alejandro_goal_munwhu
5,4,79,4_munwhu_weghorst_mufc_facup
6,5,75,5_mufc_facup_way_la
7,6,66,6_spurs_spursy_coys_thfc
8,7,58,7_vs_blackburn_brighton_draw
9,8,51,8_west_ham_31_united


In [None]:
topic_model.visualize_topics()