# Import Libraries

In [2]:
# install bertopic
!pip install bertopic --quiet

In [3]:
#Imports the BERTopic class from the bertopic library.
#This class is the core component for performing topic modeling using BERTopic.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from bertopic import BERTopic
from kaggle.api.kaggle_api_extended import KaggleApi

#Imports the CountVectorizer class from scikit-learn.
#This class is used to convert a collection of text documents into a matrix of token counts
from sklearn.feature_extraction.text import CountVectorizer

# Read Data

In [4]:
# Read data
api = KaggleApi()
api.authenticate()
api.dataset_download_files('saurabhshahane/fake-news-classification', unzip=True)
data = pd.read_csv('WELFake_Dataset.csv')

# Data Preparation

In [5]:
# Use only Real news for the topic modelling.
# Fake news articles are not used for the modelling
data = data[data.label == 1]

In [6]:
# drop null values
data = data.dropna()

In [7]:
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1
5,5,About Time! Christian Group Sues Amazon and SP...,All we can say on this one is it s about time ...,1
6,6,DR BEN CARSON TARGETED BY THE IRS: “I never ha...,DR. BEN CARSON TELLS THE STORY OF WHAT HAPPENE...,1


# Topic Modelling

* **BERTopic's goal is to group similar documents together into topics.**
    - It does this by first converting each document into a numerical representation (a vector or embedding) that captures its meaning. Then it uses clustering algorithms to find groups of similar vectors, where each group represents a topic.

* **Steps followed in BERTopic modelling:**
  - Convert Words to Vectors (Embeddings)
  - Reduce Dimensions (UMAP)
  - Find Clusters (HDBSCAN)
  - Label the Topics (c-TF-IDF)
  
* **In summary:**
    - BERTopic reads documents.
    - It converts each document into a numerical code that represents its meaning.   
    - It groups together documents with similar codes, forming topics.   
    - It labels each topic with the words that are most representative of that topic.

In [8]:
# Create an instance of the CountVectorizer class
# use both unigrams and bigrams as features during the vectorization process
# Removes common English stop words

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")

# Checks if data.text is already a list. If it is, assigns it directly to text.
# Otherwise, converts data.text into a Python list and assigns it to text.
# Ensures that the text data is in a suitable format for BERTopic.

if type(data.text) is list:
    text = data.text
else:
    text = data.text.tolist()
    
# Create a BERTopic model instance
# Calculate the probability of each document belonging to each topic
    
model = BERTopic(
    vectorizer_model=vectorizer_model,
    language='english', calculate_probabilities=True,
    verbose=True
)

# Fits the BERTopic model to the text data
# transforms the data into topic representations.
# topics: An array where each element represents the assigned topic for the corresponding document in text.
# probs: An array where each element represents the probability of the corresponding document belonging to its assigned topic.

topics, probs = model.fit_transform(text)

2024-08-26 06:27:16,776 - BERTopic - Embedding - Transforming documents to embeddings.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1141 [00:00<?, ?it/s]

2024-08-26 06:29:12,108 - BERTopic - Embedding - Completed ✓
2024-08-26 06:29:12,109 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-08-26 06:29:57,533 - BERTopic - Dimensionality - Completed ✓
2024-08-26 06:29:57,536 - BERTopic - Cluster - Start clustering the reduced embeddings
  pid = os.fork()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used.

# Visualisation

## Topic Info

* Retrieves various information and statistics about the topics discovered by the model.
* Information included:
   - **Topic:** The topic number or ID.
   - **Count:** The number of documents assigned to this topic.
   - **Name:** A representative name or label for the topic, generated based on the top words associated with the topic.
   - **Representation:** The top 'n' words that are most representative of the topic, usually ranked by their c-TF-IDF scores.

In [9]:
freq = model.get_topic_info()
freq.head(10)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,11902,-1_trump_people_clinton_hillary,"[trump, people, clinton, hillary, said, presid...",[More evidence is mounting against Hillary Cli...
1,0,749,0_ouch_bites ya_buyin brilliant_karma bites,"[ouch, bites ya, buyin brilliant, karma bites,...","[Brilliant , Nice try BUT WE RE NOT BUYIN IT!..."
2,1,389,1_obamacare_insurance_care_healthcare,"[obamacare, insurance, care, healthcare, healt...",[Wasn t the point of Obamacare to provide heal...
3,2,387,2_antifa_protesters_rally_supporters,"[antifa, protesters, rally, supporters, trump ...",[Typical! The New York Times tries to blame co...
4,3,304,3_comey_fbi_investigation_comeys,"[comey, fbi, investigation, comeys, fbi direct...","[Posted on October 30, 2016 by Sean Adl-Tabata..."
5,4,300,4_consciousness_life_energy_love,"[consciousness, life, energy, love, human, bod...",[. We Are Aliens Because Our Souls Are Extra-T...
6,5,268,5_students_campus_student_university,"[students, campus, student, university, school...",[Because everyone deserves a safe space right...
7,6,254,6_climate_climate change_warming_global warming,"[climate, climate change, warming, global warm...",[ No challenge poses a greater threat to futu...
8,7,252,7_pipeline_dakota_standing rock_dakota access,"[pipeline, dakota, standing rock, dakota acces...","[We Are Change \nOceti Sakowin, ND – As water ..."
9,8,223,8_abortion_parenthood_planned parenthood_planned,"[abortion, parenthood, planned parenthood, pla...",[A shocking new video has just been released b...


## Intertopic Distance Map

In [10]:
model.visualize_topics()

* Intertopic Distance Map generated by BERTopic's visualize_topics() function
* It helps understand the relationships between the different topics discovered in topic modeling process
* Each circle represents a topic
* The size of the circle indicates the number of documents (or data points) associated with that topic
* Larger circles mean more documents belong to that topic.
* The position of the circle in the 2D space represents the topic's relationship with other topics.
* Circles closer together indicate that their topics are more semantically similar.
* The distances between circles roughly correspond to how dissimilar the topics are. Closer circles represent topics that share more words or concepts.
* The axes, labeled D1 and D2, represent the two principal dimensions derived from the dimensionality reduction technique.
*  The overall spread of the circles indicates the diversity of topics discovered in your data.

## Hierarchical Clustering

In [11]:
model.visualize_hierarchy()

## Topic Word Scores

In [12]:
model.visualize_barchart(top_n_topics = 30)

* Each bar represents a word.
* The length of the bar corresponds to its score or importance within a particular topic.
* The horizontal axis represents the c-TF-IDF score, indicating the importance of each word within its respective topic.
* This chart is valuable for quickly understanding the main themes associated with each discovered topic.
* For Example Topic 5:
    - Top words: "press", "cnn", "media", "twitter", "trump"
    - Possible Theme: Centers around media and news, with a potential focus on political news or Donald Trump.