<a href="https://colab.research.google.com/github/Vitor-Sallenave/Formacao-em-NLP/blob/main/Topic-Modeling/BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## ***◼️ Setup:***

In [None]:
!pip install BERTopic
!pip install sentence-transformers

In [None]:
from google.colab import files

from zipfile import ZipFile

import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
nltk.download("stopwords")
nltk.download("punkt")

from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer

## ***◼️ Importing the files:***

In [None]:
from glob import iglob

In [None]:
# Uploading the files
files.upload()

In [None]:
# Creating the zip object
zip = ZipFile("documentos.zip")

# Extracting all the zipped information
zip.extractall()

In [None]:
# The default work space of Colab
desktop = '/content/documentos/'

docs_data = list()

# Accessing the folder directory and finding for the files
for arq in iglob(desktop + '*.txt', recursive=False):
    # Storing the content of each text file
    with open(arq, 'r', encoding='UTF-8') as doc:
        text = doc.read()
        docs_data.append(text)

In [None]:
len(docs_data)

22

In [None]:
docs_data[0]

In [None]:
# BERT has a limit when it comes to processing data. Therefore, is a smart
# idea tokenizing the texts.

sentences = list()

for content in docs_data:
    for sentence in sent_tokenize(content):
        sentences.append(sentence)

In [None]:
len(sentences)

1504

## ***◼️ Creating the model:***

In [None]:
# Hyperparameters in BERTopic:
# 1. top_n_words = number of words per topic
# 2. min_topic_size = minimum number necessary to a topic be created
# 3. nr_topics = reduces the initial created topics to this number
# 4. vectorizer_model = vectorization function for the model
# 5. embedding_model = embedding model based on sentence similarity

In [None]:
# 1. ngram_range = quantity of tokens that will be considered when searching
# for similarity
# 2. min_df = minimum frequency in the text that a word needs to be analyzed

model = BERTopic(
    language='portuguese',
    verbose=True,
    top_n_words=15,
    min_topic_size=10,
    nr_topics=20,
    embedding='all-MiniLM-L6-v2'
    # embedding='xlm-r-bert-base-nli-stsb-mean-tokens'
    vectorizer_model=CountVectorizer(
        ngram_range=(1, 3),
        stop_words=stopwords.words("portuguese"),
        min_df = 10
        )
    )

In [None]:
%%time
topics, probs = model.fit_transform(sentences)

Downloading (…)0fe39/.gitattributes:   0%|          | 0.00/968 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)83e900fe39/README.md:   0%|          | 0.00/3.79k [00:00<?, ?B/s]

Downloading (…)e900fe39/config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/471M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading unigram.json:   0%|          | 0.00/14.8M [00:00<?, ?B/s]

Downloading (…)900fe39/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches:   0%|          | 0/47 [00:00<?, ?it/s]

2023-09-30 21:12:41,430 - BERTopic - Transformed documents to Embeddings
2023-09-30 21:12:56,652 - BERTopic - Reduced dimensionality
2023-09-30 21:12:56,737 - BERTopic - Clustered reduced embeddings
2023-09-30 21:12:57,229 - BERTopic - Reduced number of topics from 29 to 20


## ***◼️ Visualization:***

In [None]:
# topic "-1" represents noisy data
frequency = model.get_topic_info()
frequency.head(10)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,216,-1_engenheiro_além_desenvolvimento_campo,"[engenheiro, além, desenvolvimento, campo, tra...",[Um engenheiro hídrico também deve preocupar-s...
1,0,230,0_engenharia_sistemas_controle_sistema,"[engenharia, sistemas, controle, sistema, dese...",[[21]\n\nEspecializações\nA Engenharia Elétric...
2,1,162,1_estudo_vez_outras_sendo,"[estudo, vez, outras, sendo, sobre, podem, out...",[O conteúdo não verificável pode ser removido....
3,2,153,2_tempo_maior_cada_pode,"[tempo, maior, cada, pode, meio, através, form...",[Um aumento na PRF resulta em uma diminuição n...
4,3,95,3_podem_diversos_ainda_qualidade,"[podem, diversos, ainda, qualidade, utilizados...",[[8]\n\nOs diversos processos dentro e fora do...
5,4,95,4_estruturas_sobre_tipos_meio,"[estruturas, sobre, tipos, meio, assim, princi...","[Estes micrótomos, no entanto, forneciam corte..."
6,5,87,5_sistema_podem_através_processo,"[sistema, podem, através, processo, outras, pr...",[Os mamíferos reagem à infecção através do sis...
7,6,67,6_engenheiro_formação_conhecimento_assim,"[engenheiro, formação, conhecimento, assim, co...",[[16]\n\nÉ necessária para a formação de um en...
8,7,62,7_através_estudo_partir_podem,"[através, estudo, partir, podem, forma, propri...",[A mineração à superfície envolve a extração d...
9,8,47,8_grandes_durante_então_estruturas,"[grandes, durante, então, estruturas, grande, ...","[Na América, por fim, as grandes civilizações ..."


In [None]:
model.get_topic(5)

In [None]:
model.visualize_barchart(n_words=15, top_n_topics=20)

In [None]:
model.visualize_hierarchy(top_n_topics=15)

In [None]:
model.visualize_heatmap(n_clusters=15)