# Mineração de textos

Aluna: Ana Letícia Garcez Vicente

nº USP: 10746842

# 1.1 Identificação do Problema

A depressão é um transtorno mental que afeta milhões de pessoas em todo o mundo, impactando a saúde mental dos indivíduos e, consequentemente, sua qualidade de vida, relacionamentos e produtividade. De acordo com DSM-5 (Manual Diagnóstico de Transtornos Mentais) a depressão inclui “transtorno disruptivo da desregulação do humor, transtorno depressivo maior (incluindo episódio depressivo maior), transtorno depressivo persistente (distimia), transtorno disfórico pré-menstrual, transtorno depressivo induzido por substância/medicamento, transtorno depressivo devido a outra condição médica, outro transtorno depressivo especificado e transtorno depressivo não especificado” (DSM-5, p. 155). Poder extrair padrões presentes em comportamentos associados a este transtorno é fundamental para aprimorar o conhecimento e diagnóstico da doença, assim como aprimorar indicando possíveis focos de tratamento.

Através de redes sociais, como o *Twitter* ou o *Reddit*, podemos encontrar diversas pessoas que compartilham seus sentimentos, pensamentos, assim como vontades através de pequeno textos escritos, incluindo textos ligados a comportamentos depressivos. Dito isso, uma forma onde podemos analisar esses textos é unindo a mineração de dados munido de ferramentas de inteligência artificial, permitindo lidar com grandes volumes de informações, assim como identificar padrões, tendências e tópicos recorrentes de maneira automatizada.

Este trabalho se propõe a utilizar um banco de dados de textos extraídos do Reddit, que inclue textos contendo duas categorias: "depressão" e "não depressão". O nosso objetivo principal é extrair e analisar os principais tópicos discutidos nos textos classificados como "depressivos" por meio de técnicas de mineração de textos e, posteriormente, aplicar a clusterização para organizar esses tópicos em grupos relevantes.

O dataset utilizado foi o [Depression: Reddit Dataset Cleaned](https://www.kaggle.com/datasets/infamouscoder/depression-reddit-cleaned/data) extraído do Kaggle.

In [None]:
# Download Dataset

!gdown --id 14RqYe0Su76kOgsd9MaNC3yGNKglYJLEW

Downloading...
From: https://drive.google.com/uc?id=14RqYe0Su76kOgsd9MaNC3yGNKglYJLEW
To: /content/depression_dataset_reddit_cleaned.csv
100% 2.82M/2.82M [00:00<00:00, 82.7MB/s]


In [None]:
import pandas as pd

dataset = pd.read_csv('depression_dataset_reddit_cleaned.csv')
dataset

Unnamed: 0,clean_text,is_depression
0,we understand that most people who reply immed...,1
1,welcome to r depression s check in post a plac...,1
2,anyone else instead of sleeping more when depr...,1
3,i ve kind of stuffed around a lot in my life d...,1
4,sleep is my greatest and most comforting escap...,1
...,...,...
7726,is that snow,0
7727,moulin rouge mad me cry once again,0
7728,trying to shout but can t find people on the list,0
7729,ughh can t find my red sox hat got ta wear thi...,0


Como dito antes, utilizaremos apenas os textos que estão classificados como depressão ("is_depression" = 1).

In [None]:
df_depression = pd.DataFrame(dataset[dataset['is_depression'] == 1]['clean_text'])

In [None]:
# Observando um exemplo
df_depression['clean_text'].iloc[15]

'i don t think i have the ball to do it but i ve become obsessed with the idea of killing myself all i can think about is suicide i ve developed a deep and genuine hatred for myself i don t want to live to see another day i don t want to get better bc i don t deserve it i wish i had the courage to kill myself'

Apesar de ter sido afirmado que o dataset possui apenas textos em inglês, existem alguns textos que estão em outros idiomas, como alemão e francês. Portanto tiraremos esses textos através da biblioteca langdetect.

In [None]:
pip install langdetect



In [None]:
from langdetect import detect

def detect_language(text):
    try:
        return detect(text)
    except:
        return 'unknown'

# detecta o idioma de cada texto, se inglês retorna 'en', caso não, retorna 'unknown'
df_depression['language'] = df_depression['clean_text'].apply(detect_language)

# reconstruindo o dataset apenas com os textos que foram classificados como inglês
english_df = df_depression[df_depression['language'] == 'en']
english_df

Unnamed: 0,clean_text,language
0,we understand that most people who reply immed...,en
1,welcome to r depression s check in post a plac...,en
2,anyone else instead of sleeping more when depr...,en
3,i ve kind of stuffed around a lot in my life d...,en
4,sleep is my greatest and most comforting escap...,en
...,...,...
3825,divya gandotra to be in continuous state of do...,en
3826,thlolo march eh it s because i don t want stre...,en
3827,i hate it when i m having depression day and t...,en
3829,dmt powder helping with depression amp anxiety...,en


# 1.2 Pré Processamento

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import cosine
import numpy as np
import networkx as nx
import string
!pip install plotly.express
from plotly import graph_objs as go
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('rslp')
from nltk.tokenize import word_tokenize
from nltk.stem.porter import *



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package rslp to /root/nltk_data...
[nltk_data]   Package rslp is already up-to-date!


In [None]:
def remove_stopwords(text,stop_words):

  # text: dados para a remoção das stopwords
  # stop_words: lista de palavras

  # tudo para caixa baixa
  s = str(text).lower()

  # obtendo os tokens dos textos
  tokens = word_tokenize(s)

  # remove stopwords, dígitos, caracteres especiais e pontuações
  v = [word for word in tokens if not word in stop_words and word.isalnum() and not word.isdigit()]

  return v


def meu_tokenizador(doc, stop_words=nltk.corpus.stopwords.words('english')):

  # lista de palavras relacionadas ao contexto dos textos
  # depressão e linguagem de internet
  stop_words.append('depression')
  stop_words.append('http')
  stop_words.append('co')
  stop_words.append('wa')
  stop_words.append('im')

  tokens = remove_stopwords(doc,stop_words)

  return tokens

In [None]:
# Visualizando os n gramas mais frequentes nos textos

VSM = TfidfVectorizer(tokenizer=meu_tokenizador,min_df=2,ngram_range=(2,2))
X = VSM.fit_transform(english_df['clean_text'])



In [None]:
df_bigrams_tfidfs = pd.DataFrame()
df_bigrams_tfidfs['word'] = VSM.get_feature_names_out()
df_bigrams_tfidfs['tfidf_sum'] = X.toarray().sum(axis=0)
df_bigrams_tfidfs.sort_values(by='tfidf_sum',ascending=False,inplace=True)
df_bigrams_tfidfs.head(20)

Unnamed: 0,word,tfidf_sum
7144,feel like,68.241272
14480,mental health,24.017865
24089,wan na,20.893953
9183,gon na,20.8804
1324,anyone else,18.623088
16509,panic attack,18.452592
14031,make feel,15.348988
8348,get better,13.895437
6250,every day,13.44904
13855,loved one,13.109478


In [None]:
# retirando os palavras relacionadas aos 20 ngramas mais frequentes do dataset
lista_ngramas = df_bigrams_tfidfs['word'].head(20).tolist()

stop_words = nltk.corpus.stopwords.words('english')
# contexto
stop_words.append('depression')
stop_words.append('http')
stop_words.append('co')
stop_words.append('wa')
stop_words.append('im')
# ngramas
stop_words.append('like')
stop_words.append('feel')
stop_words.append('feeling')
stop_words.append('felt')
stop_words.append('gon')
stop_words.append('na')
stop_words.append('does')
stop_words.append('wan')
stop_words.append('plan')
stop_words.append('planning')
stop_words.append('play')
stop_words.append('playing')
stop_words.append('point')
stop_words.append('pointless')

## Bag of words

In [None]:
# Bag of Words

def compute_bag_of_words(dataset,lang):

  d = []

  for index,row in dataset.iterrows():
    text = row['clean_text']
    v = remove_stopwords(text, stop_words)
    text2 = ''
    for token in v:
      text2 += token+" "
    text2 = text2.strip()

    d.append(text2)

  matrix = CountVectorizer(max_features=1000)
  X = matrix.fit_transform(d)

  count_vect_df = pd.DataFrame(X.todense(), columns=matrix.get_feature_names_out())

  return count_vect_df


bow = compute_bag_of_words(english_df,'english')
bow

Unnamed: 0,00b,0mg,ability,able,absolutely,abuse,abusive,accept,accident,account,...,wrong,wrote,yeah,year,yes,yesterday,yet,young,younger,zoloft
0,0,0,1,1,0,1,0,1,0,2,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3617,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3618,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3619,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3620,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## BERT

In [None]:
!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np
import logging



In [None]:
# Carregando o modelo pré-treinado
model = SentenceTransformer('distiluse-base-multilingual-cased')

In [None]:
# criando embedding para os textos através do modelo

english_df['bert_embedding'] = list(model.encode(english_df.clean_text.to_list()))
english_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  english_df['bert_embedding'] = list(model.encode(english_df.clean_text.to_list()))


Unnamed: 0,clean_text,language,bert_embedding
0,we understand that most people who reply immed...,en,"[-0.06543694, 0.032098528, 0.019543484, -0.038..."
1,welcome to r depression s check in post a plac...,en,"[0.032912392, -0.059611227, 0.010005628, -0.03..."
2,anyone else instead of sleeping more when depr...,en,"[-0.04232005, -0.03977695, -0.03429169, -0.075..."
3,i ve kind of stuffed around a lot in my life d...,en,"[0.019328248, -0.02408923, -0.008499024, -0.02..."
4,sleep is my greatest and most comforting escap...,en,"[0.011330487, -0.040457632, -0.059397433, -0.0..."
...,...,...,...
3825,divya gandotra to be in continuous state of do...,en,"[0.031196265, -0.03564598, -0.06852871, -0.043..."
3826,thlolo march eh it s because i don t want stre...,en,"[0.09839182, -0.002173871, -0.104865715, -0.03..."
3827,i hate it when i m having depression day and t...,en,"[0.039205052, -0.00036558133, -0.06826417, -0...."
3829,dmt powder helping with depression amp anxiety...,en,"[0.029033495, -0.041709296, -0.022431426, -0.0..."


# 1.3 Extração de Padrões

Será utilizado o K-Means

In [None]:
from sklearn.cluster import KMeans
from sklearn.neighbors import kneighbors_graph
import numpy as np
num_cluster = 20

### Bag of words

In [None]:
X_bow = np.array(bow)

# calcula o comprimento de cada vetor de dados em X através da norma euclidiana
length = np.sqrt((X_bow**2).sum(axis=1))[:,None]

# lida com comprimentos nulos
zero_length_indices = np.where(length == 0)
length[zero_length_indices] = 1e-6

# normalização dos dados
X_bow = X_bow / length

kmeans_bow = KMeans(n_clusters=num_cluster, random_state=0, n_init=10).fit(X_bow)

In [None]:
from sklearn.neighbors import kneighbors_graph

# construção de um grafo com a metric 'cosine'
A_bow = kneighbors_graph(bow, 2, metric='cosine')
G_bow = nx.Graph(A_bow.toarray())

import re
L_nodes = []
for node in G_bow.nodes():
  L_nodes.append([node,list(pd.DataFrame(bow.iloc[node]).sort_values(by=node,ascending=False).head(20).index), english_df['clean_text'].iloc[node], kmeans_bow.labels_[node]])

In [None]:
df_nodes_bow = pd.DataFrame(L_nodes)
df_nodes_bow.columns = ['id', 'bag', 'text', 'cluster']
df_nodes_bow

Unnamed: 0,id,bag,text,cluster
0,0,"[mental, people, health, much, help, someone, ...",we understand that most people who reply immed...,5
1,1,"[post, support, giving, help, issue, see, plac...",welcome to r depression s check in post a plac...,4
2,2,"[else, anxiety, instead, sleeping, coming, eve...",anyone else instead of sleeping more when depr...,11
3,3,"[job, thing, move, month, working, lot, ha, ev...",i ve kind of stuffed around a lot in my life d...,11
4,4,"[escape, hell, first, sleep, problem, waking, ...",sleep is my greatest and most comforting escap...,19
...,...,...,...,...
3617,3617,"[doubt, state, seems, anxiety, positive, presc...",divya gandotra to be in continuous state of do...,10
3618,3618,"[stress, want, 00b, prescribed, physical, phys...",thlolo march eh it s because i don t want stre...,9
3619,3619,"[sad, also, hate, brain, want, something, happ...",i hate it when i m having depression day and t...,17
3620,3620,"[amp, anxiety, helping, 00b, putting, physical...",dmt powder helping with depression amp anxiety...,10


In [None]:
# criando um dataset
cluster_words_bow = pd.DataFrame(columns=['Cluster', 'Top Words'])

for cluster in df_nodes_bow['cluster'].unique():

  # filtrando todos os textos com o mesmo cluster
  cluster_data = df_nodes_bow[df_nodes_bow['cluster'] == cluster]

  # concatenando
  cluster_text = ' '.join(' '.join(bag) for bag in cluster_data['bag'])

  # tokenização
  tokens = cluster_text.split()
  word_series = pd.Series(tokens)

  # top n palavras mais frequentes
  top_words = word_series.value_counts().head(6)

  # adicionando no dataset
  cluster_words_bow = pd.concat([cluster_words_bow, pd.DataFrame({'Cluster': cluster, 'Top Words': ', '.join(top_words.index)}, index=[0])], ignore_index=True)

cluster_words_bow = cluster_words_bow.sort_values(by='Cluster')
print(cluster_words_bow)

   Cluster                                        Top Words
10       0          time, piece, picture, pick, pill, place
12       1               time, know, get, even, want, thing
9        2              year, last, time, ha, anxiety, back
14       3       go, please, pill, place, picture, positive
1        4       help, please, place, pill, positive, piece
0        5      people, place, please, pill, picture, piece
15       6          fucking, want, hate, fuck, people, life
17       7         dont, know, want, life, please, positive
6        8       know, please, pill, positive, place, piece
4        9       want, please, place, pill, piece, possible
13      10    anxiety, place, piece, pill, please, positive
2       11         thing, know, time, place, piece, picture
19      12  place, piece, picture, theekween, anxiety, pick
7       13          really, want, know, time, please, place
5       14           life, want, know, people, friend, even
8       15   please, pill, place, piece,

In [None]:
# Visualizando alguns textos presentes em um cluster

cluster_data = pd.DataFrame(df_nodes_bow[df_nodes_bow['cluster'] == 1]).reset_index()
for i in range(5):
  print(cluster_data['text'].iloc[i])
  print()

i ve been in a bad spot for a long time i ve dealt with a lot of grief a lot of handling shit on my own and trying to keep up appearance but thing took a turn for the worse when i had a traumatic event a few month ago that sent me over the edge i developed post traumatic stress disorder from it all and coping since ha been excruciatingly difficult i threw myself into work for about a month and a half i quickly burned out the trigger became an everyday occurrence i wanted help i needed help but i wa afraid every hand extended towards me would only reach to choke me those around me could drown me in an ocean of love and i d never feel wet my clothes may be wet but my skin my heart would feel dry a a desert i feel so closed off yet i crave closeness i can t even remember the last few month but from what i can i ve been destructive i ve been in so much emotional pain that i ve unintentionally caused emotional pain nothing is ever good enough for me it s not a conscious thing you see it s m

Os dados coletados apresentam ruídos decorrentes da natureza da escrita na internet, tornando desafiante tanto a aplicação eficaz de modelos quanto a compreensão humana dos resultados. Alguns textos são particularmente obscuros em relação aos conceitos inseridos.

O que torna bem difícil encontrar tópicos nos clusters. Podemos notar a frequente presença de palavras como "pílula", que está bem relacionada com a depressão.

## BERT

In [None]:
vsm = pd.DataFrame(np.array(english_df['bert_embedding'].to_list()))
A_vsm = kneighbors_graph(vsm, 2, metric='cosine')
G_vsm = nx.Graph(A_vsm.toarray())

kmeans_vsm = KMeans(n_clusters=num_cluster, random_state=0, n_init=10).fit(vsm)

In [None]:
L_nodes = []
for node in G_vsm.nodes():
  L_nodes.append([node,list(pd.DataFrame(bow.iloc[node]).sort_values(by=node,ascending=False).head(20).index), english_df['clean_text'].iloc[node], kmeans_vsm.labels_[node]])

In [None]:
df_nodes_vsm = pd.DataFrame(L_nodes)
df_nodes_vsm.columns = ['id','bag', 'text', 'cluster']
df_nodes_vsm

Unnamed: 0,id,bag,text,cluster
0,0,"[mental, people, health, much, help, someone, ...",we understand that most people who reply immed...,0
1,1,"[post, support, giving, help, issue, see, plac...",welcome to r depression s check in post a plac...,15
2,2,"[else, anxiety, instead, sleeping, coming, eve...",anyone else instead of sleeping more when depr...,18
3,3,"[job, thing, move, month, working, lot, ha, ev...",i ve kind of stuffed around a lot in my life d...,6
4,4,"[escape, hell, first, sleep, problem, waking, ...",sleep is my greatest and most comforting escap...,18
...,...,...,...,...
3617,3617,"[doubt, state, seems, anxiety, positive, presc...",divya gandotra to be in continuous state of do...,14
3618,3618,"[stress, want, 00b, prescribed, physical, phys...",thlolo march eh it s because i don t want stre...,3
3619,3619,"[sad, also, hate, brain, want, something, happ...",i hate it when i m having depression day and t...,6
3620,3620,"[amp, anxiety, helping, 00b, putting, physical...",dmt powder helping with depression amp anxiety...,2


In [None]:
# criando um dataset
cluster_words_vsm = pd.DataFrame(columns=['Cluster', 'Top Words'])

for cluster in df_nodes_vsm['cluster'].unique():

    # filtrando todos os textos com o mesmo cluster
    cluster_data = df_nodes_vsm[df_nodes_vsm['cluster'] == cluster]

    # concatenando
    cluster_text = ' '.join(' '.join(bag) for bag in cluster_data['bag'])

    # tokenização
    tokens = cluster_text.split()
    word_series = pd.Series(tokens)

    # top n palavras mais frequentes
    top_words = word_series.value_counts().head(7)

    # adicionando no dataset
    #cluster_words_vsm = cluster_words_vsm.append({'Cluster': cluster, 'Top Words': ', '.join(top_words.index)}, ignore_index=True)
    cluster_words_vsm = pd.concat([cluster_words_vsm, pd.DataFrame({'Cluster': cluster, 'Top Words': ', '.join(top_words.index)}, index=[0])], ignore_index=True)

cluster_words_vsm = cluster_words_vsm.sort_values(by='Cluster')
print(cluster_words_vsm)

   Cluster                                          Top Words
0        0     people, friend, want, know, time, really, make
14       1   anxiety, time, know, get, please, really, attack
17       2  positive, pill, piece, please, place, picture,...
15       3  please, place, possible, positive, pill, pictu...
5        4             want, life, get, time, know, mom, year
8        5          want, life, kill, know, please, time, die
3        6          know, get, time, want, life, really, help
11       7    pill, place, please, know, time, piece, picture
12       8        school, know, time, want, friend, even, get
18       9  piece, pill, please, place, positive, possible...
10      10          job, know, time, work, get, anxiety, want
13      11  anxiety, pill, anyone, place, medication, take...
6       12  pill, picture, piece, place, please, positive,...
4       13       life, want, know, get, even, anymore, people
16      14  piece, picture, place, pill, please, positive,...
1       

In [None]:
# Visualizando alguns textos presentes em um cluster

cluster_data_vsm = pd.DataFrame(df_nodes_vsm[df_nodes_vsm['cluster'] == 12]).reset_index()
for i in range(8):
  print(cluster_data_vsm['text'].iloc[i])
  print()

idk how to elaborate on it i just started suddenly cry for no real reason and couldn t stop for like 0 minute doe anyone else have this problem i m just wondering

nothing in life is enjoyable not to mention that i have like missing assignment i could be doing right now

i just can feel it i can t explain it but i can feel it i feel like this is my true self and if it go on i ll lose it

i don t know what to feel but i just am tired and over it and there s no end to running on a hamster wheel of constant sadness ugh

like a battery in a remote s back that keep it working i wish i could also remove the battery and just turn off for a while

this isn t getting better and i don t want to be here anymore

i can t take it anymore

but telling them im not will just make them worry they got their own problem dont need mine too



O mesmo problema quanto aos dados aconteceu aqui. No entanto, mesmo diante desses desafios, é possível identificar alguns padrões. Podemos observar a formação de alguns clusters que aparentam ter temas bem definidos. Por exemplo, há um cluster relacionado à escola e amizades (Cluster 8), outro que aborda a preocupante tendência ao suicídio (Cluster 5), além de clusters que se concentram em temas como ansiedade e ataques (Cluster 1). Também é notável a presença frequente das palavras "pílula" e "medicação" em vários clusters, sugerindo uma conexão com questões relacionadas à medicação.

# 1.4 Pós-Processamento

Para avaliar a qualidade dos clusters criados no exercício anterior, será utilizado a medida da silhueta. Essa medida calcula a diferença de um dado entra a distancia média de outros pontos do mesmo cluster com a distância média para outros pontos de um cluster diferente mais próximo.

## Bag of Words

In [None]:
from sklearn.metrics import silhouette_score, silhouette_samples

# Os pontos de dados
X_bow = np.array(bow)

# Resultados do K-Means
kmeans_labels = kmeans_bow.labels_

# Calcular a medida de silhueta
silhouette_avg = silhouette_score(X_bow, kmeans_labels)

print(f"A medida de silhueta média para {num_cluster} clusters é: {silhouette_avg}")

A medida de silhueta média para 20 clusters é: -0.21492695403983025


In [None]:
silhouette_samples_values = silhouette_samples(X_bow, kmeans_labels)

# Adicionar a medida de silhueta de cada ponto ao DataFrame
df_nodes_vsm['silhouette_score'] = silhouette_samples_values

# Calcular a medida de silhueta média para cada cluster
silhouette_avg_per_cluster = []
for cluster_id in df_nodes_vsm['cluster'].unique():
    cluster_data = df_nodes_vsm[df_nodes_vsm['cluster'] == cluster_id]
    cluster_silhouette_avg = cluster_data['silhouette_score'].mean()
    silhouette_avg_per_cluster.append((cluster_id, cluster_silhouette_avg))

# Imprimir a medida de silhueta média para cada cluster
for cluster_id, silhouette_avg in silhouette_avg_per_cluster:
    print(f"Cluster {cluster_id}: Silhueta média = {silhouette_avg}")

Cluster 0: Silhueta média = -0.20435587054818552
Cluster 15: Silhueta média = -0.31904545505011067
Cluster 18: Silhueta média = -0.21715290793129152
Cluster 6: Silhueta média = -0.22150868054752948
Cluster 13: Silhueta média = -0.23546650720935203
Cluster 4: Silhueta média = -0.18243071836263314
Cluster 12: Silhueta média = -0.3003158587892837
Cluster 16: Silhueta média = -0.20918120408455865
Cluster 5: Silhueta média = -0.23976498622626882
Cluster 17: Silhueta média = -0.21726680623209244
Cluster 10: Silhueta média = -0.20779007942856642
Cluster 7: Silhueta média = -0.23419064469772682
Cluster 8: Silhueta média = -0.18827413345903068
Cluster 11: Silhueta média = -0.2160209609003744
Cluster 1: Silhueta média = -0.21157231315703037
Cluster 3: Silhueta média = -0.17904911799876322
Cluster 14: Silhueta média = -0.25790077679097106
Cluster 2: Silhueta média = -0.22377434650014819
Cluster 9: Silhueta média = -0.2605727307540028
Cluster 19: Silhueta média = 0.41757616408386833


Podemos notar que a média da nossa silhueta deu negativa, indicando que a clusterização não foi muito boa, já que os pontos estão mais próximos de clusters errados

## BERT

In [None]:
# Calcular a medida de silhueta
silhouette_avg = silhouette_score(vsm, kmeans_vsm.labels_)
print("A média da medida de silhueta é:", silhouette_avg)

A média da medida de silhueta é: 0.021373965


In [None]:
silhouette_samples_values = silhouette_samples(vsm, kmeans_vsm.labels_)

# Adicionar a medida de silhueta de cada ponto ao DataFrame
df_nodes_bow['silhouette_score'] = silhouette_samples_values

# Calcular a medida de silhueta média para cada cluster
silhouette_avg_per_cluster = []
for cluster_id in df_nodes_bow['cluster'].unique():
    cluster_data = df_nodes_bow[df_nodes_bow['cluster'] == cluster_id]
    cluster_silhouette_avg = cluster_data['silhouette_score'].mean()
    silhouette_avg_per_cluster.append((cluster_id, cluster_silhouette_avg))

# Imprimir a medida de silhueta média para cada cluster
for cluster_id, silhouette_avg in silhouette_avg_per_cluster:
    print(f"Cluster {cluster_id}: Silhueta média = {silhouette_avg}")

Cluster 5: Silhueta média = 0.007972242310643196
Cluster 4: Silhueta média = 0.06221683323383331
Cluster 11: Silhueta média = 0.012847842648625374
Cluster 19: Silhueta média = 0.017464738339185715
Cluster 9: Silhueta média = 0.021633194759488106
Cluster 14: Silhueta média = 0.014979781582951546
Cluster 8: Silhueta média = 0.015492619946599007
Cluster 13: Silhueta média = 0.014308024197816849
Cluster 15: Silhueta média = 0.008725092746317387
Cluster 2: Silhueta média = 0.021673545241355896
Cluster 0: Silhueta média = 0.01800110563635826
Cluster 18: Silhueta média = 0.01964113488793373
Cluster 1: Silhueta média = 0.025322936475276947
Cluster 10: Silhueta média = 0.01848583295941353
Cluster 3: Silhueta média = 0.018533915281295776
Cluster 6: Silhueta média = 0.014217589981853962
Cluster 16: Silhueta média = 0.02086196094751358
Cluster 7: Silhueta média = 0.023397212848067284
Cluster 17: Silhueta média = 0.01610546186566353
Cluster 12: Silhueta média = 0.2593909800052643


A silhueta está próxima de zero, indicando sobreposição entre clusters.

# 1.5 Uso do Conhecimento

Ao extrair tópicos de discussão em textos com conteúdo depressivo, torna-se possível monitorar áreas de preocupação e identificar situações de risco, como tendências suicidas. Além disso, essa abordagem pode fornecer insights adicionais sobre os padrões de pensamento associados à depressão. Essa compreensão mais profunda não apenas contribui para a pesquisa sobre a depressão, mas também pode ter um impacto positivo no desenvolvimento de estratégias de tratamento mais eficazes.