In [7]:
import pandas as pd
import numpy as np
import emoji

Inicialmente cargamos los datos

In [8]:
raw_data = pd.read_csv('mental_disorders_reddit.csv', sep=',')

Y observamos la estructura

In [9]:
raw_data.head()

Unnamed: 0,title,selftext,created_utc,over_18,subreddit
0,Life is so pointless without others,Does anyone else think the most important part...,1650356960,False,BPD
1,Cold rage?,Hello fellow friends 😄\n\nI'm on the BPD spect...,1650356660,False,BPD
2,I don’t know who I am,My [F20] bf [M20] told me today (after I said ...,1650355379,False,BPD
3,HELP! Opinions! Advice!,"Okay, I’m about to open up about many things I...",1650353430,False,BPD
4,help,[removed],1650350907,False,BPD


El dataset contiene 5 columnas:

- **title**: Título del post
- **selftext**: Texto del post
- **created_utc**: Fecha de creación del post
- **over_18**: Si el post es para mayores de 18 años
- **subreddit**: Subreddit al que pertenece el post

In [10]:
raw_data.shape

(701787, 5)

La descripción del dataset indicaba que exsitían valores nulos, por tanto, procedemos a eliminarlos

In [11]:
raw_data.dropna(inplace=True)

Y además eliminamos los posts eliminados

In [12]:
removed_index = raw_data[raw_data['selftext'] == '[removed]'].index

In [13]:
raw_data.drop(removed_index, inplace=True)

Contamos los duplicados y los eliminamos

In [14]:
duped_idex = raw_data[raw_data.duplicated()].index

In [15]:
raw_data.drop(duped_idex, inplace=True)

Finalmente para faciliar la codificación eliminamos los emojis de los campos de texto (title y selftext)

In [16]:
raw_data.iloc[1]['selftext']

'Hello fellow friends 😄\n\nI\'m on the BPD spectrum and have discouraged (silent) borderline characteristics.\n\nThere are different levels to experiencing anger. I was wondering, what are yours? And how do you express it? What\'s a healthy way you found to cool it down?\n\nFor me I will first become silent and blame myself, "maybe if I" or "If I only", "maybe he\'s just not having it today", "maybe I simply don\'t get him due to my own shortcomings in understanding". However, I find it interesting how, when someone hurts the ones I love, I tend not to demonize myself no more in the extend I would normally do, but rather the aggressor. In extreme cases this can lead to my maximum expression of anger. I don\'t know whether you guys get to experience this as well? \n\nI have written this as a reaction to another post and it illustrates what this anger would look like:\n\n"The maximum amount of rage. it\'s like I blackout. I call it cold rage. No sense of pain whatsoever, pure anger. It\'

In [17]:
def remove_emojis(text):
    return emoji.replace_emoji(text, replace='')

raw_data['title'] = raw_data['title'].apply(remove_emojis)
raw_data['selftext'] = raw_data['selftext'].apply(remove_emojis)

In [18]:
raw_data.iloc[1]

title                                                 Cold rage?
selftext       Hello fellow friends \n\nI'm on the BPD spectr...
created_utc                                           1650356660
over_18                                                    False
subreddit                                                    BPD
Name: 1, dtype: object

Elimino saltos de línea

In [19]:
# remove \n
raw_data['title'] = raw_data['title'].str.replace('\n', ' ')
raw_data['selftext'] = raw_data['selftext'].str.replace('\n', ' ')

In [20]:
raw_data.iloc[1]

title                                                 Cold rage?
selftext       Hello fellow friends   I'm on the BPD spectrum...
created_utc                                           1650356660
over_18                                                    False
subreddit                                                    BPD
Name: 1, dtype: object

Salvamos una copia

In [21]:
raw_data.to_csv('somewhat_clean_mental_disorders_reddit.csv', index=False)

Y la cargamos

In [22]:
import pandas as pd

In [23]:
df = pd.read_csv('somewhat_clean_mental_disorders_reddit.csv')

In [24]:
df

Unnamed: 0,title,selftext,created_utc,over_18,subreddit
0,Life is so pointless without others,Does anyone else think the most important part...,1650356960,False,BPD
1,Cold rage?,Hello fellow friends I'm on the BPD spectrum...,1650356660,False,BPD
2,I don’t know who I am,My [F20] bf [M20] told me today (after I said ...,1650355379,False,BPD
3,HELP! Opinions! Advice!,"Okay, I’m about to open up about many things I...",1650353430,False,BPD
4,My ex got diagnosed with BPD,"Without going into detail, this diagnosis expl...",1650350635,False,BPD
...,...,...,...,...,...
581048,I really need to talk to a therapist..,I can't afford a real session and it's 11 PM. ...,1415332108,False,mentalillness
581049,I have pica,Hello. I'm taking steps to get rid o...,1414896638,False,mentalillness
581050,Where can you go to get help for someone menta...,Someone (a war veteran) I know is mentally ill...,1396298261,False,mentalillness
581051,I am rooster illusion,AMA,1344639905,False,mentalillness


In [25]:
df.subreddit.value_counts()

subreddit
BPD              212789
Anxiety          161580
depression       121141
mentalillness     38150
bipolar           35669
schizophrenia     11724
Name: count, dtype: int64

Antes de entrenar algún modelo clasificador, sabemos por el poster del dataset que hay muchos posts trolls, por tanto, vamos a codificar los textos con un modelo de huggingface y a realizar una reducción de dimensionalidad con clustering para ver si los posts trolls se agrupan en un cluster, y si es así, podremos utilizar esta información para entrenar un modelo clasificador que, además de la etiqueta del subreddit, pueda diferenciar si un post es troll o no

In [26]:
from sentence_transformers import SentenceTransformer

dimensions = 512

model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1", truncate_dim=dimensions)

  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/171 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/114k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

In [27]:
df['full_text'] = df['title'] + ' ' + df['selftext']

In [28]:
df['full_text'].fillna('', inplace=True)
texts = df.full_text.values
texts

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['full_text'].fillna('', inplace=True)


array(['Life is so pointless without others Does anyone else think the most important part of life is being in a relationship? like the absolute most important. I don’t really care for any other goals in my life lol as long as I end up in a relationship. that’s like my ultimate life goal. I wish I wasn’t like this tho.  my therapist will ask me ab life goals and I just can’t imagine doing anything without someone by my side.',
       'Cold rage? Hello fellow friends   I\'m on the BPD spectrum and have discouraged (silent) borderline characteristics.  There are different levels to experiencing anger. I was wondering, what are yours? And how do you express it? What\'s a healthy way you found to cool it down?  For me I will first become silent and blame myself, "maybe if I" or "If I only", "maybe he\'s just not having it today", "maybe I simply don\'t get him due to my own shortcomings in understanding". However, I find it interesting how, when someone hurts the ones I love, I tend not to

In [29]:
texts[:10]

array(['Life is so pointless without others Does anyone else think the most important part of life is being in a relationship? like the absolute most important. I don’t really care for any other goals in my life lol as long as I end up in a relationship. that’s like my ultimate life goal. I wish I wasn’t like this tho.  my therapist will ask me ab life goals and I just can’t imagine doing anything without someone by my side.',
       'Cold rage? Hello fellow friends   I\'m on the BPD spectrum and have discouraged (silent) borderline characteristics.  There are different levels to experiencing anger. I was wondering, what are yours? And how do you express it? What\'s a healthy way you found to cool it down?  For me I will first become silent and blame myself, "maybe if I" or "If I only", "maybe he\'s just not having it today", "maybe I simply don\'t get him due to my own shortcomings in understanding". However, I find it interesting how, when someone hurts the ones I love, I tend not to

In [31]:
embeddings = model.encode(texts)

  attn_output = torch.nn.functional.scaled_dot_product_attention(


In [33]:
import numpy as np
np.savetxt('my_file.txt', embeddings)

In [34]:
embeddings = np.loadtxt('my_file.txt')

In [36]:
import umap

In [39]:
umap_model = umap.UMAP(n_components=25, metric='cosine')

embeddings_25 = umap_model.fit(embeddings)

In [43]:
np.savetxt('umap_25.txt', embeddings_25.embedding_)

In [47]:
umap_25 = np.loadtxt('umap_25.txt')