<a href="https://colab.research.google.com/github/dledbetter123/depression-ATT/blob/main/678_Data_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Goal Is to reduce the parameters needed to detect depression so that it could be used on edge devices (Apples autocorrect can detect and help users find mental health sources) rather than needing expensive API requests to a remote LLM.

In [None]:
!git clone https://github.com/rafalposwiata/depression-detection-lt-edi-2022.git

Cloning into 'depression-detection-lt-edi-2022'...
remote: Enumerating objects: 67, done.[K
remote: Counting objects: 100% (67/67), done.[K
remote: Compressing objects: 100% (46/46), done.[K
remote: Total 67 (delta 23), reused 57 (delta 18), pack-reused 0 (from 0)[K
Receiving objects: 100% (67/67), 8.30 MiB | 6.11 MiB/s, done.
Resolving deltas: 100% (23/23), done.


In [None]:
!pwd
!ls

/content/depression-detection-lt-edi-2022/data/original_dataset
depression-detection-lt-edi-2022  dev.tsv  test.tsv  train.tsv


In [None]:
%cd depression-detection-lt-edi-2022/data/original_dataset/

/content/depression-detection-lt-edi-2022/data/original_dataset/depression-detection-lt-edi-2022/data/original_dataset


In [None]:
!ls

dev.tsv  test.tsv  train.tsv


In [None]:
import pandas as pd
import numpy as np
import re

import os
directory = '.'
files = []
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    if os.path.isfile(f):
        print(f.split('.')[1][1:])
        files.append(f.split('.')[1][1:])
print(files)

# for file_ in files:
#   with open(file_ + ".tsv", 'r') as myfile:
#     with open(file_ + ".csv", 'w') as csv_file:
#       for line in myfile:

#         fileContent = re.sub("\t", ",", line)

#         csv_file.write(fileContent)

df_train = pd.read_table(files[0] + ".tsv")
#.to_csv(files[0] + ".csv", index=False)

df_train.head()

train
test
dev
['train', 'test', 'dev']


Unnamed: 0,PID,Text_data,Label
0,train_pid_1,Waiting for my mind to have a breakdown once t...,moderate
1,train_pid_2,My new years resolution : I'm gonna get my ass...,moderate
2,train_pid_3,New year : Somone else Feeling like 2020 will ...,moderate
3,train_pid_4,"My story I guess : Hi, Im from Germany and my ...",moderate
4,train_pid_5,Sat in the dark and cried myself going into th...,moderate


In [None]:
df_train['Class labels'] = df_train['Label'].str.lower()
df_train['text data'] = df_train['Text_data'].str.lower()

df_train.drop(columns=['Label', "Text_data"], axis=1, inplace=True)
df_train.head()

Unnamed: 0,PID,Class labels,text data
0,train_pid_1,moderate,waiting for my mind to have a breakdown once t...
1,train_pid_2,moderate,my new years resolution : i'm gonna get my ass...
2,train_pid_3,moderate,new year : somone else feeling like 2020 will ...
3,train_pid_4,moderate,"my story i guess : hi, im from germany and my ..."
4,train_pid_5,moderate,sat in the dark and cried myself going into th...


In [None]:
display(df_train['Class labels'].unique())

array(['moderate', 'not depression', 'severe'], dtype=object)

In [None]:
label_mapping = {
    'moderate': 0,
    'not depression': 1,
    'severe': 2
}
df_train['Class labels'] = df_train['Class labels'].map(label_mapping)


In [None]:
df_train['Class labels'].value_counts()

Unnamed: 0_level_0,count
Class labels,Unnamed: 1_level_1
0,6019
1,1971
2,901


In [None]:
max_length = 0
for text in df_train['text data']:
  max_length = max(max_length, len(text))

print(f"The length of the longest 'text data' string is: {max_length}")

The length of the longest 'text data' string is: 15996


In [None]:
my_data = []
for data in df_train['text data']:
  my_data.append(data)
for i, datum in enumerate(my_data[:5]):
  print("Datum #%d:\n %s | not depression\n\n" % (i, datum))

Datum #0:
 waiting for my mind to have a breakdown once the “new year” feeling isn’t there anymore : i don’t know about anyone else, but i’m a little bit worried that i’ll go back to being depressed in a few days time or something. last year, i tried not to have any breakdowns for the start of 2019. a mere 10 days later, i broke down crying. i wasn’t the same for that entire year. up until december, where i was ok that month. now i just wait... it’s a weird way to act and feel, but it feels a bit normal. | not depression


Datum #1:
 my new years resolution : i'm gonna get my ass into a therapists office, and if i dont become even a little bit happy, then i'm not dealing with this shit anymore.

i'm not asking for a lot, just a little bit of serotonin is all i want | not depression


Datum #2:
 new year : somone else feeling like 2020 will be there last
year on earth because even wen your hammerd your feeling like a moron thats depressed? | not depression


Datum #3:
 my story i guess 

In [None]:
print(my_data[1])

my new years resolution : i'm gonna get my ass into a therapists office, and if i dont become even a little bit happy, then i'm not dealing with this shit anymore.

i'm not asking for a lot, just a little bit of serotonin is all i want


In [None]:
%ls

dev.tsv  test.tsv  train.tsv


In [None]:
!pwd

/content/depression-detection-lt-edi-2022/data/original_dataset/depression-detection-lt-edi-2022/data/original_dataset


In [None]:
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

max_length = 0
for text in df_train['text data']:
  max_length = max(max_length, len(text))

print(f"The length of the longest 'Text_data' string is: {max_length}")

class DepressionDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

texts = ["sample text here", "another sample text"]
labels = [0, 1]
dataset = DepressionDataset(texts, labels, tokenizer)
data_loader = DataLoader(dataset, batch_size=2)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

The length of the longest 'Text_data' string is: 15996


In [None]:

for batch in data_loader:
    print("Input IDs:", batch['input_ids'])
    print("Attention Mask:", batch['attention_mask'])
    print("Labels:", batch['label'])
    break

Input IDs: tensor([[ 101, 7099, 3793,  ...,    0,    0,    0],
        [ 101, 2178, 7099,  ...,    0,    0,    0]])
Attention Mask: tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])
Labels: tensor([0, 1])


In [None]:
class DepressionDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length=6390):
        self.texts = dataframe['text data'].tolist()
        self.labels = dataframe['Class labels'].tolist()
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long)
        }

dataset = DepressionDataset(df_train, tokenizer)
data_loader = DataLoader(dataset, batch_size=4, shuffle=True)

In [26]:
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class MHA(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads=8, dropout=0.4):
        super(MHA, self).__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.fc_q = nn.Linear(embed_dim, embed_dim)
        self.fc_k = nn.Linear(embed_dim, embed_dim)
        self.fc_v = nn.Linear(embed_dim, embed_dim)
        self.fc_o = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(embed_dim)
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=-1)
        self.classifier = nn.Linear(embed_dim, 2)

    def forward(self, input_ids):
        x = self.embedding(input_ids)
        batch_size = x.size(0)
        q = self.fc_q(x)
        k = self.fc_k(x)
        v = self.fc_v(x)
        q = q.view(batch_size, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3)
        k = k.view(batch_size, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3)

        v = v.view(batch_size, -1, self.num_heads, self.head_dim).permute(0, 2, 1, 3)
        attn_weights = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
        attn_weights = torch.softmax(attn_weights, dim=-1)
        attn_output = torch.matmul(attn_weights, v)
        attn_output = attn_output.permute(0, 2, 1, 3).contiguous().view(batch_size, -1, self.embed_dim)
        output = self.fc_o(attn_output)
        output = self.relu(output)
        output = self.dropout(output)
        output = self.layer_norm(output + x)
        output = self.classifier(output)
        output = output[:, 0, :]
        return output

import torch.optim as optim
import torch.nn as nn
from transformers import BertTokenizer
import tqdm

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
mha_model = MHA(vocab_size=tokenizer.vocab_size, embed_dim=768).to(device)

optimizer = optim.Adam(mha_model.parameters(), lr=1e-5)
criterion = nn.CrossEntropyLoss().to(device)

num_epochs = 10

for epoch in range(num_epochs):
    loop = tqdm(data_loader, leave=True)
    for batch in loop:
        optimizer.zero_grad()

        input_ids = batch['input_ids'].type(torch.long).to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        outputs = mha_model(input_ids)

        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        loop.set_description(f"Epoch [{epoch + 1}/{num_epochs}]")
        loop.set_postfix(loss=loss.item())

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
