# Predicting Depression in Tweets
****

In the mental health epidemic today, I thought it would be useful to be able to predict whether a user is in distress or struggling based on their posts on social media. So, using the BERT NLP model, I analyzed a database of tweets from Kaggle which were labeled with whether the user is depressed or not.

Since this particular dataset had not been analyzed with the BERT model yet, I used the BERT model to see if its sentiment analysis might be helpful in predicting the mood of a user. 

I chose the BERT model because it has been trained by experts in the field of Natural Language Processing, and the BERT model has an understanding of not only words and grammatical structure, but also the meaning of the words. 

# Imports and Setup

In [1]:
!pip install -q transformers
!pip install -q sentencepiece
!pip install -q nlpaug
!pip install -q emoji
!pip install -q datasets

[K     |████████████████████████████████| 4.7 MB 4.9 MB/s 
[K     |████████████████████████████████| 6.6 MB 48.8 MB/s 
[K     |████████████████████████████████| 101 kB 13.0 MB/s 
[K     |████████████████████████████████| 596 kB 40.9 MB/s 
[K     |████████████████████████████████| 1.3 MB 4.7 MB/s 
[K     |████████████████████████████████| 410 kB 4.8 MB/s 
[K     |████████████████████████████████| 197 kB 4.6 MB/s 
[?25h  Building wheel for emoji (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 365 kB 5.0 MB/s 
[K     |████████████████████████████████| 115 kB 70.1 MB/s 
[K     |████████████████████████████████| 212 kB 64.8 MB/s 
[K     |████████████████████████████████| 141 kB 70.0 MB/s 
[K     |████████████████████████████████| 127 kB 42.4 MB/s 
[?25h

In [2]:
import numpy as np
import pandas as pd
import re
import string
import emoji
from tqdm.auto import tqdm
import os
import random
import torch
from datasets import ( load_dataset, Dataset, load_metric, DatasetDict )
from transformers import ( BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments, default_data_collator, set_seed, )

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, classification_report

I imported the data

In [3]:
df = pd.read_csv("Mental-Health-Twitter.csv",encoding="ISO-8859-1")

df.head()

Unnamed: 0.1,Unnamed: 0,post_id,post_created,post_text,user_id,followers,friends,favourites,statuses,retweets,label
0,0,637894677824413696,Sun Aug 30 07:48:37 +0000 2015,It's just over 2 years since I was diagnosed w...,1013187241,84,211,251,837,0,1
1,1,637890384576778240,Sun Aug 30 07:31:33 +0000 2015,"It's Sunday, I need a break, so I'm planning t...",1013187241,84,211,251,837,1,1
2,2,637749345908051968,Sat Aug 29 22:11:07 +0000 2015,Awake but tired. I need to sleep but my brain ...,1013187241,84,211,251,837,0,1
3,3,637696421077123073,Sat Aug 29 18:40:49 +0000 2015,RT @SewHQ: #Retro bears make perfect gifts and...,1013187241,84,211,251,837,2,1
4,4,637696327485366272,Sat Aug 29 18:40:26 +0000 2015,Itâs hard to say whether packing lists are m...,1013187241,84,211,251,837,1,1


Then I dropped duplicate tweets from the data.

In [4]:
df.drop_duplicates(subset="post_text",inplace=True)

# Text Preprocessing

In [5]:
#Remove emojis from text
def strip_emoji(text):
    regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  
        u"\U0001F300-\U0001F5FF"  
        u"\U0001F680-\U0001F6FF"  
        u"\U0001F1E0-\U0001F1FF"  
                           "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'',text)

#Remove punctuation, links, mentions and new line characters
def strip_all_entities(text): 
    text = re.sub(r"\n\r", "", text).lower()
    text = re.sub(r"(?:\@|https?\://)\S+", "", text) 
    text = re.sub(r'[^\x00-\x7f]',r'', text)
    banned_list= string.punctuation + 'Ã'+'±'+'ã'+'¼'+'â'+'»'+'§'
    table = str.maketrans('', '', banned_list)
    text = text.translate(table)
    return text

#remove hashtags
def clean_hashtags(tweet):
    new_tweet = " ".join(word.strip() for word in re.split('#(?!(?:hashtag)\b)[\w-]+(?=(?:\s+#[\w-]+)*\s*$)', tweet)) 
    new_tweet2 = " ".join(word.strip() for word in re.split('#|_', new_tweet)) 
    return new_tweet2

#Filter special characters
def filter_chars(a):
    sent = []
    for word in a.split(' '):
        if ('$' in word) | ('&' in word):
            sent.append('')
        else:
            sent.append(word)
    return ' '.join(sent)

def remove_mult_spaces(text): # remove multiple spaces
    return re.sub("\s\s+" , " ", text)

In [6]:
# Add the clean Text to a new column in the dataframe
df["post_clean"] = [remove_mult_spaces(filter_chars(clean_hashtags(strip_all_entities(strip_emoji(t))))) for t in tqdm(df["post_text"].values)]
df.head()

  0%|          | 0/19488 [00:00<?, ?it/s]

Unnamed: 0.1,Unnamed: 0,post_id,post_created,post_text,user_id,followers,friends,favourites,statuses,retweets,label,post_clean
0,0,637894677824413696,Sun Aug 30 07:48:37 +0000 2015,It's just over 2 years since I was diagnosed w...,1013187241,84,211,251,837,0,1,its just over 2 years since i was diagnosed wi...
1,1,637890384576778240,Sun Aug 30 07:31:33 +0000 2015,"It's Sunday, I need a break, so I'm planning t...",1013187241,84,211,251,837,1,1,its sunday i need a break so im planning to sp...
2,2,637749345908051968,Sat Aug 29 22:11:07 +0000 2015,Awake but tired. I need to sleep but my brain ...,1013187241,84,211,251,837,0,1,awake but tired i need to sleep but my brain h...
3,3,637696421077123073,Sat Aug 29 18:40:49 +0000 2015,RT @SewHQ: #Retro bears make perfect gifts and...,1013187241,84,211,251,837,2,1,rt retro bears make perfect gifts and are grea...
4,4,637696327485366272,Sat Aug 29 18:40:26 +0000 2015,Itâs hard to say whether packing lists are m...,1013187241,84,211,251,837,1,1,its hard to say whether packing lists are maki...


# BERT Model

In [7]:
# Importing and Assigning Train/Test Sets
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size = 0.2)

In [8]:
# Making the dataframes to datasets
train_ds = Dataset.from_pandas(df_train[["post_clean", "label"]])
test_ds = Dataset.from_pandas(df_test[["post_clean", "label"]])

In [9]:
# Assigning labels to ID numbers
label2id = {
    0:0,
    1:1
}
id2label = {i:l for l,i in label2id.items()}
num_labels = len(label2id)

In [10]:
# Splitting the training set to training and validation sets
splits = train_ds.train_test_split(test_size=0.2)
train_ds, val_ds = splits["train"], splits["test"]
raw_datasets = DatasetDict({"train": train_ds, "validation": val_ds, "test": test_ds})

In [11]:
# Creating the model
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", # using bert-base-uncased model
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
)


model = model.cuda()

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") # using bert-base-uncased model


Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [12]:
%%time

# Setting up the BERT models
max_seq_length = 128

def preprocess_function(examples):
    # Tokenize the texts
    args = (
        (examples["post_clean"],)
    )
    result = tokenizer(*args, padding="max_length", max_length=max_seq_length, truncation=True)

    result["result_label"] = [l for l in examples["label"]]
    return result

raw_datasets = raw_datasets.map(
    preprocess_function,
    batched=True,
    desc="Running tokenizer on dataset",
)

Running tokenizer on dataset:   0%|          | 0/13 [00:00<?, ?ba/s]

Running tokenizer on dataset:   0%|          | 0/4 [00:00<?, ?ba/s]

Running tokenizer on dataset:   0%|          | 0/4 [00:00<?, ?ba/s]

CPU times: user 9.78 s, sys: 89.9 ms, total: 9.87 s
Wall time: 9.92 s


In [None]:
# Setting up the METRICS
os.environ["WANDB_DISABLED"] = "true"
os.environ["WANDB_MODE"] = "offline"

metric_name = "accuracy"
# Load metric
metric = load_metric(metric_name)

# You can define your custom compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a
# predictions and label_ids field) and has to return a dictionary string to float.
def compute_metrics(p):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.argmax(preds, axis=1)
    return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}


In [14]:
# Setting hyper parameters for the models
batch_size = 16
training_args = TrainingArguments(
    f"bert-finetuned-tweet-sentiment",
    learning_rate = 0.00007,
    num_train_epochs = 3,
    metric_for_best_model=metric_name,
    evaluation_strategy = 'epoch',
    save_strategy = "epoch",
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size
    # Task here: set training arguments by looking at the web provided on EdStem
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [15]:
# Assigning Train and Test data to dicts
train_randomlist = random.sample(range(len(raw_datasets["train"])), len(raw_datasets["train"])) # Task: Random pick 1000 or 2000 number from all train dataset
test_randomlist = random.sample(range(len(raw_datasets["validation"])), len(raw_datasets["validation"])) # Task: Random pick 100 or 200 number from all test dataset


debug = False

# Initialize our Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=raw_datasets["train"].select(train_randomlist),
    eval_dataset=raw_datasets["validation"].select(test_randomlist),
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

In [16]:
trainer.train() # Training the model

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: __index_level_0__, post_clean, result_label. If __index_level_0__, post_clean, result_label are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 12472
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2340


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5474,0.456575,0.778704
2,0.313,0.503553,0.796023
3,0.1634,0.803828,0.797306


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: __index_level_0__, post_clean, result_label. If __index_level_0__, post_clean, result_label are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3118
  Batch size = 16
Saving model checkpoint to bert-finetuned-tweet-sentiment/checkpoint-780
Configuration saved in bert-finetuned-tweet-sentiment/checkpoint-780/config.json
Model weights saved in bert-finetuned-tweet-sentiment/checkpoint-780/pytorch_model.bin
tokenizer config file saved in bert-finetuned-tweet-sentiment/checkpoint-780/tokenizer_config.json
Special tokens file saved in bert-finetuned-tweet-sentiment/checkpoint-780/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: __index_

TrainOutput(global_step=2340, training_loss=0.327565558343871, metrics={'train_runtime': 912.6945, 'train_samples_per_second': 40.995, 'train_steps_per_second': 2.564, 'total_flos': 2461140811837440.0, 'train_loss': 0.327565558343871, 'epoch': 3.0})

# Evaluation

In [17]:
trainer.evaluate() # evaluating the model

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: __index_level_0__, post_clean, result_label. If __index_level_0__, post_clean, result_label are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 3118
  Batch size = 16


{'epoch': 3.0,
 'eval_accuracy': 0.7973059415817261,
 'eval_loss': 0.8038281798362732,
 'eval_runtime': 23.7349,
 'eval_samples_per_second': 131.368,
 'eval_steps_per_second': 8.216}

In [18]:
%%time
predict_dataset = raw_datasets["test"]
predict_dataset = predict_dataset.remove_columns("result_label")
predictions = trainer.predict(predict_dataset, metric_key_prefix="predict").predictions
predictions = np.argmax(predictions, axis=1)

df_test = predict_dataset.to_pandas()
df_test["pred_sent"] = [id2label[item] for item in predictions]
output_predict_file = os.path.join(training_args.output_dir, "predict_results.csv")
df_test.to_csv(output_predict_file, index=False)

from sklearn.metrics import accuracy_score, f1_score, classification_report
y_true = [l for l in df_test["label"]]
y_pred = list(predictions)
print(classification_report(y_true, y_pred))

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: __index_level_0__, post_clean. If __index_level_0__, post_clean are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 3898
  Batch size = 16


              precision    recall  f1-score   support

           0       0.78      0.79      0.79      1927
           1       0.80      0.79      0.79      1971

    accuracy                           0.79      3898
   macro avg       0.79      0.79      0.79      3898
weighted avg       0.79      0.79      0.79      3898

CPU times: user 32.6 s, sys: 261 ms, total: 32.9 s
Wall time: 32.9 s


# Demo

In [19]:
def pred_sentiment(input_text):
    text = input_text
    print("Tweet:", text)
    encoding = tokenizer(text, return_tensors="pt")
    encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}

    outputs = trainer.model(**encoding)
    softmax = torch.nn.Softmax(dim=-1)
    probs = softmax(outputs.logits)
    pred = probs.argmax().item()

    def prediction(predict):
      if predict == 0:
        return "User is not depressed"
      else:
        return "User is depressed"
    print("Prediction:", prediction(id2label[pred]))

In [26]:
# Write your own tweet
tweet = input("Please write a tweet: ")
pred_sentiment(tweet)

Please write a tweet: My life sucks
Tweet: My life sucks
Prediction: User is depressed


# Conclusions
The model had about an 80% accuracy when it came to the evaluation dataset. Also it had an f1 score of about 0.8.

However, a single tweet is obviously not a good determinant of whether a user has depression. It would be better to be able to analyze a user's entire twitter history rather than just individual tweets. Unfortunately, the dataset does not have enough data to achieve this (when I tried it, the model would memorize the data). 

Despite its many issues, I think a model of this type would be good to implement on social media sites to help users get help when they are struggling or to help with suicide prevention. With more data and better labeling, a model would be more successful at determining which posts are concerning and which are not.

# Sources
https://www.kaggle.com/discussions/general/132022#1563212

Boston University's AI4All program