<a href="https://colab.research.google.com/github/arjumand252/Mental-Health-Chatbot/blob/main/Sentiment_analysis_reddit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [3]:
!kaggle datasets download -d neelghoshal/reddit-mental-health-data

Dataset URL: https://www.kaggle.com/datasets/neelghoshal/reddit-mental-health-data
License(s): unknown
Downloading reddit-mental-health-data.zip to /content
  0% 0.00/1.83M [00:00<?, ?B/s]
100% 1.83M/1.83M [00:00<00:00, 50.1MB/s]


In [4]:
!unzip /content/reddit-mental-health-data.zip

Archive:  /content/reddit-mental-health-data.zip
  inflating: data_to_be_cleansed.csv  


In [5]:
import pandas as pd
import re
import string
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.base import BaseEstimator, TransformerMixin
import nltk
from sklearn.model_selection import train_test_split
import torch
import json

In [6]:
# Download necessary NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

# Data Pre-processing pipeline

In [7]:
df = pd.read_csv('data_to_be_cleansed.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,text,title,target
0,0,Welcome to /r/depression's check-in post - a p...,"Regular check-in post, with information about ...",1
1,1,We understand that most people who reply immed...,Our most-broken and least-understood rules is ...,1
2,2,Anyone else just miss physical touch? I crave ...,"I haven’t been touched, or even hugged, in so ...",1
3,3,I’m just so ashamed. Everyone and everything f...,Being Depressed is Embarrassing,1
4,4,I really need a friend. I don't even have a si...,I'm desperate for a friend and to feel loved b...,1



Targets given have the following mappings:
0 = Stress
1 = Depression
2 = Bipolar disorder
3 = Personality disorder
4 = Anxiety

In [8]:
# Replace NaN or missing values with an empty string
df['text'] = df['text'].fillna('')

# Convert all entries to strings
df['text'] = df['text'].astype(str)

In [9]:
def clean_text(text):
    text = text.lower()  # Lowercase text
    text = re.sub(r'http\S+|www\S+', '', text)  # Remove URLs
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    text = ' '.join(text.split())  # Remove extra whitespaces
    return text

# Function for lemmatization (removing stopwords and reducing words to their base form)
def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    words = text.split()
    lemmatized = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(lemmatized)

# Custom transformer class to apply the functions on a dataset
class PreprocessText(BaseEstimator, TransformerMixin):
    def __init__(self, clean_func=clean_text, lemmatize_func=lemmatize_text):
        self.clean_func = clean_func
        self.lemmatize_func = lemmatize_func

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [self.lemmatize_func(self.clean_func(text)) for text in X]

preprocessing_pipeline = Pipeline([
    ('preprocessor', PreprocessText())  # Apply text cleaning and lemmatization
])

In [10]:
# Apply the preprocessing pipeline to the 'text' column
df['cleaned_text'] = preprocessing_pipeline.fit_transform(df['text'])

# Apply the pipeline to a user's response (for chatbot input)
def preprocess_user_input(user_input):
    return preprocessing_pipeline.transform([user_input])[0]

# Example of preprocessed user input
user_input = "I'm feeling really down today, I just can't seem to shake this sadness off."
processed_input = preprocess_user_input(user_input)
print(f"Processed user input: {processed_input}")

Processed user input: im feeling really today cant seem shake sadness




In [11]:
train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)

In [12]:
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

In [None]:
test_df['target'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 1788 entries, 0 to 1787
Series name: target
Non-Null Count  Dtype
--------------  -----
1788 non-null   int64
dtypes: int64(1)
memory usage: 14.1 KB


# TinyBERT model

In [13]:
import os
import torch
os.environ["WANDB_DISABLED"] = "true"
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import Dataset

In [16]:
model_name='google/bert_uncased_L-4_H-512_A-8'
num_labels=5
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/116M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-4_H-512_A-8 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [24]:
# Tokenize datasets
train_encodings = tokenizer(
    train_df['text'].tolist(),
    truncation=True,
    padding=True,
    max_length=128    #512(109MB), 128
)
test_encodings = tokenizer(
      test_df['text'].tolist(),
      truncation=True,
      padding=True,
      max_length=128    #512(109MB), 128
)

In [25]:
# Create datasets
train_dataset = Dataset.from_dict({
      'input_ids': train_encodings['input_ids'],
      'attention_mask': train_encodings['attention_mask'],
      'labels': train_df['target'].tolist()
})
test_dataset = Dataset.from_dict({
      'input_ids': test_encodings['input_ids'],
      'attention_mask': test_encodings['attention_mask'],
      'labels': test_df['target'].tolist()
})

In [19]:
print(f"Train dataset length: {len(train_dataset)}")
print(f"Unique labels: {set(train_df['target'])}")

Train dataset length: 4169
Unique labels: {0, 1, 2, 3, 4}


In [26]:
# Define Training Arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_dir='./logs'
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [27]:
trainer.train()

Step,Training Loss
500,0.5304


TrainOutput(global_step=783, training_loss=0.44632749204282407, metrics={'train_runtime': 50.9645, 'train_samples_per_second': 245.406, 'train_steps_per_second': 15.364, 'total_flos': 123676717798656.0, 'train_loss': 0.44632749204282407, 'epoch': 3.0})

In [28]:
# Save the model
torch.save(model.state_dict(), 'tinybert_model2.pth')

In [29]:
# saving with compression
torch.save(model.state_dict(), 'compressed_model.pth', _use_new_zipfile_serialization=True)

In [31]:
def test_tinybert_model(model_path, text_samples, num_labels=5):
    # Load the model and tokenizer
    model = AutoModelForSequenceClassification.from_pretrained(
        'google/bert_uncased_L-4_H-512_A-8',
        num_labels=num_labels
    )
    model.load_state_dict(torch.load(model_path))
    model.eval()  # Set to evaluation mode

    tokenizer = AutoTokenizer.from_pretrained('google/bert_uncased_L-4_H-512_A-8')

    # Inference
    results = []
    with torch.no_grad():
        for text in text_samples:
            # Tokenize and prepare input
            inputs = tokenizer(
                text,
                return_tensors='pt',
                truncation=True,
                padding=True,
                max_length=512
            )

            # Predict
            outputs = model(**inputs)
            predictions = torch.softmax(outputs.logits, dim=1)
            predicted_class = torch.argmax(predictions, dim=1).item()

            results.append({
                'text': text,
                'predicted_class': predicted_class,
                'confidence': predictions[0][predicted_class].item()
            })

    return results


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-4_H-512_A-8 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  model.load_state_dict(torch.load(model_path))


Text: This is a great product
Predicted Class: 0
Confidence: 0.5207

Text: I'm  extremely depressed these days, nothing makes me feel joy.
Predicted Class: 1
Confidence: 0.9395

Text: I have been overjoyed this past month, life has taken a turn for the better.
Predicted Class: 3
Confidence: 0.4404



In [32]:
# Example usage
sample_texts = [
    "This is a terrible product",
    "I'm  extremely depressed these days, nothing makes me feel joy.",
    "I have been overjoyed this past month, life has taken a turn for the better."
]

# Assumes you've saved your model as 'tinybert_model.pth'
test_results = test_tinybert_model('tinybert_model.pth', sample_texts)
for result in test_results:
    print(f"Text: {result['text']}")
    print(f"Predicted Class: {result['predicted_class']}")
    print(f"Confidence: {result['confidence']:.4f}\n")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-4_H-512_A-8 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  model.load_state_dict(torch.load(model_path))


Text: This is a terrible product
Predicted Class: 1
Confidence: 0.3310

Text: I'm  extremely depressed these days, nothing makes me feel joy.
Predicted Class: 1
Confidence: 0.9395

Text: I have been overjoyed this past month, life has taken a turn for the better.
Predicted Class: 3
Confidence: 0.4404



 0 = Stress

 1 = Depression

 2 = Bipolar disorder

 3 = Personality disorder
  
 4 = Anxiety