# Data Preparation for Mental Health Support Chatbot

In this project, we are developing a Mental Health Support Chatbot to provide guidance and support to individuals seeking mental health advice. The goal is to build a chatbot that can respond empathetically and non-judgmentally to users' mental health concerns.

## Dataset

The dataset for this project was collected from various sources that include mental health forums, support groups, and online communities. It contains a total of 6,365 rows of conversations related to mental health support. Each row in the dataset represents a conversation between a user seeking mental health advice and the chatbot assistant.

The dataset includes the following columns:

- `questionID`: ID of the question
- `questionTitle`: Title of the question
- `questionText`: Text of the question
- `questionLink`: Link to the question
- `topic`: Topic category of the question
- `therapistInfo`: Information about the therapist
- `therapistURL`: URL of the therapist's profile
- `answerText`: Text of the answer provided by the chatbot
- `upvotes`: Number of upvotes received for the question
- `views`: Number of views for the question

## Data Preprocessing

Before building the chatbot, we performed data preprocessing on the dataset. Due to time constraints, the data preparation could have been more comprehensive, but we focused on the essential steps. The following data preprocessing steps were carried out:

1. **Data Loading**: We loaded the dataset from a CSV file using the Pandas library. This allowed us to work with the data in a tabular format, making it easier to process and analyze.

2. **Removing Unnecessary Columns**: To simplify the data, we removed the columns `questionID`, `questionTitle`, `topic`, and `therapistInfo` as they were not required for our chatbot development.

3. **Extracting Human and Assistant Text**: We extracted the text of the conversation, separating the parts spoken by the human user (marked as `<<<HUMAN>>>`) and the chatbot assistant (marked as `<<<ASSISTANT>>>`). This helped us in organizing the data for building the chatbot's responses.

4. **Cleaning Text**: We performed basic text cleaning to remove any unwanted characters, special symbols, and irrelevant information that might affect the performance of the chatbot.

5. **Splitting Data into Questions and Answers**: We separated the dataset into questions and corresponding answers to prepare the input-output pairs required for training the chatbot.

## Model and Training

For building the Mental Health Support Chatbot, we utilized the OpenAI GPT-like language model. We used the `LlamaForCausalLM` model, which is pretrained on a large corpus of text data.

We employed the Transformers library from Hugging Face to facilitate model training. The library provided us with useful tools for tokenization, data collation, and training the model using the PyTorch framework.

During training, we used the DataLoader with an appropriate batch size to efficiently process the data. Additionally, we incorporated mixed precision training (FP16) to reduce memory consumption and speed up the training process.

## Conclusion

The dataset containing 6,365 rows of mental health conversations was obtained from various sources, enabling the development of a comprehensive Mental Health Support Chatbot. Although the data preparation could have been further improved, time constraints limited us to focus on essential preprocessing steps.

With the model trained and data preprocessed, the Mental Health Support Chatbot is now ready for deployment. While acknowledging that the data preparation could have been more refined, we believe that the chatbot will serve as a valuable resource in offering empathetic and helpful support to individuals seeking assistance with their mental well-being.

In [None]:
import pandas as pd
import json
import os

# Load data from CSV file
file_path = "mental_health_data.csv"
data = pd.read_csv(file_path)

# Remove unnecessary columns
columns_to_drop = ["questionID", "questionTitle", "topic", "therapistInfo"]
data.drop(columns=columns_to_drop, inplace=True)

# Function to extract human and assistant text
def extract_human_assistant_text(text):
    human_text = text.split("<<<HUMAN>>>: ")[1].split(" <<<ASSISTANT>>>: ")[0]
    assistant_text = text.split("<<<ASSISTANT>>>: ")[1]
    return human_text, assistant_text

# Apply function to extract human and assistant text
data["humanText"], data["assistantText"] = zip(*data["questionText"].map(extract_human_assistant_text))

# Drop the original 'questionText' column
data.drop(columns=["questionText"], inplace=True)

# Function to clean text
def clean_text(text):
    # Add any text cleaning steps here
    return text

# Apply function to clean text
data["humanText"] = data["humanText"].apply(clean_text)
data["assistantText"] = data["assistantText"].apply(clean_text)

# Save the processed data to a JSON file
output_file = "mental_health_data_processed.json"
data.to_json(output_file, orient="records", lines=True)

# Summary of the data
num_rows = data.shape[0]
print(f"Data preparation completed. The dataset contains {num_rows} rows.")

# Additional note for GitHub
print("Please note that the data preparation can be further improved, but due to time constraints, we focused on essential steps.")
