## Mental-Health-Discussion-Analyzer
#### Author
**Name:** Andres Figueroa  
**Email:** andresfigueroa@brandeis.edu

#### Project Description
The purpose of this project is to build a tool that collects, processes, and visualizes online discussions about mental health (like posts on Reddit).

---

#### Importing Libraries

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, ENGLISH_STOP_WORDS
import pandas as pd
import numpy as np
import re
from textblob import TextBlob
import matplotlib.pyplot as plt
import seaborn as sns
import nltk

---

#### Loading and Exploring Our Data

In [17]:
df = pd.read_csv("mental_health_posts.csv")
print(f"DataFrame shape: {df.shape}")
df.head()

DataFrame shape: (500, 5)


Unnamed: 0,title,score,num_comments,created,selftext
0,I feel behind in life and it is making me depr...,1,0,2025-09-08 23:29:39,I just feel so behind in life. I keep wishing ...
1,I don't understand why I'm hurting,1,0,2025-09-08 23:27:41,I've been diagnosed by a psychiatrist for depr...
2,Struggling with Maladaptive Daydreaming: How D...,1,0,2025-09-08 23:27:27,"Hi everyone,\nI’m reaching out because I’m str..."
3,I hate my life,2,2,2025-09-08 23:26:05,I'm 30m. And I hate my life. I hate my job and...
4,Guys it's slowly but surely becoming unbearable,1,1,2025-09-08 23:22:24,"I can't. I just can't. \n\nAt day, at work, wi..."


**Note:** I know we could look at this back in `WebScraper.py`, but also looks nice here too. I just want to make things look nice 😊.

In [18]:
df.dtypes

title           object
score            int64
num_comments     int64
created         object
selftext        object
dtype: object

---

#### Defining Our ML Problem

##### Explaining our Data
The data is from the Reddit forum `r/mentalhealth` and from the `New` tab I scraped the following:
- Post Title (`post`)
- Number of Upvotes (`score`)
- Number of Comments (`num_comments`)
- Time Post was Created (`created`)
- The Post or Body Text (`selftext`)

---

##### Prediction Target (Unlabeled)
The goal of this project is to analyze the text data, upvotes, and number of comments from the posts to identify common themes and topics discussed in the mental health community. Since the data is unlabeled, we will use unsupervised learning techniques to discover patterns and insights.

---

##### Problem Type
This is an unsupervised learning problem as there are no labels. Insight I am hoping to gather are:
- sentiment trends
- common topics discussed
- patterns and trends

---

##### Importance
Understanding online discussions about mental health can provide valuable insights into the challenges and concerns faced by individuals. This information can be used to inform mental health interventions, support services, and public health campaigns.

---

#### Cleaning Our Data/Text
##### Combining Title and Text

In [19]:
df["text"] = df["title"].fillna("") + " " + df["selftext"].fillna("")

##### Cleaning Our Text

In [None]:
lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.PorterStemmer()
nltk.download('wordnet') # Download WordNet for lemmatization gives your lemmatizer the vocabulary and rules to reduce words

def clean_text(text):
    text = str(text)
    text = str(TextBlob(text).correct())
    text = text.lower()
    text = re.sub(r"[^a-z\s]", "", text)
    words = text.split()
    words = [word for word in words if word not in ENGLISH_STOP_WORDS]
    words = [lemmatizer.lemmatize(word) for word in words]
    return " ".join(words)

df["cleaned_text"] = df["text"].apply(clean_text)
df["cleaned_text"].head()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\andre\AppData\Roaming\nltk_data...


0    feel life making depressed just feel life wish...
1    dont understand im hurting ive diagnosed psych...
2    struggling maladaptive daydreaming break cycle...
3    hate life im hate life hate job house rent hat...
4    slowly surely unbearable just day work friend ...
Name: cleaned_text, dtype: object

**Note:** It's more readable than it once was.  
What's being done (`Text Cleaning`):
- Ensuring `"text"` is a string
- Correcting any spelling mistakes
- Lowercasing
- Removing punctuation and special characters
- Tokenizing (splitting into words)
- Removing stopwords (common words that don't add much meaning)
- Stemming

##### Implementing TF-IDF Vectorizer to  Transform Text

In [22]:
vectorizer = TfidfVectorizer(max_features=5000)  # limit vocab to top 5000 terms
X = vectorizer.fit_transform(df["cleaned_text"])
print(f"TF-IDF matrix shape: {X.shape}")
feature_names = vectorizer.get_feature_names_out()
print(f"Sample features: {feature_names[:10]}")  # print first 10 feature names



TF-IDF matrix shape: (500, 5000)
Sample features: ['abandoned' 'abandoning' 'abandonment' 'ability' 'able' 'abroad'
 'abruptly' 'abscess' 'absence' 'absolute']


**Note:** I am not fond of NLP. Yes, I did one assignemnts on NLP, but I wouldn't say I am master.  

TF-IDF stand for Term Frequency - Inverse Document Frequency. It's a method to turn text into numbers, so that machine learning models can understand it.

- **Term Frequency (TF)**: This measures how often a word appears in a document. The more times a word appears, the higher its TF score.  

        For Example: "I feel so lonely lonely" {"I" = 1, "feel" = 1, "so" = 1, "lonely" = 2}

- **Inverse Document Frequency (IDF)**: This measures how important a word is across all documents. If a word appears in many documents, it gets a lower IDF score. Rare words get higher scores.  

        Formula:
        IDF(word) = log(Total Number of Documents / 1 + Number of Documents with the word)

- **TF-IDF Score**: The TF-IDF score is calculated by multiplying the TF and IDF scores. This way, common words that appear in many documents get lower scores, while rare but important words get higher scores.

        Formula:
        TF-IDF = TF * IDF