# **Approach 1: Machine Learning Baseline (XGBoost)**

### **Overview**
This notebook implements the machine learning baseline for suicide risk detection. We use **XGBoost**, a powerful gradient boosting algorithm, trained on a combination of:
1.  **TF-IDF Vectors:** To capture n-gram textual patterns.
2.  **Psycholinguistic Features:** To capture emotional tone and semantic topics.

### **Feature Engineering Strategy**
Instead of using the proprietary **LIWC** software, we adopt a robust open-source framework based on recent literature:
* **Empath:** Used to extract semantic categories (e.g., 'death', 'pain', 'family'). Empath has been validated to correlate highly ($r > 0.9$) with LIWC categories.
* **TextBlob:** Used to extract **Sentiment Polarity** and **Subjectivity**, following the forensic text analysis methodology of Adkins et al. (2025).

This approach ensures reproducibility and transparency while maintaining high feature quality.

---

**Imports & Setup**

In [3]:
import pandas as pd
import numpy as np
import sys
import os
import scipy.sparse as sp
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, accuracy_score
from textblob import TextBlob
from empath import Empath
import xgboost as xgb

current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
sys.path.append(parent_dir)

from src.utils import compute_graded_metrics

PROCESSED_DATA_DIR = '../data/processed'

**Load Data**

In [7]:
train_df = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'train.pkl'))
val_df = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'val.pkl'))
test_df = pd.read_pickle(os.path.join(PROCESSED_DATA_DIR, 'test.pkl'))

print(f"Train, Val, Test size: {len(train_df)}, {len(val_df)}, {len(test_df)}")

train_df.head()

Train, Val, Test size: 11972, 1605, 1036


Unnamed: 0,users,text,sentiment,time,timestamp_dt,label_ordinal
0,1,No one understands how much I desperately want...,Ideation,1648483701,2022-03-28 16:08:21,1
1,2,Today I never wanted to live to see 25. That m...,Behavior,1651130449,2022-04-28 07:20:49,2
2,3,Suicidal thoughts at / because of school For s...,Ideation,1662712545,2022-09-09 08:35:45,1
3,4,I feel like the pain will never end Everyday f...,Ideation,1638628371,2021-12-04 14:32:51,1
4,4,Is there even a point to living if you're not ...,Indicator,1639749228,2021-12-17 13:53:48,0


---
### **Feature Engineering 1: Psycholinguistic Feature Extraction**

We implement a feature extractor combining **TextBlob** and **Empath**.

**References:**
* **Fast, E., Chen, B., & Bernstein, M. S. (2016).** *Empath: Understanding Topic Signals in Large-Scale Text.* CHI 2016. (Validated Empath against LIWC).
* **Adkins, J., Al Bataineh, A., & Khanal, A. (2025).** *A psycholinguistic NLP framework for forensic text analysis of deception and emotion.* Frontiers in AI. (Used TextBlob for sentiment/subjectivity).

In [10]:
# Initialize Empath lexicon
lexicon = Empath()

# ===== BEGIN: Gemini-generated block =====
def get_psycholinguistic_features(texts):
    """
    Extracts psycholinguistic features to replace LIWC.
    
    Features include:
    1. Basic Stats: Word count, First-person pronoun ratio (I-usage).
    2. Sentiment & Subjectivity: Calculated via TextBlob (Adkins et al., 2025).
    3. Semantic Topics: Calculated via Empath (Fast et al., 2016).
    """
    # Define specific categories relevant to suicide risk detection
    # These align with standard LIWC categories often used in mental health research
    target_categories = [
        "death", "pain", "medical", "negative_emotion", 
        "sadness", "anxiety", "family", "friend", "work", "swearing"
    ]
    
    features = []
    print(f"Extracting features for {len(texts)} posts...")
    
    for text in texts:
        text_str = str(text)
        blob = TextBlob(text_str)
        words = text_str.lower().split()
        num_words = max(1, len(words)) # Avoid division by zero
        
        # --- 1. Basic Linguistic Statistics ---
        # Self-references (I, me, my) are strong indicators of self-focus in depression
        i_count = sum(1 for w in words if w in ['i', 'me', 'my', 'mine', 'myself'])
        i_ratio = i_count / num_words
        
        # --- 2. TextBlob Features (Adkins et al., 2025) ---
        # Polarity: -1.0 (Negative) to 1.0 (Positive)
        # Subjectivity: 0.0 (Objective) to 1.0 (Subjective)
        polarity = blob.sentiment.polarity
        subjectivity = blob.sentiment.subjectivity
        
        # --- 3. Empath Features (Fast et al., 2016) ---
        # Analyze text against the target categories
        # normalize=True divides the count by total words (similar to LIWC)
        empath_scores = lexicon.analyze(text_str.lower(), categories=target_categories, normalize=True)
        
        # Handle cases where Empath might return None for empty strings
        if not empath_scores:
            empath_scores = {cat: 0.0 for cat in target_categories}
            
        # Combine all features into a single row
        row = [num_words, i_ratio, polarity, subjectivity] + [empath_scores[cat] for cat in target_categories]
        features.append(row)
        
    return np.array(features)

# ===== END: Gemini-generated block =====

# Apply extraction to all splits
# (Note: This might take a minute or two depending on dataset size)
print("--- Processing Train Set ---")
X_train_psych = get_psycholinguistic_features(train_df['text'])

print("--- Processing Val Set ---")
X_val_psych = get_psycholinguistic_features(val_df['text'])

print("--- Processing Test Set ---")
X_test_psych = get_psycholinguistic_features(test_df['text'])

print(f"Psycholinguistic Feature Shape: {X_train_psych.shape}")

--- Processing Train Set ---
Extracting features for 11972 posts...
--- Processing Val Set ---
Extracting features for 1605 posts...
--- Processing Test Set ---
Extracting features for 1036 posts...
Psycholinguistic Feature Shape: (11972, 14)


### **Feature Engineering 2: TF-IDF (N-grams)**
*default: lowercase=True

In [11]:
# Limit features to top 5000 to prevent overfitting and high dimensionality
tfidf = TfidfVectorizer(max_features=5000, stop_words='english')

# Fit on TRAIN only to prevent data leakage
X_train_tfidf = tfidf.fit_transform(train_df['text'])
X_val_tfidf = tfidf.transform(val_df['text'])
X_test_tfidf = tfidf.transform(test_df['text'])

print(f"TF-IDF Train Shape: {X_train_tfidf.shape}")
print(f"TF-IDF Val Shape:   {X_val_tfidf.shape}")
print(f"TF-IDF Test Shape:  {X_test_tfidf.shape}")

TF-IDF Train Shape: (11972, 5000)
TF-IDF Val Shape:   (1605, 5000)
TF-IDF Test Shape:  (1036, 5000)


### **Feature Combination**

1.  **Dense Matrix:** Psycholinguistic features.
2.  **Sparse Matrix:** TF-IDF vectors.

These are horizontally stacked to form the final training input ($X$).

In [13]:
# ===== BEGIN: Gemini-generated block =====

# Stack TF-IDF (Sparse) with Psycholinguistic (Dense) matrices
X_train = sp.hstack([X_train_tfidf, X_train_psych])
X_val = sp.hstack([X_val_tfidf, X_val_psych])
X_test = sp.hstack([X_test_tfidf, X_test_psych])

# Prepare target labels (to Numpy array)
y_train = train_df['label_ordinal'].values
y_val = val_df['label_ordinal'].values
y_test = test_df['label_ordinal'].values

# ===== END: Gemini-generated block =====

print(f"Final Training Data Shape:   {X_train.shape}")
print(f"Final Validation Data Shape: {X_val.shape}")
print(f"Final Test Data Shape:       {X_test.shape}")

Final Training Data Shape:   (11972, 5014)
Final Validation Data Shape: (1605, 5014)
Final Test Data Shape:       (1036, 5014)


### **Model Training (XGBoost)**

In [15]:
model = xgb.XGBClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=6,
    subsample=0.8,
    objective='multi:softmax', # Used for multiclass classification
    num_class=4,               # 4 Ordinal Classes: Indicator, Ideation, Behavior, Attempt
    n_jobs=-1,                 # Use all CPU cores
    random_state=42,
    early_stopping_rounds=20   # prevent overfitting
)

print("training start")
model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=True
)
print("training end")

training start
[0]	validation_0-mlogloss:1.22856
[1]	validation_0-mlogloss:1.18457
[2]	validation_0-mlogloss:1.14708
[3]	validation_0-mlogloss:1.11378
[4]	validation_0-mlogloss:1.08502
[5]	validation_0-mlogloss:1.05914
[6]	validation_0-mlogloss:1.03659
[7]	validation_0-mlogloss:1.01733
[8]	validation_0-mlogloss:0.99862
[9]	validation_0-mlogloss:0.98234
[10]	validation_0-mlogloss:0.96882
[11]	validation_0-mlogloss:0.95544
[12]	validation_0-mlogloss:0.94358
[13]	validation_0-mlogloss:0.93368
[14]	validation_0-mlogloss:0.92363
[15]	validation_0-mlogloss:0.91556
[16]	validation_0-mlogloss:0.90859
[17]	validation_0-mlogloss:0.90113
[18]	validation_0-mlogloss:0.89451
[19]	validation_0-mlogloss:0.88790
[20]	validation_0-mlogloss:0.88208
[21]	validation_0-mlogloss:0.87708
[22]	validation_0-mlogloss:0.87280
[23]	validation_0-mlogloss:0.86835
[24]	validation_0-mlogloss:0.86363
[25]	validation_0-mlogloss:0.86019
[26]	validation_0-mlogloss:0.85640
[27]	validation_0-mlogloss:0.85367
[28]	validation

---
### **Evaluation**

We evaluate the model using:
1.  **Standard Accuracy:** General correctness
2.  **Graded Metrics:** Specifically designed for ordinal suicide risk (Sawhney et al.), implemented in `src.utils`
    * **Graded Precision (GP)**
    * **Graded Recall (GR)**
    * **Graded F1 (GF1)**

In [18]:
# Predict on test set
y_pred = model.predict(X_test)

# ===== BEGIN: Gemini-generated block =====

# 1. Standard Metrics
acc = accuracy_score(y_test, y_pred)
print(f"\nSimple Accuracy: {acc:.4f}")

# 2. Detailed Classification Report
target_names = ['Indicator', 'Ideation', 'Behavior', 'Attempt']
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))

# ===== END: Gemini-generated block =====

# 3. Graded Metrics (Project Requirement)
# Using the custom function from src/utils.py
graded_metrics = compute_graded_metrics(y_test, y_pred)
print("\n=== Graded Metrics ===")
print(f"Graded Precision: {graded_metrics['graded_precision']:.4f}")
print(f"Graded Recall:    {graded_metrics['graded_recall']:.4f}")
print(f"Graded F1-Score:  {graded_metrics['graded_f1']:.4f}")



Simple Accuracy: 0.6631

Classification Report:
              precision    recall  f1-score   support

   Indicator       0.63      0.71      0.67       305
    Ideation       0.69      0.80      0.74       530
    Behavior       0.61      0.25      0.36       135
     Attempt       0.59      0.15      0.24        66

    accuracy                           0.66      1036
   macro avg       0.63      0.48      0.50      1036
weighted avg       0.65      0.66      0.64      1036


=== Graded Metrics ===
Graded Precision: 0.8986
Graded Recall:    0.7645
Graded F1-Score:  0.8262
