# Feature Engineering Notebook: Mental Health Text Classification
This notebook presents a robust, production-ready feature engineering pipeline for mental health text classification. The goal is to transform raw user-generated statements into a comprehensive set of interpretable and machine-learning-ready features, enabling accurate categorization into five key mental health states: Anxiety, Depression, Normal, Stress, and Suicidal.

The approach is designed for maximum portability and reliability, using only pure-Python and lightweight NLP libraries. This ensures seamless deployment in containerized and cloud environments, such as Google Cloud Platform (GCP), without dependency on GPU drivers or complex C/C++ libraries.

**Key steps include:**
- Rigorous text cleaning and normalization to ensure privacy and consistency.
- Extraction of structural, lexical, sentiment, and domain-specific features that capture both the style and substance of mental health discourse.
- Heuristic part-of-speech (POS) analysis and TF-IDF vectorization to provide both interpretable and high-dimensional representations.
- Careful feature scaling and export, supporting a wide range of downstream machine learning models.

This notebook serves as a critical foundation for subsequent model development, evaluation, and deployment in real-world mental health support systems.y.

## 1. Import Libraries

In [103]:
import pandas as pd
import numpy as np
import re
import emoji
import time
import warnings
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import save_npz

warnings.filterwarnings('ignore')

# Always disable NLTK usage for tokenization/POS to avoid punkt errors
use_nltk = False

# Fallback stopword set
stopset = {
    'the','and','is','in','to','it','of','for','on','with','as','this',
    'a','an','that','at','by','from','or','be','are','was','were'
}

print("Environment setup complete (pure-Python fallback)\n")

Environment setup complete (pure-Python fallback)



## 2. Load Data

In [105]:
print("Loading cleaned_dataset.csv...")
try:
    df = pd.read_csv('cleaned_dataset.csv')
except Exception as e:
    raise RuntimeError(f"Failed to load data: {e}")

df = df.rename(columns={'Unnamed: 0': 'Unique_ID'})
df = df[['Unique_ID', 'statement', 'status']]
print(f"Dataset loaded: {len(df)} records")
print("Class distribution:")
print(df['status'].value_counts(), "\n")

Loading cleaned_dataset.csv...
Dataset loaded: 45261 records
Class distribution:
status
Normal                  15991
Depression              12106
Suicidal                 8812
Anxiety                  3103
Stress                   2343
Bipolar                  2118
Personality disorder      788
Name: count, dtype: int64 



## 3. Text Cleaning
Remove HTML tags, mask URLs and emojis, normalize whitespace, lowercase.

In [107]:
def clean_text(text):
    t = str(text)
    t = re.sub(r"<.*?>", " ", t)
    t = re.sub(r"http\S+|www\S+", " urltoken ", t)
    t = emoji.replace_emoji(t, replace=" emoticon ")
    t = re.sub(r"\s+", " ", t)
    return t.lower().strip()

print("Cleaning text...")
start = time.time()
df['clean_text'] = df['statement'].apply(clean_text)
print(f"Completed in {time.time() - start:.2f}s\n")

Cleaning text...
Completed in 15.89s



*Decision Note: We perform cleaning here to handle noisy characters, privacy-sensitive content, and ensure all downstream features are reliable.*

## 4. Structural Features
Extract metrics: text length, word count, URL count, emoji count, special chars,  punctuation bursts, average word length.

In [110]:
print("Extracting structural features...")
start = time.time()
df['text_length'] = df['clean_text'].str.len()
df['word_count'] = df['clean_text'].apply(lambda x: len(x.split()))
df['num_urls'] = df['statement'].apply(lambda x: len(re.findall(r"http\S+|www\S+", str(x))))
df['num_emojis'] = df['statement'].apply(lambda x: emoji.emoji_count(str(x)))
df['num_special_chars'] = df['statement'].apply(lambda x: sum(ord(c) > 126 for c in str(x)))
df['num_excess_punct'] = df['statement'].apply(lambda x: len(re.findall(r"[.!?]{3,}", str(x))))
df['avg_word_length'] = df['clean_text'].apply(
    lambda x: np.mean([len(w) for w in x.split()]) if x.split() else 0
)
print(f"Completed in {time.time() - start:.2f}s\n")

Extracting structural features...
Completed in 16.31s



*Decision Note: These features capture information density, use of digital content, and structural quirks of user statements.*

### Interpretation:
*Statements with high emoji or special character counts may correspond to more expressive or emotional content.*


## 5. Lexical Features 
Compute stopword ratio and type-token ratio for vocabulary richness.

In [113]:
print("Extracting lexical features...")
start = time.time()
df['stopword_ratio'] = df['clean_text'].apply(
    lambda x: sum(1 for w in x.split() if w in stopset) / max(1, len(x.split()))
)
df['type_token_ratio'] = df['clean_text'].apply(
    lambda x: len(set(x.split())) / max(1, len(x.split()))
)
print(f"Completed in {time.time() - start:.2f}s\n")

Extracting lexical features...
Completed in 1.27s



*Decision Note: Vocabulary richness helps distinguish between vague and highly-detailed posts, often correlated to mental health states.*

## 6. Sentiment and Domain-Specific Features
Use TextBlob for polarity and subjectivity. Binary flags for suicidal, stress, and help-seeking keywords.

In [116]:
print("Extracting sentiment features...")
start = time.time()
from textblob import TextBlob
df[['polarity', 'subjectivity']] = df['clean_text'].apply(
    lambda x: pd.Series(TextBlob(x).sentiment)
)
print(f"Completed in {time.time() - start:.2f}s\n")

print("Extracting keyword flags...")
start = time.time()
suicidal_kw = {'kill','die','suicide','self-harm','end it','dead'}
stress_kw = {'stress','overwhelmed','pressure','burnout','anxious'}
help_kw = {'help','save me','talk','support','therapist'}

def flag_keywords(text, keywords):
    lw = text.lower()
    words = set(lw.split())
    if words & keywords:
        return 1
    return any(phrase in lw for phrase in keywords if ' ' in phrase)

df['has_suicidal_keyword'] = df['clean_text'].apply(lambda x: flag_keywords(x, suicidal_kw))
df['has_stress_keyword']   = df['clean_text'].apply(lambda x: flag_keywords(x, stress_kw))
df['has_help_keyword']     = df['clean_text'].apply(lambda x: flag_keywords(x, help_kw))
print(f"Completed in {time.time() - start:.2f}s\n")

Extracting sentiment features...
Completed in 32.79s

Extracting keyword flags...
Completed in 1.63s



*Decision Note: Specific keyword flags allow for easy triage and prioritization of high-risk statements in a production setting.*


## 7. POS Ratios Transformer
Compute noun, verb, adjective, adverb ratios via NLTK or suffix heuristics.

In [119]:
print("Extracting POS ratio features...")
start = time.time()

def compute_pos_ratios(text):
    words = text.split()
    total = len(words) or 1
    return pd.Series({
        'noun_ratio': sum(w.endswith(('ion','ment'))   for w in words) / total,
        'verb_ratio': sum(w.endswith(('ing','ed'))    for w in words) / total,
        'adj_ratio' : sum(w.endswith(('y','al'))      for w in words) / total,
        'adv_ratio' : sum(w.endswith('ly')            for w in words) / total
    })

pos_df = df['clean_text'].apply(compute_pos_ratios)
df = pd.concat([df, pos_df], axis=1)

print(f"POS ratio features extracted in {time.time()-start:.2f}s\n")

Extracting POS ratio features...
POS ratio features extracted in 9.88s



*Decision Note: POS ratios highlight syntactic style, which has been shown to indicate mood and intent.*


## 8. TF-IDF Vectorization
Fit a TF-IDF vectorizer on unigrams and bigrams (5,000 features max) with sublinear scaling and document frequency thresholds to create sparse bag-of-words features.

In [122]:
print("Generating TF-IDF features...")
start = time.time()

tfidf = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1,2),
    min_df=2,
    max_df=0.95,
    sublinear_tf=True
)
tfidf_matrix = tfidf.fit_transform(df['clean_text'])
tfidf_vocab  = tfidf.get_feature_names_out()

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"TF-IDF generation completed in {time.time()-start:.2f}s\n")

Generating TF-IDF features...
TF-IDF matrix shape: (45261, 5000)
TF-IDF generation completed in 9.51s



*Decision Note: TF-IDF provides a lightweight, interpretable representation of important terms without requiring deep models or GPUs.*

## 10. Feature Scaling
Apply Minâ€“Max scaling to all numeric features (structural, lexical, sentiment, POS ratios) to bound them between 0 and 1.

In [125]:
print("Scaling numeric features...")
start = time.time()

numeric_cols = [
    'text_length','word_count','num_urls','num_emojis','num_special_chars',
    'num_excess_punct','avg_word_length','stopword_ratio','type_token_ratio',
    'polarity','subjectivity','noun_ratio','verb_ratio','adj_ratio','adv_ratio'
]
scaler = MinMaxScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

print(f"Feature scaling completed in {time.time()-start:.2f}s\n")

Scaling numeric features...
Feature scaling completed in 0.02s



*Decision Note: Uniform feature scaling prevents variables with larger magnitudes from dominating model learning and ensures stable optimization.*

## 11. Export Feature Sets

In [128]:
print("Exporting feature sets...")

# Add text column to inter_cols for modeling with DistilBERT
inter_cols = ['Unique_ID', 'statement', 'status'] + numeric_cols + [
    'has_suicidal_keyword', 'has_stress_keyword', 'has_help_keyword'
]

df[inter_cols].to_csv('features_interpretable.csv', index=False)

save_npz('tfidf_features.npz', tfidf_matrix)
pd.DataFrame({'term': tfidf_vocab}).to_csv('tfidf_vocab.csv', index=False)

print("Export complete:")
print("  - features_interpretable.csv (includes raw text column for deep models)")
print("  - tfidf_features.npz")
print("  - tfidf_vocab.csv\n")

print("Feature engineering pipeline finished successfully.")


Exporting feature sets...
Export complete:
  - features_interpretable.csv (includes raw text column for deep models)
  - tfidf_features.npz
  - tfidf_vocab.csv

Feature engineering pipeline finished successfully.


*Decision Note: The feature set is now ready for modeling. Export ensures easy reuse and versioning in later ML stages.*

# Interpretation:
The notebook is modular and reproducible for future datasets, with all steps explained and decisions justified.

In [130]:
df.describe()

Unnamed: 0,Unique_ID,text_length,word_count,num_urls,num_emojis,num_special_chars,num_excess_punct,avg_word_length,stopword_ratio,type_token_ratio,polarity,subjectivity,noun_ratio,verb_ratio,adj_ratio,adv_ratio
count,45261.0,45261.0,45261.0,45261.0,45261.0,45261.0,45261.0,45261.0,45261.0,45261.0,45261.0,45261.0,45261.0,45261.0,45261.0,45261.0
mean,26237.80544,0.276083,0.247557,0.00323,0.001622,0.002778,0.00341,0.010863,0.198562,0.807242,0.5049,0.45248,0.012495,0.056079,0.069837,0.021898
std,15279.138295,0.264196,0.236354,0.027201,0.018873,0.013427,0.017014,0.006173,0.092677,0.153194,0.125276,0.265234,0.044371,0.056095,0.065345,0.041114
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,12817.0,0.050336,0.045455,0.0,0.0,0.0,0.0,0.00931,0.158273,0.677812,0.45,0.313636,0.0,0.0,0.027778,0.0
50%,27082.0,0.192114,0.170455,0.0,0.0,0.0,0.0,0.010319,0.206897,0.801045,0.5,0.5,0.0,0.051282,0.06422,0.0
75%,39460.0,0.437919,0.390152,0.0,0.0,0.0,0.0,0.011585,0.25,1.0,0.558333,0.614286,0.0,0.078947,0.094595,0.031579
max,53042.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
