# Data Processing Pipeline for Depression Detection

This notebook processes Reddit text data for depression detection using:
- Text preprocessing (cleaning, tokenization, lemmatization)
- TF-IDF vectorization (5000 features)
- NRCLex emotion extraction (10 features)
- Feature combination and normalization

**Input**: `bin_reddit1.csv` (raw Reddit posts)  
**Output**: Processed feature matrices saved to `processed/` folder

## 1. Setup and Imports

In [1]:
# Import necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from nrclex import NRCLex
import re
import emoji
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from scipy.sparse import csr_matrix, save_npz
import os

# Try to import wordsegment for advanced hashtag processing
try:
    from wordsegment import load, segment
    load()
    WORDSEGMENT_AVAILABLE = True
    print("✓ wordsegment library available - using advanced hashtag processing")
except ImportError:
    WORDSEGMENT_AVAILABLE = False
    print("⚠ wordsegment not installed - using regex-based hashtag processing")
    print("  To enable advanced processing: pip install wordsegment")

print("All imports successful!")

⚠ wordsegment not installed - using regex-based hashtag processing
  To enable advanced processing: pip install wordsegment
All imports successful!


### Optional: Install wordsegment for better hashtag processing

For improved hashtag segmentation (e.g., `#mentalhealth` → "mental health"), install the `wordsegment` library:

```bash
pip install wordsegment
```

Without it, the notebook will use regex-based processing which still works well for camelCase and underscores.

In [2]:
# Download necessary NLTK data
print("Downloading NLTK resources...")
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt_tab', quiet=True)
print("NLTK resources downloaded successfully!")

Downloading NLTK resources...
NLTK resources downloaded successfully!
NLTK resources downloaded successfully!


## 2. Load and Explore Data

In [3]:
# Load the dataset
df = pd.read_csv('bin_reddit1.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nColumn names: {df.columns.tolist()}")
print(f"\nLabel distribution:")
print(df['label'].value_counts())
print(f"\nFirst few rows:")
df.head()

Dataset shape: (99590, 3)

Column names: ['text', 'label', ' ']

Label distribution:
label
0    58405
1    41185
Name: count, dtype: int64

First few rows:


Unnamed: 0,text,label,Unnamed: 3
0,aa glad fun paint night sky,0,
1,abandonment massive fear trigger suicidal,1,
2,ability induce anxiety gift god,0,
3,ability write complex business,0,
4,Q,0,


## 3. Text Preprocessing Functions

In [4]:
def process_hashtags(text):
    """
    Advanced hashtag processing with multiple strategies.
    
    If wordsegment is available (pip install wordsegment):
        Uses statistical word segmentation to split hashtags intelligently
        Example: #mentalhealth → "mental health" (automatically recognizes words)
    
    Otherwise uses regex-based approach:
        1. Split camelCase: #MentalHealth → mental health
        2. Replace underscores: #mental_health → mental health
        3. Separate numbers: #covid19 → covid 19
        4. Handle mixed case: #MentalHealthAwareness2024 → mental health awareness 2024
    
    Examples:
        #MentalHealth → "mental health"
        #mentalhealth → "mental health" (with wordsegment) or "mentalhealth" (without)
        #mental_health → "mental health"
        #COVID19 → "covid 19"
        #MentalHealthAwareness2024 → "mental health awareness 2024"
    """
    hashtags = re.findall(r'#(\w+)', text)
    
    for hashtag in hashtags:
        if WORDSEGMENT_AVAILABLE:
            # Statistical word segmentation (best approach)
            # Automatically splits concatenated words: "mentalhealth" → ["mental", "health"]
            words = segment(hashtag.lower())
            processed = ' '.join(words)
        else:
            # Fallback: regex-based processing
            processed = hashtag
            
            # Step 1: Replace underscores with spaces
            processed = processed.replace('_', ' ')
            
            # Step 2: Split camelCase (lowercase followed by uppercase)
            # Handles: MentalHealth → Mental Health
            processed = re.sub(r'(?<=[a-z])(?=[A-Z])', ' ', processed)
            
            # Step 3: Split on uppercase sequences followed by lowercase
            # Handles: MENTALHealth → MENTAL Health
            processed = re.sub(r'(?<=[A-Z])(?=[A-Z][a-z])', ' ', processed)
            
            # Step 4: Separate numbers from letters
            # Handles: covid19 → covid 19, 2024awareness → 2024 awareness
            processed = re.sub(r'(?<=[a-zA-Z])(?=\d)', ' ', processed)
            processed = re.sub(r'(?<=\d)(?=[a-zA-Z])', ' ', processed)
            
            # Step 5: Convert to lowercase and clean up extra spaces
            processed = ' '.join(processed.lower().split())
        
        # Replace in original text
        text = text.replace(f'#{hashtag}', processed)
    
    return text

# Test the function with examples
print("Testing hashtag processing:")
print(f"Mode: {'Statistical word segmentation (wordsegment)' if WORDSEGMENT_AVAILABLE else 'Regex-based processing'}")
print()

test_cases = [
    "#MentalHealth",
    "#mentalhealth",  # Only wordsegment handles this well
    "#mental_health",
    "#COVID19",
    "#MentalHealthAwareness2024",
    "#depressed",
    "#IFeelDepressed",
    "#2024Goals",
    "#selfcare"  # Another test for wordsegment
]

for test in test_cases:
    result = process_hashtags(test)
    print(f"  {test:35} → {result}")

print("\n✓ Hashtag processing function defined successfully!")

Testing hashtag processing:
Mode: Regex-based processing

  #MentalHealth                       → mental health
  #mentalhealth                       → mentalhealth
  #mental_health                      → mental health
  #COVID19                            → covid 19
  #MentalHealthAwareness2024          → mental health awareness 2024
  #depressed                          → depressed
  #IFeelDepressed                     → i feel depressed
  #2024Goals                          → 2024 goals
  #selfcare                           → selfcare

✓ Hashtag processing function defined successfully!


In [5]:
# Initialize lemmatizer (not used for NRCLex)
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

print("Starting text preprocessing pipeline...")

# Step 1: Remove URLs
print("1/6 - Removing URLs...")
df['text'] = df['text'].str.replace(r'http\S+', '', regex=True)

# Step 2: Remove mentions
print("2/6 - Removing mentions...")
df['text'] = df['text'].str.replace(r'@\S+', '', regex=True)

# Step 3: Normalize to lowercase
print("3/6 - Converting to lowercase...")
df['text'] = df['text'].str.lower()

# Step 4: Remove numbers
print("4/6 - Removing numbers...")
df['text'] = df['text'].str.replace(r'\d+', '', regex=True)

# Step 5: Replace emojis with text
print("5/6 - Converting emojis to text...")
df['text'] = df['text'].apply(lambda x: emoji.demojize(x, delimiters=(" ", " ")))

# Step 6: Convert hashtags to normal text
print("6/6 - Processing hashtags...")
df['text'] = df['text'].apply(process_hashtags)

# Save intermediate texts for feature extractors
# TF-IDF works best with the raw cleaned text (let it tokenize itself)
df['text_for_tfidf'] = df['text'].copy()
# NRCLex works better when negations/stopwords are preserved (no lemmatization)
df['text_for_nrc'] = df['text'].copy()

print("\n✓ Text preprocessing complete!")
print(f"\nSample outputs:")
print(df[['text_for_tfidf', 'text_for_nrc', 'label']].head())

Starting text preprocessing pipeline...
1/6 - Removing URLs...
2/6 - Removing mentions...
3/6 - Converting to lowercase...
4/6 - Removing numbers...
5/6 - Converting emojis to text...
6/6 - Processing hashtags...

✓ Text preprocessing complete!

Sample outputs:
                                text_for_tfidf  \
0                  aa glad fun paint night sky   
1    abandonment massive fear trigger suicidal   
2              ability induce anxiety gift god   
3               ability write complex business   
4                                            q   

                                  text_for_nrc  label  
0                  aa glad fun paint night sky      0  
1    abandonment massive fear trigger suicidal      1  
2              ability induce anxiety gift god      0  
3               ability write complex business      0  
4                                            q      0  
6/6 - Processing hashtags...

✓ Text preprocessing complete!

Sample outputs:
                       

## 5. Class Distribution Analysis

## 4. Apply Text Preprocessing Pipeline

Steps:
1. Remove URLs and mentions
2. Convert to lowercase
3. Remove numbers
4. Convert emojis to text
5. Process hashtags
6. Create two text streams:
   - text_for_tfidf (cleaned, not tokenized)
   - text_for_nrc (negations/stopwords preserved)

Note: We no longer tokenize/lemmatize for NRCLex to preserve cues like negation; TF-IDF handles its own tokenization and stopword removal.

In [6]:
# Analyze class distribution
df_class_0 = df[df['label'] == 0]
df_class_1 = df[df['label'] == 1]

print("=" * 60)
print("CLASS DISTRIBUTION ANALYSIS")
print("=" * 60)
print(f"\nClass 0 (Non-depression): {df_class_0.shape[0]} samples")
print(f"Class 1 (Depression): {df_class_1.shape[0]} samples")
print(f"\nImbalance ratio: {df_class_0.shape[0] / df_class_1.shape[0]:.2f}:1")
print("\nSample from Class 0:")
display(df_class_0[['text_for_nrc', 'label']].head(3))
print("\nSample from Class 1:")
display(df_class_1[['text_for_nrc', 'label']].head(3))

CLASS DISTRIBUTION ANALYSIS

Class 0 (Non-depression): 58405 samples
Class 1 (Depression): 41185 samples

Imbalance ratio: 1.42:1

Sample from Class 0:


Unnamed: 0,text_for_nrc,label
0,aa glad fun paint night sky,0
2,ability induce anxiety gift god,0
3,ability write complex business,0



Sample from Class 1:


Unnamed: 0,text_for_nrc,label
1,abandonment massive fear trigger suicidal,1
7,absence mental illness doesnt presence menta...,1
11,absolute bastard odd,1


## 6. Feature Extraction

### 6.1 TF-IDF Features (5000 features)

In [7]:
# Prepare features and target
X_tfidf_text = df['text_for_tfidf']   # Use mid-cleaned text (no tokenization) for TF-IDF
X_nrc_text = df['text_for_nrc']       # Use mid-cleaned text (preserve negations/stopwords) for NRCLex
y = df['label']

print(f"Total samples: {len(y)}")
print(f"TF-IDF input: {X_tfidf_text.shape}")
print(f"NRCLex input: {X_nrc_text.shape}")
print(f"Target (y): {y.shape}")

Total samples: 99590
TF-IDF input: (99590,)
NRCLex input: (99590,)
Target (y): (99590,)


In [8]:
# Extract TF-IDF features with optimized configuration
print("Extracting TF-IDF features...")
print("Configuration:")
print("  - Using text_for_tfidf (cleaned but not tokenized)")
print("  - TF-IDF handles tokenization, stop words, and n-grams internally")

tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),          # Include unigrams and bigrams
    min_df=2,                    # Ignore terms that appear in fewer than 2 documents
    max_df=0.95,                 # Ignore terms that appear in more than 95% of documents
    stop_words='english',        # Let TF-IDF handle stop words removal
    token_pattern=r'\b\w+\b',    # Match word tokens
    strip_accents='unicode',     # Normalize unicode characters
    lowercase=True,              # Already lowercased, but ensure consistency
    sublinear_tf=True           # Use log(tf) instead of raw frequency
)

X_tfidf = tfidf_vectorizer.fit_transform(X_tfidf_text)

print(f"\n✓ TF-IDF shape: {X_tfidf.shape}")
print(f"  - Sparse matrix with {X_tfidf.nnz:,} non-zero elements")
print(f"  - Sparsity: {(1 - X_tfidf.nnz / (X_tfidf.shape[0] * X_tfidf.shape[1])) * 100:.2f}%")
print(f"  - Vocabulary size: {len(tfidf_vectorizer.vocabulary_):,} unique terms")

Extracting TF-IDF features...
Configuration:
  - Using text_for_tfidf (cleaned but not tokenized)
  - TF-IDF handles tokenization, stop words, and n-grams internally

✓ TF-IDF shape: (99590, 5000)
  - Sparse matrix with 662,785 non-zero elements
  - Sparsity: 99.87%
  - Vocabulary size: 5,000 unique terms

✓ TF-IDF shape: (99590, 5000)
  - Sparse matrix with 662,785 non-zero elements
  - Sparsity: 99.87%
  - Vocabulary size: 5,000 unique terms


### 6.2 NRCLex Emotion Features (10 features)

In [9]:
# Extract NRC emotion features
def extract_nrc_features(text_list):
    """
    Extract emotion features using NRCLex.
    Returns a DataFrame with emotion scores for each text.
    """
    nrc_features = []
    total = len(text_list)
    
    for i, text in enumerate(text_list):
        if (i + 1) % 5000 == 0:
            print(f"  Processing: {i + 1}/{total} texts...")
        
        emotion_object = NRCLex(text)
        scores = emotion_object.raw_emotion_scores
        nrc_features.append(scores)
    
    # Convert to DataFrame and fill missing values
    nrc_df = pd.DataFrame(nrc_features).fillna(0)
    return nrc_df

print("Extracting NRCLex emotion features...")
print("Using mid-cleaned text_for_nrc (negations preserved, no lemmatization)")
text_list = X_nrc_text.tolist()
X_nrc_features = extract_nrc_features(text_list)

print(f"\n✓ NRC features shape: {X_nrc_features.shape}")
print(f"\nEmotion columns:")
print(X_nrc_features.columns.tolist())
print(f"\nSample emotion scores:")
display(X_nrc_features.head())

Extracting NRCLex emotion features...
Using mid-cleaned text_for_nrc (negations preserved, no lemmatization)
  Processing: 5000/99590 texts...
  Processing: 10000/99590 texts...
  Processing: 15000/99590 texts...
  Processing: 10000/99590 texts...
  Processing: 15000/99590 texts...
  Processing: 20000/99590 texts...
  Processing: 25000/99590 texts...
  Processing: 20000/99590 texts...
  Processing: 25000/99590 texts...
  Processing: 30000/99590 texts...
  Processing: 35000/99590 texts...
  Processing: 30000/99590 texts...
  Processing: 35000/99590 texts...
  Processing: 40000/99590 texts...
  Processing: 45000/99590 texts...
  Processing: 40000/99590 texts...
  Processing: 45000/99590 texts...
  Processing: 50000/99590 texts...
  Processing: 55000/99590 texts...
  Processing: 50000/99590 texts...
  Processing: 55000/99590 texts...
  Processing: 60000/99590 texts...
  Processing: 65000/99590 texts...
  Processing: 60000/99590 texts...
  Processing: 65000/99590 texts...
  Processing: 700

Unnamed: 0,anticipation,joy,positive,anger,fear,negative,sadness,surprise,disgust,trust
0,2.0,2.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,3.0,3.0,3.0,2.0,1.0,1.0,0.0
2,3.0,2.0,3.0,1.0,2.0,1.0,1.0,1.0,0.0,1.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 6.3 Normalize NRC Features

In [10]:
# Normalize NRC features to [0, 1] range
print("Normalizing NRC features...")
scaler = MinMaxScaler()
X_nrc_features_scaled = scaler.fit_transform(X_nrc_features)

print(f"✓ Scaled NRC features shape: {X_nrc_features_scaled.shape}")
print(f"  - Min value: {X_nrc_features_scaled.min()}")
print(f"  - Max value: {X_nrc_features_scaled.max()}")

Normalizing NRC features...
✓ Scaled NRC features shape: (99590, 10)
  - Min value: 0.0
  - Max value: 1.0


## 7. Save Processed Data

Save all feature matrices and target variable to the `processed/` folder for use in model training.

### 6.4 Combine TF-IDF and NRC Features

In [11]:
# Combine TF-IDF and NRC features efficiently (keep TF-IDF sparse)
print("Combining TF-IDF and NRC features...")

from scipy.sparse import hstack

# Convert scaled NRC features (dense) to sparse for efficient concatenation
nrc_sparse = csr_matrix(X_nrc_features_scaled)

# Horizontally stack sparse TF-IDF with sparse NRC features
X_combined_sparse = hstack([X_tfidf, nrc_sparse], format='csr')

print(f"✓ Combined features shape: {X_combined_sparse.shape}")
print(f"  - TF-IDF: {X_tfidf.shape[1]} features")
print(f"  - NRC: {nrc_sparse.shape[1]} features")
print(f"  - Total: {X_combined_sparse.shape[1]} features")

Combining TF-IDF and NRC features...
✓ Combined features shape: (99590, 5010)
  - TF-IDF: 5000 features
  - NRC: 10 features
  - Total: 5010 features


In [12]:
# Configuration
save_folder = 'processed'

# Create save folder if it doesn't exist
if not os.path.exists(save_folder):
    os.makedirs(save_folder)
    print(f"Created folder: {save_folder}/")

print("\nSaving processed data...")

# Save TF-IDF features (sparse matrix)
save_npz(os.path.join(save_folder, 'X_tfidf.npz'), X_tfidf)
print(f"✓ Saved: X_tfidf.npz ({X_tfidf.shape})")

# Save scaled NRC features (dense array)
np.save(os.path.join(save_folder, 'X_nrc_features_scaled.npy'), X_nrc_features_scaled)
print(f"✓ Saved: X_nrc_features_scaled.npy ({X_nrc_features_scaled.shape})")

# Save combined features (sparse matrix)
save_npz(os.path.join(save_folder, 'X_combined_sparse.npz'), X_combined_sparse)
print(f"✓ Saved: X_combined_sparse.npz ({X_combined_sparse.shape})")

# Save target variable
np.save(os.path.join(save_folder, 'y.npy'), y)
print(f"✓ Saved: y.npy ({y.shape})")

print("\n" + "=" * 60)
print("DATA PROCESSING COMPLETE!")
print("=" * 60)
print(f"\nAll files saved to: {save_folder}/")
print("\nGenerated files:")
print("  1. X_tfidf.npz              - TF-IDF features only (5000 features)")
print("  2. X_nrc_features_scaled.npy - NRCLex features only (10 features)")
print("  3. X_combined_sparse.npz    - Combined features (5010 features)")
print("  4. y.npy                    - Target labels")


Saving processed data...
✓ Saved: X_tfidf.npz ((99590, 5000))
✓ Saved: X_nrc_features_scaled.npy ((99590, 10))
✓ Saved: X_tfidf.npz ((99590, 5000))
✓ Saved: X_nrc_features_scaled.npy ((99590, 10))
✓ Saved: X_combined_sparse.npz ((99590, 5010))
✓ Saved: y.npy ((99590,))

DATA PROCESSING COMPLETE!

All files saved to: processed/

Generated files:
  1. X_tfidf.npz              - TF-IDF features only (5000 features)
  2. X_nrc_features_scaled.npy - NRCLex features only (10 features)
  3. X_combined_sparse.npz    - Combined features (5010 features)
  4. y.npy                    - Target labels
✓ Saved: X_combined_sparse.npz ((99590, 5010))
✓ Saved: y.npy ((99590,))

DATA PROCESSING COMPLETE!

All files saved to: processed/

Generated files:
  1. X_tfidf.npz              - TF-IDF features only (5000 features)
  2. X_nrc_features_scaled.npy - NRCLex features only (10 features)
  3. X_combined_sparse.npz    - Combined features (5010 features)
  4. y.npy                    - Target labels
