## XGBoost

XGBoost_lexical.ipynb

This script helps examine predictive power of lexical features (and TF-IDF) for 
predicting depression symptom severity (PHQ-8 scores). Developed as part of 
a study on multilingual lexical markers and depression severity.

- Loads and preprocesses multilingual linguistic data (e.g., English, Dutch, Spanish)
- Applies feature scaling and text vectorization (TF-IDF)
- Constructs XGBoost regression pipelines with and without feature selection
- Implements nested GroupKFold cross-validation
- Supports multiple feature combinations (lexical features only, features + TF-IDF, etc.)
- Evaluates performance using RMSE and R²
- Supports grid search for hyperparameter tuning

**Usage**:
- Place your input CSV file in the desired directory.
- Make sure the file includes a column named 'Text' containing the sentences.

**Author**: Anastasiia Tokareva



### Models tested:
1. Baseline: full features, no TF-IDF
2. Full features + TF-IDF
3. Automatic Feature Selection: SelectKBest() + TSVD + TF-IDF
4. Stats-based features + TF-IDF

In [2]:
## Load libraries

# general
import pandas as pd
import numpy as np

# pre-processing
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.compose import ColumnTransformer

# ML pipeline
from sklearn.metrics import make_scorer, mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import GroupKFold, cross_validate, GridSearchCV 
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD   # TSVD

# regressor
# !pip install xgboost
from xgboost import XGBRegressor  


In [3]:
 ## 1. Clean your data
data = pd.read_csv("C:/Users/your/file/name/here.csv")  
# remove missing values ('NA')
data_cleaned = data.dropna(axis=0) 

# Below are the changes that were made to rename LIWC-generated column names, which differed 
# between the languages. Absolutist word frequency could only be extracted for English, so this
# lexical feature was not included in the main ML models to enable comparison with Spanish
# and Dutch data. Please uncomment the lines that you would like to run for your project.

# a) English - remove the column with absolutist words 
# data_cleaned = data_cleaned.drop('allnone', axis=1)

# b) Dutch - rename some columns to match English LIWC column names
# data_cleaned = data_cleaned.rename(columns = {'posemo':'emo_pos', 'negemo':'emo_neg'})

# c) Spanish - rename some columns to match English LIWC column names
# data_cleaned = data_cleaned.rename(columns = {'EmoPos':'emo_pos', 'EmoNeg':'emo_neg', 'Yo':'i', 'Nosotro':'we', 'Pasado': 'focuspast'})


## 2. Binarise COVID data
data_cleaned['Recording_Date'] = pd.to_datetime(data_cleaned['Recording_Date'])

# define COVID lockdown start and end dates (dates based on Leightley et al. (2021), https://pubmed.ncbi.nlm.nih.gov/34488697/)
covid_start = pd.to_datetime('2020-03-23')
covid_end = pd.to_datetime('2021-05-11')

data_cleaned['COVID'] = ((data_cleaned['Recording_Date'] >= covid_start) & (data_cleaned['Recording_Date'] <= covid_end)).astype(int)
data_cleaned.head(n=5)

# COVID now added as the last column (0/1)
data_cleaned.head(n=5)

Unnamed: 0,Site,participant_ID,Age,Gender,Education_Years,Height,Recording_Date,Task,PHQ8,ColumnID,...,AllPunc,Period,Comma,QMark,Exclam,Apostro,OtherP,Emoji,TTR,COVID
0,RADAR-MDD-KCL-s1,71a74929-ce52-494f-9d41-d08cbcf53707,56,1,14,165,2020-05-27,Unscripted,1,Text,...,20.0,12.0,8.0,0.0,0.0,0.0,0.0,0.0,73.333333,1
1,RADAR-MDD-KCL-s1,71a74929-ce52-494f-9d41-d08cbcf53707,56,1,14,165,2020-08-01,Unscripted,2,Text,...,29.91,7.25,16.62,0.3,0.0,5.74,0.0,0.0,39.534884,1
2,RADAR-MDD-KCL-s1,71a74929-ce52-494f-9d41-d08cbcf53707,56,1,14,165,2020-02-19,Unscripted,1,Text,...,27.75,8.61,16.75,0.0,0.0,2.39,0.0,0.0,47.565543,0
3,RADAR-MDD-KCL-s1,71a74929-ce52-494f-9d41-d08cbcf53707,56,1,14,165,2019-11-28,Unscripted,0,Text,...,23.77,7.62,13.9,0.0,0.0,2.24,0.0,0.0,45.289855,0
4,RADAR-MDD-KCL-s1,71a74929-ce52-494f-9d41-d08cbcf53707,56,1,14,165,2019-11-12,Unscripted,1,Text,...,18.07,7.23,5.42,0.0,0.0,5.42,0.0,0.0,48.979592,0


#### 1. Baseline: lexical features only 
E.g., frequency of past-tense verbs, WPS, etc.

In [4]:
# Define column transformer (StandardScaler for numerical, and FunctionTransformer for raw features)
preprocessor_1 = ColumnTransformer([
    ('num_scaler', StandardScaler(), ['Age', 'Education_Years', 'allnone',
                                       'WC', 'WPS', 'i', 'emo_neg', 'emo_pos', 'Brunet', 'we',
                                       'emo_anx', 'emo_anger', 'emo_sad', 'focuspast', 'TTR']),  # scale numerical features
    ('num_raw', FunctionTransformer(lambda x: x, validate=False), ['Gender', 'COVID'])           # dummy variables unscaled
])

# Define the pipeline
pipeline_1 = Pipeline([
    ('preprocessing', preprocessor_1),
    ('regressor',  XGBRegressor(random_state = 42))  
])


#### Grid search parameters

In [5]:
param_grid = {
    'regressor__max_depth': [3, 5, 7, 9],
    'regressor__learning_rate': [0.1, 0.01, 0.001],
    'regressor__subsample': [0.5, 0.7, 0.9],        # lowering the value prevents overfitting
    'regressor__alpha': [0, 0.01, 0.1, 0.5, 1],     # L1 regularisation (default = 0)
    'regressor__lambda': [0, 0.01, 0.1, 0.5, 1]      # L2 regularisation (default = 1)                      
}


#### 2. Baseline + TF-IDF
Lexical features + TF-IDF vectors

In [6]:
# Define column transformer (TF-IDF for text, StandardScaler for numerical)
preprocessor_2 = ColumnTransformer([
    ('tfidf', TfidfVectorizer(max_features=500), 'Text'),                                         # TF-IDF (no scaling)
    ('num_scaler', StandardScaler(), ['Age', 'Education_Years', 'allnone',
                                       'WC', 'WPS', 'i', 'emo_neg', 'emo_pos', 'Brunet', 'we',
                                       'emo_anx', 'emo_anger', 'emo_sad', 'focuspast', 'TTR']),   # scale numerical features
    ('num_raw', FunctionTransformer(lambda x: x, validate=False), ['Gender', 'COVID'])            # keeps dummy variables unscaled
])

# Define the pipeline
pipeline_2 = Pipeline([
    ('preprocessing', preprocessor_2),
    ('regressor', XGBRegressor(random_state = 42))
])


#### 3. Automated + TF-IDF 
Selection of features based on their F-score using SelectKBest + dimensionality reduction using TSVD to preserve top 100 componenets

In [7]:
# Define column transformer (StandardScaler for numerical, and FunctionTransformer for raw features)
preprocessor_3 = ColumnTransformer([
    ('tfidf', TfidfVectorizer(max_features=500), 'Text'),  # TF-IDF (no scaling)
    ('num_scaler', StandardScaler(), ['Age', 'Education_Years', 'allnone',
                                       'WC', 'WPS', 'i', 'emo_neg', 'emo_pos', 'Brunet', 'we',
                                       'emo_anx', 'emo_anger', 'emo_sad', 'focuspast', 'TTR']),   # scale numerical features
    ('num_raw', FunctionTransformer(lambda x: x, validate=False), ['Gender', 'COVID'])            # keeps dummy variables unscaled
])


# Define the pipeline
pipeline_3 = Pipeline([
    ('preprocessing', preprocessor_3),
    ('feature_selection', SelectKBest(f_regression, k=100)),  
    ('pca', TruncatedSVD(n_components=100)),  
    ('regressor', XGBRegressor(random_state = 42)) 
])


#### 4. Stats-based (no i, no past) + TF-IDF
Removal of lexical features that were not strongly associated with PHQ-8 scores in Linear Mixed Modelling (LME), see lme in this repo for more detail

In [8]:
# Define column transformer (StandardScaler for numerical, and FunctionTransformer for raw features)

## In our project, different associations between lexical features and depression severity (PHQ-8 scores) were observed 
# for different languages. Please uncomment the following section to run the required model.

# a) English --> only frequency of past-tense verbs and first-person singular pronouns NOT associated with
# PHQ-8 scores --> remove this features

# preprocessor_4 = ColumnTransformer([
#     ('tfidf', TfidfVectorizer(max_features=500), 'Text'),                # TF-IDF (no scaling, as vectors are already scaled)
#     ('num_scaler', StandardScaler(), ['Age', 'Education_Years',
#                                        'WC', 'WPS', 'emo_neg', 'emo_pos', 'Brunet', 'we', 'allnone',
#                                        'emo_anx', 'emo_anger', 'emo_sad','TTR']),       # scale numerical features
#     ('num_raw', FunctionTransformer(lambda x: x, validate=False), ['Gender', 'COVID'])  # keeps dummy variables unscaled
# ])

# b) Dutch --> only WPS and emo_pos were strongly associated with PHQ-8 scores --> remove the rest

# preprocessor_4 = ColumnTransformer([
#     ('tfidf', TfidfVectorizer(max_features=500), 'Text'),                # TF-IDF (no scaling, as vectors are already scaled)
#     ('num_scaler', StandardScaler(), ['Age', 'Education_Years',
#                                     'WPS', 'emo_pos']),                                 # scale numerical features
#     ('num_raw', FunctionTransformer(lambda x: x, validate=False), ['Gender', 'COVID'])  # keeps dummy variables unscaled
# ])

# c) Spanish - no lexical features were strongly associated with PHQ-8 scores --> remove all

# preprocessor_4 = ColumnTransformer([
#     ('tfidf', TfidfVectorizer(max_features=500), 'Text'),                # TF-IDF (no scaling, as vectors are already scaled)
#     ('num_scaler', StandardScaler(), ['Age', 'Education_Years']),                         # scale numerical features
#     ('num_raw', FunctionTransformer(lambda x: x, validate=False), ['Gender', 'COVID'])   # keeps dummy variables unscaled
# ])


# Define the pipeline
pipeline_4 = Pipeline([
    ('preprocessing', preprocessor_4),
    ('regressor', XGBRegressor(random_state = 42)) 
])


### Custom RMSE

In [9]:
# Custom RMSE scorer
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

# Define the scorers dictionary
scorers = {
    'rmse': make_scorer(rmse),
    'r2': make_scorer(r2_score)
}

### Set up nested CFV

In [10]:
## Different X for different model versions (see the 4 versions above)

# 1) Version 1 (all lexical features)
X_1 = data_cleaned[['Age', 'Education_Years', 'Gender', 'COVID',
             'WC', 'WPS','i','emo_neg', 'emo_pos', 'Brunet', 'we', 
             'focuspast','TTR']]    


# 2-3) Appraoches 2-3 (all lexical features + TF-IDF)
X_23 = data_cleaned[['Age', 'Education_Years', 'Gender', 'COVID',
             'Text', 'WC', 'WPS','i','emo_neg', 'emo_pos', 'Brunet', 'we',
             'focuspast','TTR']]    

# 4) Significant features only - differs between the languages based on LME results
# Please uncomment the section to run it for your project

# a) English
# X_4 = data_cleaned[['Age', 'Education_Years', 'Gender', 'COVID',
#              'Text', 'WC', 'WPS','emo_neg', 'emo_pos', 'Brunet', 'we', 'allnone',
#              'emo_anx','emo_anger','emo_sad','TTR']]  

# b) Dutch
# X_4 = data_cleaned[['Age', 'Education_Years', 'Gender', 'COVID',
#             'Text', 'WPS', 'emo_pos']]

# c) Spanish
# X_4 = data_cleaned[['Age', 'Education_Years', 'Gender', 'COVID','Text', ]]


# y = data_cleaned[['PHQ8']]
y = data_cleaned['PHQ8'].values.ravel()  # This will convert y to a 1D arra
groups = data_cleaned['participant_ID']

In [11]:
# Define the outer cross-validation strategy (GroupKFold)
inner_cv = GroupKFold(n_splits=5)
outer_cv = GroupKFold(n_splits=5)

#### 1. Baseline

In [34]:
# Inner Loop
Inner_Grid = GridSearchCV(pipeline_1,
                          param_grid,
                          verbose = 1,
                          cv=inner_cv,
                          refit='rmse',
                          return_train_score=True  
                         )

# Outer Loop
nested_results = cross_validate(Inner_Grid, X_1, y, 
                                cv=outer_cv,
                                groups=groups,
                                params={'groups': groups},  # pass group information to inner split 
                                scoring=scorers,
                                return_train_score=True)    # optionally return train scores

print(f"Average Inner RMSE: {np.mean(nested_results['train_rmse']):.2f}")
print(f"Average Inner R²: {np.mean(nested_results['train_r2']):.2f}")
print(f"Average Outer RMSE: {np.mean(nested_results['test_rmse']):.2f}")
print(f"Average Outer R²: {np.mean(nested_results['test_r2']):.2f}")

Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Average Inner RMSE: 5.16
Average Inner R²: 0.21
Average Outer RMSE: 5.78
Average Outer R²: -0.00


#### 2. Baseline + TF-IDF

In [36]:
# Inner Loop
Inner_Grid = GridSearchCV(pipeline_2,
                          param_grid,
                          verbose = 1,
                          cv=inner_cv,
                          refit='rmse',
                          return_train_score=True  
                         )

# Outer Loop
nested_results = cross_validate(Inner_Grid, X_23, y, 
                                cv=outer_cv,
                                groups=groups,
                                params={'groups': groups}, #Need to pass group information to inner split 
                                scoring=scorers,
                                return_train_score=True)  # Optionally return train scores

print(f"Average Inner RMSE: {np.mean(nested_results['train_rmse']):.2f}")
print(f"Average Inner R²: {np.mean(nested_results['train_r2']):.2f}")
print(f"Average Outer RMSE: {np.mean(nested_results['test_rmse']):.2f}")
print(f"Average Outer R²: {np.mean(nested_results['test_r2']):.2f}")

Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Average Inner RMSE: 5.16
Average Inner R²: 0.21
Average Outer RMSE: 5.75
Average Outer R²: 0.01


#### 3. Automated + TF-IDF

In [16]:
# Inner Loop
Inner_Grid = GridSearchCV(pipeline_3,
                          param_grid,
                          verbose = 1,
                          cv=inner_cv,
                          refit='rmse',
                          return_train_score=True  
                         )

nested_results = cross_validate(Inner_Grid, X_23, y, 
                                cv=outer_cv,
                                groups=groups,
                                params={'groups': groups}, #Need to pass group information to inner split 
                                scoring=scorers,
                                return_train_score=True)  # Optionally return train scores

print(f"Average Inner RMSE: {np.mean(nested_results['train_rmse']):.2f}")
print(f"Average Inner R²: {np.mean(nested_results['train_r2']):.2f}")
print(f"Average Outer RMSE: {np.mean(nested_results['test_rmse']):.2f}")
print(f"Average Outer R²: {np.mean(nested_results['test_r2']):.2f}")

Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Average Inner RMSE: 4.87
Average Inner R²: 0.29
Average Outer RMSE: 5.72
Average Outer R²: 0.02


#### 4. Stats-based + TF-IDF

In [12]:
# Inner Loop
Inner_Grid = GridSearchCV(pipeline_4,
                          param_grid,
                          verbose = 1,
                          cv=inner_cv,
                          refit='rmse',
                          return_train_score=True  
                         )

nested_results = cross_validate(Inner_Grid, X_4, y, 
                                cv=outer_cv,
                                groups=groups,
                                params={'groups': groups}, #Need to pass group information to inner split 
                                scoring=scorers,
                                return_train_score=True)  # Optionally return train scores

print(f"Average Inner RMSE: {np.mean(nested_results['train_rmse']):.2f}")
print(f"Average Inner R²: {np.mean(nested_results['train_r2']):.2f}")
print(f"Average Outer RMSE: {np.mean(nested_results['test_rmse']):.2f}")
print(f"Average Outer R²: {np.mean(nested_results['test_r2']):.2f}")

Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Fitting 5 folds for each of 900 candidates, totalling 4500 fits
Average Inner RMSE: 5.16
Average Inner R²: 0.21
Average Outer RMSE: 5.75
Average Outer R²: 0.01


In [22]:
import sys
print(sys.version)

3.10.9 | packaged by Anaconda, Inc. | (main, Mar  1 2023, 18:18:15) [MSC v.1916 64 bit (AMD64)]
