# Model - XGBoost Classifier

Overview of notebook, from Nature paper will classify each subreddit... etc.

> Modeling steps in this notebook
- TF-IDF Transform 
- Split data to training (80%) and testing (20%)
- To handle imbalanced class of our target variable we use SMOTE algorithm on the training data
- Use XGBoost Classifier 

##### Import libraries

In [122]:
import pandas as pd
import numpy as np

# Train, test, split
from sklearn.model_selection import train_test_split

# TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_extraction.text import CountVectorizer 

# For handling imbalanced classes
from collections import Counter
from imblearn.over_sampling import SMOTE

# For classification
import xgboost as xgb
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV

##### Load data

In [123]:
df = pd.read_csv('../data/posts-preprocessed.csv')

In [124]:
df.head()

Unnamed: 0,author,subreddit,timeframe,text,datetime
0,sub21097,bulimia,pre-covid,"['stop', 'hating', 'eating', 'disorders', 'sin...",2017-12-02 16:36:16
1,sub9597,bulimia,pre-covid,"['got', 'upset', 'tonight', 'grandmother', 'as...",2017-12-03 23:25:31
2,sub12092,bulimia,pre-covid,"['new', 'guy', '1', 'month', '16m', 'hi', 'guy...",2017-12-05 19:45:25
3,sub6555,bulimia,pre-covid,"['vomited', 'blood', 'eat', 'throat', 'heals',...",2017-12-06 16:58:16
4,sub21097,bulimia,pre-covid,"['girlfriends', 'noticing', 'male', '20', 'rel...",2017-12-07 01:27:16


##### Binarize targets using get_dummies

Will use each subreddit as target (except mental health)

In [125]:
df = pd.get_dummies(df, columns=['subreddit'])

In [126]:
df.drop(columns='subreddit_mentalhealth', inplace = True)

## Vectorize using TFIDF
Implement Term Frequency - Inverse Document Frequency (TF-IDF) to vectorize the pre-processed text from the subreddit posts into numerical representations in a weight matrix that will be the basis for our set of feature for the predictive models. 

In [127]:
# create the transform
vectorizer = TfidfVectorizer()

In [128]:
X = vectorizer.fit_transform(df['text'])

In [129]:
X.shape

(58257, 74640)

## Subreddit: Anorexia Nervosa

In [130]:
y = df['subreddit_AnorexiaNervosa']

In [131]:
y.shape

(58257,)

### train test split

In [132]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

### Oversample the minority class using SMOTE

In [133]:
counter = Counter(y_train)
print(counter)

Counter({0: 41394, 1: 5211})


In [134]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [135]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

Counter({0: 41394, 1: 41394})


### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [136]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [137]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [138]:
model = xgb.train(param, D_train, steps)



#### XGBoost Model Scores

In [139]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-anorexia = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Anorexia = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

Precision = 0.5309246785058175
Recall = 0.6870047543581617
F1-Score: Non-anorexia = 0.9431133323533736
F1-Score: Anorexia = 0.5989637305699482
Accuracy = 0.9003604531410917


#### Grid Search

Grid search... may be a waste of time... run over night and see what params we get

clf = xgb.XGBClassifier()
parameters = {
     "eta"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
     "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
     "min_child_weight" : [ 1, 3, 5, 7 ],
     "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
     "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
     }

grid = GridSearchCV(clf,
                    parameters, n_jobs=4,
                    scoring="neg_log_loss",
                    cv=3)

grid.fit(X_train_sm, y_train_sm)

## Subreddit: Anxiety

In [46]:
y = df['subreddit_Anxiety']

In [47]:
y.shape

(58257,)

In [48]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

In [49]:
# summarize class distribution
counter = Counter(y_train)
print(counter)

Counter({0: 41468, 1: 5137})


In [50]:
# Oversample the minority class using SMOTE
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

In [51]:
# summarize class distribution
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

Counter({1: 41468, 0: 41468})


In [52]:
# Transform to Dmatrix format
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

In [53]:
# Define Parameters for XGBoost Model
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [54]:
model = xgb.train(param, D_train, steps)



In [55]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-anxiety = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Anxiety = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

Precision = 0.4195583596214511
Recall = 0.39820359281437123
F1-Score: Non-anxiety = 0.9256038647342995
F1-Score: Anxiety = 0.4086021505376343
Accuracy = 0.867833848266392


## Subreddit: Autism

In [57]:
y = df['subreddit_autism']

In [58]:
y.shape

(58257,)

### train test split

In [59]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [60]:
counter = Counter(y_train)
print(counter)

Counter({0: 41468, 1: 5137})


### Oversample the minority class using SMOTE

In [61]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [62]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

Counter({0: 41468, 1: 41468})


### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [63]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [64]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [65]:
model = xgb.train(param, D_train, steps)



#### XGBoost Model Scores

In [66]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-autism = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Autism = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

Precision = 0.2891421715656869
Recall = 0.7215568862275449
F1-Score: Non-autism = 0.8528496297091338
F1-Score: Autism = 0.41284796573875804
Accuracy = 0.7646755921730175


## Subreddit: BPD

In [68]:
y = df['subreddit_BPD']

In [69]:
y.shape

(58257,)

### train test split

In [70]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [71]:
counter = Counter(y_train)
print(counter)

Counter({0: 41432, 1: 5173})


### Oversample the minority class using SMOTE

In [72]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [73]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

Counter({0: 41432, 1: 41432})


### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [74]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [75]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [76]:
model = xgb.train(param, D_train, steps)



#### XGBoost Model Scores

In [77]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-BPD = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: BPD = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

Precision = 0.665080875356803
Recall = 0.5376923076923077
F1-Score: Non-BPD = 0.9545172528993462
F1-Score: BPD = 0.5946405784772438
Accuracy = 0.9182114658427738


## Subreddit: Bipolar

In [80]:
y = df['subreddit_bipolar']

In [81]:
y.shape

(58257,)

### train test split

In [82]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [83]:
counter = Counter(y_train)
print(counter)

Counter({0: 41412, 1: 5193})


### Oversample the minority class using SMOTE

In [84]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [85]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

Counter({0: 41412, 1: 41412})


### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [86]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [87]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [88]:
model = xgb.train(param, D_train, steps)



#### XGBoost Model Scores

In [89]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-bipolar = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Bipolar = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

Precision = 0.2629617078956286
Recall = 0.60625
F1-Score: Non-bipolar = 0.8595396633985215
F1-Score: Bipolar = 0.36681635547151975
Accuracy = 0.7700823892893924


## Subreddit: Bulimia

In [90]:
y = df['subreddit_bulimia']

In [91]:
y.shape

(58257,)

### train test split

In [92]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [93]:
counter = Counter(y_train)
print(counter)

Counter({0: 41427, 1: 5178})


### Oversample the minority class using SMOTE

In [94]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [95]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

Counter({0: 41427, 1: 41427})


### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [96]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [97]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [98]:
model = xgb.train(param, D_train, steps)



#### XGBoost Model Scores

In [99]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-bulimia = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Bulimia = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

Precision = 0.7107641741988496
Recall = 0.667953667953668
F1-Score: Non-bulimia = 0.9623893805309734
F1-Score: Bulimia = 0.6886942675159237
Accuracy = 0.9328870580157913


## Subreddit: Depression

In [102]:
y = df['subreddit_depression']

In [103]:
y.shape

(58257,)

### train test split

In [104]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [105]:
counter = Counter(y_train)
print(counter)

Counter({0: 41409, 1: 5196})


### Oversample the minority class using SMOTE

In [106]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [107]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

Counter({0: 41409, 1: 41409})


### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [108]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [109]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [110]:
model = xgb.train(param, D_train, steps)



#### XGBoost Model Scores

In [111]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-depression = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Depression = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

Precision = 0.2807906741003548
Recall = 0.4338292873923258
F1-Score: Non-depression = 0.8931883913433728
F1-Score: Depression = 0.34092307692307694
Accuracy = 0.8161688980432544


## Subreddit: Schizophrenia

In [112]:
y = df['subreddit_schizophrenia']

In [113]:
y.shape

(58257,)

### train test split

In [114]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [115]:
counter = Counter(y_train)
print(counter)

Counter({0: 41399, 1: 5206})


### Oversample the minority class using SMOTE

In [116]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [117]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

Counter({0: 41399, 1: 41399})


### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [118]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [119]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [120]:
model = xgb.train(param, D_train, steps)



#### XGBoost Model Scores

In [121]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-schizophrenia = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Schizophrenia = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

Precision = 0.2554126137433323
Recall = 0.6424625098658248
F1-Score: Non-schizophrenia = 0.850079575596817
F1-Score: Schizophrenia = 0.3655141445891334
Accuracy = 0.7574665293511843
