# Model - XGBoost Classifier

Overview of notebook, from Nature paper will classify each subreddit... etc.

> Modeling steps in this notebook
- TF-IDF Transform 
- Split data to training (80%) and testing (20%)
- To handle imbalanced class of our target variable we use SMOTE algorithm on the training data
- Use XGBoost Classifier 

##### Import libraries

In [122]:
import pandas as pd
import numpy as np

# Train, test, split
from sklearn.model_selection import train_test_split

# TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_extraction.text import CountVectorizer 

# For handling imbalanced classes
from collections import Counter
from imblearn.over_sampling import SMOTE

# For classification
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV

##### Load data

In [49]:
df = pd.read_csv('../data/posts-preprocessed.csv')

In [50]:
df.head()

Unnamed: 0,author,created_utc,subreddit,text,timeframe,words,word_stems
0,sub30605,1499390694,bulimia,['chest pains anyone else experience chest pa...,pre-covid,"['[', ""'chest"", 'pains', 'anyone', 'else', 'ex...","['[', ""'chest"", 'pains', 'anyone', 'else', 'ex..."
1,sub27274,1499060654,bulimia,['dying to eat eating to die study on shifting...,pre-covid,"['[', ""'dying"", 'eat', 'eating', 'die', 'study...","['[', ""'dying"", 'eat', 'eating', 'die', 'study..."
2,sub6055,1499029087,bulimia,['without purging what is the quickest way to...,pre-covid,"['[', ""'without"", 'purging', 'quickest', 'way'...","['[', ""'without"", 'purging', 'quickest', 'way'..."
3,sub40365,1498978259,bulimia,['bulimia and melancholy feelings i havent pu...,pre-covid,"['[', ""'bulimia"", 'melancholy', 'feelings', 'h...","['[', ""'bulimia"", 'melancholy', 'feelings', 'h..."
4,sub49857,1498814187,bulimia,['im relapsing fuck im so upset at myself rig...,pre-covid,"['[', ""'im"", 'relapsing', 'fuck', 'im', 'upset...","['[', ""'im"", 'relapsing', 'fuck', 'im', 'upset..."


##### Binarize targets using get_dummies

Will use each subreddit as target (except mental health)

In [51]:
df = pd.get_dummies(df, columns=['subreddit'])

In [52]:
df.drop(columns='subreddit_mentalhealth', inplace = True)

## Vectorize using TFIDF
Implement Term Frequency - Inverse Document Frequency (TF-IDF) to vectorize the pre-processed text from the subreddit posts into numerical representations in a weight matrix that will be the basis for our set of feature for the predictive models. 

In [53]:
# create the transform
vectorizer = TfidfVectorizer()

In [112]:
X = vectorizer.fit_transform(df['text'])

In [113]:
X.shape

(85338, 99061)

## Subreddit: Anorexia Nervosa

In [114]:
y = df['subreddit_AnorexiaNervosa']

In [115]:
y.shape

(85338,)

### train test split

In [116]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

### Oversample the minority class using SMOTE

In [117]:
counter = Counter(y_train)
print(counter)

Counter({0: 60478, 1: 7792})


In [118]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [119]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

Counter({0: 60478, 1: 60478})


### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [120]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

#### Grid Search

clf = xgb.XGBClassifier()
parameters = {
     "eta"    : [0.05, 0.1, 0.2, 0.3 ] ,
     "max_depth"        : [ 1, 3, 5, 7, 11],
     "gamma"            : [ 0.0, 0.2 , 0.4 ],
     }

grid = GridSearchCV(clf,
                    parameters, n_jobs=4,
                    scoring="neg_log_loss",
                    cv=3)

grid.fit(X_train_sm, y_train_sm)

Define Parameters for XGBoost Model

what is num_class? can we run multi-class classification with all subreddits?

In [123]:
model = XGBClassifier()

In [124]:
model.fit(X_train_sm, y_train_sm)





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [126]:
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

In [127]:
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 94.15%


In [None]:
# check if that is overfit and set random seed

In [109]:
param = {
    'max_depth': 6, # default is 6
    'objective': 'multi:softmax', #  logistic regression for binary classification, output score before logistic transformation  
    'num_class': 2} 

steps = 5  # The number of training iterations

In [110]:
model = xgb.train(param, D_train, steps)



#### XGBoost Model Scores

In [111]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-anorexia = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Anorexia = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

Precision = 0.0
Recall = 0.0
F1-Score: Non-anorexia = 0.9402378069603551
F1-Score: Anorexia = 0.0
Accuracy = 0.8872158425123037


  _warn_prf(average, modifier, msg_start, len(result))


## Subreddit: Anxiety

In [67]:
y = df['subreddit_Anxiety']

In [68]:
y.shape

(85338,)

In [69]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

In [70]:
# summarize class distribution
counter = Counter(y_train)
print(counter)

Counter({0: 60721, 1: 7549})


In [71]:
# Oversample the minority class using SMOTE
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

In [72]:
# summarize class distribution
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

Counter({0: 60721, 1: 60721})


In [73]:
# Transform to Dmatrix format
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

In [None]:
# Define Parameters for XGBoost Model
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 2} 

steps = 20  # The number of training iterations

In [None]:
model = xgb.train(param, D_train, steps)

In [None]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-anxiety = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Anxiety = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

## Subreddit: Autism

In [None]:
y = df['subreddit_autism']

In [None]:
y.shape

### train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [None]:
counter = Counter(y_train)
print(counter)

### Oversample the minority class using SMOTE

In [None]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [None]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [None]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [74]:
param = {   
    'num_class': 3} 

steps = 20  # The number of training iterations

In [75]:
model = xgb.train(param, D_train, steps)



#### XGBoost Model Scores

In [76]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-autism = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Autism = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

Precision = 0.0
Recall = 0.0
F1-Score: Non-autism = 0.9408625504188644
F1-Score: Autism = 0.0
Accuracy = 0.8883290367940004


  _warn_prf(average, modifier, msg_start, len(result))


## Subreddit: BPD

In [77]:
y = df['subreddit_BPD']

In [78]:
y.shape

(85338,)

### train test split

In [79]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [80]:
counter = Counter(y_train)
print(counter)

Counter({0: 60825, 1: 7445})


### Oversample the minority class using SMOTE

In [81]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [82]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

Counter({1: 60825, 0: 60825})


### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [83]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [87]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'binary:logistic',  
    'num_class': 2} 

steps = 10  # The number of training iterations

In [88]:
model = xgb.train(param, D_train, steps)

XGBoostError: [08:17:02] /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:57: Check failed: preds.Size() == info.labels_.Size() (243300 vs. 121650) :  labels are not correctly providedpreds.size=243300, label.size=121650, Loss: binary:logistic
Stack trace:
  [bt] (0) 1   libxgboost.dylib                    0x0000001b376ed160 dmlc::LogMessageFatal::~LogMessageFatal() + 112
  [bt] (1) 2   libxgboost.dylib                    0x0000001b377f0f8f xgboost::obj::RegLossObj<xgboost::obj::LogisticClassification>::GetGradient(xgboost::HostDeviceVector<float> const&, xgboost::MetaInfo const&, int, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*) + 447
  [bt] (2) 3   libxgboost.dylib                    0x0000001b37798b88 xgboost::LearnerImpl::UpdateOneIter(int, std::__1::shared_ptr<xgboost::DMatrix>) + 664
  [bt] (3) 4   libxgboost.dylib                    0x0000001b376e28cc XGBoosterUpdateOneIter + 156
  [bt] (4) 5   libffi.6.dylib                      0x000000010afc1884 ffi_call_unix64 + 76
  [bt] (5) 6   ???                                 0x00007ffee67d3940 0x0 + 140732765387072



#### XGBoost Model Scores

In [86]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-BPD = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: BPD = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

Precision = 0.7504012841091493
Recall = 0.4931434599156118
F1-Score: Non-BPD = 0.9589597986707106
F1-Score: BPD = 0.5951623169955442
Accuracy = 0.9254745722990392


## Subreddit: Bipolar

In [None]:
y = df['subreddit_bipolar']

In [None]:
y.shape

### train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [None]:
counter = Counter(y_train)
print(counter)

### Oversample the minority class using SMOTE

In [None]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [None]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [None]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [None]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [None]:
model = xgb.train(param, D_train, steps)

#### XGBoost Model Scores

In [None]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-bipolar = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Bipolar = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

## Subreddit: Bulimia

In [None]:
y = df['subreddit_bulimia']

In [None]:
y.shape

### train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [None]:
counter = Counter(y_train)
print(counter)

### Oversample the minority class using SMOTE

In [None]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [None]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [None]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [None]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [None]:
model = xgb.train(param, D_train, steps)

#### XGBoost Model Scores

In [None]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-bulimia = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Bulimia = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

## Subreddit: Depression

In [None]:
y = df['subreddit_depression']

In [None]:
y.shape

### train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [None]:
counter = Counter(y_train)
print(counter)

### Oversample the minority class using SMOTE

In [None]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [None]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [None]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [None]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [None]:
model = xgb.train(param, D_train, steps)

#### XGBoost Model Scores

In [None]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-depression = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Depression = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

## Subreddit: Schizophrenia

In [None]:
y = df['subreddit_schizophrenia']

In [None]:
y.shape

### train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [None]:
counter = Counter(y_train)
print(counter)

### Oversample the minority class using SMOTE

In [None]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [None]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [None]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [None]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [None]:
model = xgb.train(param, D_train, steps)

#### XGBoost Model Scores

In [None]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-schizophrenia = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Schizophrenia = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

# Try XGBoost with all classes in model