# Model - XGBoost Classifier

Overview of notebook, from Nature paper will classify each subreddit... etc.

> Modeling steps in this notebook
- TF-IDF Transform 
- Split data to training (80%) and testing (20%)
- To handle imbalanced class of our target variable we use SMOTE algorithm on the training data
- Use XGBoost Classifier 

##### Import libraries

In [1]:
import pandas as pd
import numpy as np

# Train, test, split
from sklearn.model_selection import train_test_split

# TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_extraction.text import CountVectorizer 

# For handling imbalanced classes
from collections import Counter
from imblearn.over_sampling import SMOTE

# For classification
import xgboost as xgb
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV

##### Load data

In [2]:
df = pd.read_csv('../data/posts-preprocessed.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,author,subreddit,timeframe,text,text_length,text_word_count,datetime,words,word_stems
0,853,sub17967,bulimia,pre-covid,['how can i stop hating myself i have been on...,12749,2444,2017-12-02 16:36:16,"['[', ""'how"", 'stop', 'hating', 'eating', 'dis...","['[', ""'how"", 'stop', 'hating', 'eating', 'dis..."
1,852,sub10311,bulimia,pre-covid,['new guy here 1 month on it 16m hi guys just...,11906,2334,2017-12-05 19:45:25,"['[', ""'new"", 'guy', '1', 'month', '16m', 'hi'...","['[', ""'new"", 'guy', '1', 'month', '16m', 'hi'..."
2,851,sub5587,bulimia,pre-covid,['so i just vomited blood what can i eat while...,10688,2051,2017-12-06 16:58:16,"['[', ""'so"", 'vomited', 'blood', 'eat', 'throa...","['[', ""'so"", 'vomited', 'blood', 'eat', 'throa..."
3,850,sub32498,bulimia,pre-covid,['recovery is expensive during recovery hi i...,6027,1125,2017-12-07 14:07:27,"['[', ""'recovery"", 'expensive', 'recovery', 'h...","['[', ""'recovery"", 'expensive', 'recovery', 'h..."
4,849,sub35262,bulimia,pre-covid,['anyone relate wanting validation for small s...,16805,3164,2017-12-08 00:49:23,"['[', ""'anyone"", 'relate', 'wanting', 'validat...","['[', ""'anyone"", 'relate', 'wanting', 'validat..."


##### Binarize targets using get_dummies

Will use each subreddit as target (except mental health)

In [4]:
df = pd.get_dummies(df, columns=['subreddit'])

In [6]:
df

Unnamed: 0.1,Unnamed: 0,author,timeframe,text,text_length,text_word_count,datetime,words,word_stems,subreddit_AnorexiaNervosa,subreddit_BPD,subreddit_autism,subreddit_bulimia,subreddit_schizophrenia
0,853,sub17967,pre-covid,['how can i stop hating myself i have been on...,12749,2444,2017-12-02 16:36:16,"['[', ""'how"", 'stop', 'hating', 'eating', 'dis...","['[', ""'how"", 'stop', 'hating', 'eating', 'dis...",0,0,0,1,0
1,852,sub10311,pre-covid,['new guy here 1 month on it 16m hi guys just...,11906,2334,2017-12-05 19:45:25,"['[', ""'new"", 'guy', '1', 'month', '16m', 'hi'...","['[', ""'new"", 'guy', '1', 'month', '16m', 'hi'...",0,0,0,1,0
2,851,sub5587,pre-covid,['so i just vomited blood what can i eat while...,10688,2051,2017-12-06 16:58:16,"['[', ""'so"", 'vomited', 'blood', 'eat', 'throa...","['[', ""'so"", 'vomited', 'blood', 'eat', 'throa...",0,0,0,1,0
3,850,sub32498,pre-covid,['recovery is expensive during recovery hi i...,6027,1125,2017-12-07 14:07:27,"['[', ""'recovery"", 'expensive', 'recovery', 'h...","['[', ""'recovery"", 'expensive', 'recovery', 'h...",0,0,0,1,0
4,849,sub35262,pre-covid,['anyone relate wanting validation for small s...,16805,3164,2017-12-08 00:49:23,"['[', ""'anyone"", 'relate', 'wanting', 'validat...","['[', ""'anyone"", 'relate', 'wanting', 'validat...",0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3871,27445,sub15361,post-covid,['idfk relapsed into my bulimia purging every...,3466,646,2021-02-18 18:12:13,"['[', ""'idfk"", 'relapsed', 'purging', 'every',...","['[', ""'idfk"", 'relapsed', 'purging', 'every',...",0,0,0,1,0
3872,27441,sub13955,post-covid,['well i managed a whole two weeks was on vac...,9904,1899,2021-02-18 22:54:47,"['[', ""'well"", 'managed', 'whole', 'two', 'wee...","['[', ""'well"", 'managed', 'whole', 'two', 'wee...",0,0,0,1,0
3873,27439,sub22575,post-covid,['first day in a long time with no purging ju...,6651,1233,2021-02-18 23:45:43,"['[', ""'first"", 'day', 'long', 'time', 'purgin...","['[', ""'first"", 'day', 'long', 'time', 'purgin...",0,0,0,1,0
3874,25306,sub2340,post-covid,['bingingpurging recovery stomach issues so ...,7247,1416,2021-02-23 15:49:58,"['[', ""'bingingpurging"", 'recovery', 'stomach'...","['[', ""'bingingpurging"", 'recovery', 'stomach'...",0,0,0,1,0


In [5]:
df.drop(columns='subreddit_mentalhealth', inplace = True)

KeyError: "['subreddit_mentalhealth'] not found in axis"

## Vectorize using TFIDF
Implement Term Frequency - Inverse Document Frequency (TF-IDF) to vectorize the pre-processed text from the subreddit posts into numerical representations in a weight matrix that will be the basis for our set of feature for the predictive models. 

In [None]:
# create the transform
vectorizer = TfidfVectorizer()

In [None]:
X = vectorizer.fit_transform(df['text'])

In [None]:
X.shape

## Subreddit: Anorexia Nervosa

In [None]:
y = df['subreddit_AnorexiaNervosa']

In [None]:
y.shape

### train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

### Oversample the minority class using SMOTE

In [None]:
counter = Counter(y_train)
print(counter)

In [None]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [None]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [None]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [None]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [None]:
model = xgb.train(param, D_train, steps)

#### XGBoost Model Scores

In [None]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-anorexia = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Anorexia = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

#### Grid Search

Grid search... may be a waste of time... run over night and see what params we get

clf = xgb.XGBClassifier()
parameters = {
     "eta"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
     "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
     "min_child_weight" : [ 1, 3, 5, 7 ],
     "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
     "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
     }

grid = GridSearchCV(clf,
                    parameters, n_jobs=4,
                    scoring="neg_log_loss",
                    cv=3)

grid.fit(X_train_sm, y_train_sm)

## Subreddit: Anxiety

In [None]:
y = df['subreddit_Anxiety']

In [None]:
y.shape

In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

In [None]:
# summarize class distribution
counter = Counter(y_train)
print(counter)

In [None]:
# Oversample the minority class using SMOTE
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

In [None]:
# summarize class distribution
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

In [None]:
# Transform to Dmatrix format
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

In [None]:
# Define Parameters for XGBoost Model
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [None]:
model = xgb.train(param, D_train, steps)

In [None]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-anxiety = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Anxiety = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

## Subreddit: Autism

In [None]:
y = df['subreddit_autism']

In [None]:
y.shape

### train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [None]:
counter = Counter(y_train)
print(counter)

### Oversample the minority class using SMOTE

In [None]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [None]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [None]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [None]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [None]:
model = xgb.train(param, D_train, steps)

#### XGBoost Model Scores

In [None]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-autism = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Autism = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

## Subreddit: BPD

In [None]:
y = df['subreddit_BPD']

In [None]:
y.shape

### train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [None]:
counter = Counter(y_train)
print(counter)

### Oversample the minority class using SMOTE

In [None]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [None]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [None]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [None]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [None]:
model = xgb.train(param, D_train, steps)

#### XGBoost Model Scores

In [None]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-BPD = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: BPD = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

## Subreddit: Bipolar

In [None]:
y = df['subreddit_bipolar']

In [None]:
y.shape

### train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [None]:
counter = Counter(y_train)
print(counter)

### Oversample the minority class using SMOTE

In [None]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [None]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [None]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [None]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [None]:
model = xgb.train(param, D_train, steps)

#### XGBoost Model Scores

In [None]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-bipolar = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Bipolar = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

## Subreddit: Bulimia

In [None]:
y = df['subreddit_bulimia']

In [None]:
y.shape

### train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [None]:
counter = Counter(y_train)
print(counter)

### Oversample the minority class using SMOTE

In [None]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [None]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [None]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [None]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [None]:
model = xgb.train(param, D_train, steps)

#### XGBoost Model Scores

In [None]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-bulimia = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Bulimia = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

## Subreddit: Depression

In [None]:
y = df['subreddit_depression']

In [None]:
y.shape

### train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [None]:
counter = Counter(y_train)
print(counter)

### Oversample the minority class using SMOTE

In [None]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [None]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [None]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [None]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [None]:
model = xgb.train(param, D_train, steps)

#### XGBoost Model Scores

In [None]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-depression = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Depression = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

## Subreddit: Schizophrenia

In [None]:
y = df['subreddit_schizophrenia']

In [None]:
y.shape

### train test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = .2,
                                                    random_state=42)

Summarize the class distribution of the target

In [None]:
counter = Counter(y_train)
print(counter)

### Oversample the minority class using SMOTE

In [None]:
smote = SMOTE(random_state=42) 
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Summarize the class distribution of the target after oversampling the minority class using the SMOTE algorithm.

In [None]:
counter_resampled = Counter(y_train_sm) 
print(counter_resampled)

### XGBoost Model

Transform the re-sampled features to Dmatrix format

In [None]:
D_train = xgb.DMatrix(X_train_sm, label=y_train_sm)
D_test = xgb.DMatrix(X_test, label=y_test)

Define Parameters for XGBoost Model

In [None]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

In [None]:
model = xgb.train(param, D_train, steps)

#### XGBoost Model Scores

In [None]:
preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds]) ### explain

print("Precision = {}".format(precision_score(y_test, best_preds))) ## explain.. adjusting averag='macro' changes this.. look into more
print("Recall = {}".format(recall_score(y_test, best_preds))) ### explain
print("F1-Score: Non-schizophrenia = {}".format(f1_score(y_test, best_preds, pos_label=0)))
print("F1-Score: Schizophrenia = {}".format(f1_score(y_test, best_preds, pos_label=1)))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))