## User Model - GradientBoosting

In [1]:
import pandas as pd

users = pd.read_csv("users.csv.zip", index_col="screen_name")
users = users[["followers_count", "friends_count", "statuses_count", "verified"]]



Balancing users dataframe based on classes

In [2]:
from numpy.random import binomial

sampling_ratio = (
    users.verified.value_counts()[1] * 1.0 / users.verified.value_counts()[0]
)
to_keep = []
for i in users.itertuples():
    if i[4] == 1:
        to_keep.append(True)
    else:
        to_keep.append(binomial(1, sampling_ratio) == 1)
users = users[to_keep]
users.head()

Unnamed: 0_level_0,followers_count,friends_count,statuses_count,verified
screen_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
joeflech,3658,1224,11410,1
DLasAmericas,112154,12510,407453,1
rebeccavallas,6292,1736,16352,1
PHRESHER_DGYGZ,4397,102,10708,1
yeyo478,1612,1546,148501,0


Splitting into a train and test subsets

In [3]:
from sklearn.model_selection import train_test_split

x = users.iloc[:, :-1]
y = users.iloc[:, -1]
x_tr, x_te, y_tr, y_te = train_test_split(x, y, test_size=0.3)

The model originally went through hyperparameters optimization, feature engineering, feature selection, and cross-validation on the `x_tr, y_tr` set. After proven to be better Random Forest, Logistic Regression, and other models, we are training it on the `x_tr, y_tr` set and testing it on the `x_te, y_te` set.

Training the model on 10 folds using the training set. Then, evaluating it.

In [4]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_validate
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score

gb = GradientBoostingClassifier(n_estimators=50)
cv_results = cross_validate(gb, x_tr, y_tr, scoring=["roc_auc"], cv=10)
print("Cross-Validation ROC-AUC-Score: ", cv_results["test_roc_auc"])

Cross-Validation ROC-AUC-Score:  [0.93310252 0.98172139 0.93768148 0.97561611 0.96939915 0.97953869
 0.97342342 0.96572823 0.96647898 0.95908408]


Afterwards, training the model on the full training set, and testing it on the testing set. Then, evaluating it, again.

In [5]:
gb.fit(x_tr, y_tr)
y_pred = gb.predict(x_te)
y_pred_proba = gb.predict_proba(x_te)

print("Test-Set Evaluation:")
print("F1-Score:", f1_score(y_te, y_pred))
print("Precision-Score:", precision_score(y_te, y_pred))
print("Recall-Score:", recall_score(y_te, y_pred))
print("ROC-AUC-Score:", roc_auc_score(y_te, y_pred_proba[:, 1]))

Test-Set Evaluation:
F1-Score: 0.907815631262525
Precision-Score: 0.8864970645792564
Recall-Score: 0.9301848049281314
ROC-AUC-Score: 0.9628624879272144
