# Gender Prediction Model

This notebook serves as a demo to illustrate how to build a gender predictive mode based on users' in-app behavior.

The business goal of the project is to predict users' gender, so as to drive insights to the user profile and how users' demographic features related to their in-app behaviour. For illustration purpose, the dataset used in this demo is synthesized based on users' browsing activities in a news feed mobile app. The app provides real-time news feed in various channels, such as sports, beauty, technology, cooking. The model is based on the assumption that users of different genders have different browsing behaviour in various news channel.

The training set comprises 5k app users’ gender info and their recent 120 days browsing history in various news channel. The model uses XGBoost algorithm, and  achieves 0.77 accuracy and 0.83 AUC on the 20% test set.
The trained model can be used to predict gender of all app users.

In [75]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report, confusion_matrix
import xgboost as xgb
import pickle

First, let us load the dataset. 
* Gender label: `{0:female, 1:male}`
* Features: A total of 163 feataures representing the number of news feed read by a user in 163 selected channels/tags in the past 120 days

In [83]:
# load data
df = pd.read_csv('data/data.csv')
print(df.shape)
df.head()

(5000, 164)


Unnamed: 0,gender,1,2,3,4,5,6,7,18,19,...,351,352,353,354,356,358,360,361,362,363
0,0,0.0,10.0,1.0,7.0,0.0,0.0,5.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,3.0,135.0,6.0,19.0,0.0,0.0,30.0,1.0,15.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,0.0,86.0,3.0,8.0,0.0,0.0,22.0,0.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,0.0,25.0,1.0,0.0,0.0,0.0,2.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,0.0,211.0,17.0,23.0,0.0,0.0,28.0,2.0,46.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [87]:
y = df.gender  
X = df.drop(columns=['gender'])

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print('\nShape of training set and test set: ', X_train.shape, y_train.shape, X_test.shape, y_test.shape)

# prepare data for xgboost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# set xgboost params
param = {
    'max_depth': 3,      # the maximum depth of each tree
    'eta': 0.3,          # the training step for each iteration
    'silent': 1,         # logging mode - quiet
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 2}                 # the number of classes that exist in this datset
num_round = 10                      # the number of training iterations

# train xgb
print('Traing xgboost model...')
xgb_model = xgb.train(param, dtrain, num_round)
preds = xgb_model.predict(dtest)
y_pred = np.asarray([np.argmax(line) for line in preds])

# save model 
filename = 'model/xgb_baseline_model.sav'
pickle.dump(xgb_model, open(filename, 'wb'))
print('Finish saving model!')


Shape of training set and test set:  (4000, 163) (4000,) (1000, 163) (1000,)
Traing xgboost model...
Finish saving model!


In [91]:
# evaluate on test set
print ("Accuracy on validation set: {:.4f}".format(accuracy_score(y_test, y_pred)))
print("AUC score : {:.4f}".format(roc_auc_score(y_test, preds[:,1])))
print("\nClassification report : \n", classification_report(y_test, y_pred))
print("\nConfusion Matrix : \n", confusion_matrix(y_test, y_pred))

Accuracy on validation set: 0.7730
AUC score : 0.8287

Classification report : 
              precision    recall  f1-score   support

          0       0.77      0.94      0.85       677
          1       0.77      0.42      0.55       323

avg / total       0.77      0.77      0.75      1000


Confusion Matrix : 
 [[636  41]
 [186 137]]
