In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import GridSearchCV

import xgboost as xgb
from xgboost.sklearn import XGBClassifier

# WIDS 2018 Datathon

The dataset for the challenge will contain demographic and behavioral information from a representative sample of survey respondents from India and their usage of traditional and mobile financial services. The dataset is a product of InterMedia’s research to help the world’s poorest people take advantage of widely available mobile phones and other digital technology to access financial tools and participate more fully in their local economies. Women in these communities, in particular, are often largely excluded from the formal financial system. By predicting gender, the datathon teams will explore the key differences in behavior patterns of men and women, and how that may impact their use of new financial services. Ideally, these findings will influence plans to reach women in developing economies and encourage them to adopt new financial tools that will help to lift them and their families out of poverty.

## Data
The training set has 18255 entries with 1235 features.
The test set has 27285 entries with 1234 features.
Both datasets have the same columns except for 
* test_id and train_id
* the additional column is_female in the training set (which is our target value)

All columns represent different answers to survey questions. The columns suffixed "OTHERS" contain answers which are not included as answering options in the respective survey question. In the respective survey question, this freestyle answer is encoded as "96 - others". We can therefore drop all "OTHERS" columns, as the information we need is already contained in the column of the respective question. The same applies to the columns "MM12_REC" and "MM13_REC", which contain freestyle answers.
The only other text columns are LN2_RIndLngBEOth and LN2_WIndLngBEOth. Survey questions LN2_3 and LN2_4 were conducted in different languages. The language name is recorded in LN2_RIndLngBEOth and LN2_WIndLngBEOth. These specifications were also noted down in a freestyle fashion (for example, language combinations such as "Hindi & Rajasthani" or typos such as "HIndi"). We drop the columns and only use LN2_3 and LN2_4 without further specification.

In [None]:
test = pd.read_csv('../input/test.csv', low_memory=False)
train = pd.read_csv('../input/train.csv', low_memory=False)

In [None]:
train.LN2_RIndLngBEOth.value_counts()
train.LN2_RIndLngBEOth.isnull().sum()
np.setdiff1d(test.LN2_RIndLngBEOth.values.tolist(), train.LN2_RIndLngBEOth.values.tolist())

train.LN2_WIndLngBEOth.value_counts()
np.setdiff1d(test.LN2_WIndLngBEOth.values.tolist(), train.LN2_WIndLngBEOth.values.tolist())

In [None]:
train.drop(list(train.filter(regex = 'OTHERS')), axis = 1, inplace = True)
train.drop(['MM13_REC','MM12_REC'],axis=1,inplace=True)
train.drop(['LN2_RIndLngBEOth','LN2_WIndLngBEOth'], axis=1, inplace=True)

test.drop(list(test.filter(regex = 'OTHERS')), axis = 1, inplace = True)
test.drop(['MM13_REC','MM12_REC'],axis=1,inplace=True)
test.drop(['LN2_RIndLngBEOth','LN2_WIndLngBEOth'], axis=1, inplace=True)

In [None]:
train.is_female.sum()/np.shape(train)[0] # training set is balanced

## XGBoost 

XGBoost is a Gradient Boosting Machine which can automatically deal with missing values by learning default directions at each node (https://arxiv.org/abs/1603.02754). No further preprocessing is necessary.
We use the scikit-learn wrapper for XGBoost to use scikit-learn's GridSearchCV for hyperparameter tuning.

For the submission, we used max_depth=8, min_child_weight=3, n_estimators=200

In [None]:
param_test1 = {
    'max_depth': [8],
    'min_child_weight': [3]
}

gsearch1 = GridSearchCV(estimator = XGBClassifier(learning_rate=0.1, gamma=0.9, n_estimators=200, subsample=0.8, colsample_bytree=0.8,
                                                  objective= 'binary:logistic', nthread=10, seed=0),
                        param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=3, return_train_score=True, refit=True)
gsearch1.fit(train.drop(['train_id','is_female'],axis=1),train['is_female'])

In [None]:
gsearch1.cv_results_, gsearch1.best_params_, gsearch1.best_score_

In [None]:
result_xgb=gsearch1.predict_proba(test.drop(['test_id'],axis=1))[:,1]
#pd.DataFrame(result_xgb).to_csv('results_xgb_all_grid.csv',index=True, index_label='test_id', header = ["is_female"])

In [None]:
gsearch1.get_params()

In [None]:
xgb_clf = XGBClassifier(learning_rate=0.1, gamma=0.9, n_estimators=200, subsample=0.8, colsample_bytree=0.8,
                        max_depth=8, min_child_weight=3, objective= 'binary:logistic', nthread=10, seed=0)
xgb_clf.fit(train.drop(['train_id','is_female'],axis=1),train['is_female'])

In [None]:
importances=xgb_clf.feature_importances_

In [None]:
sort_idx = np.argsort(importances)[::-1]
sort_idx
print(importances[sort_idx])
np.where(importances[sort_idx]>0.01)

In [None]:
train.drop(['train_id','is_female'],axis=1).columns[sort_idx[1:25]]

It seems there are no clear gender-defining features, because the scores are very close to zero. The most important questions seem to be
* DL1. In the past 12 months, were you mainly...?
* AA7. Unknown
* DG3. Marital Status
* AA14. Unknown
* DL2. Primary Job
* DG6.How are you related to the household head? 
* DG4. What is your highest level of education?
* FB20.What is the main reason you do not borrow from a bank?
* FL9B.Imagine that this month, after paying for food, cooking fuel, school fees, rent, and airtime, you found yourself with some extra money. Please, select 3 options from the list that you are most likely to spend it on. Option 2
* MT1A.Who decides on who should have a phone in your household?
* AA15. Unknown
* DL5. You have said that these are the ways you got money in the past 12 months. Which of these brought you the most money?
* FF3. What is the main reason you do not have a bank account?
* IFI24. What is the main reason you do not belong to any informal societies or group saving schemes?
* FL9C. Imagine that this month, after paying for food, cooking fuel, school fees, rent, and airtime, you found yourself with some extra money. Please, select 3 options from the list that you are most likely to spend it on. Option 3
* DL8. How much of your land is under cultivation? Unit: Acres (99=DK)
* DL15. What is the highest grade that the female head/spouse completed? 
* IFI16_1. If you want to get to Over the counter in a branch of a bank , how would you get there? Would you…? 
* GN1. Who usually decides how the money you earn will be used?
* FL10. What’s the most important financial goal for you right now?
* DG8a. How many adults and children do you have in the household? (99 for DK): Number of adults
* MT9. What is the main reason you do not have a mobile phone and do not use somebody else’s mobile phone?


In [None]:
# note to self: Read Section 10.13.1 “Relative Importance of Predictor Variables” in The Elements of Statistical Learning"