In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

import pickle
import argparse
# from gcforest.gcforest import GCForest
# gc = GCForest(config) # should be a dict
# X_train_enc = gc.fit_transform(X_train, y_train)
# y_pred = gc.predict(X_test)

## The Data and Question of Interest

Let's take a look at the [UCI Adult Data Set](https://archive.ics.uci.edu/ml/datasets/adult). This data set was extrated from Census data with the goal of prediction who makes over $50,000.

I would like to use these data as a means of exploring various machine learning algorithms that will increase in complexity to see how the compare on various evaluation metrics. Additonally, it will be interesting to see how much there is to gain by spending some time fine-tuning these algorithms.

We will look at the following algorithms:
1. [Logistic Regression](http://learningwithdata.com/logistic-regression-and-optimization.html#logistic-regression-and-optimization)
2. [Gradient Boosting Trees](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
3. [Deep Learning](https://blog.algorithmia.com/introduction-to-deep-learning-2016/)

And evaluate them with the following metrics:
1. [F1 Score](https://en.wikipedia.org/wiki/F1_score)
2. [Area Under ROC Curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
3. [Accuracy](https://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf)

Let's go ahead and read in the data and take a look.

# 载入数据

In [2]:
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                      header=None, names=names)
test_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                      header=None, names=names, skiprows=[0])
all_df = pd.concat([train_df, test_df])

In [3]:
all_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educationnum,maritalstatus,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,nativecountry,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
# all_df.to_csv('../data/adult/all_df.csv', encoding='utf-8', index=False)
# train_df.to_csv('../data/adult/train.csv', encoding='utf-8', index=False)
# test_df.to_csv('../data/adult/test.csv', encoding='utf-8', index=False)

In [5]:
all_df.shape

(48842, 15)

# 特征工程


It looks like we have 14 columns to help us predict our classification. We will drop fnlwgt and education and then convert our categorical features to dummy variables. We will also convert our label to 0 and 1 where 1 means the person made more than $50k



In [6]:
drop_columns = ['fnlwgt', 'education']
continuous_features = ['age', 'capitalgain', 'capitalloss', 'hoursperweek']
cat_features =['educationnum', 'workclass', 'maritalstatus', 'occupation', 'relationship', 'race', 'sex', 'nativecountry']

In [7]:
all_df_dummies = pd.get_dummies(all_df, columns=cat_features)

In [8]:
all_df_dummies.drop(drop_columns, 1, inplace=True)

In [9]:
y = all_df_dummies['label'].apply(lambda x: 0 if '<' in x else 1)
X = all_df_dummies.drop(['label'], axis=1)

In [10]:
y.value_counts(normalize=True)

0    0.760718
1    0.239282
Name: label, dtype: float64

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [12]:
X_train.shape

(32724, 106)

## 数据清洗

In [13]:
 # 中位数填充特征值后，将数据标准化
clean_pipeline = Pipeline([('imputer', preprocessing.Imputer(strategy="median")),
                           ('std_scaler', preprocessing.StandardScaler()),])

In [14]:
X_train_clean = clean_pipeline.fit_transform(X_train)

In [15]:
X_test_clean = clean_pipeline.transform(X_test)

评估函数

In [16]:
def evaluate(true, pred):
    f1 = metrics.f1_score(true, pred)
    roc_auc = metrics.roc_auc_score(true, pred)
    accuracy = metrics.accuracy_score(true, pred)
    print("F1: {0}\nROC_AUC: {1}\nACCURACY: {2}".format(f1, roc_auc, accuracy))
    return f1, roc_auc, accuracy

## Logistic Regression

The first model up is a simple logistic regression with the default hyperparameters.

In [17]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [18]:
lr_predictions = clf.predict(X_test)

In [19]:
lr_f1, lr_roc_auc, lr_acc = evaluate(y_test, lr_predictions)

F1: 0.6530320366132722
ROC_AUC: 0.7590740874725979
ACCURACY: 0.8494850477726765


## GcForest

The second model up is a gcforest with our hyperparameters.


If you wish to use Cascade Layer only, the legal data type for X_train, X_test can be:

    2-D numpy array of shape (n_sampels, n_features).
    3-D or 4-D numpy array are also acceptable. For example, passing X_train of shape (60000, 28, 28) or (60000,3,28,28) will be automatically be reshape into (60000, 784)/(60000,2352).


In [20]:
import sys 
sys.path.append("..") 
from gcforest.gcforest import GCForest
from gcforest.utils.config_utils import load_json

In [21]:
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

In [22]:
def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", dest="model", type=str, default=None, help="gcfoest Net Model File")
    args = parser.parse_args()
    return args


def get_toy_config():
    config = {}
    ca_config = {}
    ca_config["random_state"] = 0
    ca_config["max_layers"] = 10
    ca_config["early_stopping_rounds"] = 3
    ca_config["n_classes"] = 2
    ca_config["estimators"] = []
#     ca_config["estimators"].append(
#             {"n_folds": 5, "type": "XGBClassifier", "n_estimators": 10, "max_depth": 5,
#              "objective": "multi:softprob", "silent": True, "nthread": -1, "learning_rate": 0.1} )
    ca_config["estimators"].append({"n_folds": 5, "type": "RandomForestClassifier", "n_estimators": 10, "max_depth": None, "n_jobs": -1})
    ca_config["estimators"].append({"n_folds": 5, "type": "ExtraTreesClassifier", "n_estimators": 10, "max_depth": None, "n_jobs": -1})
    ca_config["estimators"].append({"n_folds": 5, "type": "LogisticRegression"})
    config["cascade"] = ca_config
    return config

In [23]:
config = get_toy_config()
gc = GCForest(config)

# If the model you use cost too much memory for you.
# You can use these methods to force gcforest not keeping model in memory
# gc.set_keep_model_in_mem(False), default is TRUE.

X_train_enc = gc.fit_transform(X_train, y_train)
y_pred = gc.predict(X_test)
gc_f1, gc_roc_auc, gc_acc = evaluate(y_test, y_pred)

[ 2018-10-03 00:10:27,814][cascade_classifier.fit_transform] X_groups_train.shape=[(32724, 106)],y_train.shape=(32724,),X_groups_test.shape=no_test,y_test.shape=no_test
[ 2018-10-03 00:10:27,856][cascade_classifier.fit_transform] group_dims=[106]
[ 2018-10-03 00:10:27,858][cascade_classifier.fit_transform] group_starts=[0]
[ 2018-10-03 00:10:27,861][cascade_classifier.fit_transform] group_ends=[106]
[ 2018-10-03 00:10:27,864][cascade_classifier.fit_transform] X_train.shape=(32724, 106),X_test.shape=(0, 106)
[ 2018-10-03 00:10:27,903][cascade_classifier.fit_transform] [layer=0] look_indexs=[0], X_cur_train.shape=(32724, 106), X_cur_test.shape=(0, 106)
[ 2018-10-03 00:10:28,369][kfold_wrapper.log_eval_metrics] Accuracy(layer_0 - estimator_0 - 5_folds.train_0.predict)=84.28%
[ 2018-10-03 00:10:28,720][kfold_wrapper.log_eval_metrics] Accuracy(layer_0 - estimator_0 - 5_folds.train_1.predict)=84.83%
[ 2018-10-03 00:10:29,059][kfold_wrapper.log_eval_metrics] Accuracy(layer_0 - estimator_0 - 5

[ 2018-10-03 00:10:45,968][kfold_wrapper.log_eval_metrics] Accuracy(layer_3 - estimator_0 - 5_folds.train_1.predict)=85.68%
[ 2018-10-03 00:10:46,326][kfold_wrapper.log_eval_metrics] Accuracy(layer_3 - estimator_0 - 5_folds.train_2.predict)=85.10%
[ 2018-10-03 00:10:46,667][kfold_wrapper.log_eval_metrics] Accuracy(layer_3 - estimator_0 - 5_folds.train_3.predict)=85.88%
[ 2018-10-03 00:10:47,008][kfold_wrapper.log_eval_metrics] Accuracy(layer_3 - estimator_0 - 5_folds.train_4.predict)=85.42%
[ 2018-10-03 00:10:47,013][kfold_wrapper.log_eval_metrics] Accuracy(layer_3 - estimator_0 - 5_folds.train_cv.predict)=85.68%
[ 2018-10-03 00:10:47,371][kfold_wrapper.log_eval_metrics] Accuracy(layer_3 - estimator_1 - 5_folds.train_0.predict)=84.63%
[ 2018-10-03 00:10:47,812][kfold_wrapper.log_eval_metrics] Accuracy(layer_3 - estimator_1 - 5_folds.train_1.predict)=84.74%
[ 2018-10-03 00:10:48,365][kfold_wrapper.log_eval_metrics] Accuracy(layer_3 - estimator_1 - 5_folds.train_2.predict)=85.56%
[ 2018-

F1: 0.6708607377752358
ROC_AUC: 0.7696213007722407
ACCURACY: 0.8571783099640153
