# 通过机器学习模型预测收入

数据集来自[UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Census+Income)。

## 探索数据

In [1]:
import numpy as np
import pandas as pd

from time import time
from IPython.display import display 

In [2]:
df = pd.read_csv("census.csv")
df.head(1)

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K


年收入大于/小于5万的人及比例

In [None]:
sample_size = df.shape[]

In [3]:
n_records = data.shape[0]
n_greater_50k = (data['income'] == '>50K').sum()
n_at_most_50k = (data['income'] != '>50K').sum()
greater_percent = n_greater_50k / n_records*100

In [4]:
# 将数据切分成`特征`和`标签`
income_raw = data['income']
features_raw = data.drop('income', axis = 1)

In [5]:
from sklearn.preprocessing import MinMaxScaler

# 归一化数字特征
scaler = MinMaxScaler()
numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
features_raw[numerical] = scaler.fit_transform(data[numerical])

display(features_raw.head(n = 1))

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,0.30137,State-gov,Bachelors,0.8,Never-married,Adm-clerical,Not-in-family,White,Male,0.02174,0.0,0.397959,United-States


One-hot编码

In [6]:
features = pd.get_dummies(features_raw)
income = income_raw.replace({'<=50K': 0, '>50K': 1})
encoded = list(features.columns)
print ("{} total features after one-hot encoding.".format(len(encoded)))

103 total features after one-hot encoding.


In [7]:
print(encoded)

['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_level_ 10th', 'education_level_ 11th', 'education_level_ 12th', 'education_level_ 1st-4th', 'education_level_ 5th-6th', 'education_level_ 7th-8th', 'education_level_ 9th', 'education_level_ Assoc-acdm', 'education_level_ Assoc-voc', 'education_level_ Bachelors', 'education_level_ Doctorate', 'education_level_ HS-grad', 'education_level_ Masters', 'education_level_ Preschool', 'education_level_ Prof-school', 'education_level_ Some-college', 'marital-status_ Divorced', 'marital-status_ Married-AF-spouse', 'marital-status_ Married-civ-spouse', 'marital-status_ Married-spouse-absent', 'marital-status_ Never-married', 'marital-status_ Separated', 'marital-status_ Widowed', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', '

### 混洗和切分数据
现在所有的 _类别变量_ 已被转换成数值特征，而且所有的数值特征已被规一化。和我们一般情况下做的一样，我们现在将数据（包括特征和它们的标签）切分成训练和测试集。其中80%的数据将用于训练和20%的数据用于测试。然后再进一步把训练数据分为训练集和验证集，用来选择和优化模型。

运行下面的代码单元来完成切分。

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, income, test_size = 0.2, random_state = 0,
                                                    stratify = income)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=0,
                                                    stratify = y_train)

print ("Training set has {} samples.".format(X_train.shape[0]))
print ("Validation set has {} samples.".format(X_val.shape[0]))
print ("Testing set has {} samples.".format(X_test.shape[0]))

Training set has 28941 samples.
Validation set has 7236 samples.
Testing set has 9045 samples.


## 初步模型的评估

In [9]:
from sklearn.metrics import fbeta_score, accuracy_score

def train_predict(learner, sample_size, X_train, y_train, X_val, y_val): 
    '''
    inputs:
       - learner: the learning algorithm to be trained and predicted on
       - sample_size: the size of samples (number) to be drawn from training set
       - X_train: features training set
       - y_train: income training set
       - X_val: features validation set
       - y_val: income validation set
    '''
    
    results = {}

    start = time()
    learner.fit(X_train[:sample_size], y_train[:sample_size])
    end = time()
    
    results['train_time'] = end - start

    start = time()
    predictions_val = learner.predict(X_val)
    predictions_train = learner.predict(X_train[:300])
    end = time()
    
    results['pred_time'] = end - start
    results['acc_train'] = accuracy_score(y_train[:300], predictions_train)
    results['acc_val'] = accuracy_score(y_val, predictions_val)
    results['f_train'] = fbeta_score(y_true=y_train[:300], y_pred=predictions_train, beta=0.5)
    results['f_val'] = fbeta_score(y_true=y_val, y_pred=predictions_val, beta=0.5)
       
    print ("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
        
    return results


In [10]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier

clf_A = DecisionTreeClassifier(random_state=16)
clf_B = SVC(kernel='rbf', random_state=16)
clf_C = AdaBoostClassifier(random_state=16)


samples_1 = int(len(y_train) * 0.01)
samples_10 = int(len(y_train) * 0.1)
samples_100 = len(y_train)

results = {}
for clf in [clf_A, clf_B, clf_C]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        results[clf_name][i] = train_predict(clf, samples, X_train, y_train, X_val, y_val)

DecisionTreeClassifier trained on 289 samples.
DecisionTreeClassifier trained on 2894 samples.
DecisionTreeClassifier trained on 28941 samples.
SVC trained on 289 samples.


  'precision', 'predicted', average, warn_for)


SVC trained on 2894 samples.




SVC trained on 28941 samples.
AdaBoostClassifier trained on 289 samples.
AdaBoostClassifier trained on 2894 samples.
AdaBoostClassifier trained on 28941 samples.


## 模型调参

In [11]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

clf = AdaBoostClassifier(random_state=16)
parameters = {'n_estimators':[10, 50, 100, 150, 200, 300]}
scorer = make_scorer(fbeta_score, beta=0.5)
grid_obj = GridSearchCV(clf, parameters, scoring=scorer)
grid_fit = grid_obj.fit(X_train, y_train)

best_clf = grid_obj.best_estimator_

predictions = (clf.fit(X_train, y_train)).predict(X_val)
best_predictions = best_clf.predict(X_val)

print (best_clf)
print ("\nUnoptimized model\n------")
print ("Accuracy score on validation data: {:.4f}".format(accuracy_score(y_val, predictions)))
print ("F-score on validation data: {:.4f}".format(fbeta_score(y_val, predictions, beta = 0.5)))
print ("\nOptimized Model\n------")
print ("Final accuracy score on the validation data: {:.4f}".format(accuracy_score(y_val, best_predictions)))
print ("Final F-score on the validation data: {:.4f}".format(fbeta_score(y_val, best_predictions, beta = 0.5)))



AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=300, random_state=16)

Unoptimized model
------
Accuracy score on validation data: 0.8648
F-score on validation data: 0.7443

Optimized Model
------
Final accuracy score on the validation data: 0.8722
Final F-score on the validation data: 0.7559


#### 结果:
 
| 评价指标         |  未优化的模型        | 优化的模型        |
| :------------: |  :---------------: | :-------------: | 
| 准确率          | 0.8648                   | 0.8722                |
| F-score        | 0.7443                    | 0.7559                |

该模型的表现随着弱分类器的数量（在10到300的范围内）的增加而提升。

# 特征的重要性

In [12]:
importances = best_clf.feature_importances_

### 特征选择
并**只使用五个最重要的特征**在相同的训练集上训练模型。

In [13]:
from sklearn.base import clone

# 减小特征空间
X_train_reduced = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:5]]]
X_val_reduced = X_val[X_val.columns.values[(np.argsort(importances)[::-1])[:5]]]

clf_on_reduced = (clone(best_clf)).fit(X_train_reduced, y_train)

reduced_predictions = clf_on_reduced.predict(X_val_reduced)

print ("Final Model trained on full data\n------")
print ("Accuracy on validation data: {:.4f}".format(accuracy_score(y_val, best_predictions)))
print ("F-score on validation data: {:.4f}".format(fbeta_score(y_val, best_predictions, beta = 0.5)))
print ("\nFinal Model trained on reduced data\n------")
print ("Accuracy on validation data: {:.4f}".format(accuracy_score(y_val, reduced_predictions)))
print ("F-score on validation data: {:.4f}".format(fbeta_score(y_val, reduced_predictions, beta = 0.5)))

Final Model trained on full data
------
Accuracy on validation data: 0.8722
F-score on validation data: 0.7559

Final Model trained on reduced data
------
Accuracy on validation data: 0.8431
F-score on validation data: 0.7136


## 测试集上的测试结果

In [14]:
# TODO test your model on testing data and report accuracy and F score
predictions_test = best_clf.predict(X_test)
accuracy_score_test = accuracy_score(y_test, predictions_test)
fbeta_score_test = fbeta_score(y_test, predictions_test, beta=0.5)
print("Accuracy:", accuracy_score_test)
print("F-score:", fbeta_score_test)

Accuracy: 0.8671088999447208
F-score: 0.7501566088953852
