# Chapter 4: Convert Education to Ordinal Values
Guilherme de Oliveira <br>
9/27/2016

I received an interesting suggestion regarding the <b><i>education</b></i> feature. Since the values for <b><i>education</i></b> have an implied ranking (eg 10th grade > 3rd grade), why not take the ranking into consideration and convert the string values to ordinal values rather than nominal values, as done in <b>Chapter 1</b>? The idea is that the trees in the random forest will produce improved results since they will be exploiting that ordering. I investigate this here.
<br>
<br>
The work is split up in the following sections:
<ol>
<li> <b>Load and Preprocess Training Data</b><br>
Use the same overall strategy employed in <b>Chapter 1</b> to load the data, remove duplicates, convert boolean columns to integer, etc... The only difference is that I am going to consider three data sets. The first data set contains important features, the second contains unimportant features and the third contains all features. The idea here is to use smaller subsets in the investigation.
<ol>
<li> the <b><i>important data set</i></b>: contains the most important features found via random forest feature importance in <b>Chapter 1</b>, eg: capital_gains, detailed_occupation_code, dividends, sex, major_occupation_code, capital_losses, weeks_worked_in_year, age, num_persons_worked_for_employer and of course, education.
<li> the <b><i>useless data set</i></b>: contains the features with zero coefficients found via logistic regression with L1 regularization in <b>Chapter 1</b>, eg: detailed_household_family_stat, cob_father, cob_self, cob_mother, migration_code_move_within_reg, migration_code_change_in_reg, migration_code_change_in_msa, migration_prev_res_in_sunbelt, state_of_previous_residence, and year, to which I will tack on education.
<li> the <b><i>full data set</i></b> containing all features.
</ol>
<li> <b>Convert Education From String to Integer Categorical and Ordinal</b><br>
For the <b><i>important data set</i></b> and the <b><i>useless data set</i></b>, perform the conversion of education twice: using the same method employed in <b>Chapter 1</b>, that is without any ordering and using a pre-defined mapping that takes ranking of educational levels into account.
<li> <b> Classification Using Random Forest</b> <br>
Using the same methodology employed in <b>Chapter 2</b>, compute random forests for each data set for each mapping of education (there are 6 total), using cross validation and grid search.
<li> <b> Conclusion </b> <br>
For the <b><i>important data set</i></b>, the accuracy of the random forest is negligeably affected by considering the ranking of educational levels. This is also true for the <b><i>full data set</i></b>, which makes sense since the <b><i>full data set</i></b> is dominated by the features in the <b><i>important data set</i></b>. For the <b><i>useless data set</i></b>, the affect is more pronounced. For example, the recall for the greater than 50K class almost doubled from 4% to 7%. In conclusion, random forest classifiers applied to this particular data set did not show noticeable improvements.

After splitting the data for cross-validation, the distribution of positive and negative labels for the **test data** set is given by the following confusion matrix:
<table>
<tr>  <td> </td> <td align="center"> prediction 0 </td> <td align="center"> prediction 1 </td> </tr>
<tr> <td> class 0 </td> <td align='center'> 42,215 </td> <td align="center"> 0 </td> </tr>
<tr> <td> class 1 </td> <td align='center'>   0 </td> <td align="center"> 3,654 </td> </tr>
</table>

The random forest optimized for <b><i>accuracy</i></b>, for the <b><i>important</i></b> data set, and unranked education has the following classification results:
<table>
<tr>  <td> </td> <td> prediction 0 </td> <td> prediction 1 </td> </tr>
<tr> <td> class 0 </td> <td> 41,627 </td> <td> 588 </td>  </tr>
<tr> <td> class 1 </td> <td>  2,126 </td> <td> 1,528 </td> </tr>
</table>

The random forest optimized for <b><i>accuracy</i></b>, for the <b><i>important</i></b> data set, and <b><i>ranked</i></b> education has the following classification results:
<table>
<tr>  <td> </td> <td> prediction 0 </td> <td> prediction 1 </td> </tr>
<tr> <td> class 0 </td> <td> 41,600 </td> <td> 615 </td>  </tr>
<tr> <td> class 1 </td> <td>  2,080 </td> <td> 1,574 </td> </tr>
</table>

The random forest optimized for <b><i>accuracy</i></b>, for the <b><i>useless</i></b> data set, and unranked education has the following classification results:
<table>
<tr>  <td> </td> <td> prediction 0 </td> <td> prediction 1 </td> </tr>
<tr> <td> class 0 </td> <td> 42,085 </td> <td> 130 </td>  </tr>
<tr> <td> class 1 </td> <td>  3,504 </td> <td> 150 </td> </tr>
</table>

The random forest optimized for <b><i>accuracy</i></b>, for the <b><i>useless</i></b> data set, and <b><i>ranked</i></b> education has the following classification results:
<table>
<tr>  <td> </td> <td> prediction 0 </td> <td> prediction 1 </td> </tr>
<tr> <td> class 0 </td> <td> 42,069 </td> <td> 146 </td>  </tr>
<tr> <td> class 1 </td> <td>  3,394 </td> <td> 260 </td> </tr>
</table>

The random forest optimized for <b><i>accuracy</i></b>, for the <b><i>full</i></b> data set, and unranked education has the following classification results:
<table>
<tr>  <td> </td> <td> prediction 0 </td> <td> prediction 1 </td> </tr>
<tr> <td> class 0 </td> <td> 41,626 </td> <td>  589 </td>  </tr>
<tr> <td> class 1 </td> <td>  2,097 </td> <td> 1,557 </td> </tr>
</table>

The random forest optimized for <b><i>accuracy</i></b>, for the <b><i>full</i></b> data set, and <b><i>ranked</i></b> education has the following classification results:
<table>
<tr>  <td> </td> <td> prediction 0 </td> <td> prediction 1 </td> </tr>
<tr> <td> class 0 </td> <td> 41,602 </td> <td>  613 </td>  </tr>
<tr> <td> class 1 </td> <td>  2,071 </td> <td> 1,583 </td> </tr>
</table>

</ol>
<br>


In [1]:
import pandas as pd

from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

## 1. Load and Preprocess Training Data

In [2]:
# preprocessing function - developped in Chapter 1
the_columns  = [('age', 'continuous'), 
                ('class_of_worker', 'nominal'), 
                ('detailed_industry_code', 'nominal'), 
                ('detailed_occupation_code', 'nominal'), 
                ('education', 'nominal'), 
                ('wage_per_hour', 'continuous'), 
                ('enrolled_in_edu_last_week', 'nominal'),
                ('marital_status', 'nominal'),
                ('major_industry_code', 'nominal'),
                ('major_occupation_code', 'nominal'),
                ('race', 'nominal'),
                ('hispanic_origin', 'nominal'),
                ('sex', 'binary'), # binary column with values Male/Female
                ('member_of_labor_union', 'nominal'), 
                ('reason_for_unemployment', 'nominal'),
                ('full_or_part_time_employment_stat', 'nominal'),
                ('capital_gains', 'continuous'),
                ('capital_losses', 'continuous'),
                ('dividends', 'continuous'),
                ('tax_filer', 'nominal'),
                ('region_of_previous_residence', 'nominal'),
                ('state_of_previous_residence', 'nominal'),
                ('detailed_household_family_stat', 'nominal'),
                ('detailed_household_summary', 'nominal'),
                ('instance_weight', 'IGNORE'), # as per instructions, to be dropped
                ('migration_code_change_in_msa', 'nominal'),
                ('migration_code_change_in_reg', 'nominal'),
                ('migration_code_move_within_reg', 'nominal'),
                ('live_in_this_house_1_yr_ago', 'nominal'),
                ('migration_prev_res_in_sunbelt', 'nominal'),
                ('num_persons_worked_for_employer', 'continuous'),
                ('family_members_under_18', 'nominal'),
                ('cob_father', 'nominal'),
                ('cob_mother', 'nominal'),
                ('cob_self', 'nominal'),
                ('citizenship', 'nominal'),
                ('own_business_or_self_employed', 'nominal'),
                ('fill_in_questionnaire_for_veterans_admin', 'nominal'),
                ('veterans_benefits', 'nominal'),
                ('weeks_worked_in_year', 'nominal'),
                ('year', 'nominal'), 
                ('savings','target')] # binary TARGET variable


In [3]:
raw_data = pd.read_csv('us_census_full/census_income_learn.csv', 
                       names=[c[0] for c in the_columns], 
                       index_col=False)

raw_data.drop('instance_weight', axis=1, inplace=True)

original_shape = raw_data.shape
print '\nThe raw data (minus the instance_weight variable) has',
print '{:d} rows and {:d} columns.'.format(original_shape[0], original_shape[1])

print 'As a sanity check, this agrees with the number of lines (199523) obtained',
print 'using the "wc -l" Unix command.\n'
assert original_shape[0] == 199523, "The number of rows is incorrect"

# note the original unaltered file contains 199523 lines as verified using Unix wc command:
# > wc -l census_income_learn.csv
# 199523


The raw data (minus the instance_weight variable) has 199523 rows and 41 columns.
As a sanity check, this agrees with the number of lines (199523) obtained using the "wc -l" Unix command.



### Eliminate Duplicate Rows

In [4]:
# find the duplicate rows, keep the first one
duplicate_rows = raw_data.duplicated(keep='first')

print '\nnumber of duplicates = {:d}'.format(duplicate_rows.sum())
raw_data = raw_data.drop_duplicates(keep='first')
new_shape =  raw_data.shape
print 'number of duplicates removed = {:d}'.format(original_shape[0] - new_shape[0])
print 'new shape = {:d}, {:d}\n'.format(raw_data.shape[0], raw_data.shape[1])


number of duplicates = 46627
number of duplicates removed = 46627
new shape = 152896, 41



### Convert Columns to Boolean Values

In [5]:
raw_data['savings'] = raw_data['savings'].map(
    lambda x: 1 if str(x).strip() == '50000+.' else 0)

raw_data['sex'] = raw_data['sex'].map(
    lambda x: 1 if str(x).strip() == 'Male' else 0)

### Set Up DataFrames
We will construct two data sets: one based on the top ten features discovered in <b>Chapter 1</b> via random forests, and one based on the most "useless" features discored via Logistic Regression. Education will be added to both data sets.

In [6]:
important_cols = ['capital_gains',
                  'detailed_occupation_code',
                  'dividends',
                  'sex',
                  'major_occupation_code',
                  'capital_losses',
                  'weeks_worked_in_year',
                  'age',
                  'num_persons_worked_for_employer',
                  'savings']

important_data = pd.DataFrame(raw_data[important_cols])

unique_occ_values = important_data['major_occupation_code'].unique()
mapping = {key:idx for idx,key in enumerate(unique_occ_values)}
important_data['major_occupation_code'] = important_data['major_occupation_code'].apply(lambda x : mapping[x])



In [7]:
useless_cols = ['detailed_household_family_stat', 
                'cob_father', 
                'cob_self', 
                'cob_mother', 
                'migration_code_move_within_reg',
                'migration_code_change_in_reg', 
                'migration_code_change_in_msa',
                'migration_prev_res_in_sunbelt',
                'state_of_previous_residence',
                'year']

useless_data = pd.DataFrame(raw_data[useless_cols])
useless_data['savings'] = raw_data['savings']
for col in useless_cols:
    unique_vals = useless_data[col].unique()
    mapping = {key:idx for idx,key in enumerate(unique_vals)}
    useless_data[col] = useless_data[col].apply(lambda x : mapping[x])


## 2. Convert Education to Integer

In [8]:
print '\nHere are the distinct values in education along with their counts:\n'
print raw_data['education'].value_counts()


Here are the distinct values in education along with their counts:

 High school graduate                      43642
 Some college but no degree                26329
 Bachelors degree(BA AB BS)                19391
 Children                                  12710
 10th grade                                 6487
 Masters degree(MA MS MEng MEd MSW MBA)     6460
 7th and 8th grade                          6309
 11th grade                                 6213
 Associates degree-occup /vocational        5249
 9th grade                                  4976
 Associates degree-academic program         4322
 5th or 6th grade                           3139
 12th grade no diploma                      2059
 Prof school degree (MD DDS DVM LLB JD)     1791
 1st 2nd 3rd or 4th grade                   1755
 Doctorate degree(PhD EdD)                  1262
 Less than 1st grade                         802
Name: education, dtype: int64


### Convert Education to Integer without Ranking
We use the same approach as in Chapter 2 to convert nominal columns to integers. The mapping depends on the order in which the pandas ***unique*** method outputs the distinct classes in a column.  The resulting mapping is displayed, which shows that ranking is ignored. For example, the first three values in ascending order are "<b><i>High school graduate</b></i>", "<b><i>Some College but no degree</b></i>", and "<b><i>10th grade</b></i>".

In [9]:
unique_edu_values = raw_data['education'].unique()
mapping = {key:idx for idx,key in enumerate(unique_edu_values)}

important_data['education'] = raw_data['education'].apply(lambda x : mapping[x])
useless_data['education'] = raw_data['education'].apply(lambda x : mapping[x])

In [10]:
for k,v in mapping.iteritems():
    print '{:<40s} : {:d}'.format(k,v)

 1st 2nd 3rd or 4th grade                : 16
 12th grade no diploma                   : 9
 Less than 1st grade                     : 6
 Some college but no degree              : 1
 Masters degree(MA MS MEng MEd MSW MBA)  : 5
 7th and 8th grade                       : 8
 11th grade                              : 13
 Bachelors degree(BA AB BS)              : 4
 Prof school degree (MD DDS DVM LLB JD)  : 11
 9th grade                               : 15
 5th or 6th grade                        : 12
 Doctorate degree(PhD EdD)               : 14
 Associates degree-occup /vocational     : 10
 High school graduate                    : 0
 Children                                : 3
 10th grade                              : 2
 Associates degree-academic program      : 7


### Convert Education to Integer with Ranking
The conversion for the ranked case is carried out using the ranked mapping defined below.

In [11]:
ranked_map = {
'Children'                                : 0,
'Less than 1st grade'                     : 1,
'1st 2nd 3rd or 4th grade'                : 2,
'5th or 6th grade'                        : 3,
'7th and 8th grade'                       : 4,
'9th grade'                               : 5,
'10th grade'                              : 6,
'11th grade'                              : 7,
'12th grade no diploma'                   : 8,
'High school graduate'                    : 9,
'Associates degree-occup /vocational'     : 10,
'Associates degree-academic program'      : 11,
'Some college but no degree'              : 12,
'Bachelors degree(BA AB BS)'              : 13,
'Masters degree(MA MS MEng MEd MSW MBA)'  : 14,
'Prof school degree (MD DDS DVM LLB JD)'  : 15,
'Doctorate degree(PhD EdD)'               : 16}


In [12]:
important_data['education_ranked'] = raw_data['education'].apply(lambda x : ranked_map[x.strip()])
useless_data['education_ranked'] = raw_data['education'].apply(lambda x : ranked_map[x.strip()])

The first 16 rows (transposed) of each data set are displayed below.

In [13]:
important_data.head(16).transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
capital_gains,0,0,0,0,0,0,5178,0,0,0,0,0,0,0,0,0
detailed_occupation_code,0,34,0,0,0,10,3,40,26,37,0,0,34,31,12,0
dividends,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
sex,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0
major_occupation_code,0,1,0,0,0,2,3,4,5,6,0,0,1,7,2,0
capital_losses,0,0,0,0,0,0,0,0,0,0,0,0,0,1590,0,0
weeks_worked_in_year,0,52,0,0,0,52,52,30,52,52,0,0,52,52,52,0
age,73,58,18,9,10,48,42,28,47,34,8,32,51,46,26,13
num_persons_worked_for_employer,0,1,0,0,0,1,6,4,5,6,0,0,3,6,6,0
savings,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [14]:
useless_data.head(16).transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
detailed_household_family_stat,0,1,2,3,3,4,1,5,4,1,3,6,1,1,7,3
cob_father,0,0,1,0,0,2,0,0,0,0,0,3,0,4,0,0
cob_self,0,0,1,0,0,0,0,0,0,0,0,2,0,3,0,0
cob_mother,0,0,1,0,0,0,0,0,0,0,0,2,0,3,0,0
migration_code_move_within_reg,0,1,0,2,2,0,2,0,0,2,2,0,2,2,0,2
migration_code_change_in_reg,0,1,0,2,2,0,2,0,0,2,2,0,2,2,0,2
migration_code_change_in_msa,0,1,0,2,2,0,2,0,0,2,2,0,2,2,0,2
migration_prev_res_in_sunbelt,0,1,0,2,2,0,2,0,0,2,2,0,2,2,0,2
state_of_previous_residence,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
year,0,1,0,1,1,0,1,0,0,1,1,0,1,1,0,1


## 3. Classification Using Random Forest

In [15]:
def print_confusion_matrix(y_true, y_pred):
    # print confusion matrix
    header = '\t          prediction 0    prediction 1'
    row0 =   '\tclass 0 {:11,d} {:14,d}'
    row1 =   '\tclass 1 {:11,d} {:14,d}'
    cm = confusion_matrix(y_true, y_pred)
    print header
    print row0.format(cm[0,0], cm[0,1])
    print row1.format(cm[1,0], cm[1,1])
    tp, fn = float(cm[0,0]), float(cm[0,1])
    fp, tn = float(cm[1,0]), float(cm[1,1])


In [16]:
def run_grid_search(classifier, parameters, X_train, y_train, X_test, y_test,
                    score=None, print_grid_scores=False, verbose=0, n_jobs=1):
    '''
    input: 
      classifier: scikit-learn classifier, 
      parameters: parameters for grid search, 
      X_train, y_train: the cross-validation training sets,
      X_test, y_test: the corss-validation test sets, 
      score: (None) the scoring function (eg accuracy, precision, recall, ...),
      print_grid_scores: (False)boolean to print the grid scores from the grid-search
      verbose: (None) passed to the scikit-learn grid search
      n_jobs: (1) the number of jobs for parallel processing
    return: 
      the best fit scikit-learn classifier from gridsearch
    
    Run a grid search with cross-validation on the training data set for the given
    classifier, for the given range of parameters, and for the given scoring method.
    
    Print the confusion matrix and classification report for the test set.
    '''
    
    clf = GridSearchCV(classifier, parameters, scoring=score, 
                       verbose=verbose, cv=5, n_jobs=n_jobs)
    
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    
    print 'Best parameters on training set:'
    print clf.best_params_
    print '\nBest score = {:.4f}'.format(clf.best_score_)
    if print_grid_scores:
        print '\nGrid scores on training set:\n'
        for params, mean_score, scores in clf.grid_scores_:
            print("%0.4f (+/-%0.04f) for %r"
                  % (mean_score, scores.std() * 2, params))
    print
    print 'Confusion matrix:'
    print_confusion_matrix(y_test, y_pred)
    print '\nClassification report:'
    print classification_report(y_test, y_pred, digits=5)
    return clf

In [17]:
parameters = {'n_estimators': [80, 160], 
                 'max_depth': [4, 8, 16, 32]}

In [18]:
score = 'accuracy'

### Important Data

#### Unranked Education

In [19]:
X = important_data.drop(['education_ranked','savings'], axis=1)
y = important_data.loc[:,'savings']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

rf = run_grid_search(RandomForestClassifier(), parameters,
                     X_train, y_train, X_test, y_test,
                     score=score, verbose=1, n_jobs=2)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best parameters on training set:
{'n_estimators': 160, 'max_depth': 16}

Best score = 0.9411

Confusion matrix:
	          prediction 0    prediction 1
	class 0      41,627            588
	class 1       2,126          1,528

Classification report:
             precision    recall  f1-score   support

          0    0.95141   0.98607   0.96843     42215
          1    0.72212   0.41817   0.52964      3654

avg / total    0.93314   0.94083   0.93348     45869



[Parallel(n_jobs=2)]: Done  40 out of  40 | elapsed:  3.0min finished


#### Ranked Education

In [20]:
X = important_data.drop(['education','savings'], axis=1)
y = important_data.loc[:,'savings']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

rf = run_grid_search(RandomForestClassifier(), parameters, 
                     X_train, y_train, X_test, y_test,
                     score=score, verbose=1, n_jobs=2)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best parameters on training set:
{'n_estimators': 160, 'max_depth': 16}

Best score = 0.9414

Confusion matrix:
	          prediction 0    prediction 1
	class 0      41,600            615
	class 1       2,080          1,574

Classification report:
             precision    recall  f1-score   support

          0    0.95238   0.98543   0.96862     42215
          1    0.71905   0.43076   0.53876      3654

avg / total    0.93379   0.94125   0.93438     45869



[Parallel(n_jobs=2)]: Done  40 out of  40 | elapsed:  3.7min finished


### Useless Data

#### Unranked Education

In [21]:
X = useless_data.drop(['education_ranked','savings'], axis=1)
y = useless_data.loc[:,'savings']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

rf = run_grid_search(RandomForestClassifier(), parameters,
                     X_train, y_train, X_test, y_test,
                     score=score, verbose=1, n_jobs=2)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best parameters on training set:
{'n_estimators': 160, 'max_depth': 16}

Best score = 0.9186

Confusion matrix:
	          prediction 0    prediction 1
	class 0      42,085            130
	class 1       3,504            150

Classification report:
             precision    recall  f1-score   support

          0    0.92314   0.99692   0.95861     42215
          1    0.53571   0.04105   0.07626      3654

avg / total    0.89228   0.92077   0.88832     45869



[Parallel(n_jobs=2)]: Done  40 out of  40 | elapsed:  3.0min finished


#### Ranked Education

In [22]:
X = useless_data.drop(['education','savings'], axis=1)
y = useless_data.loc[:,'savings']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

rf = run_grid_search(RandomForestClassifier(), parameters, 
                     X_train, y_train, X_test, y_test,
                     score=score, verbose=1, n_jobs=2)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best parameters on training set:
{'n_estimators': 160, 'max_depth': 8}

Best score = 0.9212

Confusion matrix:
	          prediction 0    prediction 1
	class 0      42,069            146
	class 1       3,394            260

Classification report:
             precision    recall  f1-score   support

          0    0.92535   0.99654   0.95962     42215
          1    0.64039   0.07115   0.12808      3654

avg / total    0.90265   0.92282   0.89338     45869



[Parallel(n_jobs=2)]: Done  40 out of  40 | elapsed:  2.7min finished


### All Data

In [23]:
def convert_categorical_to_integer():
    # create new DataFrame that contains columns of type object 
    data = pd.DataFrame(raw_data.select_dtypes(include=['object']))
    columns = data.columns
    
    for column in columns:
        if column == 'education':
            continue
        unique_values = data[column].unique()
        dictionary = {key:idx for idx,key in enumerate(unique_values)}
        data[column] = data[column].apply(lambda x : dictionary[x])

    return data
 

In [24]:
data = convert_categorical_to_integer()

# add nominal columns that were already integer and didn't need to be converted
nominal_integer_columns = [c[0] for c in the_columns 
                           if c[1] == 'nominal' 
                           and c[0] not in data.columns]
data[nominal_integer_columns] = raw_data[nominal_integer_columns]

# add 'sex' column
data['sex'] = raw_data['sex']

# add continuous columns
continuous_columns = [c[0] for c in the_columns if c[1] == 'continuous']
data[continuous_columns] = raw_data[continuous_columns]

# add target (savings)
data['savings'] = raw_data['savings']

# verify that we aren't missing any columns
assert set(data.columns) == (set(raw_data.columns))

print 'The final shape is: {:,d} x {:d}.\n'.format(data.shape[0], data.shape[1])

The final shape is: 152,896 x 41.



### Random Forest for Unranked Education

In [25]:
unique_edu_values = raw_data['education'].unique()
mapping = {key:idx for idx,key in enumerate(unique_edu_values)}

data['education'] = raw_data['education'].apply(lambda x : mapping[x])

X = data.drop('savings', axis=1)
y = data.loc[:,'savings']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

rf = run_grid_search(RandomForestClassifier(), parameters, 
                     X_train, y_train, X_test, y_test,
                     score=score, verbose=1, n_jobs=2)


Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best parameters on training set:
{'n_estimators': 160, 'max_depth': 32}

Best score = 0.9408

Confusion matrix:
	          prediction 0    prediction 1
	class 0      41,626            589
	class 1       2,097          1,557

Classification report:
             precision    recall  f1-score   support

          0    0.95204   0.98605   0.96874     42215
          1    0.72554   0.42611   0.53690      3654

avg / total    0.93400   0.94144   0.93434     45869



[Parallel(n_jobs=2)]: Done  40 out of  40 | elapsed:  5.0min finished


### Random Forest for Ranked Education

In [26]:
data['education'] = raw_data['education'].apply(lambda x : ranked_map[x.strip()])

X = data.drop('savings', axis=1)
y = data.loc[:,'savings']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

rf = run_grid_search(RandomForestClassifier(), parameters, 
                     X_train, y_train, X_test, y_test,
                     score=score, verbose=1, n_jobs=2)


Fitting 5 folds for each of 8 candidates, totalling 40 fits
Best parameters on training set:
{'n_estimators': 160, 'max_depth': 32}

Best score = 0.9417

Confusion matrix:
	          prediction 0    prediction 1
	class 0      41,602            613
	class 1       2,071          1,583

Classification report:
             precision    recall  f1-score   support

          0    0.95258   0.98548   0.96875     42215
          1    0.72086   0.43322   0.54120      3654

avg / total    0.93412   0.94149   0.93469     45869



[Parallel(n_jobs=2)]: Done  40 out of  40 | elapsed:  4.9min finished
