# Models

### The Dataset

As a refresher:
* Data from https://data.cityofnewyork.us/browse?q=poverty
* 12 annual data files, from 2005 to 2016 inclusive (e.g. NYCgov_Poverty_MeasureData__2016.csv)
* CSV files with ~80 columns and ~60,000 rows each
* Each file had essentially the same format and contained (mostly) the same information
* Data types included:
    * Classification types encoded as integers (e.g. 1 if in poverty, 2 if not in poverty)
    * Floats for financial data (e.g. wages for the calendar year)

I'll import a cleaned version of the files (see https://github.com/c74p/Springboard/blob/master/Capstone%20Project%201%20-%20Poverty/DataWranglingSummary.ipynb) for details.

### Modeling approach

The poverty rate overall in New York City is roughly 20%, and there are lots of imbalanced groups (education, income, 
disability status, etc.).  I'll use imbalanced test-train splits to improve my model.

Overview of modeling approach:
1. Use all years, households only, classify yes/no for poverty. Test and compare Logistic Regression, Support Vector 
Machines (SVM), and Random Forest algorithms.
2. Run classifiers for individual years (the thresholds differ from year to year, so a predictor for a specific year would presumably be better for a specific year).
3. Test running regressors on houshold income and poverty threshold, in order to predict poverty classification. Test and
compare Linear Regression (Ordinary Least Squares), Stochastic Gradient Descent, and ElasticNet.
    a. This is not likely to be useful, but I'm doing it as a learning exercise.
4. Test steps 2 and 3 above at the person level, rather than at the household level.
    a. This is not likely to be useful, but I'm doing it as a learning exercise.

### Housekeeping part 1: imports and file prep

After importing we'll make some quick modifications to the data.

In [1]:
# Basic imports and setup
# See below for model-specific imports
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype
import matplotlib.pyplot as plt
import seaborn as sns
import tempfile
import time

# Model-specific imports
from dask_ml.preprocessing import Categorizer, DummyEncoder

from imblearn.combine import SMOTETomek
from imblearn.ensemble import BalancedBaggingClassifier, BalancedRandomForestClassifier, RUSBoostClassifier
from imblearn.metrics import classification_report_imbalanced, geometric_mean_score
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbPipeline
from imblearn.under_sampling import TomekLinks

from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import balanced_accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV, KFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PowerTransformer, QuantileTransformer, Normalizer
from sklearn.svm import SVC

from statsmodels.discrete.discrete_model import Logit, LogitResults

%matplotlib inline

# Temporarily turn off warnings if they get to be too much
#import warnings
#warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
df = pd.read_csv('data/all_years.csv', index_col=0)

# Group the columns into 1) raw input variables, 2) id variables of various things, 3) American Community Survey (census)
# variables, 4) NYC government-calculated variables, and 5) output variables.
#
# The ACS and NYC variables are generally calculated from the raw input variables - my initial expectation is that
# the raw input variables can be thought of as independent variables, and that the ACS and NYC variables are not
# independent even though they are not output variables.

raw_inp_vars = ['AGEP', 'Boro', 'CIT', 'DIS', 'ENG', 'ESR', 'Ethnicity', 'HHT', 'HIUnit_Head', 'HousingStatus', 'JWTR', 'LANX', 'MAR', 'MSP','NP', 'Off_Threshold', 'PreTaxIncome_PU', 'REL', 'SCH', 'SCHG', 'SCHL', 'SEX', 'TEN', 'WKHP', 'WKW', 'Year']
id_vars = ['HIUnit_ID', 'Povunit_ID', 'PWGTP', 'SERIALNO', 'SNAPUnit_ID', 'SPORDER', 'TaxUnit_ID', 'WGTP']
acs_vars = ['AgeCateg', 'INTP_adj', 'OI_adj', 'MRGP_adj', 'PA_adj', 'RETP_adj', 'RNTP_adj', 'SEMP_adj', 'SSIP_adj', 'SSP_adj',  'WAGP_adj']
nyc_vars = ['CitizenStatus',  'EducAttain', 'FTPTWork', 'FamType_PU', 'NYCgov_Childcare', 'NYCgov_Commuting', 'NYCgov_EITC', 'NYCgov_FICAtax', 'NYCgov_HEAP', 'NYCgov_Housing', 'NYCgov_Income', 'NYCgov_IncomeTax', 'NYCgov_MOOP', 'NYCgov_MedPremiums', 'NYCgov_MedSpending', 'NYCgov_Nutrition', 'NYCgov_REL', 'NYCgov_SFN', 'NYCgov_SFR', 'NYCgov_SNAP', 'NYCgov_SchoolBreakfast', 'NYCgov_SchoolLunch', 'NYCgov_Threshold', 'NYCgov_WIC', 'Povunit_Rel', 'SNAPUnit_Rel',  'TaxUnit_FILER', 'TaxUnit_FILESTAT', 'TaxUnit_FILETYPE', 'TaxUnit_Rel', 'TotalWorkHrs_PU']
output_vars = ['NYCgov_PovGap', 'NYCgov_Pov_Stat', 'NYCgov_PovGapIndex', 'Off_Pov_Stat']
all_columns = raw_inp_vars + id_vars + acs_vars + nyc_vars + output_vars

In [3]:
# Create codes for the raw input variables that are number-coded, so we can create charts that make sense
raw_codes = {'Boro': {1: 'Bronx', 2: 'Brooklyn', 3: 'Manhattan', 4: 'Queens', 5: 'Staten Island'},
         'CIT': {1: 'Birth', 2: 'Territories', 3: 'US Parents', 4: 'Naturalized', 5: 'No'},
         'DIS': {0: 'NA', 1: 'Yes', 2: 'No'},
         'ENG': {0: '<5', 1: 'Very Well', 2: 'Well', 3: 'Not well', 4: 'Not at all', 5: 'Only Eng'},
         'ESR': {0: '<16', 1: 'EMP', 2:'EMP/NAW', 3: 'UNEMP', 4: 'AF', 5: 'AF/NAW', 6:'NILF'},
         'Ethnicity': {1: 'White', 2: 'Black', 3: 'Asian', 4: 'Hispanic', 5: 'Other'},
         'HHT': {0: 'NA', 1: 'MAR', 2: 'MNW', 3: 'WNM', 4: 'Malone', 5: 'MNAlone', 6: 'Walone', 7: 'WNalone'},
         'HIUnit_Head': {0: 'Not Head', 1: 'Head', 2: 'Not Head'},
         'HousingStatus': {0: 'NA', 1: 'Public', 2: 'Mitchell', 3: 'Subsidy', 4: 'Regulated', 5: 'OtherReg', 6: 'MarketRate', 7: 'NoCash', 8: 'OwnF&C', 9: 'Own-Mortgage'},
         'JWTR': {0: 'NA', 1: 'Car', 2: 'Bus', 3:'Streetcar', 4:'Subway', 5:'RR', 6:'Ferry', 7:'Taxi', 8:'Motorcycle', 9:'Bike', 10:'Walk', 11:'Home', 12: 'Other'},
         'LANX': {0: 'NA', 1: 'Yes', 2: 'Only Eng'},
         'MAR': {1: 'Married', 2:'Widowed', 3:'Divorced', 4:'Separated', 5:'Never Married'},
         'MSP': {0: 'NA', 1: 'Yes', 2:'Spouse absent', 3:'Widowed', 4:'Divorced', 5:'Separated', 6:'Never Married'},
         'REL': {0: 'Self', 1:'Spouse', 2:'Child', 3:'Adopted', 4:'Stepchild', 5:'Sibling', 6:'Parent', 7:'Grandchild', 8:'Parent-in-law', 9:'Child-in-law', 10:'Other', 11:'Boarder', 12:'Roommate', 13:'Partner', 14:'Foster', 15:'OtherNR', 16:'Inst', 17:'NonInst'},
         'SCH': {0: 'NA', 1: 'NoPast3Mos', 2:'Public', 3:'Private/Home'},
         'SCHG': {0: 'NA', 1:'Preschool', 2:'Kindergarten', 3:'1', 4:'2', 5:'3', 6:'4', 7:'5', 8:'6', 9:'7', 10:'8', 11:'9', 12:'10', 13:'11', 14:'12', 15:'College', 16:'Grad school'},
         'SCHL': {0: 'NA', 1:'None', 2:'Preschool', 3:'Kindergarten', 4:'1', 5:'2', 6:'3', 7:'4', 8:'5', 9:'6', 10:'7', 11:'8', 12:'9', 13:'10', 14:'11', 15:'12-NoDip', 16:'Diploma', 17:'GED', 18:'<1yrCollege', 19:'CollNoDegree', 20:'Associates', 21:'Bachelors', 22:'Masters', 23:'Professional', 24:'Doctorate'},
         'SEX': {1:'Male', 2:'Female'},
         'TEN': {0: 'NA', 1:'Mortage', 2:'Free&Clear', 3:'Rent', 4:'OccButNoRent'},
         'WKW': {0:'NA', 1:'50-52', 2:'48-49', 3:'40-47', 4:'27-39', 5:'14-26', 6:'<13'},
        }

# Create codes for the nyc variables that are number-coded, so we can create charts that make sense
nyc_codes = {
    'CitizenStatus': {1: 'Birth', 2: 'Naturalized', 3: 'No'},
    'EducAttain': {0: 'NA', 1:'<HS', 2:'HS', 3:'SomeCollege', 4:'Bachelors+'},
    'FTPTWork': {1:'FTYR', 2:'<FTYR', 3:'None'},
    'FamType_PU': {1:'Family', 2:'Couple', 3:'M+kid', 4:'W+kid', 5:'Mnokid', 6:'Wnokid', 7:'Unrelated', 8:'UnrelAlone'},
    'NYCgov_REL': {0:'Self', 1:'Spouse', 2:'Child', 3:'Sibling', 4:'Parent', 5:'Grandkid', 6:'Inlaw', 7:'OtherRel', 8:'Boarder', 9:'Roommate', 10:'Partner', 11:'FosterKid', 12:'OtherNonRel'},
    'NYCgov_SFR': {0: 'NA', 1:'NoKids', 2:'Kids', 3:'OneParent', 4:'Kid', 5:'Kid-Monly', 6:'Kid-Wonly'},
    'Povunit_Rel': {1:'Head', 2:'Spouse/Ptnr', 3:'Child', 4:'Other'},
    'SNAPUnit_Rel': {1:'Head', 2:'Spouse/Ptnr', 3:'Child', 4:'Other'},
    'TaxUnit_FILER': {1:'Filer', 0:'Non-Filer'},
    'TaxUnit_FILESTAT': {0: 'NA', 1:'Joint', 2:'HH', 3:'MFS', 4:'Single'},
    'TaxUnit_FILETYPE': {0: 'NA', 1: 'Normal', 2:'Dependent', 3:'BelowThresh'},
    'TaxUnit_Rel': {1:'Head', 2:'Spouse/Ptnr', 3:'Child', 4:'Other', 5:'EIC', 6:'Relative'},
    'TotalWorkHrs_PU': {1:'3500+', 2:'2340-3500', 3:'1750-2340', 4:'<1750', 5:'None'}
    }

# Create a dataframe 'cats' that uses categorical coding, rather than numerical coding, based on the dictionaries above.
#cats = df.replace(nyc_codes)
#cats = cats.replace(raw_codes)
#cats = cats.replace({'NYCgov_Pov_Stat': {1: 'Pov', 2:'Not Pov'}, 
                     #'Off_Pov_Stat': {1:'Pov', 2:'Not Pov'}, 
                     #'AgeCateg': {1: 'U18', 2:'18-64', 3:'65+'}})

# Update one column so that NA's are all in one category
#cats.loc[cats['HIUnit_Head'].isna(), 'HIUnit_Head'] = 'NA'

In [4]:
# Key dataframes of interest

# cats is already listed above - dataframe with all of the category variables un-encoded

# All 2016 data
all_2016 = df[df.Year == 2016]
#cats_2016 = cats[cats.Year == 2016]

# 2016 data for poverty units only
# For example, the data dictionary, sheet "Column Info", cell D81, says in relation to calculating the poverty gap:
# "retain only the reference person of each family in poverty (Povunit_Rel==1 & NYCgov_Poverty == 1)"
pu_2016 = df[(df.Year == 2016) & (df.Povunit_Rel == 1)]
#pu_cats_2016 = cats[(cats.Year == 2016) & (cats.Povunit_Rel == 'Head')]
#pu_cats_all_years = cats[(cats.Povunit_Rel == 'Head')]

# Our data set contains two sets of weights: household weights and person weights.  
# We need to separate out each column by whether it should be weighted as a household variable or a person variable.
# Lists to create weighted columns, separated based on whether they are personal or household statistics.
personal_vars = ['AGEP', 'Boro', 'CIT', 'SCH', 'SCHG', 'SCHL', 'SEX', 'ESR', 'ENG', 'LANX', 'MSP', 'MAR', 'NYCgov_EITC', 'WKW', 'WKHP', 'DIS', 'JWTR', 'WAGP_adj', 'INTP_adj', 'SEMP_adj', 'SSP_adj', 'SSIP_adj', 'PA_adj', 'RETP_adj', 'OI_adj', 'TaxUnit_Rel', 'NYCgov_REL', 'NYCgov_SFR', 'SNAPUnit_Rel', 'TaxUnit_FILER', 'TaxUnit_FILESTAT', 'TaxUnit_FILETYPE', 'Ethnicity', 'EducAttain', 'CitizenStatus', 'AgeCateg', 'FTPTWork', 'PWGTP'] 
pu_vars = ['MRGP_adj', 'RNTP_adj', 'NP', 'TEN', 'HHT', 'FamType_PU', 'HousingStatus', 'TotalWorkHrs_PU', 'PreTaxIncome_PU', 'NYCgov_Income', 'NYCgov_Threshold', 'NYCgov_Pov_Stat',  'NYCgov_Housing', 'NYCgov_Childcare', 'NYCgov_Commuting', 'NYCgov_MOOP', 'NYCgov_MedSpending', 'NYCgov_MedPremiums', 'NYCgov_HEAP', 'NYCgov_WIC', 'NYCgov_SNAP', 'NYCgov_SchoolLunch', 'NYCgov_SchoolBreakfast', 'NYCgov_Nutrition', 'NYCgov_FICAtax', 'NYCgov_IncomeTax', 'Off_Threshold', 'Off_Pov_Stat', 'NYCgov_PovGap', 'NYCgov_PovGapIndex', 'WGTP']
other_vars = ['HIUnit_Head', 'HIUnit_ID', 'NYCgov_SFN', 'Povunit_ID', 'Povunit_Rel', 'REL', 'SERIALNO', 'SNAPUnit_ID', 'SPORDER', 'TaxUnit_ID', 'Year']

### Feature Engineering - Categorical Variables
Creating our categorical variables happens logically *after* our feature engineering, but it builds on some of the
structure in our data set *before* we have the new features.  So, let's do it now; if this doesn't make sense, look at 
the next Feature Engineering section and it should make more sense.

In [5]:
X = all_2016.copy()

# All the categoricals that we'll have to set up
categoricals = ['AGEP', 'CIT', 'SCHL', 'SEX', 'ENG', 'MSP', 'WKW', 'WKHP', 'DIS', 'JWTR', 'Ethnicity', 'Boro', 'NP', 'TEN', 'HHT', 'HousingStatus', 'TotalWorkHrs_PU']

categories = {} # Dict for each initial categorical

# Some categoricals have odd ordering; some have no ordering
odd_ordering = {'ENG': [0, 4, 3, 2, 1, 5], 'WKW': [0, 6, 5, 4, 3, 2, 1], 'TotalWorkHrs_PU': [5, 4, 3, 2, 1]}
unordered = ['DIS', 'SEX', 'MSP', 'JWTR', 'Ethnicity', 'Boro', 'TEN', 'HHT', 'HousingStatus']

# Loop through and assign appropriate category structure
for feature in categoricals:
    if feature in odd_ordering.keys():
        cats = odd_ordering[feature]
    else:
        cats = X[feature].unique()
        cats.sort()
    if feature in unordered:
        categories[feature] = pd.Categorical(cats, ordered=False)
    else: # Category is ordered
        categories[feature] = pd.Categorical(cats, ordered=True, categories=cats)

# Create a dictionary 'dummy_these' that we'll pass to our dummy-maker later
# The poverty-unit categoricals can be passed as-is
# For the personal categoricals, we'll have features like 'AGEP_1', 'AGEP_2', ..., 'AGEP_20' - so we have to 
# loop through and assign categories
personal_categoricals = ['AGEP', 'CIT', 'SCHL', 'SEX', 'ENG', 'MSP', 'WKW', 'WKHP', 'DIS', 'JWTR', 'Ethnicity', 'Boro']
pu_categoricals = ['NP', 'TEN', 'HHT', 'HousingStatus', 'TotalWorkHrs_PU']

dummy_these = {}
    
for i in range(1,21):
    for feature in personal_categoricals:
        name = feature + '_' + str(i)
        dummy_these[name] = categories[feature]

for feature in pu_categoricals:
    dummy_these[feature] = categories[feature]

len(dummy_these) #245
len(personal_categoricals) #12
len(pu_categoricals)
#len(categoricals)
#len(personal_columns)
#print(categoricals)
#print(personal_columns)
dummy_these['TotalWorkHrs_PU']

[5, 4, 3, 2, 1]
Categories (5, int64): [5 < 4 < 3 < 2 < 1]

### Feature Engineering - Features
Our dataset contains people in poverty units (a household may contain one or more poverty units). The entire 
poverty unit either is or is not in poverty, but the data set as constructed has people in different rows (the data is
not tidy).

To tidy up, we'll move information on all the people in the poverty unit, into the row for that poverty unit.
Instead of having 3 people in a poverty unit represented by different rows, we'll put all three people in
the same row but different columns. The columns will be named 'AGEP_1', 'AGEP_2', 'AGEP_3', etc, with zero values in all
columns where person n does not exist.

There are three main columns of interest for this:
* SERIALNO is the serial number of each household.
* PovUnit_ID is the serial number of the poverty unit within the household (1-18). Each household can have more than one poverty unit (although the vast majority of households have only one poverty unit).
* SPORDER is the serial number of a person in the household (1-20). Note that the dataset only assigns serial numbers to
the people in the *household*, not the *poverty unit*.  This means that if for example a household has two poverty units,
the first with two people and the second with three people, the head of the second poverty unit will have SPORDER of 3,
not 1.  (One-based counting scheme) This is not a problem, but a particularity to be aware of when looking at dataset
rows for reference.

In [27]:
X = all_2016.copy()

categoricals = ['AGEP', 'CIT', 'SCHL', 'SEX', 'ENG', 'MSP', 'WKW', 'WKHP', 'DIS', 'JWTR', 'Ethnicity', 'Boro', 'NP', 'TEN', 'HHT', 'HousingStatus', 'TotalWorkHrs_PU']

# We'll create separate dataframes for personal and poverty-unit variables, then join them together
personal_columns = ['AGEP', 'CIT', 'SCHL', 'SEX', 'ENG', 'MSP', 'WKW', 'WKHP', 'DIS', 'JWTR', 'WAGP_adj', 'INTP_adj', 'SEMP_adj', 'SSP_adj', 'SSIP_adj', 'PA_adj', 'RETP_adj', 'OI_adj', 'Ethnicity', 'Boro']
pu_columns = ['NP', 'TEN', 'HHT', 'MRGP_adj', 'RNTP_adj', 'HousingStatus', 'TotalWorkHrs_PU', 'NYCgov_Pov_Stat']

# Create a dataframe for the personal columns, including our 3 indicator variables
X_pers = X.copy()
X_pers_columns = ['SERIALNO', 'Povunit_ID', 'SPORDER'] + personal_columns
X_pers = X_pers[X_pers_columns]

# Grouping by SERIALNO and Povunit_ID, put SPORDER (person # in household) at the top as multi-index columns
X_pers = X_pers.set_index(['SERIALNO', 'Povunit_ID', 'SPORDER']).unstack('SPORDER').fillna(0)

# Turn the multi-index columns into a single indexed column: 'AGEP_1', 'AGEP_2', 'AGEP_3', etc.
X_pers.columns = list(map('_'.join, [(y, str(z)) for y, z in (x for x in X_pers.columns)]))

# Create a dataframe for the poverty-unit columns, including our 3 indicator variables
X_pu = X.copy()
X_pu_columns = ['SERIALNO', 'Povunit_ID', 'SPORDER'] + pu_columns
X_pu = X_pu[X_pu_columns]

# Grouping by SERIALNO and Povunit_ID, put SPORDER (person # in household) at the top as multi-index columns
X_pu = X_pu.set_index(['SERIALNO', 'Povunit_ID', 'SPORDER']).unstack('SPORDER').fillna(0)

# Groupby and take the max of SPORDER (these are poverty-unit variables; if there is a nonzero value, it's unique)
X_pu = X_pu.stack().groupby(['SERIALNO', 'Povunit_ID']).max()

# Add the personal and poverty-unit dataframes
X = X_pers.add(X_pu, fill_value=0)
X.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,AGEP_1,AGEP_10,AGEP_11,AGEP_12,AGEP_13,AGEP_14,AGEP_15,AGEP_16,AGEP_17,AGEP_18,...,WKW_19,WKW_2,WKW_20,WKW_3,WKW_4,WKW_5,WKW_6,WKW_7,WKW_8,WKW_9
SERIALNO,Povunit_ID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1521345,1,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1521371,1,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1521389,1,57.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1521399,1,39.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1521415,1,36.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Let's hit it 

In [None]:
# Pull off 'NYCgov_Pov_Stat' for our target variable
y = X['NYCgov_Pov_Stat'].replace({'NYCgov_Pov_Stat': {1: 'Pov', 2:'Not Pov'}})
X = X.drop('NYCgov_Pov_Stat', axis='columns')

# Get train and test - be sure to stratify since this is imbalanced data (poverty ~20% of the set)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

# Transforms for pipeline: 
# 1) categorize to prep for one-hot encoding
# 2) one-hot encode, dropping one to avoid colinearity
# 3) deal with imbalanced data with sampling strategies (poverty is ~20% of total)
# 4) scale data
# 5) classifiers
categorizer = Categorizer(columns=dummy_these)
dummy_encoder = DummyEncoder(drop_first=True)
samplers = [SMOTE(random_state=42), SMOTETomek(random_state=42), TomekLinks(random_state=42)]
#scalers = [StandardScaler(), Normalizer(), PowerTransformer(), QuantileTransformer()]
scalers = [Normalizer()]
scaler = Normalizer()
#classifiers = [LogisticRegression(), SGDClassifier(), AdaBoostClassifier(), BaggingClassifier(), GradientBoostingClassifier(), 
               #RandomForestClassifier(), BalancedBaggingClassifier(), BalancedRandomForestClassifier(), RUSBoostClassifier()]
#classifiers = [BalancedBaggingClassifier(), BaggingClassifier(), RandomForestClassifier(), BalancedRandomForestClassifier(), 
               #AdaBoostClassifier(), GradientBoostingClassifier()]
#classifiers = [RandomForestClassifier(), BalancedRandomForestClassifier()]
classifiers = [BalancedRandomForestClassifier()]

#sampler = TomekLinks(random_state=42)
#scaler = QuantileTransformer()
#clf = LogisticRegression(solver='lbfgs', max_iter=200)
#clf = RandomForestClassifier(n_estimators=100)
#clf = AdaBoostClassifier()
#params={0: {'clf__C': [1, 1e-1, 1e-2, 1e-3], 'clf__max_iter': [1e2, 1e3, 1e4], # Logistic Regression
                               #'clf__solver': ['lbfgs', 'liblinear', 'sag', 'saga']}, 
        #1: {'n_estimators': [1e1, 1e2, 1e3], 'max_features': [5, 10, 50, 100], # Random Forest Classifier
                         #'criterion': ['gini', 'entropy']}
       #}
#params = {0: {'clf__n_estimators': [10, 100, 1000], 'clf__max_features': [5, 10, 50, 100],
              #'clf__criterion': ['gini', 'entropy']},
          #1: {'clf__n_estimators': [10, 100, 1000], 'clf__max_features': [5, 10, 50, 100],
              #'clf__criterion': ['gini', 'entropy'], 'clf_sampling_strategy': [0.05, 0.25, 0.5, 0.75, 0.95]}
         #}

params = {0: {'clf__n_estimators': [1000], 'clf__max_features': [100],
              'clf__sampling_strategy': ['not minority', 'not majority', 'all']}}
          #1: {'clf__n_estimators': [10, 100, 1000], 'clf__max_features': [5, 10, 50, 100],
              #'clf__criterion': ['gini', 'entropy']},
         #}

#parameters = {'clf__n_estimators': [10, 100, 1000], 'clf__max_features': [5, 10, 50, 100], 'clf__criterion': ['gini', 'entropy']}
parameters = {'clf__n_estimators': [100], 'clf__max_features': ['auto'], 'clf__criterion': ['gini']}

cachedir = tempfile.mkdtemp()

#pipeline = imbPipeline(steps=[('cat', categorizer),
                              #('dummies', dummy_encoder),
                              #('sampler', sampler),
                              #('scaler', scaler),
                              #('clf', BalancedRandomForestClassifier())], 
                      #memory=cachedir)
                    
#grid = GridSearchCV(estimator=pipeline, param_grid=parameters, cv=5, n_jobs=-1, pre_dispatch=2, verbose=9)#, scoring=balanced_accuracy_score())
#grid = GridSearchCV(estimator=pipeline, param_grid=parameters, cv=5, n_jobs=-1, verbose=9)

#t0 = time.time()
#grid.fit(X_train, y_train)
#time_to_fit = time.time() - t0
#print('Took: ' + '{:4.2f}'.format(time_to_fit) + 's to fit ')
#print(grid.cv_results_)

#for sampler, i in zip(samplers, range(len(samplers))):
for i in range(1):
    #for scaler, j in zip(scalers, range(len(scalers))):
    for scaler in scalers:
        for k in range(len(classifiers)):
            #pipeline = Pipeline(steps=[#('cat', categorizer),
            pipeline = imbPipeline(steps=[('cat', categorizer),
                                          ('dummies', dummy_encoder),
                                          #('sampler', sampler),
                                          ('scaler', scaler),
                                          ('clf', classifiers[k])],
                                  memory=cachedir)

            #print(pipeline)
            #print(params[i])
            #pipeline.get_params().keys()
            grid = GridSearchCV(estimator=pipeline, param_grid=params[k], cv=3, n_jobs=-1, verbose=9)#, scoring=balanced_accuracy_score())

            t0 = time.time()
            #pipeline.fit(X_train, y_train)
            grid.fit(X_train, y_train)
            time_to_fit = time.time() - t0
            print('Took: ' + '{:4.2f}'.format(time_to_fit) + 's to fit ')
            print(grid.cv_results_)
            print('best estimator: ' + str(grid.best_estimator_))
            print('best params: ' + str(grid.best_params_))
            print('best index: ' + str(grid.best_index_))
            
            #print(str(sampler) + ',' + str(scaler) + ',' + str(classifiers[k]))
            #print(str(scaler) + ',' + str(classifiers[k]))
                  
            #means = grid.cv_results_['mean_test_score']
            #stds = grid.cv_results_['std_test_score']
            #for mean, std, params in zip(means, stds, grid.cv_results_['params']):
                #print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

            #predictions = pipeline.predict(X_test)

#print('Predictions: ' + str(predictions))
#print('Actual:\n' + str(y_small))
            #print('\nBalanced accuracy: ' + str(balanced_accuracy_score(y_test, predictions)))
            #print('Geometric mean: ' + str(geometric_mean_score(y_test, predictions)))
            #print('Confusion matrix:\n' + str(confusion_matrix(y_test, predictions)))
            #print('\nClassification report:\n' + str(classification_report_imbalanced(y_test, predictions)))

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


In [13]:
geometric_mean_score(grid.predict(X_test), y_test)

0.7975358685378098

In [35]:
#tester = BalancedRandomForestClassifier().fit(X=X_train, y=y_train)
#len(tester.feature_importances_)
#X_train.columns
#geometric_mean_score(tester.predict(X_test), y_test)  #0.775727880752169
imps = list(zip(tester.feature_importances_, X_train.columns))
#imps.sort(key=lambda x: x[1])
sorted(imps, key=lambda tup: tup[0])[-15:]

[(0.019535791275811728, 'WKW_2'),
 (0.020252231346152935, 'INTP_adj_1'),
 (0.020452137721584687, 'SCHL_2'),
 (0.020767468686063252, 'RNTP_adj'),
 (0.02515045254416771, 'WKHP_2'),
 (0.026205088630138082, 'AGEP_1'),
 (0.02817688155652038, 'SCHL_1'),
 (0.03238306328525264, 'WKW_1'),
 (0.03242418603775719, 'JWTR_1'),
 (0.03587014309356664, 'RETP_adj_1'),
 (0.041251391667779504, 'WKHP_1'),
 (0.05057339662820781, 'SSP_adj_1'),
 (0.052182437399624144, 'WAGP_adj_2'),
 (0.07156076688876317, 'TotalWorkHrs_PU'),
 (0.14873418922209963, 'WAGP_adj_1')]

In [20]:
results = pd.DataFrame({'mean_fit_time': np.array([ 29.3565313 ,  21.269967  , 133.63124331,   6.9153959 ,
        17.71104503, 151.52857844,   8.77863534,  36.48309787,
       353.55145939,  10.91108894,  62.25011571, 604.09919691,
         6.63593658,  18.70342271, 125.81533424,   5.24119258,
        16.36292839, 141.45259889,   7.89260626,  28.13510966,
       273.91240621,   9.28567576,  44.10937214, 356.8786389 ]), 'std_fit_time': np.array([ 0.77492853,  2.90770999, 11.63067571,  0.2850153 ,  0.98171153,
       11.44797585,  0.29892795,  3.36007335, 24.64549427,  0.59362711,
        6.20874779, 40.38846034,  0.20280898,  0.27479866, 10.47984029,
        0.84278515,  1.09157463, 12.09636917,  0.18592348,  1.8504363 ,
       17.28253135,  0.37329269,  4.38992519, 99.02754836]), 'mean_score_time': np.array([ 7.33835514,  8.92508117, 24.09163777,  8.09356125,  7.32807509,
       19.51869694,  7.59556937,  6.94122378, 14.72333399,  7.78833922,
        7.38280678, 15.11881781,  8.04199298,  9.8266205 , 25.7414674 ,
        6.29152044,  8.31556026, 18.50302108,  7.92796397,  7.69944366,
       13.55180319,  7.96719853,  7.42520293, 11.7625773 ]), 'std_score_time': np.array([0.1306685 , 2.03098883, 2.33742256, 0.45626301, 1.50775737,
       3.48599838, 0.21847926, 1.10654311, 2.18771397, 0.04787844,
       1.2097773 , 2.10400838, 0.09428107, 0.18964651, 3.18375043,
       1.19595878, 0.94813268, 1.95163257, 0.19510065, 1.0073612 ,
       1.71333338, 0.16709411, 1.21011483, 2.31377346]), 'param_clf__criterion': np.ma.masked_array(data=['gini', 'gini', 'gini', 'gini', 'gini', 'gini', 'gini',
                   'gini', 'gini', 'gini', 'gini', 'gini', 'entropy',
                   'entropy', 'entropy', 'entropy', 'entropy', 'entropy',
                   'entropy', 'entropy', 'entropy', 'entropy', 'entropy',
                   'entropy'],
             mask=[False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'param_clf__max_features': np.ma.masked_array(data=[5, 5, 5, 10, 10, 10, 50, 50, 50, 100, 100, 100, 5, 5,
                   5, 10, 10, 10, 50, 50, 50, 100, 100, 100],
             mask=[False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'param_clf__n_estimators': np.ma.masked_array(data=[10, 100, 1000, 10, 100, 1000, 10, 100, 1000, 10, 100,
                   1000, 10, 100, 1000, 10, 100, 1000, 10, 100, 1000, 10,
                   100, 1000],
             mask=[False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'clf__criterion': 'gini', 'clf__max_features': 5, 'clf__n_estimators': 10}, {'clf__criterion': 'gini', 'clf__max_features': 5, 'clf__n_estimators': 100}, {'clf__criterion': 'gini', 'clf__max_features': 5, 'clf__n_estimators': 1000}, {'clf__criterion': 'gini', 'clf__max_features': 10, 'clf__n_estimators': 10}, {'clf__criterion': 'gini', 'clf__max_features': 10, 'clf__n_estimators': 100}, {'clf__criterion': 'gini', 'clf__max_features': 10, 'clf__n_estimators': 1000}, {'clf__criterion': 'gini', 'clf__max_features': 50, 'clf__n_estimators': 10}, {'clf__criterion': 'gini', 'clf__max_features': 50, 'clf__n_estimators': 100}, {'clf__criterion': 'gini', 'clf__max_features': 50, 'clf__n_estimators': 1000}, {'clf__criterion': 'gini', 'clf__max_features': 100, 'clf__n_estimators': 10}, {'clf__criterion': 'gini', 'clf__max_features': 100, 'clf__n_estimators': 100}, {'clf__criterion': 'gini', 'clf__max_features': 100, 'clf__n_estimators': 1000}, {'clf__criterion': 'entropy', 'clf__max_features': 5, 'clf__n_estimators': 10}, {'clf__criterion': 'entropy', 'clf__max_features': 5, 'clf__n_estimators': 100}, {'clf__criterion': 'entropy', 'clf__max_features': 5, 'clf__n_estimators': 1000}, {'clf__criterion': 'entropy', 'clf__max_features': 10, 'clf__n_estimators': 10}, {'clf__criterion': 'entropy', 'clf__max_features': 10, 'clf__n_estimators': 100}, {'clf__criterion': 'entropy', 'clf__max_features': 10, 'clf__n_estimators': 1000}, {'clf__criterion': 'entropy', 'clf__max_features': 50, 'clf__n_estimators': 10}, {'clf__criterion': 'entropy', 'clf__max_features': 50, 'clf__n_estimators': 100}, {'clf__criterion': 'entropy', 'clf__max_features': 50, 'clf__n_estimators': 1000}, {'clf__criterion': 'entropy', 'clf__max_features': 100, 'clf__n_estimators': 10}, {'clf__criterion': 'entropy', 'clf__max_features': 100, 'clf__n_estimators': 100}, {'clf__criterion': 'entropy', 'clf__max_features': 100, 'clf__n_estimators': 1000}], 'split0_test_score': np.array([0.91245407, 0.91866211, 0.91942227, 0.90903332, 0.91625491,
       0.91980236, 0.91600152, 0.92246294, 0.92296972, 0.91460788,
       0.92702395, 0.92727733, 0.91118713, 0.92018244, 0.91967566,
       0.91308755, 0.92018244, 0.921196  , 0.91435449, 0.92525022,
       0.92715064, 0.91600152, 0.92639047, 0.92854428]), 'split1_test_score': np.array([0.91320324, 0.92004562, 0.92042575, 0.91180943, 0.92080588,
       0.91979219, 0.9175114 , 0.92435378, 0.92549417, 0.91409022,
       0.92967562, 0.92790167, 0.91104916, 0.91953877, 0.92042575,
       0.90927522, 0.92194627, 0.92181956, 0.91586417, 0.92663457,
       0.9270147 , 0.92017233, 0.92904207, 0.93132286]), 'split2_test_score': np.array([0.91116462, 0.92117602, 0.91990876, 0.91040426, 0.92028894,
       0.92180966, 0.91838804, 0.92523128, 0.92700545, 0.9177544 ,
       0.92599164, 0.92789254, 0.91078444, 0.9210493 , 0.92028894,
       0.9153466 , 0.92244329, 0.92054239, 0.91471296, 0.92497782,
       0.92738563, 0.91940185, 0.92687872, 0.92713218]), 'mean_test_score': np.array([0.91227403, 0.91996114, 0.91991891, 0.91041561, 0.9191164 ,
       0.92046798, 0.91730022, 0.92401588, 0.92515628, 0.91548403,
       0.92756378, 0.92769049, 0.91100693, 0.9202568 , 0.92013009,
       0.91256969, 0.92152391, 0.92118601, 0.91497719, 0.92562088,
       0.92718365, 0.91852509, 0.92743707, 0.92899983]), 'std_test_score': np.array([0.00084192, 0.00102804, 0.00040974, 0.00113341, 0.00203455,
       0.00094862, 0.00098567, 0.00115515, 0.00166481, 0.00161909,
       0.00155163, 0.0002922 , 0.00016709, 0.00061889, 0.00032618,
       0.00250546, 0.00097011, 0.00052144, 0.00064403, 0.00072536,
       0.00015322, 0.00181211, 0.00115228, 0.00174085]), 'rank_test_score': np.array([22, 14, 15, 24, 16, 11, 18,  8,  7, 19,  3,  2, 23, 12, 13, 21,  9,
       10, 20,  6,  5, 17,  4,  1], dtype=np.int32), 'split0_train_score': np.array([0.99708547, 1.        , 1.        , 0.99651524, 1.        ,
       1.        , 0.99524805, 1.        , 1.        , 0.99581829,
       1.        , 1.        , 0.99531141, 1.        , 1.        ,
       0.99645188, 1.        , 1.        , 0.9962618 , 1.        ,
       1.        , 0.99746563, 1.        , 1.        ]), 'split1_train_score': np.array([0.99562848, 1.        , 1.        , 0.99664217, 1.        ,
       1.        , 0.99562848, 1.        , 1.        , 0.99619868,
       1.        , 1.        , 0.99588191, 1.        , 1.        ,
       0.99664217, 1.        , 1.        , 0.99581855, 1.        ,
       1.        , 0.99600862, 1.        , 1.        ]), 'split2_train_score': np.array([0.99676908, 1.        , 1.        , 0.99670573, 1.        ,
       1.        , 0.99632563, 1.        , 1.        , 0.99689579,
       1.        , 1.        , 0.99588217, 1.        , 1.        ,
       0.99607222, 0.99993665, 1.        , 0.99569211, 1.        ,
       1.        , 0.99575546, 1.        , 1.        ]), 'mean_train_score': np.array([0.99649435, 1.        , 1.        , 0.99662105, 1.        ,
       1.        , 0.99573405, 1.        , 1.        , 0.99630425,
       1.        , 1.        , 0.99569183, 1.        , 1.        ,
       0.99638876, 0.99997888, 1.        , 0.99592415, 1.        ,
       1.        , 0.9964099 , 1.        , 1.        ]), 'std_train_score': np.array([6.25733424e-04, 0.00000000e+00, 0.00000000e+00, 7.91906591e-05,
       0.00000000e+00, 0.00000000e+00, 4.46206266e-04, 0.00000000e+00,
       0.00000000e+00, 4.46177186e-04, 0.00000000e+00, 0.00000000e+00,
       2.68995301e-04, 0.00000000e+00, 0.00000000e+00, 2.36922985e-04,
       2.98640811e-05, 0.00000000e+00, 2.44267970e-04, 0.00000000e+00,
       0.00000000e+00, 7.53630262e-04, 0.00000000e+00, 0.00000000e+00])})
results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__criterion,param_clf__max_features,param_clf__n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
0,29.356531,0.774929,7.338355,0.130668,gini,5,10,"{'clf__criterion': 'gini', 'clf__max_features'...",0.912454,0.913203,0.911165,0.912274,0.000842,22,0.997085,0.995628,0.996769,0.996494,0.000626
1,21.269967,2.90771,8.925081,2.030989,gini,5,100,"{'clf__criterion': 'gini', 'clf__max_features'...",0.918662,0.920046,0.921176,0.919961,0.001028,14,1.0,1.0,1.0,1.0,0.0
2,133.631243,11.630676,24.091638,2.337423,gini,5,1000,"{'clf__criterion': 'gini', 'clf__max_features'...",0.919422,0.920426,0.919909,0.919919,0.00041,15,1.0,1.0,1.0,1.0,0.0
3,6.915396,0.285015,8.093561,0.456263,gini,10,10,"{'clf__criterion': 'gini', 'clf__max_features'...",0.909033,0.911809,0.910404,0.910416,0.001133,24,0.996515,0.996642,0.996706,0.996621,7.9e-05
4,17.711045,0.981712,7.328075,1.507757,gini,10,100,"{'clf__criterion': 'gini', 'clf__max_features'...",0.916255,0.920806,0.920289,0.919116,0.002035,16,1.0,1.0,1.0,1.0,0.0
5,151.528578,11.447976,19.518697,3.485998,gini,10,1000,"{'clf__criterion': 'gini', 'clf__max_features'...",0.919802,0.919792,0.92181,0.920468,0.000949,11,1.0,1.0,1.0,1.0,0.0
6,8.778635,0.298928,7.595569,0.218479,gini,50,10,"{'clf__criterion': 'gini', 'clf__max_features'...",0.916002,0.917511,0.918388,0.9173,0.000986,18,0.995248,0.995628,0.996326,0.995734,0.000446
7,36.483098,3.360073,6.941224,1.106543,gini,50,100,"{'clf__criterion': 'gini', 'clf__max_features'...",0.922463,0.924354,0.925231,0.924016,0.001155,8,1.0,1.0,1.0,1.0,0.0
8,353.551459,24.645494,14.723334,2.187714,gini,50,1000,"{'clf__criterion': 'gini', 'clf__max_features'...",0.92297,0.925494,0.927005,0.925156,0.001665,7,1.0,1.0,1.0,1.0,0.0
9,10.911089,0.593627,7.788339,0.047878,gini,100,10,"{'clf__criterion': 'gini', 'clf__max_features'...",0.914608,0.91409,0.917754,0.915484,0.001619,19,0.995818,0.996199,0.996896,0.996304,0.000446
