# Building a Classification Model with Features that have been Generated Using Featuretools

In this chapter, we have learned about Featuretools and how to build automated features using it. In the next activity, we will apply what we have learned to a new dataset. This dataset is a modified version of the adult dataset from the UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science, which can be found here, in the adult.data file. This dataset has various attributes of a working adult, such as age, occupation, education, and native. The task is to predict whether a particular adult will earn more than 50,000 in their yearly salary or not.

The details about the various attributes are available at the preceding link in the adult.names file. This dataset has a mix of both categorical and numerical data and is a good dataset to try out what you have learned about Featuretools.

In this activity, you will build a logistic regression model on the adult dataset to predict whether an adult will earn more than 50,000 per year or not. You will begin this activity by fitting a benchmark model on raw features and then note the benchmark metrics. After this, you will generate new features using Featuretools and then build another model on the new dataset. You should analyze the results to observe the performance of the models you've built.

In [83]:
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
import featuretools as ft

In [84]:
df = pd.read_csv('../Datasets/adult.csv', sep=',', na_values=' ?')
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,sex,capital-gain,capital-loss,hours,native,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Female,0,0,40,Cuba,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,Female,0,0,38,United-States,0
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,Male,0,0,40,United-States,1
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,Female,0,0,40,United-States,0
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,Male,0,0,20,United-States,0


In [85]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
sex               object
capital-gain       int64
capital-loss       int64
hours              int64
native            object
label              int64
dtype: object

In [86]:
# drop rows with NaN
df.dropna(axis=0, inplace=True)
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,sex,capital-gain,capital-loss,hours,native,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Female,0,0,40,Cuba,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,Female,0,0,38,United-States,0
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,Male,0,0,40,United-States,1
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,Female,0,0,40,United-States,0
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,Male,0,0,20,United-States,0


In [87]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'sex', 'capital-gain',
       'capital-loss', 'hours', 'native', 'label'],
      dtype='object')

In [88]:
# Create X, y
y = df.pop('label')
X = df
print(X.shape)
print(y.shape)

(30162, 13)
(30162,)


## Benchmark model

In [89]:
# Split train-test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(21113, 13)
(9049, 13)
(21113,)
(9049,)


In [90]:
# Pipeline for transforming categorical variables
catFeatures = X.select_dtypes(include=['object']).columns
catTransformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])
 
# Pipeline for scaling numerical variables
numFeatures = X.select_dtypes(include=['int64']).columns
numTransformer = Pipeline(steps=[('scaler', StandardScaler())])

# Create the preprocessing engine
preprocessor = ColumnTransformer(
    transformers=[
        ('numeric', numTransformer, numFeatures),
        ('categoric', catTransformer, catFeatures)
    ]
)

In [91]:
# Create an estimator with both preprocessor and model
estimator = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('clf', LogisticRegression(random_state=123))
    ]
)

In [92]:
# Fit the modelling pipeline on the training set
estimator.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
       'hours'],
      dtype='object')),
                                                 ('categoric',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  Index(['workclass', 'education', 'marital-status', 'occupation',
       'relationship', 'sex', 'native'],
      dtype='object'))])),
                ('clf', LogisticRegression(random_state=123))])

In [93]:
# predict and evaluate
y_pred = estimator.predict(X_test)
print(f'Score: {estimator.score(X_test, y_test)}\n')
print(confusion_matrix(y_test, y_pred))
print('\n')
print(classification_report(y_test, y_pred))

Score: 0.8534644712122886

[[6333  468]
 [ 858 1390]]


              precision    recall  f1-score   support

           0       0.88      0.93      0.91      6801
           1       0.75      0.62      0.68      2248

    accuracy                           0.85      9049
   macro avg       0.81      0.77      0.79      9049
weighted avg       0.85      0.85      0.85      9049



## Feature Engineering

In [94]:
# Create entities and their IDs
df['adultID'] = df.index.values
df['adultID'] = 'ad' + df['adultID'].astype(str)

In [102]:
df['workclassID'] = 0
df['workclassID'] = df['workclass'].apply(lambda x: list(df['workclass'].unique()).index(x))

In [103]:
df['occupationID'] = 0
df['occupationID'] = df['occupation'].apply(lambda x: list(df['occupation'].unique()).index(x))

In [108]:
# Create the entity set
adult_entities = ft.EntitySet(id = 'Adult')

In [109]:
# Map a dataframe to the entityset to form the parent entity
adult_entities.entity_from_dataframe(
    entity_id = 'DemographicData', 
    dataframe = df, 
    index = 'adultID'
)

Entityset: Adult
  Entities:
    DemographicData [Rows: 30162, Columns: 18]
  Relationships:
    No relationships

In [110]:
# Map to parent entity and set the relationship
adult_entities.normalize_entity(
    base_entity_id='DemographicData', 
    new_entity_id='Education', 
    index = 'education-num', 
    additional_variables = ['education']
)

adult_entities.normalize_entity(
    base_entity_id='DemographicData', 
    new_entity_id='WorkClass', 
    index = 'workclassID', 
    additional_variables = ['workclass']
)

adult_entities.normalize_entity(
    base_entity_id='DemographicData', 
    new_entity_id='Occupation', 
    index = 'occupationID', 
    additional_variables = ['occupation']
)

Entityset: Adult
  Entities:
    DemographicData [Rows: 30162, Columns: 15]
    Education [Rows: 16, Columns: 2]
    WorkClass [Rows: 7, Columns: 2]
    Occupation [Rows: 14, Columns: 2]
  Relationships:
    DemographicData.education-num -> Education.education-num
    DemographicData.workclassID -> WorkClass.workclassID
    DemographicData.occupationID -> Occupation.occupationID

In [112]:
# Create aggregation and transformation primitives
agg_primitives = ['std', 'min', 'max', 'mean', 'last', 'count']
trans_primitives = ['percentile', 'subtract_numeric', 'divide_numeric']

# Define the new set of features
feature_set, feature_names = ft.dfs(
    entityset=adult_entities, 
    target_entity='DemographicData',
    agg_primitives=agg_primitives,
    trans_primitives=trans_primitives, 
    max_depth=2, 
    verbose=2, 
    n_jobs=1
)

Built 216 features
Elapsed: 00:02 | Progress: 100%|██████████


In [113]:
# Reindex the feature_set
feature_set = feature_set.reindex(index=df['adultID'])
feature_set = feature_set.reset_index()

In [114]:
# Drop all Ids
X = feature_set[feature_set.columns[~feature_set.columns.str.contains('adultID|education-num|workclassID|occupationID')]]
# Replace all columns with infinity with nan
X = X.replace([np.inf, -np.inf], np.nan)
# Drop all columns with nan
X = X.dropna(axis=1, how='any')

In [115]:
# Split train-test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(21113, 180)
(9049, 180)
(21113,)
(9049,)


In [119]:
# Pipeline for transforming categorical variables
catFeatures = X.select_dtypes(include=['object']).columns
catTransformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])
 
# Pipeline for scaling numerical variables
numFeatures = X.select_dtypes(include=['int64', 'float64']).columns
numTransformer = Pipeline(steps=[('scaler', StandardScaler())])

# Create the preprocessing engine
preprocessor = ColumnTransformer(
    transformers=[
        ('numeric', numTransformer, numFeatures),
        ('categoric', catTransformer, catFeatures)
    ]
)

In [120]:
# Create an estimator with both preprocessor and model
estimator = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('clf', LogisticRegression(random_state=123))
    ]
)

In [121]:
# Fit the modelling pipeline on the training set
estimator.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  Index(['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours', 'workclassId',
       'occupationId', 'age / fnlwgt', 'age / hours', 'capital-gain / age',
       ...
       'Occupation.MIN(DemographicData.hours)',
       'Occupation.MIN(DemographicData.occupationId)',
       'Occup...
       'WorkClass.LAST(DemographicData.native)',
       'WorkClass.LAST(DemographicData.relationship)',
       'WorkClass.LAST(DemographicData.sex)',
       'Occupation.LAST(DemographicData.marital-status)',
       'Occupation.LAST(DemographicData.native)',
       'Occupation.LAST(DemographicData.relationship)',
       'Occupation.LAST(DemographicData.sex)'],
      dtype='object'))])),
               

In [122]:
# predict and evaluate
y_pred = estimator.predict(X_test)
print(f'Score: {estimator.score(X_test, y_test)}\n')
print(confusion_matrix(y_test, y_pred))
print('\n')
print(classification_report(y_test, y_pred))

Score: 0.8587689247430655

[[6338  463]
 [ 815 1433]]


              precision    recall  f1-score   support

           0       0.89      0.93      0.91      6801
           1       0.76      0.64      0.69      2248

    accuracy                           0.86      9049
   macro avg       0.82      0.78      0.80      9049
weighted avg       0.85      0.86      0.85      9049

