# Subsetting Data

In this notebook, we subset the full feature matrices to the features identified in the `Sampling` notebook. The `_sample.csv` files are the sampled version (10% of the observations) with the feature selection applied. The `fm` dataframes are the full data. The full data after being subsetted is then saved as `_selected.csv`.

The random search is run on the `_sample.csv` files, but the `_selected.csv` files will be used to train and test the final model. These files have all the observations, but only the columns that remained after feature selection.

In [1]:
import pandas as pd
import numpy as np

  return f(*args, **kwds)
  return f(*args, **kwds)


# Default Features

In [2]:
train, test = pd.read_csv('../input/application_train.csv'), pd.read_csv('../input/application_test.csv')
test['TARGET'] = np.nan
train, test = pd.get_dummies(train).align(pd.get_dummies(test), axis = 1, join = 'inner')

sample = pd.read_csv('../input/features_default_sample.csv')
fm = train.append(test, ignore_index = True)
fm.shape

(356255, 243)

In [3]:
fm = fm[[x for x in sample.columns if x in fm.columns]]
fm.to_csv('../input/features_default_selected.csv')
fm.shape

(356255, 204)

# Manual Features

In [4]:
sample = pd.read_csv('../input/features_manual_sample.csv')
fm = pd.read_csv('../input/features_manual.csv')

# One-hot encoding
fm = pd.get_dummies(fm)
fm.shape

(356255, 273)

In [5]:
# Subset to the columns in the sample
fm = fm[sample.columns]
fm.shape

(356255, 230)

In [6]:
# Save to csv
fm.to_csv('../input/features_manual_selected.csv', index = False)

# Featuretools Features

In [14]:
# Read in sample and full data
sample = pd.read_csv('../input/feature_matrix_sample.csv')
fm = pd.read_csv('../input/feature_matrix.csv')

print(fm.shape)

# One hot encoding
cat = pd.get_dummies(fm.select_dtypes('object'))

# Convert the column types
for col in fm:
    if fm[col].dtype == 'bool':
        fm[col] = fm[col].astype(np.uint8)
        
# Add the one-hot encoded columns
fm = fm.select_dtypes(['number'])
fm = pd.concat([fm, cat], axis = 1)
fm.shape

(356255, 1821)


(356255, 2111)

In [15]:
sample = pd.read_csv('../input/feature_matrix_sample.csv')
sample.shape

(30751, 1042)

In [16]:
# Subset to the columns in the sample
fm = fm[[x for x in sample.columns if x in fm]]
fm.shape

(356255, 1042)

In [17]:
# Save
fm.to_csv('../input/feature_matrix_select.csv', index = False)

# Semi-Automated Features

In [14]:
sample = pd.read_csv('../input/features_semi_sample.csv')
fm = pd.read_csv('../input/features_semi.csv')

# One hot encoding
fm = pd.get_dummies(fm)
fm.shape

(356255, 1447)

In [15]:
# Subset to the columns in sample
fm = fm[sample.columns]
fm.shape

(356255, 880)

In [16]:
fm.to_csv('../input/features_semi_selected.csv', index = False)

# Conclusions

The resulting data can now be used for modeling. The columns have been reduced through feature selection but all of the observations remain. The final datasets will be tested both with the default gradient boosting machine hyperparameters and with the optimal hyperparameters found from random search. The next notebook is Results which implements these datasets with the final model.