# Feature Engineer

This engineers features on the data.

## Installations

In [1]:
! pip install feature-engine



In [42]:
! pip install boruta

Collecting boruta
  Using cached Boruta-0.4.3-py3-none-any.whl.metadata (8.8 kB)
Downloading Boruta-0.4.3-py3-none-any.whl (57 kB)
Installing collected packages: boruta
Successfully installed boruta-0.4.3


In [43]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# sklearn imputation libraries
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing

# apply Boruta method for dimensionality reduction
from boruta import BorutaPy

# Use DataPrep function to remove all rows with missing values
from dataprep.clean import clean_df

# mean data imputer
from feature_engine.imputation import MeanMedianImputer

### Read in data

In [3]:
# Read data from csv
df = pd.read_csv("../data/raw/raw_data.csv", sep=',', engine='python')

In [4]:
# create list of float variables
float_vars = list()        
for x in df.columns:
    if df[x].dtypes == 'float64':
        float_vars.append(x)

# create list of int variables
int_vars = list()        
for x in df.columns:
    if df[x].dtypes == 'int':
        int_vars.append(x)

# create list of string variables
string_vars = list()        
for x in df.columns:
    if df[x].dtypes == 'str':
        string_vars.append(x)

# create list of X variables
X_vars = list()
for col in df.columns:
    if col.startswith('x'):
        X_vars.append(col)
        
print(float_vars)
print(int_vars)
print(string_vars)
print(X_vars)

['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19', 'x20', 'x21', 'x22', 'x23', 'x24', 'x25', 'x26', 'x27', 'x28', 'x29', 'x30', 'x31', 'x32', 'x33', 'x34', 'x35', 'x36', 'x37', 'x38', 'x39']
['y']
[]
['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19', 'x20', 'x21', 'x22', 'x23', 'x24', 'x25', 'x26', 'x27', 'x28', 'x29', 'x30', 'x31', 'x32', 'x33', 'x34', 'x35', 'x36', 'x37', 'x38', 'x39']


### Remove all rows with missing values

Apply the [Drop Missing Data imputation method](https://feature-engine.trainindata.com/en/latest/user_guide/imputation/DropMissingData.html) from the `feature-engineer` package.

Removing rows with nan values from a dataset is a common practice in data science and machine learning projects.

You are probably familiar with the use of pandas dropna. You basically take a pandas dataframe or a pandas series, apply dropna, and eliminate those rows that contain nan values in one or more columns.

In [25]:
from feature_engine.imputation import DropMissingData

dmd = DropMissingData()
dmd.fit(df)
df_dropna = dmd.transform(df)

In [28]:
df_dropna.to_csv('../data/processed/data_nomissingvalues.csv', index=False)

### Replace missing data with median values of the variable

The MeanMedianImputer() replaces missing data by the mean or median value of the variable. It works only with numerical variables. See the method documentation [here](https://feature-engine.trainindata.com/en/latest/api_doc/imputation/MeanMedianImputer.html)

In [9]:
# Read data from csv
df = pd.read_csv("../data/raw/raw_data.csv", sep=',', engine='python')

# impute missing values and standardize values 
imputer = SimpleImputer(missing_values=np.nan, strategy='median')

imputer.fit(df)
dfimp = imputer.transform(df)
dfimp_df = pd.DataFrame(dfimp, columns=df.columns[:])

#### Save resulting dataframes to csv for further analysis

In [10]:
dfimp_df.to_csv('../data/processed/data_medianvalsformissing.csv', index=False)

### Replace missing data with mean values of the variable

In [11]:
# Read data from csv
df = pd.read_csv("../data/raw/raw_data.csv", sep=',', engine='python')

# impute missing values and standardize values 
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

imputer.fit(df)
dfimp = imputer.transform(df)
dfimp_df = pd.DataFrame(dfimp, columns=df.columns[:])

#### Save resulting dataframes to csv for further analysis

In [12]:
dfimp_df.to_csv('../data/processed/data_meanvalsformissing.csv', index=False)

### Replace missing data with a random sample extracted from the variable

Use the [Random Sample imputer method](https://feature-engine.trainindata.com/en/latest/user_guide/imputation/RandomSampleImputer.html) from the `feature-engineer` package to impute the missing values with a random sample extracted from the variable.

If `seed = 'observation'`, then the random_state should be a variable name or a list of variable names. The seed will be calculated observation per observation, either by adding or multiplying the values of the variables indicated in the `random_state`. Then, a value will be extracted from the train set using that seed and used to replace the NAN in that particular observation. This is the equivalent of `pandas.sample(1, random_state=var1+var2)` if the `seeding_method` is set to add or `pandas.sample(1, random_state=var1*var2)` if the `seeding_method` is set to multiply.

In [15]:
from feature_engine.imputation import RandomSampleImputer

# Read data from csv
df = pd.read_csv("../data/raw/raw_data.csv", sep=',', engine='python')

# set up the imputer
imputer = RandomSampleImputer(
        random_state=['y'],
        seed='observation',
        seeding_method='add'
    )

# fit the imputer
imputer.fit(df)

# transform the data
dfimp_random = imputer.transform(df)


#### Save resulting dataframes to csv for further analysis

In [18]:
dfimp_random.to_csv('../data/processed/data_randomsampleformissing.csv', index=False)

### Replace missing data with arbitrary values

The [Arbitrary Number Imputer method](https://feature-engine.trainindata.com/en/latest/user_guide/imputation/ArbitraryNumberImputer.html) replaces missing data with an arbitrary numerical value determined by the user. It works only with numerical variables.

The ArbitraryNumberImputer() can find and impute all numerical variables automatically. Alternatively, you can pass a list of the variables you want to impute to the variables parameter.

You can impute all variables with the same number, in which case you need to define the variables to impute in the variables parameter and the imputation number in arbitrary_number parameter.

In [21]:
from feature_engine.imputation import ArbitraryNumberImputer

# Read data from csv
df = pd.read_csv("../data/raw/raw_data.csv", sep=',', engine='python')

# set up the imputer
arbitrary_imputer = ArbitraryNumberImputer(
    arbitrary_number=np.min(df), #for exposition, make arbitrary number the minimum value in the dataframe
    )

# fit the imputer
arbitrary_imputer.fit(df)

# transform the data
dfImp_arbitrary= arbitrary_imputer.transform(df)

#### Save resulting dataframes to csv for further analysis

In [24]:
dfImp_arbitrary.to_csv('../data/processed/data_arbitraryvalueformissing.csv', index=False)

### Add Missing Indicator for columns with missing values

Apply the `Add Missing Indicator` method to add a binary variable indicating if observations are missing (missing indicator). It adds missing indicators to both categorical and numerical variables. See the documentation of the [Add Missing Indicator](https://feature-engine.trainindata.com/en/latest/user_guide/imputation/AddMissingIndicator.html) method.

You can select the variables for which the missing indicators should be created passing a variable list to the variables parameter. Alternatively, the imputer will automatically select all variables.

The imputer has the option to add missing indicators to all variables or only to those that have missing data in the train set. You can change the behaviour using the parameter missing_only.

If `missing_only=True`, missing indicators will be added only to those variables with missing data in the train set. This means that if you passed a variable list to variables and some of those variables did not have missing data, no missing indicators will be added to them. If it is paramount that all variables in your list get their missing indicators, make sure to set missing_only=False.

It is recommended to use missing_only=True when not passing a list of variables to impute.



In [35]:
from feature_engine.imputation import AddMissingIndicator

# set up the imputer
addBinary_imputer = AddMissingIndicator(missing_only=True)

# fit the imputer
addBinary_imputer.fit(df)

df_MissingIndicator = addBinary_imputer.transform(df)

In [36]:
df_MissingIndicator.shape

(10000, 81)

In [38]:
# analyze the results
print(df_MissingIndicator.shape)
df_MissingIndicator.columns

(10000, 81)


Index(['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10',
       'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19', 'x20',
       'x21', 'x22', 'x23', 'x24', 'x25', 'x26', 'x27', 'x28', 'x29', 'x30',
       'x31', 'x32', 'x33', 'x34', 'x35', 'x36', 'x37', 'x38', 'x39', 'y',
       'x0_na', 'x1_na', 'x2_na', 'x3_na', 'x4_na', 'x5_na', 'x6_na', 'x7_na',
       'x8_na', 'x9_na', 'x10_na', 'x11_na', 'x12_na', 'x13_na', 'x14_na',
       'x15_na', 'x16_na', 'x17_na', 'x18_na', 'x19_na', 'x20_na', 'x21_na',
       'x22_na', 'x23_na', 'x24_na', 'x25_na', 'x26_na', 'x27_na', 'x28_na',
       'x29_na', 'x30_na', 'x31_na', 'x32_na', 'x33_na', 'x34_na', 'x35_na',
       'x36_na', 'x37_na', 'x38_na', 'x39_na'],
      dtype='object')

#### Save resulting dataframes to csv for further analysis

In [39]:
df_MissingIndicator.to_csv('../data/processed/data_missingindicator.csv', index=False)

### Apply Boruta method to reduce dimensionality of the dataframe

Apply the Boruta method for feature selection. Boruta is an all relevant feature selection method, while most other are minimal optimal; this means it tries to find all features carrying information usable for prediction, rather than finding a possibly compact subset of features on which some classifier has a minimal error.

Why bother with all relevant feature selection? When you try to understand the phenomenon that made your data, you should care about all factors that contribute to it, not just the bluntest signs of it in context of your methodology (yes, minimal optimal set of features by definition depends on your classifier choice).

See documentation of the Boruta method [here](https://github.com/scikit-learn-contrib/boruta_py).

In [51]:
# select arbitrary number of selected features following Boruta ranking
no_selectedfeatures = int(input("Please enter number of features to select \n"))

Please enter number of features to select 
 10


In [53]:
# impute missing values and standardize values 
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
scaler = StandardScaler()

imputer.fit(df[X_vars])
Ximp = imputer.transform(df[X_vars])
scaler.fit(Ximp)
Xscaled = scaler.transform(Ximp)

In [54]:
# instantiate random forest
forest = RandomForestRegressor(n_jobs = -1, max_depth = 5)

# fit boruta
boruta_selector = BorutaPy(forest, n_estimators = 'auto', random_state = 0)
boruta_selector.fit(np.array(Xscaled), np.array(df['y']))

In [57]:
boruta_ranking = boruta_selector.ranking_
for i, val in enumerate(boruta_ranking):
    if val <= no_selectedfeatures:
        print (X_vars[i], val)

x3 1
x7 9
x10 10
x13 1
x15 1
x20 1
x21 4
x27 3
x28 8
x29 1
x30 7
x31 2
x33 1
x34 5
x36 6


In [60]:
# select features with a ranking of 5 or higher following application of Boruta method
boruta_ranking = boruta_selector.ranking_
selected_features = np.array(X_vars)[boruta_ranking <= 5]
print(selected_features)

['x3' 'x13' 'x15' 'x20' 'x21' 'x27' 'x29' 'x31' 'x33' 'x34']


In [61]:
### save dataframe following Boruta method for downstream analysis
df_Boruta = pd.concat([pd.DataFrame(df, columns=selected_features),
                       pd.DataFrame(df, columns=['y'])], axis=1)

In [62]:
df_Boruta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x3      9988 non-null   float64
 1   x13     9987 non-null   float64
 2   x15     9994 non-null   float64
 3   x20     9995 non-null   float64
 4   x21     9995 non-null   float64
 5   x27     9989 non-null   float64
 6   x29     9996 non-null   float64
 7   x31     9988 non-null   float64
 8   x33     9992 non-null   float64
 9   x34     9993 non-null   float64
 10  y       10000 non-null  int64  
dtypes: float64(10), int64(1)
memory usage: 859.5 KB


#### Save resulting dataframes to csv for further analysis

In [63]:
df_Boruta.to_csv('../data/processed/data_boruta.csv', index=False)

### Apply Principal Components Analysis (PCA) to reduce dimensionality

Apply PCA to reduce the dimensions of the dataset. Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.

It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract.

Set the `n_components` to 'mle' will interpret svd_solver == 'auto' as svd_solver == 'full'. 

Set `whiten` option to True (False by default) so that the `omponents_` vectors are multiplied by the square root of `n_samples` and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.

Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.

See the sci-kit learn documentation for implementing PCA [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

In [65]:
# impute missing values and standardize values 
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
scaler = StandardScaler()

imputer.fit(df[X_vars])
Ximp = imputer.transform(df[X_vars])
scaler.fit(Ximp)
Xscaled = scaler.transform(Ximp)

In [104]:
from sklearn.decomposition import PCA

# set whiten option to True

pca = PCA(n_components="mle", copy=True, whiten=True, 
          svd_solver='auto', tol=0.0, iterated_power='auto', 
          n_oversamples=5, power_iteration_normalizer='auto',
          random_state=None)

df_pca = pca.fit_transform(Xscaled)

df_pca.shape

(10000, 39)

In [105]:
df_pca = pd.DataFrame(df_pca)

# create column names
featurenames = list()
for i in range(df_pca.shape[1]):
    featurenames.append(f"x{i}")
df_pca.columns = featurenames

In [107]:
### save dataframe following application of PCA for downstream analysis
df_pca2 = pd.concat([pd.DataFrame(df_pca),
                       pd.DataFrame(df, columns=['y'])], axis=1)

#### Save resulting dataframes to csv for further analysis

In [109]:
df_pca2.to_csv('../data/processed/data_pca.csv', index=False)