## COMP 4447: Week 8 Live Session
## _Feature Engineering_

### We've all heard this before, but accessing, cleaning and preparing the data for analysis is the most time-consuming part of a data science role.

### However, being thorough and creative during these aspects of the process will have a huge impact on outcomes. Good features can tap into useful information and set your model apart.

<img src="https://pbs.twimg.com/media/EgN0LhoVoAAetP2.png" width="500" height="500"/>
<img src="https://rapidminer.com/wp-content/uploads/2022/04/data-prep-approach-time.png" width="500" height="500"/>
<img src="https://pbs.twimg.com/media/CwVwtGzUUAAM5x0.jpg" width="500" height="500"/>

In [None]:
# Missing Values....most ML and statistical models don't handle these well.
# Assessing Missingness (MAR, MCAR, MNAR)
# Imputation: central tendency and estimation approaches
# Listwise deletion

#### There are different types of missingness and your approach to handling them can be informed by the type of missingness

#### In general, we have the following options for handling missing data.

- Listwise deletion

- Imputation (for continuous and categorical data)

- Using algorithms that are robust to missingness

- Imputing predicted missing values

## There are different varieties of missingness, referred to as missingness mechanisms...the names are a bit misleading.

- __Missing Completely at Random (MCAR):__
    
    This is the best case scenario...and also the easiest to understand. Missingness is due to a random process and doesn't depend on any of the other observed variables. There is no evidence that there is a systematic explanation for the missing data.  Missing data casess are essentially a random sample of our dataset.  The only consequence is that we have fewer data to model.  

    The misssingness doesn't introduce bias into our model.

    Can be tested.
    

- __Missing at Random (MAR):__
    
    Here the missingness is conditioned upon another variable, but does not depend on the missing values.  For example, if you're utilizing survey data and observe that the missingness on a certain item (say job compensation) is associated with age, then the missingness for that item is conditioned upon age.  In this case compensation itself isn't impacting the missingness of compensation information, but perhaps older people are less likely to still be working, and we can expalin the missingness of compensation information based on their age.  
    
    Introduces bias, but we can control for it.
    
    Can be tested.


- __Missing Not at Random (MNAR)__

    This is the worst case. The missingness depends on the actual missing value. Let's use the compensation example again. Say we observe missing data for a measure of job compensation but find no association between missingness and age or any other features in our dataset.  However, maybe people who are paid less are less likely to offer up their compensation information.  In this case the missingness of compensation depends on the missing values themselves, so the data in this case would be considered MNAR.  The tricky part is, if we don't know the missing values, then we can't directly evaluate whether or not those values are associated with missingness.
    
    Introduces bias.  Harder to control for.
    
    Can't be directly tested.

## Testing for MCAR vs MAR

A common approach here is to dummy code based on missingness and then perform t-tests or chi-square tests against our other features.  If we don't observe differential outcomes based on our missingness indicator, then we can assume that the data are missing completely at random.  If we observe differential outcomes, then we know that the data are missing at random or missing not at random. This could involve many comparisons depending on the amount of missing data you have, so it can be a good idea to use an adjustment for multiple comparisons such as Bonferroni.  Little's MCAR test computes a single test statistic by doing all of these operations and compares MAR vs MCAR.

## How does the missingness mechanism impact handling of missing data?

If your data are MCAR then you can comfortably listwise delete records with missing data.  Encoding missing values with an indicator, such as -1, can be helpful if you're using a tree based approach, as these values can be handled by the algorithm.  Simple single imputation techniques such as mean or median imputation can be used for MAR or MNAR.  Using regression (also single imputation) to estimate the values of the missing data using the other features in your dataset is a more sophisticated approach. Multiple imputation is probably the most sophisticated approach and involves making multiple imputed datasets using the distribution of the predicted value.  Model coefficients from multiple imputation can then be pooled.

##  Caveats about missingness

Assessment of missingness, especially at the level of MAR and MNAR requires domain knowledge and contextural information.  Rigorously dealing with missingness mechanisms is more important in clinical situtations.  Oftentimes we're primarily concerned with making accurate classifications or estimates, so if our models have good out-of-sample performance characteristics, then we can usually handle missingness as best we can and just except not fully understanding or controlling for the mechanisms. 

In many contexts, missing data just isn't that common or if it is, we know it's due to process or instrumentation. Say we're looking at transactional data collected from an internal system.  This is common and it's a much easier to diagnose missingness within this context.  Any time we're using data in which a subject, respondent or individual has to be engaged in the act of offering or reporting data we're much more likely to encounter missing data that has a more nebulous cause.

#### A note about deletion

We can use columnwise or listwise deletion. As a rule of thumb, we wouldn't want to use features that are missing more than 40% of their data.  Removal of rows is more common. Sometimes we set a threshold for record retention based on the number of missing values across all features.  Other times we consider record retention based on missingness for a single feature. 

Typically we want to avoid this strategy as it results in the loss of useful information.

### One Hot Encoding: Reserved for strictly categorical data.

Although you can do many of these tranformations in pandas, we'll focus on transformations using sklearn, as it's likely the tool you'll initially use for model-building, and when doing cross-validation it's easier to build a pipeline to apply your
tranformation steps using built-in sklearn utilities.

In [1]:
# first we'll create a df with a single categorical feature just for example purposes
import numpy as np
import pandas as pd
from pandas import DataFrame

np.random.seed(51)

st = ['California', 'Coloroado', 'Missouri', 'Indiana']

df = DataFrame({'state': [st[i] for i in np.random.randint(low=0, high=4, size=50)]})
df.head()

Unnamed: 0,state
0,Missouri
1,Coloroado
2,Coloroado
3,California
4,Coloroado


In [2]:
from sklearn.preprocessing import OneHotEncoder
oh = OneHotEncoder(sparse=True)

# as you saw in the asynch there is a fit_tranform() method, but using the fit()
# and transform() methods on their own is helpful when working in pipelines.
# Here is an example of using those methods.

oh.fit(df[['state']])

In [64]:
# Now we'll call the transform method from the fitted encoding.

# Just so we understand the sparse param, let's look at a sparse scipy matrix.

sparse_matrix = oh.transform(df[['state']])

print(oh.get_feature_names_out()) # feature names

print(type(sparse_matrix)) # it's a scipy object
print(sparse_matrix)

['state_California' 'state_Coloroado' 'state_Indiana' 'state_Missouri']
<class 'scipy.sparse.csr.csr_matrix'>
  (0, 3)	1.0
  (1, 1)	1.0
  (2, 1)	1.0
  (3, 0)	1.0
  (4, 1)	1.0
  (5, 1)	1.0
  (6, 0)	1.0
  (7, 1)	1.0
  (8, 3)	1.0
  (9, 0)	1.0
  (10, 1)	1.0
  (11, 1)	1.0
  (12, 3)	1.0
  (13, 3)	1.0
  (14, 1)	1.0
  (15, 0)	1.0
  (16, 3)	1.0
  (17, 0)	1.0
  (18, 0)	1.0
  (19, 0)	1.0
  (20, 2)	1.0
  (21, 2)	1.0
  (22, 3)	1.0
  (23, 0)	1.0
  (24, 3)	1.0
  (25, 0)	1.0
  (26, 3)	1.0
  (27, 3)	1.0
  (28, 1)	1.0
  (29, 1)	1.0
  (30, 1)	1.0
  (31, 0)	1.0
  (32, 2)	1.0
  (33, 0)	1.0
  (34, 0)	1.0
  (35, 0)	1.0
  (36, 3)	1.0
  (37, 1)	1.0
  (38, 0)	1.0
  (39, 0)	1.0
  (40, 1)	1.0
  (41, 1)	1.0
  (42, 3)	1.0
  (43, 2)	1.0
  (44, 2)	1.0
  (45, 2)	1.0
  (46, 0)	1.0
  (47, 3)	1.0
  (48, 2)	1.0
  (49, 2)	1.0


In [55]:
from scipy.sparse import csr_matrix

dense_matrix = csr_matrix.todense(oh.transform(df[['state']]))
print(dense_matrix)

[[0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]]


In [3]:
oh = OneHotEncoder(sparse=False)
df[oh.get_feature_names_out()] = oh.fit_transform(df[['state']])
df.head()

Unnamed: 0,state,state_California,state_Coloroado,state_Indiana,state_Missouri
0,Missouri,0.0,0.0,0.0,1.0
1,Coloroado,0.0,1.0,0.0,0.0
2,Coloroado,0.0,1.0,0.0,0.0
3,California,1.0,0.0,0.0,0.0
4,Coloroado,0.0,1.0,0.0,0.0


In [74]:
# Let's introduce a misssing value in our state column and run this again.

df.iloc[0, 0] = np.nan

oh = OneHotEncoder(sparse=False, dtype=int)
df[oh.get_feature_names_out()] = oh.fit_transform(df[['state']])
df.head()

# notice we get an encoding for nan

Unnamed: 0,state,state_California,state_Coloroado,state_Indiana,state_Missouri,state_nan
0,,0,0,0,0,1
1,,0,0,0,0,1
2,Coloroado,0,1,0,0,0
3,California,1,0,0,0,0
4,Coloroado,0,1,0,0,0


In [89]:
# we can avoid this by passing in a list of valid categories when initializing our encoding object.
# and setting the handle_unknown param to 'ignore'

# first we'll get rid of the dummy vars we've already created.
df = df[['state']].copy()
df.head()

Unnamed: 0,state
0,
1,
2,Coloroado
3,California
4,Coloroado


In [88]:
# now we ignore any values not explicitly specified when instantiating our OneHotEncoder.

oh = OneHotEncoder(categories=[st], sparse=False, dtype=int, handle_unknown='ignore')
df[oh.get_feature_names_out()] = oh.fit_transform(df[['state']])
df.head()

Unnamed: 0,state,state_California,state_Coloroado,state_Missouri,state_Indiana
0,,0,0,0,0
1,,0,0,0,0
2,Coloroado,0,1,0,0
3,California,1,0,0,0
4,Coloroado,0,1,0,0


#### Note that with something like state, if we had full representation of all states we'd blow up our dimensionality very quickly by one-hot encoding everything.  Perhaps we're only interested in certain states, or maybe there's an attribute associated with the state that we could instead model...for example population, costal designation, gdp, distance from some point of concern. Perhaps state isn't important at all.

### Ordinal Data

Note that when transforming exogenous features (predictors), it is recommended to use OrdinalEncoder instead of LabelEncoder, as the latter is intended for endogenous (target) variables.  OrdinalEncoder fits data with the shape (n_samples, n_features) while LabelEncoder only fits data with the shape (n_samples,).  You could use LabelEncoder in a loop, but OrdinalEncoder is preferred. 

When working in sklearn you'll notice that the features are usually considered as 2D arrays (n_sample, n_features) while the target is usually considered as a 1D array (n_samples,)

In [100]:
# (4,) 1D array
print(np.array([1, 2, 3, 4]).shape)


print(np.array([[1],
         [2],
         [3],
         [4]]).shape)

(4,)
(4, 1)


In [13]:
# first we'll create a df with a single categorical feature just for example purposes
np.random.seed(51)

rt = ['Very High', 'High', 'Low', 'Very Low']
rt2 = ['Very High', 'High', 'Low', 'Very Low', 'ugh']

df = DataFrame({'rating1': [rt[i] for i in np.random.randint(low=0, high=4, size=50)],
               'rating2': [rt2[i] for i in np.random.randint(low=0, high=5, size=50)]})


df.head()

Unnamed: 0,rating1,rating2
0,Low,Very Low
1,High,Very Low
2,High,Very High
3,Very High,Low
4,High,High


In [14]:
# in terms of passsing the correct shapes, notice the following behavior.
print(df['rating1'].shape)
print(df[['rating1']].shape)

(50,)
(50, 1)


In [23]:
# Ordinal Data

from sklearn.preprocessing import OrdinalEncoder

# using the default categories='auto' will pickup unknown or unwanted categories.
oe = OrdinalEncoder(categories=[rt, rt], 
                    dtype=float,
                    handle_unknown='use_encoded_value',
                    unknown_value=np.nan) 

# note that there is newer param called encoded_missing_value. 
# You can set this to np.nan, but the dtype must be float if we use np.nan here.

# this should probably work if you're using a recent sklearn install
# but get_feature_names_out() hasn't been implemented on all tranformers yet.

#df[oe.get_feature_names_out()] = oe.fit_transform(df[['rating1', 'rating2']])

df[['rating_1_tr', 'rating_2_tr']] = oe.fit_transform(df[['rating1', 'rating2']])

print(df)

      rating1    rating2  rating_1_tr  rating_2_tr
0         Low   Very Low          2.0          3.0
1        High   Very Low          1.0          3.0
2        High  Very High          1.0          0.0
3   Very High        Low          0.0          2.0
4        High       High          1.0          1.0
5        High   Very Low          1.0          3.0
6   Very High  Very High          0.0          0.0
7        High       High          1.0          1.0
8         Low       High          2.0          1.0
9   Very High        ugh          0.0          NaN
10       High       High          1.0          1.0
11       High       High          1.0          1.0
12        Low   Very Low          2.0          3.0
13        Low   Very Low          2.0          3.0
14       High        Low          1.0          2.0
15  Very High   Very Low          0.0          3.0
16        Low        ugh          2.0          NaN
17  Very High       High          0.0          1.0
18  Very High       High       

### Scaling: Standardization vs Normalization

We'll cover this in the asynch next week, but it's useful to talk about before then.  Many algorithms will require scaled data. This is particularly imoprtant in regression and distance based approaches such as K Nearest Neighbors, as raw variables on a different scale can exert an undue impact on our model.  

Standardization and normalization are both types of scaling.  Standardization, or z scoring, is the act of subtracting the feature mean from each observation and dividing by the standard deviation. It results in a distribution with a mean of 0 and standard deviation of 1.  Standardization is unbounded and is best used with Gaussian data.  Standardization is more robust to outliers than min-max scaling.

Normalization is typically used to desribe min-max scaling where we transform observations by subtracting the feature minimum and divding by the feature range.  Normalization scales the featurs values between 0 and 1 (bounded).  It's best used when your algorithm doesn't have any assumptions about the distribution of the input features.

Unnecessary for tree based models, but still potentially a good idea

Generally it's not necessary to scale your target, but it may be useful in some situations.

In [49]:
from sklearn.preprocessing import StandardScaler

df = DataFrame({'feat': np.random.rand(100)*100})

ss = StandardScaler()

df['feat_z'] = ss.fit_transform(df[['feat']])

print(df['feat_z'].mean())
print(df['feat_z'].std())

-8.659739592076221e-17
1.005037815259212


In [53]:
from sklearn.preprocessing import MinMaxScaler

mm = MinMaxScaler()

df['feat_norm'] = mm.fit_transform(df[['feat']])

print(df['feat_norm'].max())
print(df['feat_norm'].min())

# Note that sklearn Normalizer() operates row-wise and scales samples to unit norm.  We're concerned with normalizing columns
# at the moment and should use MinMaxScaler()

1.0
0.0


### Creative Feature Generation


In [55]:
# This is a relatively simple dataset.  Take a few minutes to explore and transform the data appropriately.
#  Plot  the distribution of the raw wage column.  Apply a log tranformation on this column and plot it again.

data = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/Wage.csv")

# It would be unusual to have this feature created already.  We'll drop it and recreate later.
data.drop('logwage', axis=1, inplace=True)