# Machine Learning Pipelines

Many models were run and compared using different combinations of data pre-processing techniques, degrees of feature selection, and hyperparameter tunes. This section outlines the experiments and interprets their results. The main challenge of this section was developing models which could *actually* run and would not crash the python kernel. The computational cost of these algorithms is a common theme and will be revisited often in discussion.

## Pre-Processing Techniques

Several techniques were employed to pre process data in various ways, and for different purposes. These are discussed in the sections below.

### Feature Engineering - UPDATE WITH SECTION XXX info 

See section XXX for detailed explanations of the methods used to re-engineer and transform features in the various source tables. Untransformed and transformed datasets were one comparison the ML experiments addressed. 

### Data Type Optimization

When aggregated, the dataset is very large, approximately 2.5 gb of space. When reading these data in as-is, the large memory required to work with the dataset made it a cumbersome object on which to operate ML pipelines. The `reduce_mem_usage()` function is used immediately upon data read-in to counter this large memory requirement. Where possible, this function changes a given column's default datatype to a datatype with a smaller memory footprint. For example, an `int64` column composed of only 1s and 0s (IE OHE columns...) might be converted to an `int8` column with no data loss, and significant memory reduction.

This operation typically led to a ~70% memory size reduction of the imported dataset making it much easier to work with. The function is provided below with an example output. Credit to the publisher: https://pythonsimplified.com/how-to-handle-large-datasets-in-python-with-pandas/

```python
def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() / 1024**3
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**3
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df
```

[Example Output]:

<img src="../images/mem_reducer.png" alt="drawing" width="300"/>

### Collinearity Reduction

During Phase 2 exploratory data anlysis, multicollinearity between input variables was shown to be prevalent in most of the data tables. Multicollinearity was further increased during the aggregation process as numeric input variables were proliferated into their mean, median, variance counterparts (amongst other aggregation types). This introduction of new (possibly) redundant variables to the ML algorithms significantly increases computation time and often reduces algorithm accuracy/effectiveness. 

Used in conjunction with the `DataframeSelector()` transformer class, the `CollinearityReducer()` transformer class combats multicollinearity by removing the *most* (multi-) collinear columns which are *least* correlated to the target variable. The algorithm steps are described below: 

1. Given input variable *X* and target variable *y*, calculate the correlation matrix
2. Pivot the correlation matrix into a long dataframe of correlation pairs and values (input variable 1, input variable 2, absolute correlation) and drop the following variable pairs:
   + any pair including the target variable
   + any pair with two of the same input variable
   + any pair with an absolute correlation value below a specified threshold value (default is 0.5)
3. For each variable pair, compare to see which variable is *more* correlated to the target variable. These input variables are given a 'win' while the input variable in the pair which is *less* correlated to the target variable is given a 'loss'. 
4. Count the total 'wins' and 'losses' for each variable and drop from the original dataframe any variable with 0 wins. 
5. Repeat steps 1 to 4 until there are no more input variable pairs with correlations above the threshold, no more input variables scoring 0 wins, or the specified maximum number of iterations has been reached. 

The `CollinearityReducer()` thus applies a common multicollinearity solution - drop the variable which is least correlated to the target (Introduction to Statistical Learning Chapter 3). This is no simple task for human to perform on a high-dimensional dataset, but this class provides a mechanical solution so it does not have to be done manually. the `transform` method of this class creates a list of attribute names from the original dataframe which are *to be kept* - multicollinear classes culled by the `CollinearityReducer()` are *not* included in the output. This output list is then used as the input for the `DataframeSelector()` class in the pipeline. Thus, the `CollinearityReducer()` selects columns based only on the training data and there is no leakage from the validation or test data. 

Notably, this algorithm is only applied to numerical variables as it is assumed there should natrually be some elevated degree of collinearity between one-hot-encoded categorical variables. Furthermore, the algorithm should be applied to numerical data which has been subjected to the same scaling and imputation strategy as applied in the actual pipeline. 

Both the `CollinearityReducer()` and `DataframeSelector()` classes are shown below along with a basic example of their usage. 

```python
# transformer reduces the list of columns by a subset
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

# transformer produces a reduced column list by collinearity reduction
class CollinearityReducer(BaseEstimator, TransformerMixin):
    
    '''
    This class reduces features by measuring collinearity between the input variables and target.
    Works on numerical features based on the correlations between each variable pair.
    Of the var1iable pairs with absolute correlations above the threshold value the variables with the lowest target variable correlation are dropped from the input X.
    Repeat until no more collinear pairs with absolute correlations above the threshold or max_iter. 
    
    Inputs:
       X (numpy array) - input variables
       y (numpy array) - target variable
       attribute_names (list) - column names of the input variables (from original dataframe order
       threshold (int) - the absolute correlation threshold above which variable pairs are subject to the 'correlation competition'
       max_iter (int) - the maximum number of iterations to cut off the algorithm
       
    Output:
       list - attribute_names to be used by DataframeSelector class
    '''
    
    def __init__(self, attribute_names, threshold=0.5, max_iter=None):
        self.attribute_names = attribute_names
        self.threshold = threshold
        self.max_iter = max_iter
            
    def fit(self, X, y):
        return self
    
    def transform(self, X, y=None): 
        
        dataframe = pd.concat([y, pd.DataFrame(X)], axis=1)
        
        i = 0
        while i <= self.max_iter:

            # read-in and assign columns
            # gets correlation matrix between variables and pivots to a longer df
            # identify target variable
            # drop same-name and target correlations pairs
              
            df = dataframe
            features = df.iloc[:,1:].columns
            target_name = df.iloc[:,0].name

            df = pd.melt(abs(df.corr()).reset_index(), id_vars='index', value_vars=features)
            targets = df[df['index']==target_name]
            df = df[(df['index'] != df['variable']) & (df['index'] != target_name) & (df['variable'] != target_name)]

            # combine the correlated variables into ordered pairs
            # aggregate the max correlation and sort pairs
            # split out the variables from the pair
            # join the target variable correlations for each variable pair

            df['joined'] = df[['index', 'variable']].apply(lambda row: '::'.join(np.sort(row.values.astype(str))), axis=1)

            df = df.groupby('joined', as_index=False) \
                   .agg({'value':'max'}) \
                   .sort_values(by='value', ascending=False)

            df[['var_1','var_2']] = df['joined'].str.split("::",expand=True).astype(int)

            df = df.merge(targets, how='left', left_on='var_1', right_on='variable') \
                   .merge(targets, how='left', left_on='var_2', right_on='variable')
            df.rename(columns = {'value_x':'var_pair_corr', 'value_y':'var_1_target_corr', 'value':'var_2_target_corr'}, inplace = True)

            # Take only variable pairs with a correlation greater than threshold
            # determine which variable has a higher correlation with the target.
            # The higher of the two gets marked as a win
            # While the other gets marked as a loss
            # the wins and losses for each variable are then grouped and summed

            exceeds = df[df['var_pair_corr']>self.threshold]

            # break if none above threshold
            if len(exceeds['var_pair_corr'])==0:
                break

            # "correlation competition"
            exceeds['var_1_win'] = exceeds.apply(lambda row: 1 if row["var_1_target_corr"] >= row["var_2_target_corr"] else 0, axis=1)
            exceeds['var_1_loss'] = exceeds.apply(lambda row: 1 if row["var_2_target_corr"] >= row["var_1_target_corr"] else 0, axis=1)
            exceeds['var_2_win'] = exceeds.apply(lambda row: 1 if row["var_1_target_corr"] < row["var_2_target_corr"] else 0, axis=1)
            exceeds['var_2_loss'] = exceeds.apply(lambda row: 1 if row["var_2_target_corr"] < row["var_1_target_corr"] else 0, axis=1)

            # aggregate scores
            var1 = exceeds[['var_1', 'var_1_win', 'var_1_loss']].groupby('var_1', as_index=False) \
                                                                .agg({'var_1_win':'sum', 'var_1_loss':'sum'})
            var1.rename(columns = {'var_1':'var', 'var_1_win':'win', 'var_1_loss':'loss'}, inplace=True)

            var2 = exceeds[['var_2', 'var_2_win', 'var_2_loss']].groupby('var_2', as_index=False) \
                                                                .agg({'var_2_win':'sum', 'var_2_loss':'sum'})
            var2.rename(columns = {'var_2':'var', 'var_2_win':'win', 'var_2_loss':'loss'}, inplace=True)

            corrcomps = pd.concat([var1,var2], axis=0).groupby('var', as_index=False) \
                                                      .agg({'win':'sum', 'loss':'sum'})

            # drop variables which had 0 wins
            # IE collinear variables which were always least related to the target
            dropvars = corrcomps[corrcomps['win']==0]['var']

            dataframe = dataframe.drop(dropvars, axis=1)  

            i += 1  
        
        X = [self.attribute_names[col] for col in dataframe.columns]

        return X
    
    
### Example Usage ###

# determine feature types, reduce numerical features by collinearity reduction
id_col, feat_num, feat_cat, feature =  id_num_cat_feature(X_train, text = False)

cr = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler(),    
    CollinearityReducer(attribute_names=feat_num, threshold = 0.7, max_iter=2)
)

reduced_feat_num = cr.fit_transform(X_train[feat_num], y_train)

# Pipeline

num_pipeline = Pipeline([
    ('selector', DataFrameSelector(reduced_feat_num)),
    ('imputer',SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler())
])
```

## Hyperparameter Tuning

Hyperparameter tuning was co,nducted using `GridSearchCV` to methodically test different combinations of hyperparameters. Identical pipelines were run for various combinations of `CollinearityReducer()` hyperparameters to tune this transformer also. Oftentimes, only a subset of the data was used to train and tune models due to the immensity of the dataset - this alleviated computation time headaches. By the Law-of-Large-Numbers, optimal hyperparameters found on these 'micro pipes' are good proxies for optimal parameters on the pipelines using the full dataset. 

An example of a "base" pipeline upon which various parameters and models were tuned is shown below. The Area-Under-the-Curve is the scoring metric used at is provides a better measure of fit than the Accuracy score. Where an AUC-ROC score of 0.5 indicates a model composed of randomly guessing, an AUC-ROC score of 1 represents a perfect model. For reference, the baseline logistic regression model scored an AUC-ROC score of ~0.76.

```python
# example basic pipeline
num_pipeline = Pipeline([
    ('selector', DataFrameSelector(reduced_feat_num)),  # use only if CollinearityReducer() implemented
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])

data_pipeline = ColumnTransformer(
    transformers=[
        ("num_pipeline", num_pipeline, feat_num),
        ("cat_pipeline", cat_pipeline, feat_cat)
    ],
    remainder='drop',
    n_jobs=-1
)

full_pipeline_with_predictor = Pipeline([
    ("preparation", data_pipeline),
    ("classifier", Classifier())  # ML classifier
])

param_1 = [5, 10, 25, 50, 100]
param_2 = [0.1, 1, 10]

parameters = dict(
    classifier__param_1 = param_1,
    classifier__param_2 = param_2
)

grid = GridSearchCV(
    full_pipeline_with_predictor, param_grid = parameters, 
    cv = 3, n_jobs = 4, scoring = 'roc_auc', verbose = 2
)

grid.fit(X_train, y_train)
```

### Logistic Regression Model Tuning

The baseline pipeline is a Logistic Regression pipeline with numerical and categorical column transformers. A number of pipeline experiments were conducted and the results of these are shown below. For the sake of brevity, code is only included for the pipelines which gave the most impactful insights into hyperparameter tuning - otherwise only results of the experimental round are show along with bulleted explanations of the pipeline parameters and results.

#### Logistic Regression with Bureau and Application Data

+ These pipelines were tested using only data aggregated from `bureau.csv` and `bureau balance.csv`, and from `application_train.csv`
+ Used 40% of available data
+ Both pipelines used logistic regression with L1 regularization
+ **Comparisons:**
   + untransformed vs transformed variables 
   + L1 vs L2 regularization
+ **Interpretations:** 
   + The pipeline *without* the transformed data performs better
   + L1 was preferred regularization parameter for *both* pipelines

<img src="../images/tunes/LR_appbureau_L1_v_L2_trans_v_untrans.png" alt="drawing" width="500"/>

#### Logistic Regression with All Untransformed Data - Regularization

+ These pipelines were tested using data aggregated from all the child datasets, and from `application_train.csv`
+ Used 40% of available data
+ Both pipelines used untransformed data (not feature engineered) with collinearity reduction `(threshold=0.7, max_iter=2)`
+ **Comparisons:**
   + L1 vs L2 regularization
+ **Interpretations:** 
   + Pipeline with L1 regularization parameter performed better

<img src="../images/tunes/LR_agg1293_CR_noreg_v_l1.png" alt="drawing" width="500"/>

#### Logistic Regression with All Untransformed Data - Collinearity Reduction

+ These pipelines were tested using data aggregated from all the child datasets, and from `application_train.csv`
+ Used 40% of available data
+ All pipelines used untransformed data (not feature engineered) with L1 regularization
+ **Comparisons:**
   + Collinearity Reduction
+ **Interpretations:** 
   + Best performing Pipeline is the one with `CollinearityReducer(threshold=0.5, max_iter=10)`

<img src="../images/tunes/LR_agg1293_L1_CR_comparison.png" alt="drawing" width="500"/>

### Nonparametric Model Tuning

Nonparametric classification models (as opposed to *parametric* classification models like logistic regression) do not rely on underlying assumptions on our dataset. This can be useful for high dimensional data. Decision Tree ("DT") models build trees of binary decisions by which to classify values in a feature space into its target classifications. Used in ensemble (many trees used together), these models can be very powerful! In this project, we explore two types of nonparametric models to compare and tune: *Random Forest* DT models and *Gradient Boosted* DT models.

> + In **Random Forest Models**, the trees are grown independently on random samples of the observations. However, each split on each tree is performed using a random subset of the features, thereby decorrelating the trees, and leading to a more thorough exploration of model space relative to bagging. This algorithm combines the output of multiple (randomly created) Decision Trees to generate the final output.  In this algorithm, each node in the decision tree is grown based on a random subset of features and subset of the input features.
> + In **Gradient Boosted Models**, we only use the original data, and do not draw any random samples. The trees are grown successively, using a “slow” learning approach: each new tree is fit to the signal that is left over from the earlier trees, and shrunken down before it is used. These trees incrementally added to an ensemble by training each new instance to emphasize the training instances previously mis-modeled.
> 
> *description excerpts from course notes*

Tuning and testing of these models are explored below.

#### Random Forests and Gradient Boosting with Bureau and Application Data

+ These pipelines were tested using data aggregated from `bureau.csv` and `bureau balance.csv`, and from `application_train.csv`
+ Used 40% of available data for the first 4 experiments, and 80% of data for the last two experiments
+ **Comparisons:**
   + Random Forest and XGB (gradient boosted) classifiers
   + Collinearity Reduction (RF only)
   + Transformed vs Untransformed
   + training size 
   + Random Forest Hyperparameters
+ **Interpretations:** 
   + Best performing Pipeline is the one with `CollinearityReducer(threshold=0.5, max_iter=10)`, Random Forest Algorithm on Transformed data with {'rf__max_depth': 25, 'rf__min_samples_leaf': 25}
   + The `CollinearityReducer()` helped the Random Forest algorithm
   + Transformed data tended to do better than untransformed data for both Random Forest and XGB algorithms
   + XGB generally outperformed Random Forest
   + Larger training dataset improved the XGB scores

<img src="../images/tunes/RF_XGB_appbureau_comp_CR_trans.png" alt="drawing" width="500"/>

#### Random Forests and Gradient Boosting with All Data Except Credit Card

+ These pipelines were tested using data aggregated from all the child datasets (except credit card), and from `application_train.csv`
+ Used 10% of available data
+ Used transformed data
+ **Comparisons:**
   + Random Forest and XGB (gradient boosted) classifiers
   + Collinearity Reduction
   + Random Forest Hyperparameters
+ **Interpretations:** 
   + Best performing Pipeline is the XGB algorithm {'xgb__subsample':0.8} with `CollinearityReducer(threshold=0.5, max_iter=25)`
   + The `CollinearityReducer()` generally helped the XGB algorithm but generally hurt the Random Forest algorithm
   + XGB generally outperformed Random Forest

<img src="../images/tunes/RF_XGB_agg_noccb_trans_comp_CR.png" alt="drawing" width="500"/>

#### Gradient Boosting with All Data Except Credit Card

+ These pipelines were tested using data aggregated from all the child datasets (except credit card), and from `application_train.csv`
+ Used 10% of available data
+ **Comparisons:**
   + Collinearity Reduction
   + Transformed vs untransformed data
+ **Interpretations:** 
   + Best performing Pipeline is the XGB algorithm {'xgb__subsample':0.8} with no Collinearity Reduction on untransformed data
   + The `CollinearityReducer()` generally improved performance with more iterations, but generally did not perform as well as models with a higher subsample parameter value *without* the `CollinearityReducer()`
   + Untransformed data outperformed transformed data

<img src="../images/tunes/XGB_agg_trans_comp_CR.png" alt="drawing" width="500"/>

<img src="../images/tunes/XGB_agg_comp_CR.png" alt="drawing" width="500"/>
