## ***Intro***

After the previous feature engineering step we need to select among the 2526 features a more reasonable number of features in order for the models to be as simple as possible, try to avoid overfitting and get ride of irrelevent features that can slow down learning. We may also drop some noisy training examples (if any) and implement a solution to handle null and nan values.

This notebook can be considered as the direct continuation of the [automatic feature engineering notebook](https://www.kaggle.com/code/diarray/autofeatureengineering-dfs). Herein we will drop columns with over 70% null values, columns with variance below 0.2, and columns highly correlated with other features (correlation factor above 0.8) to get ride of the least informative features. Subsequently, we will select the top 100 features from the remaining based on Information Gain estimation using K-Nearest Neighbors distance.

## ***Imports***

In [None]:
# For data manipulation and linear algebra
import pandas as pd
import numpy as np

# for file manipulation
import glob
# for Garbage collecting
import gc
# To filter warnings
import warnings
# for serialization
import pickle
# To tune the default parameter of mutual_info_classif
from functools import partial

# For feature selection
from sklearn.feature_selection import SelectKBest, mutual_info_classif, VarianceThreshold
from sklearn.preprocessing import LabelEncoder

## ***Functions***

In [None]:
def concatenate_fms(fm_paths, columns=None):
    """
    Concatenate the feature matrices, to avoid out of memory error intermediate dfs are deleted and garbage collected as we advance
    
    Parameters:
    fm_paths (list of str): List of file paths to the parquet files.
    columns (pd.Index, list[str]): The specific columns to read from the parquet file
    
    Returns:
    pd.DataFrame: The concatenated Feature matrix.
    """
    feature_matrix = pd.read_parquet(fm_paths[0], columns=columns)
    for fm in fm_paths[1:]:
        intermediate = pd.read_parquet(fm, columns=columns)
        feature_matrix = pd.concat([feature_matrix, intermediate])
        del intermediate
        gc.collect()
    return feature_matrix

def drop_toomany_nan(X_df, threshold=0.7):
    """
    Drops columns from a DataFrame that have more than a specified threshold of missing values.
    
    Args:
        df (pandas.DataFrame): The DataFrame to process.
        threshold (float, optional): The percentage of missing values above which a column is dropped. Defaults to 0.7 (70%).
    
    Returns:
        pandas.DataFrame: The DataFrame with columns exceeding the threshold of missing values dropped.
    """
    # Calculate the percentage of missing values per column
    missing_percentages = X_df.isnull().mean()
    # Drop columns exceeding the threshold
    columns_to_drop = missing_percentages[missing_percentages >= threshold].index
    X_df.drop(columns=columns_to_drop, inplace=True)
    return X_df

def fillna(X_df):
    """
    Fills missing values in a DataFrame with specified placeholders.
    
    Numerical columns are filled with -9999 and non-numerical columns 
    are filled with "<N/A>".

    Args:
        X_df (pandas.DataFrame): The DataFrame with potential missing values.
    
    Returns:
        pandas.DataFrame: The DataFrame with missing values filled.
    """
    # Replace the remaining NAN values with a placeholder
    num_placeholder = -9999
    others_placeholder = "<N/A>"
    non_num_cols = set(X_df.select_dtypes(exclude="number").columns)
    num_cols = set(X_df.columns) - non_num_cols
    
    # Add a category for missing values
    for col in X_df.select_dtypes(include='category').columns:
        X_df[col] = X_df[col].cat.add_categories("<N/A>")
        
    X_df[list(num_cols)] = X_df[list(num_cols)].fillna(num_placeholder)
    X_df[list(non_num_cols)] = X_df[list(non_num_cols)].fillna(others_placeholder)
    return X_df

def to_num(X_df, label_encoders=None):
    """
    Encodes categorical and boolean features in a pandas DataFrame to numerical representations.
    This function performs two main tasks:
    
    1. **Converts boolean columns to integers:** 
    2. **Encodes categorical features using label encoding:** 
    
    Args:
        X_df (pandas.DataFrame): The DataFrame containing the features to be converted.
        label_encoders (dict, optional): A dictionary where the keys are the column names and the values are labelEncoder instances used 
          to encode the respective columns, If None is provided the function create this dictionary and returns it.
    
    Returns:
        pandas.DataFrame: The DataFrame with encoded features (if label_encoders is not None).
        tuple[pandas.DataFrame, dict]: A tuple containing the DataFrame with encoded features and a dictionary of fitted LabelEncoders (if label_encoders is None).
  """
    # Step 1: Convert boolean values to their integer representation
    bool_cols = X_df.select_dtypes(include="bool").columns
    X_df[bool_cols] =  X_df[bool_cols].astype(int)

    # Step 2: Convert categorical values to their integer representation using LabelEncoder
    if label_encoders:
        for col, le in label_encoders.items():
            if col in X_df.columns:
                # It is possible that some categories was missing in the subset used during feature selection
                # This block will handle such cases without having to re-fit the labelEncoder
                try:
                    X_df[col] = le.transform(X_df[col])
                except ValueError:
                    # Handling new categories by updating the LabelEncoder's classes_ attribute
                    existing_classes = set(le.classes_)
                    new_classes = set(X_df[col].unique())
                    combined_classes = np.array(list(existing_classes | new_classes))
                    le.classes_ = combined_classes
                    X_df[col] = le.transform(X_df[col])
        return X_df
    
    label_encoders = {}  # To keep track of label encoders for each categorical column
    for col in X_df.select_dtypes(include=['object', "category"]).columns:
        le = LabelEncoder()
        X_df[col] = le.fit_transform(X_df[col])
        label_encoders[col] = le
    return X_df, label_encoders

def drop_low_variance(X_df, threshold=0.2):
    """
    Drops features from a pandas DataFrame with low variance (below a threshold).
    
    Args:
        df (pandas.DataFrame): The DataFrame containing the features.
        threshold (float, optional): The minimum variance threshold to keep a feature. Features with variance below this threshold are removed. Defaults to 0.2.
    
    Returns:
      pandas.DataFrame: The DataFrame with low-variance features removed.
    """
    selector = VarianceThreshold(threshold=threshold)
    selector.fit(X_df)
    # Get the boolean mask of features (the selected features will be indexed by one and the remainings by 0)
    support_mask = selector.get_support()
    # Use the mask to get the names of the remaining columns
    selected_columns = X_df.columns[support_mask]
    return X_df[selected_columns]

def drop_redundant_features(X_df, threshold=0.8):
    """
    Drops features from a DataFrame that are highly correlated (above the threshold)
    with another feature, keeping only one of the correlated features.
    
    Args:
        df (pandas.DataFrame): The DataFrame containing the features.
        threshold (float, optional): The correlation threshold above which features are considered highly correlated.
    
    Returns:
        pandas.DataFrame: The DataFrame with redundant features removed.
    """
    corr_matrix = X_df.corr()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    to_drop = [col for col in upper.columns if any(upper[col].abs() > threshold)]
    X_df.drop(columns=to_drop, inplace=True)
    return X_df

def selectKbestbyIG(X_df, target, k=100, n_neighbors=5):
    """
    Selects the top K features from a pandas DataFrame for classification using Information Gain.
    
    Args:
        df (pandas.DataFrame): The DataFrame containing the features.
        target (pandas.Series): The target variable column.
        k (int, optional): The number of features to select top K features based on information gain. (Defaults to 100.)
    
    Returns:
    pandas.DataFrame: The DataFrame containing the top K features selected using mutual information.
    """
    # Create a partial function to set the number of neighbors
    mutual_info_func = partial(mutual_info_classif, n_neighbors=n_neighbors)
    
    selector = SelectKBest(mutual_info_func, k=k)
    selector.fit(X_df, target)
    # Get the boolean mask of features (the selected features will be indexed by one and the remainings by 0)
    support_mask = selector.get_support()
    # Use the mask to get the names of the remaining columns
    selected_columns = X_df.columns[support_mask]
    return X_df[selected_columns]

## ***Feature Selection***

**Use a subset of 32 features matrices (25% of the dataset) to perform feature selection. The selected features will then be generalized to the entire dataset.**

In [None]:
# Paths to the feature matrices
fms_paths = glob.glob("/kaggle/input/deep-feature-synthesis-home-credit-stability/feature_matrices/*")
# We will do the feature selection with a randomly shuffled subset (approximately 25% of the entire dataset)
fms_paths[:10]

In [None]:
warnings.filterwarnings("ignore")

In [None]:
subset = concatenate_fms(fms_paths[:32])

In [None]:
subset.shape

In [None]:
# Function "drop_toomany_nan" has been edited, its previous version (used for feature selction) emcompassed function "fillna"
subset = drop_toomany_nan(X_df=subset)

In [None]:
subset.isnull().sum().sum()

In [None]:
# Encode non numeric features
subset, lbl_encorders = to_num(subset)

In [None]:
subset

In [None]:
# Drop features with low variance
subset = drop_low_variance(subset)

In [None]:
subset

In [None]:
corr_matrix = subset.corr()
corr_matrix

In [None]:
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
upper

In [None]:
to_drop = [col for col in upper.columns if any(upper[col].abs() >= 0.8)]

In [None]:
print(len(to_drop))
# Drop redundant features
subset.drop(columns=to_drop, inplace=True)
subset

In [None]:
target = pd.read_parquet("/kaggle/input/home-credit-credit-risk-model-stability/parquet_files/train/train_base.parquet", 
                         columns=["case_id", "target"])
target

In [None]:
target = target[target["case_id"].isin(subset.index)].set_index("case_id")
target

In [None]:
target = target.reindex(subset.index)["target"]
target

In [None]:
# Select the 100 most infomative features (according to Information Gain)
subset = selectKbestbyIG(X_df=subset, target=target)
subset

## ***Save intermediate objects***

**Save the subset dataframe with the 100 selected features to parquet and the label encoders with pickle. Those will be used to generalize the results to the rest of original the features matrices**

In [None]:
# Save the reference Dataframe 
subset.to_parquet("feature_selection.parquet")
%ls

In [None]:
import pickle

with open('labelEncoders.pkl', 'wb') as file:
    # Pickle the label encoders dictionary using the highest protocol available.
    pickle.dump(lbl_encorders, file, pickle.HIGHEST_PROTOCOL)

## ***Create the final training dataset***

**Create a base table with the 100 selected features and the target holding the data for all the training examples (case_ids)**

In [None]:
# Paths to the feature matrices (ordered this time)
fms_paths = glob.glob("/kaggle/input/deep-feature-synthesis-home-credit-stability/feature_matrices/*")
fms_paths.sort(key=lambda fm_path: int(fm_path.split("_")[1][18:]))
fms_paths[:10]

In [None]:
with open('labelEncoders.pkl', 'rb') as file:
    lbl_encoders = pickle.load(file)

In [None]:
selected100features = list(pd.read_parquet("feature_selection.parquet").columns)
# This is because WEEK_NUM has not been selected as one of the 100 most important features,...
# but we need it for the stability metric (competition context)
selected100features.insert(0, "WEEK_NUM")
selected100features[:10]

In [None]:
dataset = concatenate_fms(fms_paths, columns=selected100features)
dataset = fillna(dataset)
dataset = to_num(X_df=dataset, label_encoders=lbl_encoders)
dataset.head()

In [None]:
dataset.shape

In [None]:
target = pd.read_parquet("/kaggle/input/home-credit-credit-risk-model-stability/parquet_files/train/train_base.parquet", 
                         columns=["case_id", "target"]).set_index("case_id")["target"]
target

In [None]:
# Create the final base table with the selected features and the target
dataset["target"] = target
dataset

In [None]:
# Save the final table to csv and parquet
dataset.to_parquet("base_100features.parquet")
dataset.to_csv("base_100features.csv")

In [None]:
%ls

## ***Conclusion***

Following our feature engineering steps, we needed to narrow down the 2526 features to a more manageable number to simplify the model, minimize overfitting, and eliminate irrelevant features that could hinder learning. This operation was much less time consuming than the feature engineering process, but note that the different steps in this notebook are highly customizable and this is just a baseline.

**Thanks for reading**