# **Project:** Home Credit Default Risk

## **Team Members:**
### Hari Kiran Gannamani, chaitanya chirumamilla, Shashi Preetham Podupuganti, Muhammad Naveed

### **Project Overview**

**Context:** Many individuals struggle to secure loans due to insufficient credit histories, facing exploitation by predatory lenders.

**Home Credit's Mission:** Aimed at enhancing financial inclusion, Home Credit utilizes alternative data (like telecom and transaction records) to evaluate loan repayment abilities, providing a safer borrowing experience for the underbanked.

**Kaggle Challenge:** Home Credit seeks innovative solutions from Kagglers to improve their data's predictive accuracy, ensuring loans are accessible to those who can repay and are structured to support borrowers' financial success.

### **Data Overview**

**Source and Purpose:** The dataset is provided by Home Credit, a company focusing on lending to individuals with limited access to traditional banking services. The objective is to predict the likelihood of a client repaying a loan, addressing a key business challenge. This competition on Kaggle encourages the machine learning community to develop models that can assist in this prediction task.

**Data Composition:** The dataset encompasses seven distinct sources, all in CSV format, detailing various aspects of loan applications and client credit history:

1. **application_train/application_test:** These files constitute the primary dataset, containing records of loan applications at Home Credit. Each row represents a unique loan application, identified by `SK_ID_CURR`. The training data includes a `TARGET` feature indicating loan repayment status (0: repaid, 1: not repaid).

2. **bureau:** This dataset provides information on the client's previous credits from other financial institutions, with each row representing a distinct previous credit. A single loan in the application data may be associated with multiple entries here.

3. **bureau_balance:** Offers monthly details on previous credits listed in the bureau data. Each row corresponds to a month of credit, with multiple rows per credit reflecting its duration.

4. **previous_application:** Contains data on prior loan applications at Home Credit by clients with current loans in the application data. Each row represents a previous application, identified by `SK_ID_PREV`, with the potential for multiple entries per current loan.

5. **POS_CASH_BALANCE:** Presents monthly information on past point of sale or cash loans with Home Credit. Each entry covers a month of loan, with multiple entries per loan.

6. **credit_card_balance:** Monthly data on previous credit cards held with Home Credit. Similar to other datasets, each row accounts for a month's balance, with multiple rows per credit card.

7. **installments_payments:** Tracks payment history for past loans at Home Credit, including both made and missed payments. Each payment instance is recorded in a separate row.

This structured and multifaceted dataset offers a comprehensive view of clients' financial behaviors and credit histories, aiming to enable precise loan repayment predictions.

## Imports

- We are using basic imports required for data science

In [2]:
# Essential Libraries for Data Manipulation
import numpy as np  # For numerical operations
import pandas as pd  # For data manipulation and analysis

# File System Management
import os  # For interacting with the operating system

# Warnings Management
import warnings  # To suppress warnings which can clutter the notebook
warnings.filterwarnings('ignore')

# Data Visualization Libraries
import matplotlib.pyplot as plt  # For creating static, animated, and interactive visualizations
import seaborn as sns  # For making attractive and informative statistical graphics

# Preprocessing Tools
from sklearn.preprocessing import LabelEncoder  # For encoding labels with value between 0 and n_classes-1

# Set matplotlib inline for Jupyter notebook (use only in a Jupyter notebook)
%matplotlib inline

# Optionally, set a style for seaborn to make plots more appealing
sns.set(style="whitegrid")  # This is optional and can be modified based on preference

ModuleNotFoundError: No module named 'matplotlib'

### Load Training data

In [1]:
app_train = pd.read_csv('Projet+Mise+en+prod+-+home-credit-default-risk/application_train.csv')
print(app_train.shape)
app_train.head()

NameError: name 'pd' is not defined

The training data has 307511 rows i.e. each one a separate loan and 122 features/columns including the TARGET (the label we want to predict).

### Load Testing data

In [None]:
app_test = pd.read_csv('../input/home-credit-default-risk/application_test.csv')
print(app_test.shape)
app_test.head()

The testing data has 48744 rows i.e. each one a separate loan and 121 features/columns which does not include the TARGET column. Testing dataset is considerably smaller than train data.

### **Exploratory Data Analysis (EDA) Overview**

Exploratory Data Analysis (EDA) plays a crucial role in understanding the underlying patterns, anomalies, relationships, and trends within a dataset. It's a foundational step in the data science process that precedes model building. EDA involves a mix of visualization and statistical analysis to uncover insights from data. It's an iterative and exploratory process, starting from a broad perspective and progressively focusing on specific aspects as interesting patterns emerge.

**Case Study: Analyzing the TARGET Column**

One of the initial steps in EDA, especially for a classification problem like the Home Credit Default Risk, is to examine the distribution of the target variable. In this case, the TARGET column indicates whether a loan was repaid (0) or not (1). Let's perform a basic analysis of this column to understand the class distribution:

## Examining `TARGET` column distribution

In [None]:
# Examining the distribution of the 'TARGET' column
target_distribution = app_train['TARGET'].value_counts()
print(target_distribution)

In [None]:
# Visualizing the distribution of the 'TARGET' column
app_train['TARGET'].astype(int).plot.hist()
plt.title('Distribution of the TARGET Variable')
plt.xlabel('TARGET Value')
plt.ylabel('Frequency')
plt.show()

**Insights:**

From the output, we observe that there are 282,686 instances where the loan was repaid (TARGET=0) and 24,825 instances where the loan was not repaid (TARGET=1).
This indicates a significant imbalance in the dataset: 91.93% of the loans were repaid on time, while only 8.07% were not. Such an imbalance can influence the performance of machine learning models, as they might become biased towards predicting the majority class.

### Analyzing Missing Values

In [None]:
import pandas as pd

# Function to calculate missing values by column in a DataFrame
def missing_values_table(df):
    # Calculate total missing values
    mis_val = df.isnull().sum()
    
    # Calculate percentage of missing values
    mis_val_pct = 100 * df.isnull().sum() / len(df)
    
    # Make a table with the results
    mis_val_tab = pd.concat([mis_val, mis_val_pct], axis=1)
    
    # Rename the columns
    mis_val_tab_col_renamed = mis_val_tab.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
    
    # Sort the table by percentage of missing descending, after excluding columns with no missing values
    mis_val_tab_col_renamed = mis_val_tab_col_renamed[
        mis_val_tab_col_renamed.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
    
    # Print some summary information
    print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
           "There are " + str(mis_val_tab_col_renamed.shape[0]) + 
           " columns that have missing values.")
    
    # Return the dataframe with missing information
    return mis_val_tab_col_renamed
# Missing values statistics
missing_values = missing_values_table(app_train)
missing_values.head(15)

We must use the proper approach to fill in missing information, or imputation, in order to develop our machine learning model. Eliminating columns with significant missing value percentages (e.g., 40%) is another way. However, this is a loss of data, and we are unsure as to whether or not these columns will be useful as we create our model. Thus, they will remain with us. Other methods, such as XGBoost or Random Forest, do not require missing value imputation.

## Checking datatypes of columns

In [None]:
# Number of each type of column
app_train.dtypes.value_counts()

In [None]:
# Number of unique classes in each object column
app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

Due to the presence of categorical variables with a few unique entries, it's essential to convert them into numerical format for machine learning compatibility, as most models require numerical input.

**Encoding Strategy:**
- **Label Encoding** is used for binary categories, assigning an integer to each category. It's straightforward but imposes an arbitrary order.
- **One-Hot Encoding** is applied to variables with more than two categories, creating a new column for each category and avoiding arbitrary numerical assignments.

For binary categories, Label Encoding is sufficient. However, for variables with more than two categories, One-Hot Encoding is preferred to prevent arbitrary value assignments. One potential drawback of One-Hot Encoding is the significant increase in dataset dimensions, which can be managed through dimensionality reduction techniques like PCA.

In practice, we apply Label Encoding to variables with two categories and One-Hot Encoding to those with more. After encoding:

In [None]:
# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in app_train:
    if app_train[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(app_train[col].unique())) <= 2:
            # Train on the training data
            le.fit(app_train[col])
            # Transform both training and testing data
            app_train[col] = le.transform(app_train[col])
            app_test[col] = le.transform(app_test[col])
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

In [None]:
# one-hot encoding of categorical variables
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

This approach ensures our dataset is fully numerical, with specific encoding strategies tailored to the number of categories in each variable.

**Synchronizing Training and Testing Data**

For consistent model training and evaluation, it's crucial that the training and testing datasets have identical features. Post one-hot encoding, the training dataset might end up with additional columns due to categories absent in the testing set. To rectify this and ensure uniformity, the datasets must be aligned.

The alignment process involves:

1. Preserving the `TARGET` variable from the training set as it's not present in the testing dataset but is necessary for training.
2. Aligning the datasets to match their columns. This step involves removing any columns from the training set that aren't found in the testing set, ensuring both datasets have the same features.

Here's how to implement these steps:


In [None]:
train_labels = app_train['TARGET']

# Align the training and testing data, keep only columns present in both dataframes
app_train, app_test = app_train.align(app_test, join = 'inner', axis = 1)

# Add the target back in
app_train['TARGET'] = train_labels

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

Post-alignment, the datasets are now harmonized, containing an identical set of features which is essential for the integrity of model training and evaluation processes.


**Detecting Data Anomalies**

Identify anomalies by examining column statistics, particularly for columns like `DAYS_BIRTH`, which are recorded relative to the loan application date and are negative. Converting `DAYS_BIRTH` to positive and representing it in years:


In [None]:
for col in app_train:
    print(col, app_train[col].dtype)

In [None]:
(app_train['DAYS_BIRTH'] / -365).describe()

Those ages look reasonable viz. (20, 70). There are no outliers for the age on either the high or low end. How about the days of employment?

In [None]:
app_train['DAYS_EMPLOYED'].describe()

That doesn't look right! The maximum value (besides being positive) is about 1000 years. 

In [None]:
app_train['DAYS_EMPLOYED'].plot.hist(title = 'Days Employment Histogram')
plt.xlabel('Days Employment')
plt.show()

Just out of curiousity, let's subset the anomalous clients and see if they tend to have higher or low rates of default than the rest of the clients.

In [None]:
anom = app_train[app_train['DAYS_EMPLOYED'] == 365243]
non_anom = app_train[app_train['DAYS_EMPLOYED'] != 365243]
print('The non-anomalies default on %0.2f%% of loans' % (100 * non_anom['TARGET'].mean()))
print('The anomalies default on %0.2f%% of loans' % (100 * anom['TARGET'].mean()))
print('There are %d anomalous values in column days of employment' % len(anom))

It turns out that the anomalies have a lower rate of default.

Handling the anomalies depends on the exact situation, with no set rules. One of the safest approaches is just to set the anomalies to a missing value and then have them filled in (using Imputation) before machine learning. In this case, since all the anomalies have the exact same value, we want to fill them in with the same value in case all of these loans share something in common. The anomalous values seem to have some importance, so we want to tell the machine learning model if we did in fact fill in these values. As a solution, we will fill in the anomalous values with not a number (np.nan) and then create a new boolean column indicating whether or not the value was anomalous.

In [None]:
# Create an anomalous flag column
app_train['DAYS_EMPLOYED_ANOM'] = app_train["DAYS_EMPLOYED"] == 365243

# Replace the anomalous values with nan
app_train['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)

app_train['DAYS_EMPLOYED'].plot.hist(title = 'Days Employment Histogram')
plt.xlabel('Days Employment')
plt.show()

The distribution now appears more realistic and aligned with expectations. Additionally, a new column has been introduced to inform the model about the original anomalous values. This is crucial since we'll need to impute the missing values (now represented as NaNs) with a representative statistic, likely the column's median.

In [None]:
# Describing the distribution of the 'DAYS_REGISTRATION' feature
app_train['DAYS_REGISTRATION'].describe()

In [None]:
# Describing the distribution of the 'DAYS_ID_PUBLISH' feature
app_train['DAYS_ID_PUBLISH'].describe()

In [None]:
# Describing the distribution of the 'OWN_CAR_AGE' feature
app_train['OWN_CAR_AGE'].describe()

The other columns with DAYS in the dataframe look to be about what we expect with no obvious outliers.

As an extremely important note, anything we do to the training data we also have to do to the testing data. Let's make sure to create the new column and fill in the existing column with np.nan in the testing data.

In [None]:
# Identifying anomalies in 'DAYS_EMPLOYED' for the test dataset
app_test['DAYS_EMPLOYED_ANOM'] = app_test["DAYS_EMPLOYED"] == 365243
# Replacing identified anomalies with NaN for consistency
app_test["DAYS_EMPLOYED"].replace({365243: np.nan}, inplace = True)
# Reporting the number of anomalies found in the test data
print('There are %d anomalies in the test data out of %d entries' % (app_test["DAYS_EMPLOYED_ANOM"].sum(), len(app_test)))
# Calculation to find the percentage of anomalies in the test dataset
percentage_anomalies = 9274 * 100 / 48744
print(f'Percentage of anomalies in the test data: {percentage_anomalies:.2f}%')

### Correlations
Exploring correlations between features and the target variable is a fundamental part of understanding the data. By calculating the Pearson correlation coefficient, we can gauge linear relationships between each variable and the target. Although the correlation coefficient might not fully capture the "relevance" of a feature, it provides insights into potential associations within the data. The correlation values range from -1 to 1, where values closer to 1 or -1 indicate a strong positive or negative relationship, respectively, and values near 0 suggest no linear correlation. Here's a breakdown of how to interpret the absolute value of the correlation coefficient:

- **0.00-0.19:** Very weak
- **0.20-0.39:** Weak
- **0.40-0.59:** Moderate
- **0.60-0.79:** Strong
- **0.80-1.0:** Very strong

To identify and display the most significant correlations with the target variable:


In [None]:

# Calculating correlations between all features and the target
correlations = app_train.corr()['TARGET'].sort_values()

# Displaying the top 15 positive correlations
print('Most Positive Correlations:\n', correlations.tail(15))

# Displaying the top 15 negative correlations
print('\nMost Negative Correlations:\n', correlations.head(15))

This process highlights the features most positively and negatively correlated with the target, guiding further analysis and feature selection for modeling.

Exploring the significant correlations further, we find that `DAYS_BIRTH` shows the highest positive correlation. This feature represents the client's age in days (recorded as negative values) at the time of the loan application. Despite the positive correlation, the feature's negative value implies an inverse relationship: older clients are less likely to default on their loans. To clarify, converting `DAYS_BIRTH` to absolute values would reveal a negative correlation, indicating that age is inversely related to default risk. However, it's important to note that all identified correlations are very weak.

### Effect of Age (ie `DAYS_BIRTH`) on repayment

In [None]:
# Find the correlation of the positive days since birth and target
app_train['DAYS_BIRTH'] = abs(app_train['DAYS_BIRTH'])
app_train['DAYS_BIRTH'].corr(app_train['TARGET'])

As the client gets older, there is a negative linear relationship with the target meaning that as clients get older, they tend to repay their loans on time more often.

In [None]:
# Set the style of plots
plt.style.use('fivethirtyeight')

# Plot the distribution of ages in years
plt.hist(app_train['DAYS_BIRTH'] / 365, edgecolor = 'k', bins = 50)
plt.title('Age of Client')
plt.xlabel('Age (years)')
plt.ylabel('Count')
plt.show()

To illustrate the impact of age on the target variable, we'll proceed by generating a Kernel Density Estimation (KDE) plot, where the color represents the target value.

A KDE plot portrays the distribution of a single variable, akin to a smoothed histogram. It achieves this by calculating a kernel, often Gaussian, at each data point and subsequently averaging these kernels to form a single smooth curve.

For this visualization, we'll utilize the seaborn kdeplot function.

In [None]:
plt.figure(figsize = (8, 6))

# KDE plot of loans that were repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, 'DAYS_BIRTH'] / 365, label = 'Loans repayed on time')

# KDE plot of loans which were not repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, 'DAYS_BIRTH'] / 365, label = 'Loans not repayed on time')

# Labeling of plot
plt.xlabel('Age (years)')
plt.ylabel('Density')
plt.title('Distribution of Ages')
plt.legend()

plt.show()

The curve representing loans not repaid on time tends to skew towards younger individuals within the age range. Although this correlation may not be significant, it's likely to be valuable in machine learning models as it impacts the target variable. Let's explore this relationship from a different perspective: the average loan repayment failure rate across age groups.

To visualize this relationship, we'll categorize age into bins of 5-year intervals. Then, we'll calculate the average target value for each age bin, indicating the proportion of loans not repaid in each age category.

In [None]:
# Age information into a separate dataframe
age_data = app_train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365

# Bin the age data
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))
age_data.head()

In [None]:
# Group by the bin and calculate averages
age_groups  = age_data.groupby('YEARS_BINNED').mean()
age_groups

In [None]:
# Graph the age bins and the average of the target as a bar plot
plt.bar(age_groups.index.astype(str), 100 * age_groups['TARGET'])

# Plot labeling
plt.xticks(rotation = 75)
plt.xlabel('Age Group (years)')
plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Age Group')

plt.show()

There is a clear trend: younger applicants are more likely to not repay the loan! The rate of failure to repay is above 10% for the youngest three age groups and below 5% for the oldest age group.

This is information that could be directly used by the bank: because younger clients are less likely to repay the loan, maybe they should be provided with more guidance or financial planning tips. This does not mean the bank should discriminate against younger clients, but it would be smart to take precautionary measures to help younger clients pay on time.

### External Data Sources viz. `EXT_SOURCE_1`, `EXT_SOURCE_2`, `EXT_SOURCE_3`
The 3 variables with the strongest negative correlations with the target are `EXT_SOURCE_1`, `EXT_SOURCE_2`, and `EXT_SOURCE_3`. According to the documentation, these features represent a "normalized score from external data source".

Lets take a look at these features-

In [None]:
# Extract the EXT_SOURCE variables and show correlations
ext_data = app_train[['TARGET', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]
ext_data_corrs = ext_data.corr()
ext_data_corrs

In [None]:

# Displaying a heatmap of correlations with a different color map
sns.heatmap(ext_data_corrs, cmap='viridis', vmin=-0.25, annot=True, vmax=0.6)
plt.title('Correlation Heatmap')
plt.show()


The three EXT_SOURCE features exhibit negative correlations with the TARGET variable, suggesting that higher values of EXT_SOURCE are associated with a higher likelihood of loan repayment by the client. Additionally, there is a positive correlation between DAYS_BIRTH and EXT_SOURCE_1, implying that client age might be a contributing factor to this score.

In [None]:
plt.figure(figsize = (8, 12))

# iterate through the sources
for i, source in enumerate(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']):
    
    # create a new subplot for each source
    plt.subplot(3, 1, i + 1)
    # plot repaid loans
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, source], label = 'Loans repayed on time')
    # plot loans that were not repaid
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, source], label = 'Loans not repayed on time')
    
    # Label the plots
    plt.title('Distribution of %s by TARGET Value' % source)
    plt.xlabel('%s' % source)
    plt.ylabel('Density')
    plt.legend()
    
plt.tight_layout(h_pad = 2.5)
plt.show()

"EXT_SOURCE_3 demonstrates the most noticeable variance concerning the target values, indicating a certain association with loan repayment likelihood. Although the correlation is not particularly strong, these variables still hold predictive value for machine learning models assessing loan repayment.

For a final exploratory analysis, we'll generate a pairs plot involving the EXT_SOURCE variables and DAYS_BIRTH. This plot provides insights into relationships among multiple variable pairs and distributions of individual variables. Utilizing the seaborn visualization library and PairGrid function, the pairs plot comprises scatterplots on the upper triangle, histograms on the diagonal, and 2D kernel density plots with correlation coefficients on the lower triangle."

As a final exploratory plot, we can make a pairs plot of the `EXT_SOURCE` variables and the `DAYS_BIRTH` variable. 

The `Pairs Plot` is a great exploration tool because it lets us see relationships between multiple pairs of variables as well as distributions of single variables. 
Here we are using the seaborn visualization library and the PairGrid function to create a Pairs Plot with scatterplots on the upper triangle, histograms on the diagonal, and 2D kernel density plots and correlation coefficients on the lower triangle.

In [None]:
# Copy the data for plotting
plot_data = ext_data.drop(columns = ['DAYS_BIRTH']).copy()

# Add in the age of the client in years
plot_data['YEARS_BIRTH'] = age_data['YEARS_BIRTH']

# Drop na values and limit to first 100000 rows as sample
plot_data = plot_data.dropna().loc[:100000, :]

# Function to calculate correlation coefficient between two columns
def corr_func(x, y, **kwargs):
    r = np.corrcoef(x, y)[0][1]
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r),
                xy=(.2, .8), xycoords=ax.transAxes,
                size = 10)

# Create the pairgrid object
grid = sns.PairGrid(data = plot_data, diag_sharey=False,
                    hue = 'TARGET', 
                    vars = [x for x in list(plot_data.columns) if x != 'TARGET'])

# Upper is a scatter plot
grid.map_upper(plt.scatter, alpha = 0.2)

# Diagonal is a histogram
grid.map_diag(sns.kdeplot)

# Bottom is density plot
grid.map_lower(sns.kdeplot, cmap = plt.cm.OrRd_r)

plt.legend()
plt.suptitle('Ext Source and Age Features Pairs Plot', size = 20, y = 1.05)
plt.show()

In this plot, the red indicates loans that were not repaid and the blue are loans that are paid. We can see the different relationships within the data. There does appear to be a moderate positive linear relationship between the `EXT_SOURCE_1` and the `DAYS_BIRTH` (or equivalently `YEARS_BIRTH`), indicating that this feature may take into account the age of the client.

### Feature Engineering Introduction:

Feature engineering encompasses the process of feature construction, where new features are generated from existing data, and feature selection, which involves selecting the most relevant features or reducing dimensionality using various techniques. We'll employ both methods to enhance our dataset.

### Polynomial Features
In this method, we make features that are powers of existing features as well as interaction terms between existing features. These features that are a combination of multiple individual variables are called `interaction terms` because they capture the interactions between variables. In other words, while two variables by themselves may not have a strong influence on the target, combining them together into a single interaction variable might show a relationship with the target. Interaction terms are commonly used in statistical models to capture the effects of multiple variables.

In the following code, we create polynomial features using the `EXT_SOURCE` variables and the `DAYS_BIRTH` variable. `Scikit-Learn` has a useful class called `PolynomialFeatures` that creates the polynomials and the interaction terms up to a specified degree. (**Note**: High degree can lead to overfitting data)

In [None]:
# Make a new dataframe for polynomial features
poly_features = app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'TARGET']]
poly_features_test = app_test[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]

# imputer for handling missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy = 'median')

poly_target = poly_features['TARGET']

poly_features = poly_features.drop(columns = ['TARGET'])

# Need to impute missing values
poly_features = imputer.fit_transform(poly_features)
poly_features_test = imputer.transform(poly_features_test)

from sklearn.preprocessing import PolynomialFeatures
                                  
# Create the polynomial object with specified degree
poly_transformer = PolynomialFeatures(degree = 3)

In [None]:
# Train the polynomial features
poly_transformer.fit(poly_features)

# Transform the features
poly_features = poly_transformer.transform(poly_features)
poly_features_test = poly_transformer.transform(poly_features_test)
print('Polynomial Features shape: ', poly_features.shape)

In [None]:
poly_transformer.get_feature_names_out(input_features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH'])

Now, we can see whether any of these new features are correlated with the `TARGET`.

In [None]:
# Create a dataframe of the features 
poly_features = pd.DataFrame(poly_features, 
                             columns = poly_transformer.get_feature_names_out(['EXT_SOURCE_1', 'EXT_SOURCE_2',
                                                                               'EXT_SOURCE_3', 'DAYS_BIRTH']))

# Add in the target
poly_features['TARGET'] = poly_target

# Find the correlations with the target
poly_corrs = poly_features.corr()['TARGET'].sort_values(ascending=False)
# Display most negative and most positive correlations
print('Most Positive Correlations:\n', poly_corrs.head(10))
print('\nMost Negative Correlations:\n', poly_corrs.tail(5))

Several of the new variables have a greater (in terms of absolute magnitude) correlation with the target than the original features. When we build machine learning models, we can try with and without these features to determine if they actually help the model learn.

We will add these features to a copy of the training and testing data and then evaluate models with and without the features. Many times in machine learning, the only way to know if an approach will work is to try it out!

In [None]:
# Put test features into dataframe
poly_features_test = pd.DataFrame(poly_features_test, 
                                  columns = poly_transformer.get_feature_names_out(['EXT_SOURCE_1', 'EXT_SOURCE_2', 
                                                                                'EXT_SOURCE_3', 'DAYS_BIRTH']))

# Merge polynomial features into training dataframe
poly_features['SK_ID_CURR'] = app_train['SK_ID_CURR']
app_train_poly = app_train.merge(poly_features, on = 'SK_ID_CURR', how = 'left')

# Merge polnomial features into testing dataframe
poly_features_test['SK_ID_CURR'] = app_test['SK_ID_CURR']
app_test_poly = app_test.merge(poly_features_test, on = 'SK_ID_CURR', how = 'left')

# Align the dataframes
app_train_poly, app_test_poly = app_train_poly.align(app_test_poly, join = 'inner', axis = 1)

# Print out the new shapes
print('Training data with polynomial features shape: ', app_train_poly.shape)
print('Testing data with polynomial features shape:  ', app_test_poly.shape)

### Domain Knowledge features
We can make a couple features that attempt to capture what we think may be important for telling whether a client will default on a loan. In this case we can create some features using our knowledge in  finance domain.

These are some of the examples of these features:

    1. CREDIT_INCOME_PERCENT: the percentage of the credit amount relative to a client's income
    2. ANNUITY_INCOME_PERCENT: the percentage of the loan annuity relative to a client's income
    3. CREDIT_TERM: the length of the payment in months (since the annuity is the monthly amount due
    4. DAYS_EMPLOYED_PERCENT: the percentage of the days employed relative to the client's age
    

In [None]:
app_train_domain = app_train.copy()
app_test_domain = app_test.copy()

app_train_domain['CREDIT_INCOME_PERCENT'] = app_train_domain['AMT_CREDIT'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['ANNUITY_INCOME_PERCENT'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['CREDIT_TERM'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_CREDIT']
app_train_domain['DAYS_EMPLOYED_PERCENT'] = app_train_domain['DAYS_EMPLOYED'] / app_train_domain['DAYS_BIRTH']

In [None]:
app_test_domain['CREDIT_INCOME_PERCENT'] = app_test_domain['AMT_CREDIT'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['ANNUITY_INCOME_PERCENT'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['CREDIT_TERM'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_CREDIT']
app_test_domain['DAYS_EMPLOYED_PERCENT'] = app_test_domain['DAYS_EMPLOYED'] / app_test_domain['DAYS_BIRTH']

**Visualize New Variables**

We should explore these domain knowledge variables visually in a graph. For all of these, we will make the same KDE plot colored by the value of the `TARGET`.


In [None]:
plt.figure(figsize = (8, 12))

# iterate through the features
for i, feature in enumerate(['CREDIT_INCOME_PERCENT', 'ANNUITY_INCOME_PERCENT', 'CREDIT_TERM', 'DAYS_EMPLOYED_PERCENT']):
    
    # create a new subplot for each feature
    plt.subplot(4, 1, i + 1)
    # plot repaid loans
    sns.kdeplot(app_train_domain.loc[app_train_domain['TARGET'] == 0, feature], label = 'Loans repayed on time')
    # plot loans that were not repaid
    sns.kdeplot(app_train_domain.loc[app_train_domain['TARGET'] == 1, feature], label = 'Loans not repayed on time')
    
    # Label the plots
    plt.title('Distribution of %s by TARGET Value' % feature)
    plt.xlabel('%s' % feature)
    plt.ylabel('Density')
    plt.legend()
    
plt.tight_layout(h_pad = 2.5)
plt.show()

It's hard to say ahead of time if these new features will be useful.

# Model 

For a naive baseline, we could guess the same value for all examples on the testing set. We are asked to predict the probability of not repaying the loan, so if we are entirely unsure, we would guess 0.5 for all observations on the test set. This will get us a Reciever Operating Characteristic Area Under the Curve (AUC ROC) of 0.5 in the competition (random guessing on a classification task will score a 0.5).

Since we already know what score we are going to get, we don't really need to make a naive baseline guess. 

Let's use a slightly more sophisticated model for our actual baseline: Logistic Regression.

# Logistic Regression Implementation

We will use `LogisticRegression` from `Scikit-Learn` for our first model. The only change we will make from the default model settings is to lower the regularization parameter, `C`, which controls the amount of overfitting (a lower value should decrease overfitting). This will get us slightly better results than the default LogisticRegression, but it still will set a low bar for any future models.

To get a baseline, we will use all of the features after encoding the categorical variables. 

We will preprocess the data by filling in the missing values (imputation) and normalizing the range of the features (feature scaling). 

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Assuming app_train and app_test are your training and testing datasets, respectively

# Drop the target from the training data
if 'TARGET' in app_train:
    X = app_train.drop(columns=['TARGET'])
    y = app_train['TARGET']
else:
    X = app_train.copy()
    y = None  # Adjust this based on your data

# Copy of the testing data
test = app_test.copy()

# Median imputation of missing values
imputer = SimpleImputer(strategy='median')

# Scale each feature to 0-1
scaler = MinMaxScaler(feature_range=(0, 1))

# Impute missing values in the training and test sets
X_imputed = imputer.fit_transform(X)
test_imputed = imputer.transform(test)

# Scale the data
X_scaled = scaler.fit_transform(X_imputed)
test_scaled = scaler.transform(test_imputed)

# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Initialize and fit the logistic regression model
log_reg = LogisticRegression(max_iter=1000, solver='liblinear')
log_reg.fit(X_train, y_train)

# Predict probabilities for training and validation sets
train_proba = log_reg.predict_proba(X_train)[:, 1]
val_proba = log_reg.predict_proba(X_val)[:, 1]

# Calculate AUC scores for training and validation sets
train_auc = roc_auc_score(y_train, train_proba)
val_auc = roc_auc_score(y_val, val_proba)

# Print AUC scores
print(f'Training AUC: {train_auc:.3f}')
print(f'Validation AUC: {val_auc:.3f}')

# Predict probabilities for the test set (if needed)
test_proba = log_reg.predict_proba(test_scaled)[:, 1]

submission = pd.DataFrame(app_test['SK_ID_CURR'], columns=['SK_ID_CURR'])
submission['TARGET'] = test_proba

print('Submission data shape: ', submission.shape)
submission.head()

# Save the submission DataFrame to a CSV file
submission.to_csv('logistic_regression_baseline.csv', index=False)

In [None]:
from sklearn.linear_model import LogisticRegression

# Make the model with the specified regularization parameter
log_reg = LogisticRegression(C = 0.0001)

# Train on the training data
log_reg.fit(train, train_labels)

Now that the model has been trained, we can use it to make predictions. We want to predict the probabilities of not paying a loan, so we use the model predict.proba method. This returns an m x 2 array where m is the number of observations. The first column is the probability of the target being 0 and the second column is the probability of the target being 1 (so for a single row, the two columns must sum to 1). We want the probability the loan is not repaid, so we will select the second column.

In [None]:
# Make predictions
# Make sure to select the second column only
log_reg_pred = log_reg.predict_proba(test)[:, 1]

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt
# Predicting probabilities for the training set for evaluation
train_proba = log_reg.predict_proba(train)[:, 1]

# Calculating AUC scores
train_auc = roc_auc_score(train_labels, train_proba)

# Printing AUC scores
print(f'Training AUC: {train_auc}')

fpr, tpr, thresholds = roc_curve(train_labels, train_proba)

# Plotting the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % train_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

In [None]:
# Submission dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = log_reg_pred

submit.head()

The predictions represent a probability between 0 and 1 that the loan will not be repaid. If we were using these predictions to classify applicants, we could set a probability threshold for determining that a loan is risky.



In [None]:
# Save the submission DataFrame to a CSV file
submit.to_csv('logistic_regression_baseline.csv', index=False)

The submission has now been saved to the virtual environment in which our notebook is running. To access the submission, at the end of the notebook, we will hit the blue Commit & Run button at the upper right of the kernel. This runs the entire notebook and then lets us download any files that are created during the run.

Once we run the notebook, the files created are available in the Versions tab under the Output sub-tab. From here, the submission files can be submitted to the competition or downloaded. Since there are several models in this notebook, there will be multiple output files.

**Score**: 0.690

# Random Forest Implementation

Let's try using a Random Forest on the same training data to see how that affects performance. The Random Forest is a much more powerful model especially when we use hundreds of trees. We will use 100 trees in the random forest.

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

# Assuming app_train and app_test are your DataFrame variables for training and testing datasets, respectively

# Drop the target from the training data
if 'TARGET' in app_train:
    X = app_train.drop(columns=['TARGET'])
    y = app_train['TARGET']
else:
    X = app_train.copy()
    y = None  # Adjust based on your actual setup

# Copy of the testing data
test = app_test.copy()

# Median imputation of missing values
imputer = SimpleImputer(strategy='median')

# Scale each feature to 0-1
scaler = MinMaxScaler(feature_range=(0, 1))

# Impute missing values in the training and test sets
X_imputed = imputer.fit_transform(X)
test_imputed = imputer.transform(test)

# Scale the data
X_scaled = scaler.fit_transform(X_imputed)
test_scaled = scaler.transform(test_imputed)

# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Initialize and fit the Random Forest classifier
random_forest = RandomForestClassifier(n_estimators=100, random_state=50, verbose=1, n_jobs=-1)
random_forest.fit(X_train, y_train)

# Predict probabilities for training and validation sets
train_proba = random_forest.predict_proba(X_train)[:, 1]
val_proba = random_forest.predict_proba(X_val)[:, 1]

# Calculate AUC scores for training and validation sets
train_auc = roc_auc_score(y_train, train_proba)
val_auc = roc_auc_score(y_val, val_proba)

# Print AUC scores
print(f'Training AUC: {train_auc:.3f}')
print(f'Validation AUC: {val_auc:.3f}')

# Predict probabilities for the test set (if needed)
test_proba_rf = random_forest.predict_proba(test_scaled)[:, 1]

# Create a submission DataFrame or use the test probabilities as needed
# Replace 'SK_ID_CURR' with your actual ID column in the test dataset
submission = pd.DataFrame(app_test['SK_ID_CURR'], columns=['SK_ID_CURR'])
submission['TARGET'] = test_proba_rf

print('Submission data shape: ', submission.shape)
submission.head()

# Save the submission dataframe
submission.to_csv('random_forest_baseline.csv', index = False)
print(submission.head())


In [None]:
from sklearn.ensemble import RandomForestClassifier

# Make the random forest classifier
random_forest = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)

In [None]:
# Train on the training data
random_forest.fit(train, train_labels)

# Extract feature importances
feature_importance_values = random_forest.feature_importances_
feature_importances = pd.DataFrame({'feature': features, 'importance': feature_importance_values})

# Make predictions on the test data
predictions = random_forest.predict_proba(test)[:, 1]


In [None]:
# Make a submission dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = predictions

# Save the submission dataframe
submit.to_csv('random_forest_baseline.csv', index = False)
print(submit.head())

**Score**: 0.67877


### Make Predictions using Engineered Features

The only way to see if the Polynomial Features and Domain knowledge improved the model is to train a test a model on these features! We can then compare the submission performance to that for the model without these features to gauge the effect of our feature engineering.


In [None]:
poly_features_names = list(app_train_poly.columns)

# Impute the polynomial features
imputer = SimpleImputer(strategy = 'median')

poly_features = imputer.fit_transform(app_train_poly)
poly_features_test = imputer.transform(app_test_poly)

# Scale the polynomial features
scaler = MinMaxScaler(feature_range = (0, 1))

poly_features = scaler.fit_transform(poly_features)
poly_features_test = scaler.transform(poly_features_test)

random_forest_poly = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)


In [None]:
# Train on the training data
random_forest_poly.fit(poly_features, train_labels)

# Make predictions on the test data
predictions = random_forest_poly.predict_proba(poly_features_test)[:, 1]

In [None]:
# Make a submission dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = predictions

print(submit.head())
# Save the submission dataframe
submit.to_csv('random_forest_baseline_feature_engineered.csv', index = False)


**Score**: 0.60467

Given these results, it does not appear that our feature construction helped in this case.

# Testing Domain Features

In [None]:
app_train_domain = app_train_domain.drop(columns = 'TARGET')

domain_features_names = list(app_train_domain.columns)

# Impute the domainnomial features
imputer = SimpleImputer(strategy = 'median')

domain_features = imputer.fit_transform(app_train_domain)
domain_features_test = imputer.transform(app_test_domain)

# Scale the domainnomial features
scaler = MinMaxScaler(feature_range = (0, 1))

domain_features = scaler.fit_transform(domain_features)
domain_features_test = scaler.transform(domain_features_test)

random_forest_domain = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)

# Train on the training data
random_forest_domain.fit(domain_features, train_labels)

# Extract feature importances
feature_importance_values_domain = random_forest_domain.feature_importances_
feature_importances_domain = pd.DataFrame({'feature': domain_features_names, 'importance': feature_importance_values_domain})

# Make predictions on the test data
predictions = random_forest_domain.predict_proba(domain_features_test)[:, 1]

In [None]:

# Make a submission dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = predictions

# Save the submission dataframe
submit.to_csv('random_forest_baseline_feature_domain.csv', index = False)

**Score**: 0.67996

Given these results, it does not appear that our feature construction helped much in this case.

# Model Interpretation: Feature Importances


As a simple method to see which variables are the most relevant, we can look at the feature importances of the random forest. Given the correlations we saw in the exploratory data analysis, we should expect that the most important features are the `EXT_SOURCE` and the `DAYS_BIRTH`. We may use these feature importances as a method of dimensionality reduction in future work.


In [None]:
def plot_feature_importances(df):
    """
    Plot importances returned by a model. This can work with any measure of
    feature importance provided that higher importance is better. 
    
    Args:
        df (dataframe): feature importances. Must have the features in a column
        called `features` and the importances in a column called `importance`
        
    Returns:
        shows a plot of the 15 most importance features
        
        df (dataframe): feature importances sorted by importance (highest to lowest) 
        with a column for normalized importance
        """
    
    # Sort features according to importance
    df = df.sort_values('importance', ascending = False).reset_index()
    
    # Normalize the feature importances to add up to one
    df['importance_normalized'] = df['importance'] / df['importance'].sum()

    # Make a horizontal bar chart of feature importances
    plt.figure(figsize = (10, 6))
    ax = plt.subplot()
    
    # Need to reverse the index to plot most important on top
    ax.barh(list(reversed(list(df.index[:15]))), 
            df['importance_normalized'].head(15), 
            align = 'center', edgecolor = 'k')
    
    # Set the yticks and labels
    ax.set_yticks(list(reversed(list(df.index[:15]))))
    ax.set_yticklabels(df['feature'].head(15))
    
    # Plot labeling
    plt.xlabel('Normalized Importance')
    plt.title('Feature Importances')
    plt.show()
    
    return df

In [None]:
# Show the feature importances for the default features
feature_importances_sorted = plot_feature_importances(feature_importances)

We see that there are only a handful of features with a significant importance to the model, which suggests we may be able to drop many of the features without a decrease in performance (and we may even see an increase in performance.) Feature importances are not the most sophisticated method to interpret a model or perform dimensionality reduction, but they let us start to understand what factors our model takes into account when it makes predictions.

In [None]:
feature_importances_domain_sorted = plot_feature_importances(feature_importances_domain)

We see that all four of our hand-engineered features made it into the top 15 most important! This should give us confidence that our domain knowledge was at least partially on track.

# Light Gradient Boosting Machine

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import gc

def model(features, test_features, encoding = 'ohe', n_folds = 5):
    
    """Train and test a light gradient boosting model using
    cross validation. 
    
    Parameters
    --------
        features (pd.DataFrame): 
            dataframe of training features to use 
            for training a model. Must include the TARGET column.
        test_features (pd.DataFrame): 
            dataframe of testing features to use
            for making predictions with the model. 
        encoding (str, default = 'ohe'): 
            method for encoding categorical variables. Either 'ohe' for one-hot encoding or 'le' for integer label encoding
            n_folds (int, default = 5): number of folds to use for cross validation
        
    Return
    --------
        submission (pd.DataFrame): 
            dataframe with `SK_ID_CURR` and `TARGET` probabilities
            predicted by the model.
        feature_importances (pd.DataFrame): 
            dataframe with the feature importances from the model.
        valid_metrics (pd.DataFrame): 
            dataframe with training and validation metrics (ROC AUC) for each fold and overall.
        
    """
    
    # Extract the ids
    train_ids = features['SK_ID_CURR']
    test_ids = test_features['SK_ID_CURR']
    
    # Extract the labels for training
    labels = features['TARGET']
    
    # Remove the ids and target
    features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])
    test_features = test_features.drop(columns = ['SK_ID_CURR'])
    
    
    # One Hot Encoding
    if encoding == 'ohe':
        features = pd.get_dummies(features)
        test_features = pd.get_dummies(test_features)
        
        # Align the dataframes by the columns
        features, test_features = features.align(test_features, join = 'inner', axis = 1)
        
        # No categorical indices to record
        cat_indices = 'auto'
    
    # Integer label encoding
    elif encoding == 'le':
        
        # Create a label encoder
        label_encoder = LabelEncoder()
        
        # List for storing categorical indices
        cat_indices = []
        
        # Iterate through each column
        for i, col in enumerate(features):
            if features[col].dtype == 'object':
                # Map the categorical features to integers
                features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).reshape((-1,)))
                test_features[col] = label_encoder.transform(np.array(test_features[col].astype(str)).reshape((-1,)))

                # Record the categorical indices
                cat_indices.append(i)
    
    # Catch error if label encoding scheme is not valid
    else:
        raise ValueError("Encoding must be either 'ohe' or 'le'")
        
    print('Training Data Shape: ', features.shape)
    print('Testing Data Shape: ', test_features.shape)
    
    # Extract feature names
    feature_names = list(features.columns)
    
    # Convert to np arrays
    features = np.array(features)
    test_features = np.array(test_features)
    
    # Create the kfold object
    k_fold = KFold(n_splits = n_folds, shuffle = True, random_state = 50)
    
    # Empty array for feature importances
    feature_importance_values = np.zeros(len(feature_names))
    
    # Empty array for test predictions
    test_predictions = np.zeros(test_features.shape[0])
    
    # Empty array for out of fold validation predictions
    out_of_fold = np.zeros(features.shape[0])
    
    # Lists for recording validation and training scores
    valid_scores = []
    train_scores = []
    
    # Iterate through each fold
    for train_indices, valid_indices in k_fold.split(features):
        
        # Training data for the fold
        train_features, train_labels = features[train_indices], labels[train_indices]
        # Validation data for the fold
        valid_features, valid_labels = features[valid_indices], labels[valid_indices]
        
        # Create the model
        model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', 
                                   class_weight = 'balanced', learning_rate = 0.05, 
                                   reg_alpha = 0.1, reg_lambda = 0.1, 
                                   subsample = 0.8, n_jobs = -1, random_state = 50)
        
        # Train the model
        model.fit(train_features, train_labels, eval_metric = 'auc',
                  eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
                  eval_names = ['valid', 'train'], categorical_feature = cat_indices,
                  early_stopping_rounds = 100, verbose = 200)
        
        # Record the best iteration
        best_iteration = model.best_iteration_
        
        # Record the feature importances
        feature_importance_values += model.feature_importances_ / k_fold.n_splits
        
        # Make predictions
        test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
        
        # Record the out of fold predictions
        out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
        
        # Record the best score
        valid_score = model.best_score_['valid']['auc']
        train_score = model.best_score_['train']['auc']
        
        valid_scores.append(valid_score)
        train_scores.append(train_score)
        
        # Clean up memory
        gc.enable()
        del model, train_features, valid_features
        gc.collect()
        
    # Make the submission dataframe
    submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
    
    # Make the feature importance dataframe
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
    
    # Overall validation score
    valid_auc = roc_auc_score(labels, out_of_fold)
    
    # Add the overall scores to the metrics
    valid_scores.append(valid_auc)
    train_scores.append(np.mean(train_scores))
    
    # Needed for creating dataframe of validation scores
    fold_names = list(range(n_folds))
    fold_names.append('overall')
    
    # Dataframe of validation scores
    metrics = pd.DataFrame({'fold': fold_names,
                            'train': train_scores,
                            'valid': valid_scores}) 
    
    return submission, feature_importances, metrics

In [None]:
submission, fi, metrics = model(app_train, app_test)
print('Baseline metrics')
print(metrics)

In [None]:
fi_sorted = plot_feature_importances(fi)

In [None]:
submission.to_csv('baseline_lgb.csv', index = False)

- The Light Gradient Boosting Machine (LGBM) model was utilized to predict loan defaults, employing a cross-validation approach for both training and evaluation.
- Data preparation involved:
  - Training data shape: 307,511 rows and 239 features.
  - Testing data shape: 48,744 rows and 239 features.
- Model performance over training iterations showed:
  - An increase in training AUC from approximately 0.798 to 0.828 as training progressed, indicating improved model fit over time.
  - Validation AUC scores remained consistent around 0.755 to 0.763, reflecting the model's generalization capability.
- Final evaluation metrics highlighted:
  - Training AUC scores varied slightly across different folds, ranging from 0.808 to 0.817, averaging at approximately 0.813.
  - Validation AUC scores also showed minor variations across folds, with an average score of approximately **0.759**.
- These results suggest that the LGBM model demonstrates a good balance between learning from the training data and generalizing to unseen data, with a strong predictive performance in identifying the likelihood of loan defaults.

In [None]:
app_train_domain['TARGET'] = train_labels

# Test the domain knolwedge features
submission_domain, fi_domain, metrics_domain = model(app_train_domain, app_test_domain)
print('Baseline with domain knowledge features metrics')
print(metrics_domain)


In [None]:
fi_sorted = plot_feature_importances(fi_domain)

The model with domain knowledge features showed:

- Training and testing data shapes were (307511, 243) and (48744, 243), respectively.
- Early training rounds indicated AUC scores of about 0.804 to 0.805, with validation AUC around 0.762 to 0.770.
- A notable improvement in training AUC to 0.834 and validation AUC to 0.770 was observed.
- Final metrics across folds revealed:
  - Training AUC ranged from 0.807 to 0.832.
  - Validation AUC was between 0.763 and 0.770.
- Overall, the model achieved an average training AUC of 0.817 and a validation AUC of 0.766.

These outcomes indicate that adding domain knowledge features enhances the model's predictive accuracy.

In [None]:
submission_domain.to_csv('baseline_lgb_domain_features.csv', index = False)