<h1 style='font-size:40px'> NPL Risk Evaluation Modeling</h1>
<div style='font-size:20px'> 
    <ul> 
        <li> 
            This project aims the conceiving of a Machine Learning Model focused on assisting a bank on its credit approval strategy.
        </li>
        <li> 
            The corporation has been scolded for its recent NPL levels by its shareholders. Thus, the executive team has decided that a more conservative 
            credit strategy must be adopted for new contracts.
        </li>
        <li> 
            During the planning meetings, the business team has made two major requests concerning the nature of the model.
            <ul style='list-style-type:decimal'> 
                <li> 
                    It must be focused on predicting whether a given client might produce an NPL in the future.
                </li>
                <li> 
                    The output must be some kind of score suggesting the likelihood of the event to happen. They are not looking for 
                    an incisive "yes or no" answer.
                </li>
            </ul>
        </li>
    </ul>
    <p style='margin-left:30px'> <strong> Note:</strong> The bank's NPL definition is any loan which payment is at least 90 days late.</p>
</div>

<h2 style='font-size:30px'> Data Importing</h2>
<div> 
    <ul style='font-size:20px'> 
        <li> 
            The Data Engineers were able to provide two .csv views from the bank's database. The first one contains general information over the clients 
            and the second lists the loans they've contracted over some period of time.
        </li>
    </ul>
</div>

In [None]:
pip install pyspark

In [None]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from IPython.core.display import HTML

# Creating the project's SparkSession.
spark = SparkSession.builder.appName('NPL').getOrCreate()

# Also, modifying the session's log level.
log_level = spark.sparkContext.setLogLevel('ERROR')

# This tiny config enables us to scroll along the DataFrame's columns.
display(HTML("<style>pre { white-space: pre !important; }</style>"))

<h3 style='font-size:30px;font-style:italic'> Clients Database</h3>
<div> 
    <ul style='font-size:20px'> 
        <li> 
            This dataset is comprised of general personal and professional information about the loans' clients.
        </li>    
        <li> 
            A particularity worth noting is that date columns show the negative amount of days since the given event took place. Positive numbers 
            indicate the number of days since the occurence ceased to exist - as it might happen with unemployed borrowers in the DAYS_EMPLOYED feature.
        </li>
    </ul>
</div>

In [None]:
path_clients = '/kaggle/input/credit-card-approval-prediction/application_record.csv'

# Defining the data types from the clients dataset.
schema_clients = '''
`ID` STRING, `CODE_GENDER` STRING, `FLAG_OWN_CAR` STRING, `FLAG_OWN_REALTY` STRING, `CNT_CHILDREN` INT,
`AMT_INCOME_TOTAL` FLOAT, `NAME_INCOME_TYPE` STRING, `NAME_EDUCATION_TYPE` STRING, `NAME_FAMILY_STATUS` STRING, `NAME_HOUSING_TYPE` STRING,
`DAYS_BIRTH` INT, `DAYS_EMPLOYED` INT, `FLAG_MOBIL` STRING, `FLAG_WORK_PHONE` STRING, `FLAG_PHONE` STRING, `FLAG_EMAIL` STRING, 
`OCCUPATION_TYPE` STRING, `CNT_FAM_MEMBERS` DOUBLE
'''

# Reading the database with the created schema.
df_clients = spark.read.csv(path_clients, header=True, schema=schema_clients)
df_clients.show(5)

<h4 style='font-size:30px;font-style:italic;text-decoration:underline'> Duplicates Disclaimer</h4>
<div> 
    <ul style='font-size:20px'> 
        <li> 
             Clients may not have unique rows in the dataset because the ID column identifies a loan contracted instead of a person.
        </li>
        <li> 
            Thus, I've found convenient for the project to create an ID column that assigns a code for the clients
        </li>
    </ul>
</div>

In [None]:
# Listing the `df_clients` features with the exception of ID.
features_clients = df_clients.columns
features_clients.remove('ID')
features_clients

In [None]:
# Note that the database's actual amount of clients is lower than its number of rows. 
data_clients = df_clients.dropDuplicates(features_clients)
print(f'`df_clients` length: {df_clients.count()}')
print(f'Number of clients: {data_clients.count()}')

In [None]:
# We'll assign a Client ID for every Loan mentioned in `df_clients`. 
from pyspark.sql.functions import cast, row_number
from pyspark.sql.types import StringType
from pyspark.sql.window import Window

window = Window.orderBy(features_clients)
row_window = row_number().over(window)
id_clients = data_clients.withColumn('ID_CLIENT', row_window.cast(StringType())).drop('ID')

In [None]:
# Now, we only need to enrich `df_clients` with the clients' actual identification.

# Performing an INNER JOIN between `df_clients` and `id_clients` using all non-ID columns as keys.
df_clients = df_clients.join(id_clients, on=features_clients)
df_clients.show(5)

<h3 style='font-size:30px;font-style:italic'> Loans Database</h3>
<div> 
    <ul style='font-size:20px'> 
        <li> 
            This table contains the payments records for every loan since its contraction. 
        </li>
        <li> 
            But in order to the dataset to be adequate to our project's intent, two transformations are necessary: first, we need to bring the `ID_CLIENT`
            column to it and after that, group the database so that it denounces individuals who've produced an NPL at least once.            
        </li>
    </ul>
</div>

In [None]:
# Bringing the dataset into our notebook.
path_loans = '/kaggle/input/credit-card-approval-prediction/credit_record.csv'
schema_loans = '`ID` STRING, `MONTHS_BALANCE` INT, `STATUS` STRING'
df_loans = spark.read.csv(path_loans, header=True, schema=schema_loans)
df_loans.show(5)

In [None]:
# Now, providing the loans' client ID.
df_loans = df_loans.join(df_clients, ['ID']).select(['ID_CLIENT', 'ID', 'MONTHS_BALANCE', 'STATUS'])
df_loans.show(5)

<h4 style='font-size:30px;font-style:italic;text-decoration:underline'> Conceiving the Target Variable</h4>
<div> 
    <ul style='font-size:20px'> 
        <li> 
            The `STATUS` column presents a handful of codes that represent distinct status for a loan's payment. Their definition is as follows:
            <table style='font-size:15px;margin-top:20px'> 
                <tr>
                    <th> Code</th>
                    <th> Definition</th>
                </tr>
                <tr> 
                    <td> C</td>
                    <td> Paid off that month</td>
                </tr>
                <tr> 
                    <td> 0</td>
                    <td> 1-29 days past due</td>
                </tr>
                <tr> 
                    <td> 1</td>
                    <td> 30-59 days past due </td>
                </tr>
                <tr> 
                    <td> 2</td>
                    <td> 60-89 days past due </td>
                </tr>
                <tr> 
                    <td> 3</td>
                    <td> 90-119 days past due </td>
                </tr>
                <tr> 
                    <td> 4</td>
                    <td> 120-149 days past due </td>
                </tr>
                <tr> 
                    <td> 5</td>
                    <td> Overdue or bad debts,<p> write-offs for more than 150 days</p> </td>
                </tr>
                <tr> 
                    <td> X</td>
                    <td> No loan for the month</td>
                </tr>
            </table>
        </li>
        <li style='margin-top:20px'> 
            Observe that in our case only the 3, 4 and 5 codes are of our interest. Thus it would be convenient to create a binary flag that denounces whether 
            has ever caused an NPL.
        </li>
    </ul>
</div>

In [None]:
# The dependent variable's conception needs a custom GroupBy that PySpark is unable to perform. Hence, we are going to resort to pandas
# in this section.
import pandas as pd

# Defining the GroupBy's schema.
schema_flag_npl = '`ID_CLIENT` STRING, `NPL` BOOLEAN'

# This lambda expression signs whether a client has produced an NPL in the past.
lambda_npl = lambda x: any(i in x for i in ('3', '4', '5'))

def has_npl(df:pd.DataFrame)->pd.DataFrame:
    '''
        Verifies if a client's  records contain any sort of Non-Performing Loan.
        
        Parameter
        ---------
        `df`: The loan records of a certain client.
        
        Returns
        -------
        A `pd.DataFrame` with the client's ID and a flag indicating NPL existence in their loan history. 
    '''
    df['NPL'] = df.STATUS.map(lambda_npl)
    return df[['ID_CLIENT', 'NPL']].drop_duplicates()

# Finally, generating our target-variable.
target = df_loans.groupBy('ID_CLIENT').applyInPandas(has_npl, schema_flag_npl)

<h2 style='font-size:30px'> Consolidating the Data</h2>
<div> 
    <ul style='font-size:20px'> 
        <li> 
            
        </li>
    </ul>
</div>

In [None]:
df_clients.dropDuplicates(features_clients).count()


In [None]:
data_clients.count()

In [None]:
id_clients.columns

In [None]:
features_clients

In [None]:
#Pq esse JOIN fica com menos de 90 mil clientes?
data_clients.join(id_clients, on=features_clients).count()

<p style='color:red'> Dataset final está com duplicatas; EDA?</p>