# Borrower default risk analysis

This project consists of preparing a report for the loan division of a bank. I am going to find out if the marital status, the number of children, the total income etc. of a customer have an impact on a loan default. The bank already has some data on the creditworthiness of customers.

This report aims to create a **credit score** for a potential customer. A **credit score** is used to assess a potential borrower's ability to repay their loan.

* [Intro](#intro)
* [Stage 1. Data description](#data_review)
    * [1.1 Data collection](#data_collection)
    * [1.2 Data exploration](#data_exploration)
* [Stage 2. Data preprocessing](#data_preprocessing)
    * [2.1 Data preparation](#data_preparation)
    * [2.2 Data_classification](#data_classification)
* [Stage 3. Hypothesis testing](#Hypothesis_testing)
    * [3.1 Data correlation](#data_correlation)
    * [3.2 Credit score creation](#credit_score_creation)
* [General conclusion](#conclusion)

## Intro <a id='intro'></a>

This work aim to analyze the borrower default risk using a set of data of previous customers. We are going to prepare a report for the loan division of a bank to assess a potential borrower's ability to pay back his loan. More precisely we will find out the following:
- link between having children and solvency
- link between marital status and solvency
- link between total income and solvency
- link between purpose and solvency and others.

## Stage 1. Data description <a id='data_review'></a>

### 1.1 Data collection <a id='data_collection'></a>

In [None]:
# Loading all libraries
import pandas as pd
import jinja2

In [None]:
# Loading the data
df = pd.read_csv('credit_scoring_eng.csv')

In [None]:
# Show the first lines of dataframe
df.head()

### 1.2 Data exploration <a id='data_exploration'></a>

**Data description**
- `children` - the number of children in the family
- `days_employed` - work experience in days
- `dob_years` - the age of the client in years
- `education` - customer education level
- `education_id` - education identifier
- `family_status` - marital status
- `family_status_id` - family status identifier
- `gender` - gender of the client
- `income_type` - employment type
- `debt` - was there any debt in repaying a loan?
- `total_income` - monthly income
- `purpose` - the purpose of obtaining a loan

In [None]:
# Let's see how many rows and columns our dataset has
df.shape

The initial dataframe is 21525 rows and 12 columns

In [None]:
# Let's display the first 10 rows
df.head(10)

First thing that jumps to the eyes is the values in the column "days_employed": they are negative and this need further investigation. Moreover in the 'education' column we see lower and uppercase characters. We will fix it 

In [None]:
# Get insights into data info
df.info()

In the Dataframe some columns (days_employed, total_income) are not completely filled. For the moment, we leave that like this and soon we will evaluate if there's need to replace the "null values" or just dump them.

In [None]:
# Let's look at the filtered table with missing values from the first column where data is missing
df[df['days_employed'].isna()]

We notice that there are 2174 absent values in "days_employed" and we realize that for every row in which "days_employed" is absent, "total_income" is absent too.

In [None]:
# Let's apply multiple conditions to filter data and see the number of rows in the filtered table.
filtro_days_employed = df['days_employed'].isna()
filtro_total_income = df['total_income'].isna()
bank_list = df[filtro_days_employed & filtro_total_income]

len(bank_list)


**Conclusion**

We see that the number of filtered values, that are the rows in which both 'days_employed' and 'total_income' are absent, is equal to the null values in the original dataframe. So, we can assume that there is a link between 'days_employed' and 'total_income'.

In [None]:
# Let's see what percentage of rows with missing values are out of the total data frame
perce =len(bank_list)/len(df)
print(f'Percentage of number of filtered values over the original dataframe values: {perce:.0%}')

The percentage of the rows with missing values over the rows of the initial dataframe is +/- 10%.
I do not see any relation between the absence of values and some specific feature in the row.

Since 10% don't seem to me a scary value, I will compare the relation of every feature (column) with the original dataframe and the dataframe composed only by rows with missing values

In [None]:
# We are going to investigate the clients that do not have data on the identified characteristic and the column with the missing values
print("Relation with 'n of children' before filtering null values")
print(df['children'].value_counts())
print()
print("Relation with 'n of children' before filtering null values (percentage of the total)")
print(100*df['children'].value_counts()/df['children'].count())
print()


In [None]:
print("Relation with 'n of children' with null values filtering")
print(bank_list['children'].value_counts())
print()
print("Relation with 'n of children' with null values filtering (percentage of the total)")
print(100*bank_list['children'].value_counts()/bank_list['children'].count())

The relation between the 2 cases is coherent. So, we can confirm there's no actual connection between the n of children and the absent values

Lets do the same with age:

In [None]:
print("Relation with 'age' before filtering null values")
print(df['dob_years'].value_counts().head())
print()
print("Relation with 'age' before filtering null values (percentage of the total)")
print((100*df['dob_years'].value_counts()/df['dob_years'].count()).head())

In [None]:
print("Relation with 'age' with null values filtering")
print(bank_list['dob_years'].value_counts().head())
print()
print("Relation with 'age' with null values filtering (percentage of the total)")
print((100*bank_list['dob_years'].value_counts()/bank_list['dob_years'].count()).head())

The relation between the 2 cases is coherent. So, we can confirm there's no actual connection between the age and the absent values

Lets do the same with education:

In [None]:
print("Relation with 'education' before filtering null values")
print(df['education'].value_counts())
print()
print("Relation with 'education' before filtering null values (percentage of the total)")
print((100*df['education'].value_counts()/df['education'].count()))

In [None]:
print("Relation with 'education' with null values filtering")
print(bank_list['education'].value_counts())
print()
print("Relation with 'education' with null values filtering (percentage of the total)")
print((100*bank_list['education'].value_counts()/bank_list['education'].count()))

The relation between the 2 cases is coherent. So, we can confirm there's no actual connection between the education and the absent values

Lets do the same with family status:

In [None]:
print("Relation with 'family status' before filtering null values")
print(df['family_status'].value_counts())
print()
print("Relation with 'family status' before filtering null values (percentage of the total)")
print((100*df['family_status'].value_counts()/df['family_status'].count()))

In [None]:
print("Relation with 'family status' with null values filtering")
print(bank_list['family_status'].value_counts())
print()
print("Relation with 'family status' with null values filtering (percentage of the total)")
print((100*bank_list['family_status'].value_counts()/bank_list['family_status'].count()))

The relation between the 2 cases is coherent. So, we can confirm there's no actual connection between the family status and the absent values



Lets do the same with gender:

In [None]:
print("Relation with 'gender' before filtering null values")
print(df['gender'].value_counts())
print()
print("Relation with 'gender' before filtering null values (percentage of the total)")
print((100*df['gender'].value_counts()/df['gender'].count()))

In [None]:
print("Relation with 'gender' with null values filtering")
print(bank_list['gender'].value_counts())
print()
print("Relation with 'gender' with null values filtering (percentage of the total)")
print((100*bank_list['gender'].value_counts()/bank_list['gender'].count()))

The relation between the 2 cases is coherent. So, we can confirm there's no actual connection between the gender and the absent values

Lets do the same with income type:

In [None]:
print("Relation with 'income_type' before filtering null values")
print(df['income_type'].value_counts())
print()
print("Relation with 'income_type' before filtering null values (percentage of the total)")
print((100*df['income_type'].value_counts()/df['income_type'].count()))

In [None]:
print("Relation with 'income_type' with null values filtering")
print(bank_list['income_type'].value_counts())
print()
print("Relation with 'income_type' with null values filtering (percentage of the total)")
print((100*bank_list['income_type'].value_counts()/bank_list['income_type'].count()))

The relation between the 2 cases is coherent. So, we can confirm there's no actual connection between the income type and the absent values

Lets do the same with debt:

In [None]:
print("Relation with 'debt' before filtering null values")
print(df['debt'].value_counts())
print()
print("Relation with 'debt' before filtering null values (percentage of the total)")
print((100*df['debt'].value_counts()/df['debt'].count()))

In [None]:
print("Relation with 'debt' with null values filtering")
print(bank_list['debt'].value_counts())
print()
print("Relation with 'debt' with null values filtering (percentage of the total)")
print((100*bank_list['debt'].value_counts()/bank_list['debt'].count()))

The relation between the 2 cases is coherent. So, we can confirm there's no actual connection between the debt and the absent values

Lets do the same with purpose:

In [None]:
print("Relation with 'purpose' before filtering null values")
print(df['purpose'].value_counts().head())
print()
print("Relation with 'purpose' before filtering null values (percentage of the total)")
print((100*df['purpose'].value_counts()/df['purpose'].count()).head())

In [None]:
print("Relation with 'purpose' with null values filtering")
print(bank_list['purpose'].value_counts().head())
print()
print("Relation with 'purpose' with null values filtering (percentage of the total)")
print((100*bank_list['purpose'].value_counts()/bank_list['purpose'].count()).head())

The relation between the 2 cases is coherent. So, we can confirm there's no actual connection between the purpose and the absent values

So, the comparation between the dataframe with filtered valued and the initial dataframe brings us to say that the relation between the 1st, 2nd and 3rd values are pretty similar for all the features.

Anyway, I do not think the absent values are due to something in particular. This means their absence is aleatory.

## Stage 2. Data preprocessing <a id='data_preprocessing'></a>

### 2.1 Data preparation <a id='data_preparation'></a>

In [None]:
# Let's look at all the values in the 'education' column to see if the spelling will need to be corrected and what exactly will need to be corrected
df['education'].sort_values().unique()

Concerning the 'education' column, there many values that indicate the same thing, so we take care about it in the following

In [None]:
df['education'] = df['education'].str.lower()

In [None]:
# Checking all the values in the column to make sure we have corrected them
df['education'].sort_values().unique()

In [None]:
# Let's see the distribution of values in the `children` column
df['children'].value_counts()

There are some anomalies in this column:
- 20 children seems absurd
- -1 children is not acceptable

In my opinion, they are both error of typing: '20' means '2' and '-1' means '1'

So, I arrange the new dataframe with both corrections


In [None]:
df.loc[df['children'] == 20, 'children'] = 2
df.loc[df['children'] == -1, 'children'] = 1

In [None]:
# Check the `children` column again to make sure everything is fixed
df['children'].value_counts()

Just keep in mind that we are making an assumption about the nature of the errors and it is not wrong in this case to assume that it is a typo, but these modifications with a larger volume of data can alter the behavior of the information. On another occasion, it would be necessary to discuss with those in charge of collecting information.

Now we are going to see if there is some problem among the column of age

In [None]:
df['dob_years'].sort_values().unique()

In the 'dob_years' column the only thing that is not acceptable is 0 (zero), so we replace this value with the median of the column without 0 (zero).

In [None]:
median_age = df.loc[df['dob_years'] != 0, 'dob_years'].median()
median_age

Lets now assign this median value to it:

In [None]:
df.loc[df['dob_years'] == 0, 'dob_years'] = median_age

We verify:

In [None]:
df['dob_years'].sort_values().unique()

####  Restore missing values in `days_employed`

Let's work now on the column 'days_employed'

In [None]:
# Find problematic data in `days_employed`, if it exists, re-arrange them
df['days_employed'].head(20)

We see that in the column 'days_employed' there are negative and float values. Anyway the positive values are one or more order of magnitude higher maybe because of a problem with the collection of 
information. I will replace temporarely all the positive data in 'days_employed' with null-value and later I substitute these null values with the mean or median value of the respective category (to be chosen which one)

In [None]:
df.loc[df['days_employed'] > 0, 'days_employed'] = None

I check if the replacement effectively took place, counting the positive values of column 'days_employed'

In [None]:
len(df[df.days_employed > 0])

Ok! we now have no positive value. Let's now convert the neg values to positive using their absolute value

In [None]:
df['days_employed'] = df['days_employed'].abs()

In [None]:
df['days_employed'].describe()

Since the distribution of values in 'days_employed' is skewed and there are some outliers(the max is >10x the median value), I think it is better to use the median value to replace the null values. Let's dive into it

To fill the absent values in 'days_employed' we will use the median value of the corresponding age category of the row. But to do this, we need first to find out these categories.

In [None]:
# Let's write a function that calculates the age category
def age_group(age):
    """
    The function returns the age category according to the age value, using the following rules:
    — 18-35 yo for age between 18 and 35 year old
    — 36-46 yo for age between 36 and 46 year old
    — 47-66 yo for age between 47 and 66 year old
    — >66 yo for age over 66 year old
    """
    
    if 18 <= age <= 35:
        return "18-35 yo"
    if 36 <= age <= 46:
        return "36-46 yo"
    if 47 <= age <= 66:
        return "47-66 yo"
    if age > 66:
        return ">66 yo"

Now that we have the category of age we add a new column 'age_cat' referring the age of every customer

In [None]:
# Create a new column 'age_cat' based on the function
df['age_cat'] = df['dob_years'].map(age_group)

Let's now review the distribution of 'days_employed' by age_cat

In [None]:
age_days_employed = df.groupby(['age_cat']).agg({'days_employed' : ['mean', 'median', 'min', 'max']})
age_days_employed

It makes sense because the older a client is, the more days he (or she) has worked.

And now we will replace the null values with the median value of the corresponding age_cat. This because the values are highly skewed (see for example min and max)

In [None]:
def my_lambda(x):
    return x.fillna(x.median())

df['days_employed'] = df.groupby(['age_cat'])['days_employed'].transform(lambda x: my_lambda(x))

To simplify things up, lets take the int of every value.

In [None]:
df['days_employed'] = df['days_employed'].astype(int)

Ok ! Lets see the resulting column info

In [None]:
df['days_employed'].info()

Ok ! we have now 21525 int values in 'days_employed'

Lets analyze now the family_status column to see if there's some anomaly

In [None]:
df['family_status'].sort_values().unique()

There's no anomaly inside the 'family_status' column, so we skip this.

Lets analyze now the 'gender' column to see if there's some anomaly

In [None]:
df['gender'].unique()

There's no anomaly inside the 'gender' column, so we skip this.

Lets analyze now the 'income_type' column to see if there's some anomaly

In [None]:
df['income_type'].sort_values().unique()

There's no anomaly inside the 'income_type' column, so we skip this.

Lets see if there are duplicates:

In [None]:
df.duplicated().value_counts()

We have 71 duplicated rows and then we should get rid of them and reset the index 

In [None]:
df = df.drop_duplicates().reset_index(drop = True)

In [None]:
# Última comprobación para ver si tenemos duplicados

df.duplicated().value_counts()

Finally we have no duplicated values. Good! Now reset the index 

In [None]:
# Checking the size of the dataset we now have, after having run these first few manipulations
df.shape

Filtering the duplicated rows we end up with 21454 rows, so just 0.4% less rows. This does not stop us to proceed.

Restoring missing values in `total_income`

As already stated, the second column with missing values is 'total_income'

In [None]:
# Distribution of mean and median value of `total_income`
df['total_income'].describe().astype(int)

Since the distribution of values in 'total_income' is skewed and there are some outliers, I think it is better to use the median value to replace the null values. But, as already done with 'days_employed' not just the overall median but the median correspondent to the level of education. Let's dive into it

In [None]:
education_total_income = df.groupby(['education']).agg({'total_income' : ['mean', 'median', 'min', 'max']})
education_total_income

It makes sense because the higher is the level of study, the higher is the total_income

And now we will replace the null values with the median value of the corresponding education.

In [None]:
def my_lambda(x):
    return x.fillna(x.median())

df['total_income'] = df.groupby(['education'])['total_income'].transform(lambda x: my_lambda(x))

In [None]:
df['total_income'] = df['total_income'].fillna(df['total_income'].median())

Cool ! Now we have filled the null values in 'total_income' with the median value of correspondent education

To simplify things up, lets take the int of every value.

In [None]:
df['total_income'] = df['total_income'].astype(int)

Ok ! Lets see the resulting column info

In [None]:
df['total_income'].info()

Ok ! we have now 21525 int values in 'total_income'

Let's now analyze the 'total income' mean and median values based on every column

In [None]:
# Relation between 'age_cat' and 'total_income'
df.groupby(['age_cat']).agg({'total_income': ['mean', 'median']}).astype(int)

In [None]:
# Relation between 'children' and 'total_income'
df.groupby(['children']).agg({'total_income': ['mean', 'median']}).astype(int)


In [None]:
# Relation between 'education' and 'total_income'
df.groupby(['education']).agg({'total_income': ['mean', 'median']}).astype(int)

This last, the relation between 'education' and 'total_income', is the more explicit.
In short, the more people are educated, the bigger is their income.
The scale is (from lower to high): 
- primary edu
- secondary edu
- some college 
- grad degree
- BSc degree

In [None]:
# Relation between 'family_status' and 'total_income'
df.groupby(['family_status']).agg({'total_income': ['mean', 'median']}).astype(int)

In [None]:
# Relation between 'gender' and 'total_income'
df.groupby(['gender']).agg({'total_income': ['mean', 'median']}).astype(int)

Males are definitive higher earner than female customers

In [None]:
# Relation between 'income_type' and 'total_income'
df.groupby(['income_type']).agg({'total_income': ['mean', 'median']}).astype(int)

This last as well, the relation between 'income_type' and 'total_income', is reasonable. Business and entrepreneur (maybe are the same) score the highest income while students and paternity leave have the lowest

The 'education' index is, in my opinion, the most trustable index to foresee the total income.
In short, the more people are educated, the bigger is their income.

In [None]:
# Comprobar el número de entradas en las columnas
df.info()

Ok, now we are sure that every column has all non-null values.

### 2.2 Data classification <a id='data_classification'></a>

Let's first arrange the categorical data: 'purpose' column

In [None]:
# Displays the data values selected for classification
df.purpose.sort_values().unique()

We can reduce all these different purposes to just 4. Let's see how

In [None]:
def replace_wrong_purpose(wrong_purposes, correct_purpose):
    for wrong_purpose in wrong_purposes:
        df['purpose'] = df['purpose'].replace(wrong_purpose, correct_purpose)
        
duplicates_1 = ['building a property', 'building a real estate', 'buy commercial real estate', 'buy real estate', 'buy residential real estate', 'buying property for renting out', 
                'construction of own property', 'housing', 'housing renovation', 'housing transactions', 'property', 'purchase of my own house', 'purchase of the house', 
                'purchase of the house for my family', 'real estate transactions', 'transactions with commercial real estate','transactions with my real estate']
correct_1 = "real estate"

duplicates_2 = ['buying a second-hand car', 'buying my own car', 'car', 'car purchase', 'cars', 'purchase of a car', 'second-hand car purchase', 'to buy a car', 'to own a car']
correct_2 = "car purchase"

duplicates_3 = ['education', 'getting an education', 'getting higher education', 'going to university', 'profile education', 'supplementary education', 'to become educated',
                'to get a supplementary education', 'university education']
correct_3 = "education"

duplicates_4 = ['having a wedding', 'to have a wedding', 'wedding ceremony']
correct_4 = "wedding"

replace_wrong_purpose(duplicates_1, correct_1)  
replace_wrong_purpose(duplicates_2, correct_2)
replace_wrong_purpose(duplicates_3, correct_3)
replace_wrong_purpose(duplicates_4, correct_4)

Let's check again for unique values

In [None]:
df.purpose.sort_values().unique()

Cool, we have now only four kind of purposes instead of the long messy list of before

## Stage 3. Hypothesis testing <a id='Hypothesis_testing'></a>

### 3.1 Data correlations <a id='data_correlation'></a>

**Is there a correlation between having children and paying on time?**

In [None]:
# Checking the data on the children and punctual payments

deb_clients = df['children'].value_counts()

children_deb = df.groupby('children').agg({'debt':'sum'})

children_deb["default_rate_%"] = (100*children_deb['debt']/deb_clients).round(2)

children_deb.sort_values(by = 'default_rate_%', ascending=False, inplace=True)


children_deb.style.bar(color= '#ff6200')

This last means respectively that:
- 1063 clients with no children have not paid previous debts
- 445 clients with 1 children have not paid previous debts
- etc.

**Conclusion**

So we realize that, statistically speaking, the default rate of customer with no children is the lowest (i did not consider clients with 5 children) thus they are more likely to pay back their debt whereas customers with 4 children are the worst. 

**Is there a correlation between family situation and paying on time?**

In [None]:
# Comprueba los datos del estado familiar y los pagos a tiempo 

fam_stat_clients = df['family_status'].value_counts()

fam_stat_deb = df.groupby('family_status').agg({'debt':'sum'})

fam_stat_deb["default_rate %"] = (100*fam_stat_deb['debt']/fam_stat_clients).round(2)

fam_stat_deb.sort_values(by = 'default_rate %', ascending=False, inplace=True)

fam_stat_deb.style.bar(color= '#ff6200')

**Conclusion**

So we realize that the default rate of widow customer is the lowest thus they are more likely to pay back their debt whereas unmarried customers are the worst. 

**Is there a correlation between income level and on-time payment?**

To do this, let's create a category of different income levels

In [None]:
def total_income_group(income):
    """
    The function returns the total income, using the following rules:
    — >50k for customer who get more than 50k/year
    — 35-50k for customer who get between 35k and 50k/year
    — 25-34k for customer who who get between 25k and 34k/year
    — 15-24k for customer who who get between 15k and 24k/year
    — <15h for customer who who get less than 15k/year    
    """
    
    if income > 50000:
        return '>50k'
    if 35000 < income <= 50000:
        return '35-50k'
    if 25000 < income <= 35000:
        return '25-34k'
    if 15000 < income <= 25000:
        return '15-24k'
    if income <= 15000:
        return '<15k'   

In [None]:
df['income_cat'] = df['total_income'].map(total_income_group)
df.head()

Good, Now we have a new column with income_cat !

In [None]:
income_clients = df['income_cat'].value_counts()

income_deb = df.groupby('income_cat').agg({'debt':'sum'})

income_deb["default_rate_%"] = (100*income_deb['debt']/income_clients).round(2)

income_deb.sort_values(by = 'default_rate_%', ascending=False, inplace=True)

income_deb.style.bar(color= '#ff6200')


This last means respectively that:
- 739 clients with total_income between 15k-24k have not paid previous debts
- 414 clients with total_income between 25k-34k have not paid previous debts
- etc.

So we realize that, statistically speaking, customers with higher income are more likely to pay back their debt.

**How does the purpose of the loan affect the default rate?**

In [None]:
# Consulting the default rate percentages for each credit purpose and analyze them

purpose_clients = df['purpose'].value_counts()

purpose_deb = df.groupby('purpose').agg({'debt':'sum'})

purpose_deb["default_rate_%"] = (100*purpose_deb['debt']/purpose_clients).round(2)

purpose_deb.sort_values(by = 'default_rate_%', ascending=False, inplace=True)

purpose_deb.style.bar(color= '#ff6200')

So we realize that, statistically speaking, customers with real estate as a purpose are more likely to pay back their debt and customers with car purchase are the least trustable

**How does education affect the default rate?**

In [None]:
education_clients = df['education'].value_counts()

education_deb = df.groupby('education').agg({'debt':'sum'})

education_deb["default_rate_%"] = (100*education_deb['debt']/education_clients).round(2)

education_deb.sort_values(by = 'default_rate_%', ascending=False, inplace=True)

education_deb.style.bar(color= '#ff6200')

So we realize that, statistically speaking, customers with bachelor's degree  are more likely to pay back their debt and customers with primary education are the least trustable

### 3.2 Credit score creation <a id='credit_score_creation'></a>

Now we know for each category evaluated (number of children, family status, total income, purpose and education) which group is more financially reliable and who is less. So we can create an index for every one and then calculate the overall creditworthiness of every customer

Lets begin setting an index for children

In [None]:
def children_index(children):
    """
    The function returns the children index according to the previous analysis, using the following rules:
    — index = 6 (max) for customer with 5 children
    — index = 5 for customer with 0 children
    — index = 4 for customer with 3 children
    — index = 3 for customer with 1 children
    — index = 2 for customer with 2 children
    — index = 1 (min) for customer with 4 children 
    """
    
    if children == 5:
        return 6
    if children == 0:
        return 5
    if children == 3:
        return 4
    if children == 1:
        return 3
    if children == 2:
        return 2
    if children == 4:
        return 1

Let's create a new column 'children_ind' using the previous function

In [None]:
df['children_ind'] = df['children'].map(children_index)

Lets set an index for age

In [None]:
def age_index(age):
    
    """
    The function returns the age index according to the previous analysis, using the following rules:
    — index = 4 (max) for customer aged 18-35
    — index = 3 for customer aged 36-46
    — index = 2 for customer aged 47-66
    — index = 1 (min) for customer aged >66  
    """
    
    if 18 <= age <= 35:
        return 1
    if 36 <= age <= 46:
        return 2
    if 47 <= age <= 66:
        return 3
    if age >= 66:
        return 4

Let's create a new column 'age_ind' using the previous function

In [None]:
df['age_ind'] = df['dob_years'].map(age_index)

Lets set an index for family status

In [None]:
def family_status_index(family_status):
    """
    The function returns the family status index according to the previous analysis, using the following rules:
    — index = 5 (max) for widow customer 
    — index = 4 for divorced customer 
    — index = 3 for married customer
    — index = 2 for civil partnership customer
    — index = 1 (min) for unmarried customer  
    """
    
    if family_status == 'unmarried':
        return 1
    if family_status == 'civil partnership':
        return 2
    if family_status == 'married':
        return 3
    if family_status == 'divorced':
        return 4
    if family_status == 'widow / widower':
        return 5

Let's create a new column 'family_status_ind' using the previous function

In [None]:
df['family_status_ind'] = df['family_status'].map(family_status_index)

Lets set an index for education

In [None]:
def education_index(education):
    """
    The function returns the education index according to the previous analysis, using the following rules:
    — index = 5 (max) for graduate degree customer 
    — index = 4 for bachelor's degree customer 
    — index = 3 for secondary education customer
    — index = 2 for some college customer
    — index = 1 (min) for primary education customer  
    """
    
    if education == 'primary education':
        return 1
    if education == 'some college':
        return 2
    if education == 'secondary education':
        return 3
    if education == "bachelor's degree":
        return 4
    if education == "graduate degree":
        return 5

Let's create a new column 'education_ind' using the previous function

In [None]:
df['education_ind'] = df['education'].map(education_index)

Lets set an index for income

In [None]:
def income_index(income):
    """
    The function returns the income index according to the previous analysis, using the following rules:
    — index = 5 (max) for customer who get more than 50k/year
    — index = 4 for customer who get between 35k and 50k/year
    — index = 3 for customer who who get less than 15k/year
    — index = 2 for customer who who get between 25k and 35k/year
    — index = 1 (min) for customer who between 15k and 24k/year    
    """
    
    if income > 50000:
        return 5
    if 35000 < income <= 50000:
        return 4
    if 25000 < income <= 35000:
        return 2
    if 15000 < income <= 25000:
        return 1
    if income <= 15000:
        return 3 

Let's create a new column 'income_ind' using the previous function

In [None]:
df['income_ind'] = df['total_income'].map(income_index)

Lets set an index for purpose

In [None]:
def purpose_index(purpose):
    """
    The function returns the purpose index according to the previous analysis, using the following rules:
    — index = 4 (max) for real estate customer 
    — index = 3 for wedding customer
    — index = 2 for education customer
    — index = 1 (min) car purchase customer  
    """
    
    if purpose == 'car purchase':
        return 1
    if purpose == 'education':
        return 2
    if purpose == 'wedding':
        return 3
    if purpose == "real estate":
        return 4

Let's create a new column 'purpose_ind' using the previous function

In [None]:
df['purpose_ind'] = df['purpose'].map(purpose_index)

And finally we can have the final index of every row just calculating the mean value of all the indexes

In [None]:
df['final_ind'] = ((df['children_ind'] + df['age_ind'] + df['family_status_ind'] + df['education_ind'] + df['income_ind'] + df['purpose_ind'])/6).round(2)

Practically speaking, the final index ranges between 1 and 4.8 where. The smaller is the final index the lower is the creditworthiness of the debtor.

## General conclusion <a id='conclusion'></a>

We begun with a 21525x12 dataframe whith more or less 10% of rows with absent values. These absent values were always in the columns "days_employed" and "total_income" at the same time but anyway I did not discovered any connection between them.
Once I worked the values in "days_employed", meaning that I deleted the original positive values because to high to be reasonable and replaced them with the median value of the correspondent age category and converted the negative float values into positive int values.
I dropped the duplicated values (71 of them) and filled the absent values "total_income"  with the median value of the correspondent education. I did so because there were some outside values much higher or much smaller that could affect the simple average.
After all this, I ended up with 21454 rows.
After that I ran the analysis of every column based on categories and in relation with the ability to pay back the debt.

Finally I discovered that prime indicators to measure the likeability of a client to pay back his debt are:
- higher education
- real estate purpose 
- high income level
- not having children
- be a widow/widower