# Analyzing borrowers’ risk of defaulting

There are bank data on customers’ credit worthiness.

The purpose of the study: find out if a customer’s marital status and number of children has an impact on whether they will default on a loan.

The Work Plan:

<a href='#section1'>1. Getting the data and study the general information</a>

<a href='#section2'>2. Data preprocessing</a>
* <a href='#section2.1'>2.1 Processing missing values
* <a href='#section2.2'>2.2 Data type replacement
* <a href='#section2.2'>2.3 Processing duplicates

<a href='#section3'>3. Categorizing data</a>

<a href='#section4'>4. Study of the impact of the investigated features on debt repayment</a>

<a href='#section5'>5. Overall Conclusion</a>

## 1. Getting the data and study the general information. 
<a id='section1'></a>

In [1]:
#reading a file that contains credit scoring by customers 
import pandas as pd
credit_scores_data = pd.read_csv('/datasets/credit_scoring_eng.csv')

#looking at the data's general information. 
credit_scores_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       19351 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        19351 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


_The table contains __21525 rows__ for __12 features__._ 

*There are __missing values__ in the columns of `'days_employed'` and `'total_income'`. The number of missing values is the same, most likely both gaps in the same line. In addition, the `'days_employed'` column is of type __float__, although from the description it's expected that the values should be __integers__.* 

*For the columns `'dob_years'`(age), `'children'`, `'education_id'` and `'family_status_id'` __int64__ is used. This seems excessive and would be enough __int8__ (the values should most likely fit into the interval -128 127)*

In [2]:
#looking at the first few values in a table
credit_scores_data.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding


*The values in the `'days_employed'` column are indeed __not integer__ values and, moreover, there are __negative values__ among them. In the last displayed line, the value is __320 thousand and a little__. If it's days, and they are converted to years, dividing by 365, it will be __over 900 years__ of work experience. So far it __looks like some incorrect values__.*

*In addition, it seems that the information in the `'education'` column __duplicates the information__ in the `'education_id'` column. The same is true for a couple `'family_status'` and `'family_status_id'`.*

_Moreover, the values in the `'education'` column are __duplicated case sensitive__._

In [3]:
#Let's look at numerical and categorical features in more detail
#creating the corresponding lists
numeric_cols = ['children', 'days_employed', 'dob_years', 'total_income']
categorical_cols = ['education', 'education_id', 'family_status', 'family_status_id',\
                    'gender', 'income_type', 'debt', 'purpose']

#adding a function to find the minimum positive number in a series. It's required at least for 'days_employed' feature.
def positive_min(series):
    return min([x for x in series if x > 0])

#a couple of test cases:
#positive_min([-1, 2, 3])
#positive_min([-1, -2])

#looking at the values that take numerical features 
credit_scores_data[numeric_cols].agg(['min', 'max', 'mean', 'median', positive_min]).T

Unnamed: 0,min,max,mean,median,positive_min
children,-1.0,20.0,0.538908,0.0,1.0
days_employed,-18388.949901,401755.400475,63046.497661,-1203.369529,328728.720605
dob_years,0.0,75.0,43.29338,42.0,19.0
total_income,3306.762,362496.645,26787.568355,23202.87,3306.762


*The following __problems in the data of numerical columns__ could be identified:*
* *negative value for number of children (see min value of the `'children'` feature)*
* *20 kids seems like a lot for one family (see min value of the `'children'` feature)*
* *at least 50% values of the `'days_employed'` feature are negative (the value of the median is negative)*
* *the minimum positive value of the `'days_employed'` feature is quite huge (see positive_min value of the `'days_employed'` feature)*
* *there is zero value of customer's age (see min value of the `'dob_years'` feature)*

*In addition, the values for the columns `'dob_years'`(age) and `'children` are such that the type __int8__ would be enough for it.*

In [4]:
#Let's look at these problems in more detail

#some global constats for formatting the output string
BOLD = '\033[1m'
END = '\033[0m'

#loking at the 'children' feature
print(BOLD + "The 'children' feature:" + END)
print("Number of customers with a value of -1 in the 'children' field:",\
          len(credit_scores_data[credit_scores_data.children==-1]))
print("Number of customers with 20 kids in the 'children' field:",\
          len(credit_scores_data[credit_scores_data.children==20]))
print("Number of customers with 20 kids less than 15+20 years of age:",\
          len(credit_scores_data[(credit_scores_data.children==20)\
                   &(credit_scores_data.dob_years<(15+20))\
                   &(credit_scores_data.dob_years>0)]))
print()

#loking at the 'days_employed' feature
print(BOLD + "The 'days_employed' feature:" + END)

#calculating the number of negative values of the 'days_employed' feature:
neg_days_employed_count = credit_scores_data[credit_scores_data.days_employed<0].days_employed.count()
print("The number of negative values of the 'days_employed' feature is", neg_days_employed_count,\
         "which is {:.1%}".format(neg_days_employed_count/credit_scores_data.shape[0]))
print()
print("The 'days_employed' feature values converted in years:")
print((credit_scores_data[['days_employed']]/365).agg(['min', 'max', 'mean', 'median', positive_min]).T)
print()

#loking at the 'dob_years' feature
print(BOLD + "The 'dob_years' feature:" + END)
print("Number of customers with zero age: ",\
          len(credit_scores_data[credit_scores_data.dob_years==0]))

[1mThe 'children' feature:[0m
Number of customers with a value of -1 in the 'children' field: 47
Number of customers with 20 kids in the 'children' field: 76
Number of customers with 20 kids less than 15+20 years of age: 21

[1mThe 'days_employed' feature:[0m
The number of negative values of the 'days_employed' feature is 15906 which is 73.9%

The 'days_employed' feature values converted in years:
                     min          max        mean    median  positive_min
days_employed -50.380685  1100.699727  172.730131 -3.296903    900.626632

[1mThe 'dob_years' feature:[0m
Number of customers with zero age:  101


_The value of __the number of children -1__ looks __incorrect__, as well as __20 kids__ for parents whose age is less than the beginning of reproductive age (15 years old) plus 20. There are not very many such values, but not single values either. This should be taken into account in further data analysis and considered separately._

*More than __70%__ of the values in the `'days_employed'` column are __negative__ and values are __not integer__. The __minimum positive value__ of the `'days_employed'` feature converted __in years__ is also __huge__.*

_The __age__ of __101__ customers is __unknown__ and is __0__ in the table._

In [5]:
#looking at all unique values of the 'children' column and number of each values
credit_scores_data.children.value_counts()

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

_Other values of the 'children' column (except -1 and 20) appear normal._

In [6]:
#Let's look at the values that take categorical features 
pd.set_option('display.max_colwidth', -1)
credit_scores_data[categorical_cols].agg(['unique']).T

Unnamed: 0,unique
education,"[bachelor's degree, secondary education, Secondary Education, SECONDARY EDUCATION, BACHELOR'S DEGREE, some college, primary education, Bachelor's Degree, SOME COLLEGE, Some College, PRIMARY EDUCATION, Primary Education, Graduate Degree, GRADUATE DEGREE, graduate degree]"
education_id,"[0, 1, 2, 3, 4]"
family_status,"[married, civil partnership, widow / widower, divorced, unmarried]"
family_status_id,"[0, 1, 2, 3, 4]"
gender,"[F, M, XNA]"
income_type,"[employee, retiree, business, civil servant, unemployed, entrepreneur, student, paternity / maternity leave]"
debt,"[0, 1]"
purpose,"[purchase of the house, car purchase, supplementary education, to have a wedding, housing transactions, education, having a wedding, purchase of the house for my family, buy real estate, buy commercial real estate, buy residential real estate, construction of own property, property, building a property, buying a second-hand car, buying my own car, transactions with commercial real estate, building a real estate, housing, transactions with my real estate, cars, to become educated, second-hand car purchase, getting an education, car, wedding ceremony, to get a supplementary education, purchase of my own house, real estate transactions, getting higher education, to own a car, purchase of a car, profile education, university education, buying property for renting out, to buy a car, housing renovation, going to university]"


*The following __problems in the data of categorical columns__ could be identified:*
* *the values in the `'education'` column are duplicated case sensitive*
* *the information in the `'education'` column most likely duplicates the information in the `'education_id'` column*
* *the information in the `'family_status'` column most likely duplicates the information in the `'family_status_id'` column*
* *the `'gender'` column has odd value 'XNA'*
* *the values in the `'purpose'` column are duplicated in meaning, for example, ' to have a wedding' and 'having a wedding' or 'car purchase' and 'buying my own car'*

*In addition, the values for the columns `'education_id'` and `'family_status_id'` are such that the type __int8__ would be enough for them. For the column `'debt'` it would be enough just __Boolean type__.*

In [7]:
#Let's look at these problems in more detail
print(BOLD + "Mapping 'education' column and 'education_id' column:\n" + END)
print(credit_scores_data[['education', 'education_id']].drop_duplicates(subset=['education', 'education_id'])\
        .sort_values('education_id'))
print()
print(BOLD + "Mapping 'family_status' column and 'family_status_id' column:\n" + END)
print(credit_scores_data[['family_status', 'family_status_id']].\
        drop_duplicates(subset=['family_status', 'family_status_id'])\
            .sort_values('family_status_id'))
print()
print(BOLD + "Unique values of 'gender' column and their counts:" + END)
print(credit_scores_data.gender.value_counts())

[1mMapping 'education' column and 'education_id' column:
[0m
                education  education_id
0     bachelor's degree    0           
8     BACHELOR'S DEGREE    0           
62    Bachelor's Degree    0           
1     secondary education  1           
2     Secondary Education  1           
7     SECONDARY EDUCATION  1           
13    some college         2           
134   SOME COLLEGE         2           
376   Some College         2           
31    primary education    3           
797   PRIMARY EDUCATION    3           
2817  Primary Education    3           
2963  Graduate Degree      4           
4170  GRADUATE DEGREE      4           
6551  graduate degree      4           

[1mMapping 'family_status' column and 'family_status_id' column:
[0m
        family_status  family_status_id
0   married            0               
4   civil partnership  1               
18  widow / widower    2               
19  divorced           3               
24  unmarried          4 

*The information in the `'education'` column indeed __duplicates case sensitive__ the information in the `'education_id'` column. Therefore, it is necessary to convert the values of the education column to lowercase. The excess information can be moved to a separate dictionary to simplify the table and use less memory.*

*The information in the `'family_status'` column __entirely duplicates__ the information in the `'family_status_id'` column. This excess information can be moved to a separate dictionary as well.*

_The __gender__ of only 1 customer is __unknown__ and is __'XNA'__ in the table._

### Conclusions

The table contains 21525 rows for 12 folowing features:
* 4 numerical features:
 1. `'children'` - the number of children in the family
     *   __has excess type__ int64 which can be converted to int8
     *   __has odd values__ -1 and 20 which should be taken into account in further data analysis and considered separately. These values could be due to technical or human error, for example, a dash in answer of a questionnaire could be processed as -1, and 2 children could turn into 20 by adding 0.
 2. `'days_employed'` - how long the customer has been working
     *   __contains null values__ It is possible that the customer's income cannot be specified because he/she is unemployed or a student. Or is it just a technical or human error. 
     *   __has incorect values__ that cannot be the number of days of employing. These values could be due to technical or human error, for example, when exporting data, the column with the account balance was selected instead of the required one. It makes sense to contact the developer who sent the source file. If nothing is cleared up, then this column will have to be removed from consideration.
 3. `'dob_years'` -  the customer’s age
     *   __has excess type__ int64 which can be converted to int8
     *   __value for 101 customers is unknown__ and is 0 in the table
 4. `'total_income'` -  monthly income
     *   __contains null values__. It is possible that the customer's income cannot be specified because he/she is unemployed or a student. Or is it just a technical or human error. 
* 8 categorical features:
 1. `'education'` - the customer’s education level
     * __is duplicated case sensitive__. It is necessary to convert the values to lowercase. Duplication could have occurred due to manual filling of information, and not selection from the list.  
     * __duplicates__ the information in the `'education_id'` column. The excess information can be moved to a separate dictionary to simplify the table and use less memory.
 2. `'education_id'` - identifier for the customer’s education
     * __has excess type__ int64 which can be converted to int8
     * __duplicates__ the information in the `'education'` column
 3. `'family_status'` - the customer’s marital status 
     * __duplicates__ the information in the `'family_status_id'` column. The excess information can be moved to a separate dictionary to simplify the table and use less memory.
 4. `'family_status_id'` - identifier for the customer’s marital status
     * __has excess type__ int64 which can be converted to int8
     * __duplicates__ the information in the `'family_status'` column
 5. `'gender'` - the customer’s gender
     * __value for 1 customer is unknown__ and is 'XNA' in the table
 6. `'income_type'` - the customer’s income type
     * values can also be encoded as `'education_id'` and `'family_status_id'`
 7. `'debt'` - whether the customer has ever defaulted on a loan
     * __has excess type__ int64 which can be converted to bool
 8. `'purpose'` - reason for taking out a loan
     * it could take fewer categorical values

## 2. Data Preprocessing
<a id='section2'></a>

### 2.1 Processing missing values
<a id='section2.1'></a>

In [8]:
# Printing the number of missing values per column 
print("The number of missing values per column:\n")
print(credit_scores_data.isnull().sum())
print()

#Let's calculate percentage of null values
#Calculating number of null values
gaps_num = len(credit_scores_data[credit_scores_data.days_employed.isnull()].days_employed)
print("As we noted earlier, the number of gaps is the same in both columns and is {:.1%} of all data."\
         .format(gaps_num/credit_scores_data.shape[0]))
print()

#Are the missing values of these two variables by the same line?
print("The number of null values of the days_employed feature among the not-null values of the total_income feature:")
print(credit_scores_data[~credit_scores_data.total_income.isnull()].days_employed.isnull().sum())
print()
print("The number of null values of the total_income feature among the not-null values of the days_employed feature:")
print(credit_scores_data[~credit_scores_data.days_employed.isnull()].total_income.isnull().sum())

The number of missing values per column:

children            0   
days_employed       2174
dob_years           0   
education           0   
education_id        0   
family_status       0   
family_status_id    0   
gender              0   
income_type         0   
debt                0   
total_income        2174
purpose             0   
dtype: int64

As we noted earlier, the number of gaps is the same in both columns and is 10.1% of all data.

The number of null values of the days_employed feature among the not-null values of the total_income feature:
0

The number of null values of the total_income feature among the not-null values of the days_employed feature:
0


*There are missing values in two columns: `'days_employed'` and `'total_income'`. Both gaps in the same line. Missing values make up 10% of all data, this is a quite large part of the data to delete it.*

If the customers are not employed or are students then it makes sense that total income is unknown and because of this, there are missing values in the table. Let's look at customers with income type unemployed or student: 

In [9]:
credit_scores_data[(credit_scores_data.income_type=='unemployed')\
                   |(credit_scores_data.income_type=='student')]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
3133,1,337524.466835,31,secondary education,1,married,0,M,unemployed,1,9593.119,buying property for renting out
9410,0,-578.751554,22,bachelor's degree,0,unmarried,4,M,student,0,15712.26,construction of own property
14798,0,395302.838654,45,Bachelor's Degree,0,civil partnership,1,F,unemployed,0,32435.602,housing renovation


_There are no null values for unemployed customers or students._

Let's look at income type for customers with null value of total_income:

In [10]:
#save unique values of 'income_type' column for missing values of 'total_income' and their counts
gaps_count_by_income_type = credit_scores_data[credit_scores_data.total_income.isnull()].income_type.value_counts()
gaps_count_by_income_type

employee         1105
business         508 
retiree          413 
civil servant    147 
entrepreneur     1   
Name: income_type, dtype: int64

_There are missing values for different income types._

The income for different income types is most likely different, so it's necessary to fill in the missing values in accordance with the income type of the customer. Let's look at income corresponding the income type:

In [11]:
credit_scores_data.groupby('income_type').total_income.agg(['count','min', 'median', 'mean', 'max'])

Unnamed: 0_level_0,count,min,median,mean,max
income_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
business,4577,4592.45,27577.272,32386.793835,362496.645
civil servant,1312,4672.012,24071.6695,27343.729582,145672.235
employee,10014,3418.824,22815.1035,25820.841683,276204.162
entrepreneur,1,79866.103,79866.103,79866.103,79866.103
paternity / maternity leave,1,8612.661,8612.661,8612.661,8612.661
retiree,3443,3306.762,18962.318,21940.394503,117616.523
student,1,15712.26,15712.26,15712.26,15712.26
unemployed,2,9593.119,21014.3605,21014.3605,32435.602


_Incomes do differ for diffrent income types, for instance, the imcome for business or civil servant is on average higher than for retiree._ 

_The average within a type is greater than the median. This implies that income has outliers with large values. So it's better to use the median for filling in the missing values._

_Note that in our table the income for the type of entrepreneur is filled only for one customer and the value is quite high, so we cannot fill in the missing value of this type by this single value. Then the missing value for the entrepreneur should be filled in as the median value across the entire table (in this case, the estimate will most likely be underestimated) or completely removed this customer from the consideration._

In [12]:
#income types with the missing values more than only for one customer:
income_type_with_gaps = list(gaps_count_by_income_type[gaps_count_by_income_type>1].index)

#Let's fill gaps in total income by income type in case when the data is enough
#Missing values have special types but we need to fill gaps with different values so I used .loc 
for t in income_type_with_gaps:
    income_type_median = credit_scores_data[credit_scores_data.income_type==t].total_income.median()
    credit_scores_data.loc[credit_scores_data.total_income.isnull()\
                             &(credit_scores_data.income_type==t), 'total_income'] = income_type_median

#filling gaps in total income for entrepreneur with median across all data    
median_income = credit_scores_data.total_income.median()
credit_scores_data.loc[credit_scores_data.total_income.isnull()\
                         &(credit_scores_data.income_type=='entrepreneur'),\
                         'total_income'] = median_income

#### Conclusions

There are missing values in the columns of `'days_employed'` and `'total_income'`. Both gaps in the same line. Missing values are 10% of all data, this is a quite large part of the data. 

The `'days_employed'` column is left unchanged until it becomes clear what data it contains.

The total income depends on the type of employment that the bank's customer is engaged in. Incomes differ indeed for diffrent income types. Missing values in total income were filled in according to the income type, except for the entrepreneur type for which there is not enough data.

### 2.2 Data type replacement
<a id='section2.2'></a>

In [13]:
#converting excess type int64 to int8 with help of astype() method:
credit_scores_data['children'] = credit_scores_data.children.astype('int8') 
credit_scores_data['dob_years'] = credit_scores_data.dob_years.astype('int8') 
credit_scores_data['education_id'] = credit_scores_data.education_id.astype('int8') 
credit_scores_data['family_status_id'] = credit_scores_data.family_status_id.astype('int8') 
credit_scores_data['debt'] = credit_scores_data.debt.astype('bool')  

##looking at the data's general information including data types  
credit_scores_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int8
days_employed       19351 non-null float64
dob_years           21525 non-null int8
education           21525 non-null object
education_id        21525 non-null int8
family_status       21525 non-null object
family_status_id    21525 non-null int8
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null bool
total_income        21525 non-null float64
purpose             21525 non-null object
dtypes: bool(1), float64(2), int8(4), object(5)
memory usage: 1.3+ MB


*Excess type int64 for columns `'children'`, `'dob_years'`, `'education_id'`, `'family_status_id'` war converted to int8 and `'debt'` to bool.*

### 2.3 Processing duplicates
<a id='section2.3'></a>

In [14]:
#Values of 'education' is duplicated case sensitive (this was discovered in the step 1).
#Let's convert them to lowercase. 
credit_scores_data['education'] = credit_scores_data.education.str.lower() 

_Duplicated case sensitive values in `'education'` columns were converted to lowercase._

In [15]:
#Let's see if there are any completely duplicated lines in the dataset
credit_scores_data[credit_scores_data.duplicated()].head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
2849,0,,41,secondary education,1,married,0,F,employee,False,22815.1035,purchase of the house for my family
3290,0,,58,secondary education,1,civil partnership,1,F,retiree,False,18962.318,to have a wedding
4182,1,,34,bachelor's degree,0,civil partnership,1,F,employee,False,22815.1035,wedding ceremony
4851,0,,60,secondary education,1,civil partnership,1,F,retiree,False,18962.318,wedding ceremony
5557,0,,58,secondary education,1,civil partnership,1,F,retiree,False,18962.318,to have a wedding


_In the dataset there are some completely duplicated lines._

Let's look at them in more details.

In [16]:
print(BOLD + 'Number of duplicated lines:' + END, credit_scores_data.duplicated().sum())
print()
print(BOLD + 'The values that take numeric values in duplicated lines:\n' + END)
print(credit_scores_data[credit_scores_data.duplicated()][['children', 'dob_years', 'total_income']]\
    .agg(['min', 'max', 'mean', 'median', positive_min]).T)
print()
print(BOLD + 'The values that take categorical values in duplicated lines:\n' + END)
print(credit_scores_data[credit_scores_data.duplicated()][categorical_cols].agg(['unique']).T)

[1mNumber of duplicated lines:[0m 71

[1mThe values that take numeric values in duplicated lines:
[0m
                    min        max          mean      median  positive_min
children      0.000      2.000      0.225352      0.0000      1.000       
dob_years     23.000     71.000     49.985915     54.0000     23.000      
total_income  18962.318  27577.272  21854.865514  22815.1035  18962.318   

[1mThe values that take categorical values in duplicated lines:
[0m
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

_There are 71 completely duplicated lines in the dataset. The values in these lines appear normal._

Let's drop duplicates:

In [17]:
#removing completely duplicated lines
credit_scores_data.drop_duplicates(inplace=True)

#### Conclusions

The column `'education'` had duplicated values case sensitive. Values in this column were converted to lowercase. 
After that, it made sense to drop completely duplicated lines (the drop_duplicates method would not drop duplicate lines with case sensitive duplicated fields).

## 3. Categorizing Data
<a id='section3'></a>

In [18]:
#Let's take excess columns into separate dictionaries
print(BOLD + 'The main dataset without excessed columns (first 5 lines):\n' + END)
credit_scores_main = credit_scores_data[['children', 'dob_years', 'education_id',
       'family_status_id', 'gender', 'income_type', 'debt',
       'total_income', 'purpose']].copy() 
print(credit_scores_main.head())  
print()
print(BOLD + 'Dictionary for education types:' + END)
education_dictionary = credit_scores_data[['education_id', 'education']]
education_dictionary = education_dictionary.drop_duplicates().reset_index(drop=True) 
print(education_dictionary)
print()
print(BOLD + 'Dictionary for family status:' + END)
family_status_dictionary = credit_scores_data[['family_status_id', 'family_status']]
family_status_dictionary = family_status_dictionary.drop_duplicates().reset_index(drop=True) 
print(family_status_dictionary)

[1mThe main dataset without excessed columns (first 5 lines):
[0m
   children  dob_years  education_id  family_status_id gender income_type  \
0  1         42         0             0                 F      employee     
1  1         36         1             0                 F      employee     
2  0         33         1             0                 M      employee     
3  3         32         1             0                 M      employee     
4  0         53         1             1                 F      retiree      

    debt  total_income                  purpose  
0  False  40620.102     purchase of the house    
1  False  17932.802     car purchase             
2  False  23341.752     purchase of the house    
3  False  42820.568     supplementary education  
4  False  25378.572     to have a wedding        

[1mDictionary for education types:[0m
   education_id            education
0  0             bachelor's degree  
1  1             secondary education
2  2             

_Now the main table does not contain excess information._

_In addition, we have different categories of marital status and it's possible to see then if there is a difference in debt in each category. Let's create similar categories for having children, income levels and loan purposes._

In [19]:
#looking at the number of customers with diffrent number of kids
credit_scores_main.groupby('children').children.agg('count')

children
-1     47   
 0     14091
 1     4808 
 2     2052 
 3     330  
 4     41   
 5     9    
 20    76   
Name: children, dtype: int64

_There are two incorrect values for the number of kids and a very small number of customers with 4 and 5 kids. Let's combine two odd values into the 'Unknown' category and also combine 3 or more children together. To do this, create a dictionary with these categories and add a column to the table with their numeric designation._

In [20]:
#creating dict to make it easier to add code values to the table
kids_dict = {'no kids': 0,
                     '1 kid': 1, 
                     '2 kids': 2, 
                     '3 kids and more': 3,
                     'Unknown': 4}

#function to calculate a corresponding category ID by the number of children
def having_kids_id(children):
    if children == 0:
        return kids_dict['no kids']
    if children == 1:
        return kids_dict['1 kid']
    if children == 2:
        return kids_dict['2 kids']
    if children >= 3 and children <= 5: 
        return kids_dict['3 kids and more']
    return kids_dict['Unknown']

#creating a dictionary with names of having kids categories and their numeric designation
having_kids_dictionary = pd.DataFrame(list(zip(kids_dict.values(), kids_dict.keys())),\
                       columns = ['id', 'having_kids']) 
print(BOLD + 'Dictionary for having kids categories:' + END)
print(having_kids_dictionary)
print()
#adding a column with category IDs for having kids
credit_scores_main['having_kids_id'] = credit_scores_main.children.apply(having_kids_id) 

#looking at the number of customers with diffrent number of kids
print(BOLD + 'The number of customers with different having kids category:' + END)
print(credit_scores_main.merge(having_kids_dictionary, left_on='having_kids_id', right_on='id').having_kids.value_counts())

[1mDictionary for having kids categories:[0m
   id      having_kids
0  0   no kids        
1  1   1 kid          
2  2   2 kids         
3  3   3 kids and more
4  4   Unknown        

[1mThe number of customers with different having kids category:[0m
no kids            14091
1 kid              4808 
2 kids             2052 
3 kids and more    380  
Unknown            123  
Name: having_kids, dtype: int64


_There are different categories of having kids with a sufficient number of customers in each category and it's possible to see then if there is a difference in debt in these categories. Note that we can easy to exclude the category Unknown in further analysis._ 

In [21]:
#Let's look at the descriptive statistics for the total income to choose an apropriate categories for it
credit_scores_data.total_income.describe()

count    21454.000000 
mean     26448.553705 
std      15705.754178 
min      3306.762000  
25%      17219.817250 
50%      22815.103500 
75%      31330.237250 
max      362496.645000
Name: total_income, dtype: float64

_To divide the total income into income levels, we can take the values of the first three quantiles (see 25%, 50% and 75%)._ 

_The average (see mean) is greater than the median (see 50%). This implies that income has outliers with large values. Note the maximum income (see max) is an order of magnitude more than the third quantile (see 75%). We can also look at the 1% of customers with the highest income separately, if there are enough observations._

Let's create a dictionary with appropriate categories of income and add a column to the table with their numeric designation.

In [22]:
#creating dict to make it easier to add code values to the table
income_level_dict = {'low': 0,
                     'lower middle': 1, 
                     'upper middle': 2, 
                     'high': 3,
                     'very high': 4}

'''creating grobal variables for Q1, median, Q3 and 99% quantile 
so that each time we do not calculate them inside the function when it is called'''
TOTAL_INCOME_Q1 = credit_scores_main.total_income.quantile(0.25)
TOTAL_INCOME_MEDIAN = credit_scores_main.total_income.median()
TOTAL_INCOME_Q3 = credit_scores_main.total_income.quantile(0.75)
TOTAL_INCOME_QUANTILE99 = credit_scores_main.total_income.quantile(0.99)

#function to calculate a corresponding category ID by the total income
def income_level_id(total_income):
    if total_income <= TOTAL_INCOME_Q1:
        return income_level_dict['low']
    elif total_income <= TOTAL_INCOME_MEDIAN:
        return income_level_dict['lower middle']
    elif total_income <= TOTAL_INCOME_Q3:
        return income_level_dict['upper middle']
    elif total_income <= TOTAL_INCOME_QUANTILE99:
        return income_level_dict['high']
    else:
        return income_level_dict['very high']

#creating a dictionary with names of income level categories and their numeric designation
income_level_dictionary = pd.DataFrame(list(zip(income_level_dict.values(), income_level_dict.keys())),\
                       columns = ['id', 'income_level']) 
print(BOLD + 'Dictionary for income level categories:' + END)
print(income_level_dictionary)
print()
#adding a column with category IDs for income level
credit_scores_main['income_level_id'] = credit_scores_main.total_income.apply(income_level_id) 
#looking at the number of customers with diffrent number of kids
print(BOLD + 'The number of customers with different income level:' + END)
print(credit_scores_main.merge(income_level_dictionary, left_on='income_level_id', right_on='id')\
      .income_level.value_counts())

[1mDictionary for income level categories:[0m
   id  income_level
0  0   low         
1  1   lower middle
2  2   upper middle
3  3   high        
4  4   very high   

[1mThe number of customers with different income level:[0m
lower middle    5480
low             5364
upper middle    5246
high            5149
very high       215 
Name: income_level, dtype: int64


_There are different categories of income level with a sufficient number of customers in each category and it's possible to see then if there is a difference in debt in these categories._

In [23]:
"""Among the large number of specified loan purposes, in fact, there are only several categories:
    * wedding
    * education
    * car 
    * real estate
   Let's pick out these category
"""
#importing library for working with stem of words
from nltk.stem import SnowballStemmer  
english_stemmer = SnowballStemmer('english')

#creating dict to make it easier to add code values to the table
purpose_type_dict = {'wedding': 0,
                     'education': 1, 
                     'car': 2, 
                     'real estate': 3}

#function to calculate a corresponding category ID by the purpose
def purpose_type_id(purpose):
    purpose_words = purpose.split(" ")
    if english_stemmer.stem('house') in map(english_stemmer.stem, purpose_words):
        return purpose_type_dict['real estate']
    if english_stemmer.stem('property') in map(english_stemmer.stem, purpose_words):
        return purpose_type_dict['real estate']
    if english_stemmer.stem('estate') in map(english_stemmer.stem, purpose_words):
        return purpose_type_dict['real estate']
    if english_stemmer.stem('car') in map(english_stemmer.stem, purpose_words):
        return purpose_type_dict['car']
    if english_stemmer.stem('education') in map(english_stemmer.stem, purpose_words):
        return purpose_type_dict['education']
    if english_stemmer.stem('university') in map(english_stemmer.stem, purpose_words):
        return purpose_type_dict['education']
    if english_stemmer.stem('wedding') in map(english_stemmer.stem, purpose_words):
        return purpose_type_dict['wedding']
    return 'Unknown'

#creating a dictionary with names of purpose type categories and their numeric designation
purpose_type_dictionary = pd.DataFrame(list(zip(purpose_type_dict.values(), purpose_type_dict.keys())),\
                       columns = ['id', 'purpose_type']) 
print(BOLD + 'Dictionary for purpose type categories:' + END)
print(purpose_type_dictionary)
print()
#adding a column with category IDs for income level
credit_scores_main['purpose_type_id'] = credit_scores_main.purpose.apply(purpose_type_id) 
print(BOLD + 'Mapping purpose and purpose type:' + END)
print(credit_scores_main.merge(purpose_type_dictionary, left_on='purpose_type_id', right_on='id')\
          [['purpose', 'purpose_type']].drop_duplicates(subset=['purpose', 'purpose_type'])\
                .sort_values('purpose_type'))
print()
#looking at the number of customers with diffrent purpose type
print(BOLD + 'The number of customers with different purpose type:' + END)
print(credit_scores_main.merge(purpose_type_dictionary, left_on='purpose_type_id', right_on='id')\
      .purpose_type.value_counts())

[1mDictionary for purpose type categories:[0m
   id purpose_type
0  0   wedding    
1  1   education  
2  2   car        
3  3   real estate

[1mMapping purpose and purpose type:[0m
                                        purpose purpose_type
10813  buying a second-hand car                  car        
10817  cars                                      car        
10819  second-hand car purchase                  car        
10823  car                                       car        
10829  to own a car                              car        
10830  purchase of a car                         car        
10840  to buy a car                              car        
10814  buying my own car                         car        
10811  car purchase                              car        
15117  supplementary education                   education  
15122  getting an education                      education  
15124  to get a supplementary education          education  
15128  getting higher

_A few typical categories of loan purpose have been distinguished and it's possible to see then if there is a difference in debt in these categories._

### Conclusion

To look at the relationship between having kids, marital status, income level and the loan purpose with the timely repaying a loan, the corresponding categories were created with a sufficient number of observations within them.

## 4. Study of the Impact of the Investigated Features on Debt Repayment
<a id='section4'></a>

In [24]:
#Let's calculate the rate of who did not repay their debt among all customers
print('In average debt defaul rate is {:.1%}'.format(credit_scores_data.debt.mean()))

In average debt defaul rate is 8.1%


_To look at the differences in debt in different categories, we can also look at the average of `'debt'` in each category. It will be the ratio of customers who did not repay their debt to all customers in this category._ 

It is more convenient to perceive this value as a percentage, so let's create a custom function for aggregation:

In [25]:
def debt_default_rate(is_debt_series):
    return str(round(is_debt_series.mean()*100,1))+'%'

- __Is there a relation between having kids and repaying a loan on time?__

In [26]:
#filtering out customers with unknown kids data 
filtered_having_kids_dictionary = having_kids_dictionary[having_kids_dictionary.having_kids!='Unknown']

#calculating the rate of customers who did not repay their debt by having kids categories 
credit_scores_main.merge(filtered_having_kids_dictionary, left_on='having_kids_id', right_on='id')\
    .pivot_table(index=['id', 'having_kids'], values='debt', aggfunc=debt_default_rate)

Unnamed: 0_level_0,Unnamed: 1_level_0,debt
id,having_kids,Unnamed: 2_level_1
0,no kids,7.5%
1,1 kid,9.2%
2,2 kids,9.5%
3,3 kids and more,8.2%


_The rate of defaulted customers without children is slightly lower than customers with 1 or 2 children. But it is necessary to look at the statistical significance of these differences._

- __Is there a relation between marital status and repaying a loan on time?__

In [27]:
#calculating the rate of customers who did not repay their debt by marital status categories 
credit_scores_main.merge(family_status_dictionary, on='family_status_id')\
    .pivot_table(index=['family_status'], values='debt', aggfunc=debt_default_rate)

Unnamed: 0_level_0,debt
family_status,Unnamed: 1_level_1
civil partnership,9.3%
divorced,7.1%
married,7.5%
unmarried,9.8%
widow / widower,6.6%


_The rate of defaulted customers among widows/widowers is a litle bit lower than among all bank's customers. But the rate of defaulted customers among unmarried is in opposite a litle bit higher. It still is necessary to look at the statistical significance of these differences._

- __Is there a relation between income level and repaying a loan on time?__

In [28]:
#calculating the rate of customers who did not repay their debt by income level 
credit_scores_main.merge(income_level_dictionary, left_on='income_level_id', right_on='id')\
    .pivot_table(index=['id', 'income_level'], values='debt', aggfunc=debt_default_rate)

Unnamed: 0_level_0,Unnamed: 1_level_0,debt
id,income_level,Unnamed: 2_level_1
0,low,8.0%
1,lower middle,8.8%
2,upper middle,8.5%
3,high,7.2%
4,very high,6.5%


_The rate of defaulted customers among only customers with very high income is a litle bit lower than among all customers. It is necessary to look at the statistical significance of these differences as well._

- __How do different loan purposes affect on-time repayment of the loan?__

In [29]:
#calculating the rate of customers who did not repay their debt by purpose type 
credit_scores_main.merge(purpose_type_dictionary, left_on='purpose_type_id', right_on='id')\
    .pivot_table(index=['id', 'purpose_type'], values='debt', aggfunc=debt_default_rate)

Unnamed: 0_level_0,Unnamed: 1_level_0,debt
id,purpose_type,Unnamed: 2_level_1
0,wedding,8.0%
1,education,9.2%
2,car,9.4%
3,real estate,7.2%


_The rate of defaulted customers without real estate loan purpose is slightly lower than customers with car or education loan purposes. But it is necessary to look at the statistical significance of these differences._

A loan purpose seems to be an important factor when making a loan decision. It is interesting to see its impact on default together with other factors.

- __A load purpose and income type__

In [30]:
#looking at number of observation in each group
credit_scores_main.merge(purpose_type_dictionary, left_on='purpose_type_id', right_on='id')\
    .pivot_table(index='purpose_type', columns = 'income_type', values='debt', aggfunc='count') 

income_type,business,civil servant,employee,entrepreneur,paternity / maternity leave,retiree,student,unemployed
purpose_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
car,1052.0,286.0,2172.0,,1.0,795.0,,
education,954.0,258.0,2080.0,,,721.0,,
real estate,2547.0,754.0,5615.0,1.0,,1891.0,1.0,2.0
wedding,525.0,159.0,1217.0,1.0,,422.0,,


_There are enough observations only for the following types of income: business, civil servant, employee, retiree. Let's look at debt default rate only for them._

In [31]:
#looking at the rate of customers who did not repay their debt in each group
credit_scores_main[credit_scores_main.income_type.isin(['business', 'civil servant', 'employee', 'retiree'])]\
    .merge(purpose_type_dictionary, left_on='purpose_type_id', right_on='id')\
        .pivot_table(index=['id', 'purpose_type'], columns = 'income_type', values='debt', aggfunc=debt_default_rate)

Unnamed: 0_level_0,income_type,business,civil servant,employee,retiree
id,purpose_type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,wedding,9.9%,4.4%,8.4%,5.9%
1,education,7.5%,8.1%,11.0%,6.7%
2,car,8.1%,7.7%,11.2%,6.4%
3,real estate,6.6%,4.8%,8.7%,4.9%


_Interestingly, among civil servants, the percentage of non-repayment of debt for wedding and real estate purposes is quite low - 4.4% and 4.8% accodingly. There is also a low percentage of defaults among retiree for real estate purpose - 4.9% and relatively small percentage for a wedding - 5.9%. For employee the percentage of non-payment for a car or education is relatively high - 11.2% and 11.0% accodingly._

- __A load purpose and having kids__

In [32]:
#looking at number of observation in each group
credit_scores_main.merge(filtered_having_kids_dictionary, left_on='having_kids_id', right_on='id')\
    .merge(purpose_type_dictionary, left_on='purpose_type_id', right_on='id')\
        .pivot_table(index='purpose_type', columns = 'having_kids', values='debt', aggfunc='count') 

having_kids,1 kid,2 kids,3 kids and more,no kids
purpose_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
car,963,399,72,2845
education,866,403,77,2642
real estate,2447,1035,195,7074
wedding,532,215,36,1530


There are enough observations let's look at all groups.

In [33]:
#looking at the rate of customers who did not repay their debt in each group
credit_scores_main.merge(filtered_having_kids_dictionary, left_on='having_kids_id', right_on='id')\
    .merge(purpose_type_dictionary, left_on='purpose_type_id', right_on='id')\
        .pivot_table(index='purpose_type', columns = 'having_kids', values='debt', aggfunc=debt_default_rate) 

having_kids,1 kid,2 kids,3 kids and more,no kids
purpose_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
car,10.7%,12.0%,8.3%,8.5%
education,10.4%,11.4%,5.2%,8.7%
real estate,8.2%,8.5%,8.2%,6.7%
wedding,9.6%,5.6%,13.9%,7.5%


_Funny that, among customers with 3 kids and more, the percentage of non-repayment of debt for education purposes is quite low - 5.2% although it seems quite difficult. There is also a low percentage of defaults among customers with 2 kids for wedding purpose - 5.6%, apparently people with two children are pretty serious. For customers with 3 kids and more in opposite getting married seems like a pretty daunting challenge - the percentage of non-payment is quite high - 13.9%. But it is still necessary to look at the statistical significance since the amount of people with 3+ kids is not large._

- __A load purpose and marital status__

In [34]:
#looking at number of observation in each group
credit_scores_main.merge(family_status_dictionary, on='family_status_id')\
    .merge(purpose_type_dictionary, left_on='purpose_type_id', right_on='id')\
        .pivot_table(index='purpose_type', columns = 'family_status', values='debt', aggfunc='count') 

family_status,civil partnership,divorced,married,unmarried,widow / widower
purpose_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
car,434.0,281.0,2736.0,637.0,218.0
education,404.0,238.0,2595.0,577.0,199.0
real estate,989.0,676.0,7008.0,1596.0,542.0
wedding,2324.0,,,,


_There are no observations only for wedding purpose among divorced, married, unmarried and widow/widower customers, which is not surprising. For other groups, observations are sufficient. Let's look at debt default rate for them._

In [35]:
#looking at the rate of customers who did not repay their debt in each group
credit_scores_main.merge(family_status_dictionary, on='family_status_id')\
    .merge(purpose_type_dictionary, left_on='purpose_type_id', right_on='id')\
        .pivot_table(index='purpose_type', columns = 'family_status', values='debt', aggfunc=debt_default_rate) 

family_status,civil partnership,divorced,married,unmarried,widow / widower
purpose_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
car,11.8%,7.5%,8.4%,12.9%,9.2%
education,14.9%,7.1%,8.3%,10.7%,7.5%
real estate,9.2%,7.0%,6.9%,8.1%,5.2%
wedding,8.0%,,,,


_The rate of defaulted customers among widows/widowers is quite low for real estate purpose and especially in comparison with paying off the debt for the car. But the rate of defaulted customers among customers of civil partnership is high for education purpose._

- __A load purpose and education__

In [36]:
#looking at number of observation in each group
credit_scores_main.merge(education_dictionary, on='education_id')\
    .merge(purpose_type_dictionary, left_on='purpose_type_id', right_on='id')\
            .pivot_table(index='purpose_type', columns = 'education', values='debt', aggfunc='count') 

education,bachelor's degree,graduate degree,primary education,secondary education,some college
purpose_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
car,1036.0,,55.0,3062.0,153.0
education,934.0,1.0,55.0,2892.0,131.0
real estate,2680.0,5.0,144.0,7602.0,380.0
wedding,600.0,,28.0,1616.0,80.0


_There are not enough observations for graduate degree._

Let's look at debt default rate for other groups.

In [37]:
#adding to main table names of education categories
credit_scores_main_with_educ = credit_scores_main.merge(education_dictionary, on='education_id')
#looking at the rate of customers who did not repay their debt in each group exept 'graduate degree'
credit_scores_main_with_educ[~credit_scores_main_with_educ.education.isin(['graduate degree'])]\
    .merge(purpose_type_dictionary, left_on='purpose_type_id', right_on='id')\
        .pivot_table(index='purpose_type', columns = 'education', values='debt', aggfunc=debt_default_rate) 

education,bachelor's degree,primary education,secondary education,some college
purpose_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
car,6.5%,10.9%,10.3%,10.5%
education,5.9%,9.1%,10.3%,9.2%
real estate,4.8%,11.1%,7.9%,8.9%
wedding,4.5%,14.3%,9.2%,7.5%


_Customers with bachelor's degrees generally repay a loan better. The rate of defaulted customers with only primary education is high for wedding purpose._ 

### Conclusions

The relationship between having kids, marital status, income level and the loan purpose with the timely repaying a loan were considered by the corresponding categories for these features. For different groups it could be noticed differences in the debt default rate. It seems that  there is some connection between considered feature and repaying a loan on time. Although for most income levels, the rate is practically the same and only for customers with very high income is a litle bit lower. Found interesting differences in groups when the group is divided by 2 features at the same time. But it is necessary to look at the statistical significance of all revealed differences.

## 5. Overall Conclusion
<a id='section5'></a>

There are credit scoresing data from bank's loan division to further assess the impact of various customer characteristics on loan defaults. 

Before proceeding to the analysis, the data was preprocessed. 

After a general review of the data revealed:
* columns with some odd and unknown values and one column with incorect data
* columns with excess types
* columns with null values
* columns with case sensitive duplicated values
* complitly duplicated lines

It makes sense to contact the developer who sent the source file about the column with incorrect values, and also to clarify other minor inaccuracies. 

There ware missing values in the total income. The missing data made up a fairly large portion of the data. The total income depends on the type of employment that the bank's customer is engaged in. So that missing values in total income were filled in according to the income type.
The excess types were converted to appropriate ones. The case sensitive duplicated values were fixed to lowercase. Then the complitly duplicated lines were dropped. 

To look at the relationship between having kids, marital status, income level and the loan purpose with the timely repaying a loan, the corresponding categories were created with a sufficient number of observations within them. 

Further, the debt default rate was considered for each group within the categories. The following differences have been identified:
* customers without children have a slightly lower default-rate than customers with 1 or 2 children
* widows/widowers have a litle bit lower default-rate than all bank's customers in average
* unmarried customers have a little bit higher default-rate than all customers in average
* customers with a very high income (1% of customers with the highest income) have a litle bit lower default-rate than all customers in average
* customers without real estate loan purpose have a slightly lower default-rate than customers with car or education loan purposes

The impact of one more feature exept a loan purpose on the customers's repaying was additionally considered. The following interesting differences were found in the respective groups:
* civil servants have a low default-rate for wedding and real estate purposes
* retirees have a low default-rate for real estate purposes and _wedding_
* employees have a relatively high default-rate for a car or education
* customers with 3 kids and more have a low default-rate for _education_ purposes
* customers with 2 kids have a low default-rate for _wedding_ purpose
* customers with 3 kids and more have a high default-rate for _wedding_ purpose
* customers with bachelor's degrees generally repay a loan better
* customers with only primary education have a high default-rate for wedding purpose

It seems that there is some relation between considered features and repaying a loan on time. But it is necessary to look at the statistical significance of all revealed differences.