# Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

## Open the data file and have a look at the general information. 

In [1]:
import pandas as pd
report = pd.read_csv('/datasets/credit_scoring_eng.csv')
report.head(15)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.42261,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
5,0,-926.185831,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.17,purchase of the house
6,0,-2879.202052,43,bachelor's degree,0,married,0,F,business,0,38484.156,housing transactions
7,0,-152.779569,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.829,education
8,2,-6929.865299,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.093,having a wedding
9,0,-2188.756445,41,secondary education,1,married,0,M,employee,0,23108.15,purchase of the house for my family


In [2]:
report.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       19351 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        19351 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


In [3]:
report.describe()

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,26787.568355
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,16475.450632
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,3306.762
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,16488.5045
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,23202.87
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,32549.611
max,20.0,401755.400475,75.0,4.0,4.0,1.0,362496.645


### Conclusion

The table consists from 12 columns and 21525 rows. Data types are float64 - 2 columns, int64 - 5 columns and object - 5 columns. On the first look the names of columns are normal without any anomalies. 
In this table there are few problems on the first look. I see missing values in columns 'days_employed' and 'total_income'. Since the amount of missing values for those columns is equal I can think that need to check connection between them. Also I can see non-logical values in the columns days_employed (negative values and too big values), children (negative values and too big(need to check)) and different case register in the 'education' column.

## Data preprocessing

### Processing missing values

In [4]:
report.isnull().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

In [5]:
report_unemployed = report[['dob_years', 'days_employed', 'total_income']]
report_unemployed[report_unemployed['dob_years'] == 21].head(10)

Unnamed: 0,dob_years,days_employed,total_income
23,21,-272.981385,20522.515
65,21,,
219,21,-597.273402,13912.788
317,21,,
606,21,-880.221113,23253.578
735,21,-1013.920085,9023.078
763,21,,
1057,21,,
1106,21,-702.043712,31550.592
1325,21,,


In [6]:
report['days_employed'] = report['days_employed'].fillna(0)
report['total_income'] = report['total_income'].fillna(0)
report[report['dob_years'] == 21].head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
23,0,-272.981385,21,bachelor's degree,0,civil partnership,1,M,employee,0,20522.515,to have a wedding
65,0,0.0,21,secondary education,1,unmarried,4,M,business,0,0.0,transactions with commercial real estate
219,0,-597.273402,21,some college,2,civil partnership,1,F,business,0,13912.788,transactions with commercial real estate
317,0,0.0,21,bachelor's degree,0,unmarried,4,M,employee,0,0.0,purchase of a car
606,20,-880.221113,21,secondary education,1,married,0,M,business,0,23253.578,purchase of the house
735,0,-1013.920085,21,secondary education,1,married,0,F,employee,0,9023.078,buying property for renting out
763,0,0.0,21,secondary education,1,civil partnership,1,M,business,0,0.0,purchase of the house
1057,0,0.0,21,SOME COLLEGE,2,unmarried,4,M,business,0,0.0,real estate transactions
1106,0,-702.043712,21,some college,2,civil partnership,1,F,employee,0,31550.592,having a wedding
1325,1,0.0,21,secondary education,1,civil partnership,1,F,employee,0,0.0,wedding ceremony


In [7]:
report.isnull().sum()

children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64

### Conclusion

After checking missing values I saw that the amount of missing values in columns 'days_employed' and 'total_income' are equal. I made a conclusion that there is connect between them. I made a conclusion that there are people who've never worked that why they don't have any income. In most cases it can be people who just turned on 21. I filtered results for 3 columns 'days_employed', 'total_income' and 'dob_years' and saw that missing values matched. Changed them to 0

## Data type replacement

In [8]:
report['days_employed'] = report['days_employed'].astype('int')
report['total_income'] = report['total_income'].astype('int')

In [9]:
report.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       21525 non-null int64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        21525 non-null int64
purpose             21525 non-null object
dtypes: int64(7), object(5)
memory usage: 2.0+ MB


In [10]:
report.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437,42,bachelor's degree,0,married,0,F,employee,0,40620,purchase of the house
1,1,-4024,36,secondary education,1,married,0,F,employee,0,17932,car purchase
2,0,-5623,33,Secondary Education,1,married,0,M,employee,0,23341,purchase of the house
3,3,-4124,32,secondary education,1,married,0,M,employee,0,42820,supplementary education
4,0,340266,53,secondary education,1,civil partnership,1,F,retiree,0,25378,to have a wedding
5,0,-926,27,bachelor's degree,0,civil partnership,1,M,business,0,40922,purchase of the house
6,0,-2879,43,bachelor's degree,0,married,0,F,business,0,38484,housing transactions
7,0,-152,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731,education
8,2,-6929,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337,having a wedding
9,0,-2188,41,secondary education,1,married,0,M,employee,0,23108,purchase of the house for my family


### Conclusion

In two columns 'days_employed' and 'total_income' there were values float64 type. I changed type to int64 and now we have valuse comfortable for math operations and comparing.

### Processing duplicates

In [11]:
report['education'].value_counts()

secondary education    13750
bachelor's degree       4718
SECONDARY EDUCATION      772
Secondary Education      711
some college             668
BACHELOR'S DEGREE        274
Bachelor's Degree        268
primary education        250
Some College              47
SOME COLLEGE              29
PRIMARY EDUCATION         17
Primary Education         15
graduate degree            4
GRADUATE DEGREE            1
Graduate Degree            1
Name: education, dtype: int64

In [12]:
report['education'] = report['education'].str.lower()
report.head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437,42,bachelor's degree,0,married,0,F,employee,0,40620,purchase of the house
1,1,-4024,36,secondary education,1,married,0,F,employee,0,17932,car purchase
2,0,-5623,33,secondary education,1,married,0,M,employee,0,23341,purchase of the house
3,3,-4124,32,secondary education,1,married,0,M,employee,0,42820,supplementary education
4,0,340266,53,secondary education,1,civil partnership,1,F,retiree,0,25378,to have a wedding
5,0,-926,27,bachelor's degree,0,civil partnership,1,M,business,0,40922,purchase of the house
6,0,-2879,43,bachelor's degree,0,married,0,F,business,0,38484,housing transactions
7,0,-152,50,secondary education,1,married,0,M,employee,0,21731,education
8,2,-6929,35,bachelor's degree,0,civil partnership,1,F,employee,0,15337,having a wedding
9,0,-2188,41,secondary education,1,married,0,M,employee,0,23108,purchase of the house for my family


In [13]:
report['education'].value_counts()

secondary education    15233
bachelor's degree       5260
some college             744
primary education        282
graduate degree            6
Name: education, dtype: int64

In [14]:
report['family_status'].value_counts()

married              12380
civil partnership     4177
unmarried             2813
divorced              1195
widow / widower        960
Name: family_status, dtype: int64

In [15]:
report['income_type'].value_counts()

employee                       11119
business                        5085
retiree                         3856
civil servant                   1459
entrepreneur                       2
unemployed                         2
student                            1
paternity / maternity leave        1
Name: income_type, dtype: int64

In [16]:
report['gender'].value_counts()

F      14236
M       7288
XNA        1
Name: gender, dtype: int64

In [17]:
report[report['gender'] == 'XNA']

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
10701,0,-2358,24,some college,2,civil partnership,1,XNA,business,0,32624,buy real estate


In [18]:
report['gender'] = report['gender'].replace('XNA', 'F')
report.loc[10701, 'gender']

'F'

In [19]:
report['gender'].value_counts()

F    14237
M     7288
Name: gender, dtype: int64

In [20]:
report['purpose'].value_counts()

wedding ceremony                            797
having a wedding                            777
to have a wedding                           774
real estate transactions                    676
buy commercial real estate                  664
buying property for renting out             653
housing transactions                        653
transactions with commercial real estate    651
purchase of the house                       647
housing                                     647
purchase of the house for my family         641
construction of own property                635
property                                    634
transactions with my real estate            630
building a real estate                      626
buy real estate                             624
purchase of my own house                    620
building a property                         620
housing renovation                          612
buy residential real estate                 607
buying my own car                       

In [21]:
report['children'].value_counts()

 0     14149
 1      4818
 2      2055
 3       330
 20       76
-1        47
 4        41
 5         9
Name: children, dtype: int64

In [22]:
report[report['children'] == 20].head(10)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
606,20,-880,21,secondary education,1,married,0,M,business,0,23253,purchase of the house
720,20,-855,44,secondary education,1,married,0,F,business,0,18079,buy real estate
1074,20,-3310,56,secondary education,1,married,0,F,employee,1,36722,getting an education
2510,20,-2714,59,bachelor's degree,0,widow / widower,2,F,employee,0,42315,transactions with commercial real estate
2941,20,-2161,0,secondary education,1,married,0,F,employee,0,31958,to buy a car
3302,20,0,35,secondary education,1,unmarried,4,F,civil servant,0,0,profile education
3396,20,0,56,bachelor's degree,0,married,0,F,business,0,0,university education
3671,20,-913,23,secondary education,1,unmarried,4,F,employee,0,16200,buying a second-hand car
3697,20,-2907,40,secondary education,1,civil partnership,1,M,employee,0,18460,buying a second-hand car
3735,20,-805,26,bachelor's degree,0,unmarried,4,M,employee,0,21952,housing renovation


In [23]:
report['children'].median()

0.0

In [24]:
report['children'].mean()

0.5389082462253194

In [25]:
report['children'] = report['children'].replace(20, 2)
report['children'] = report['children'].replace(-1, 1)
report['children'].value_counts()

0    14149
1     4865
2     2131
3      330
4       41
5        9
Name: children, dtype: int64

In [26]:
report['dob_years'].value_counts()

35    617
40    609
41    607
34    603
38    598
42    597
33    581
39    573
31    560
36    555
44    547
29    545
30    540
48    538
37    537
50    514
43    513
32    510
49    508
28    503
45    497
27    493
56    487
52    484
47    480
54    479
46    475
58    461
57    460
53    459
51    448
59    444
55    443
26    408
60    377
25    357
61    355
62    352
63    269
64    265
24    264
23    254
65    194
66    183
22    183
67    167
21    111
0     101
68     99
69     85
70     65
71     58
20     51
72     33
19     14
73      8
74      6
75      1
Name: dob_years, dtype: int64

In [27]:
report['dob_years'].mean()

43.29337979094077

In [28]:
report['dob_years'].median()

42.0

In [29]:
report['dob_years'] = report['dob_years'].replace(0, 42)

In [30]:
report['dob_years'].value_counts()

42    698
35    617
40    609
41    607
34    603
38    598
33    581
39    573
31    560
36    555
44    547
29    545
30    540
48    538
37    537
50    514
43    513
32    510
49    508
28    503
45    497
27    493
56    487
52    484
47    480
54    479
46    475
58    461
57    460
53    459
51    448
59    444
55    443
26    408
60    377
25    357
61    355
62    352
63    269
64    265
24    264
23    254
65    194
66    183
22    183
67    167
21    111
68     99
69     85
70     65
71     58
20     51
72     33
19     14
73      8
74      6
75      1
Name: dob_years, dtype: int64

### Conclusion

In the column 'education' were duplicates values caused by a different type of cases. With the help of str.lower() method I succeded to get values without duplicates. Also there was interesting case in 'gender' column. There was one person who doesn't identify himsef with any gender. Renoving this row weren't right step, because I would lost a data from it, so I decided to replace it with majority gender Female, so it will not influence a lot at the all of data. In the column 'children' were weird values like 20 and -1. Analyzing the table I found impossible that people have 20 children in 21 years old, also -1 is not logical. Finding mean and median also brang no clean results, because median was 0 and it can impact on all of data if replace 20 children, also mean was 0.5 what was no good also. So I can assume that it was mistake and should be 2 and 1 children. Replaced 20 and -1 values with 2 and 1. Upon checking the 'children' column I noticed a value 0 in 'dob_years' column which is impossible. Calculating mean and median and decided to use median for replacement because mean was not integer value and we don't want float in our calculation.

### Categorizing Data

In [31]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()


In [32]:
def lemma_words(row):
    return [lemmatizer.lemmatize(row, pos='n') for row in nltk.word_tokenize(row)]

report['purpose_lemma'] = report['purpose'].apply(lemma_words)
report['purpose_lemma'].value_counts()

[car]                                            973
[wedding, ceremony]                              797
[having, a, wedding]                             777
[to, have, a, wedding]                           774
[real, estate, transaction]                      676
[buy, commercial, real, estate]                  664
[housing, transaction]                           653
[buying, property, for, renting, out]            653
[transaction, with, commercial, real, estate]    651
[purchase, of, the, house]                       647
[housing]                                        647
[purchase, of, the, house, for, my, family]      641
[construction, of, own, property]                635
[property]                                       634
[transaction, with, my, real, estate]            630
[building, a, real, estate]                      626
[buy, real, estate]                              624
[building, a, property]                          620
[purchase, of, my, own, house]                

In [33]:
def cat_purpose(row):
    estate = ['housing', 'house', 'real', 'estate', 'property']
    education = ['education', 'university', 'educated']
    car = ['car']
    wedding = ['wedding']
    
    for keyword in estate:
        if keyword in row:
            return 'estate'
    for keyword in education:
        if keyword in row:
            return 'education'
    for keyword in car:
        if keyword in row:
            return 'car'
    for keyword in wedding:
        if keyword in row:
            return 'wedding'
        
    return 'other'   

report['purpose'] = report['purpose_lemma'].apply(cat_purpose)

In [34]:
report['purpose'].value_counts()

estate       10840
car           4315
education     4022
wedding       2348
Name: purpose, dtype: int64

In [35]:
report.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,purpose_lemma
0,1,-8437,42,bachelor's degree,0,married,0,F,employee,0,40620,estate,"[purchase, of, the, house]"
1,1,-4024,36,secondary education,1,married,0,F,employee,0,17932,car,"[car, purchase]"
2,0,-5623,33,secondary education,1,married,0,M,employee,0,23341,estate,"[purchase, of, the, house]"
3,3,-4124,32,secondary education,1,married,0,M,employee,0,42820,education,"[supplementary, education]"
4,0,340266,53,secondary education,1,civil partnership,1,F,retiree,0,25378,wedding,"[to, have, a, wedding]"


### Conclusion

IN the column 'purpose' with the help of lemmatization I was able to group all purposes to 4 main categories. So I will be able to analize the credit score depending on purpose. 

## Answer these questions

- Is there a relation between having kids and repaying a loan on time?

In [36]:
def have_children(childrens):
    if childrens == 0:
        return 'no kids'
    else:
        return 'have'

report['children'] = report['children'].apply(have_children)
report.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,purpose_lemma
0,have,-8437,42,bachelor's degree,0,married,0,F,employee,0,40620,estate,"[purchase, of, the, house]"
1,have,-4024,36,secondary education,1,married,0,F,employee,0,17932,car,"[car, purchase]"
2,no kids,-5623,33,secondary education,1,married,0,M,employee,0,23341,estate,"[purchase, of, the, house]"
3,have,-4124,32,secondary education,1,married,0,M,employee,0,42820,education,"[supplementary, education]"
4,no kids,340266,53,secondary education,1,civil partnership,1,F,retiree,0,25378,wedding,"[to, have, a, wedding]"


In [37]:
report_children_grouped = report.groupby('children').agg({'debt': ['count', 'sum']})
report_children_grouped

Unnamed: 0_level_0,debt,debt
Unnamed: 0_level_1,count,sum
children,Unnamed: 1_level_2,Unnamed: 2_level_2
have,7376,678
no kids,14149,1063


In [38]:
report_children_grouped['unpaid'] = report_children_grouped['debt']['sum'] / report_children_grouped['debt']['count'] * 100
report_children_grouped

Unnamed: 0_level_0,debt,debt,unpaid
Unnamed: 0_level_1,count,sum,Unnamed: 3_level_1
children,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
have,7376,678,9.191974
no kids,14149,1063,7.512898


### Conclusion

To analyse how having children impact on credit score I calculated percentage of unpaid loan due to having children category and debt column. As we see as more children in family as more unpaid credit score. There is 2% of difference between childfree clients and clients with children. So I can assume that having children impact on credit score but not too much.

- Is there a relation between marital status and repaying a loan on time?

In [39]:
report['family_status'].value_counts()

married              12380
civil partnership     4177
unmarried             2813
divorced              1195
widow / widower        960
Name: family_status, dtype: int64

In [40]:
report['family_status_id'].value_counts()

0    12380
1     4177
4     2813
3     1195
2      960
Name: family_status_id, dtype: int64

In [41]:
def family_cat(family_id):
    if family_id == 0:
        return 'married'
    if family_id == 1:
        return 'married'
    else:
        return 'single'

report['family_status_id'] = report['family_status_id'].apply(family_cat)
report.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,purpose_lemma
0,have,-8437,42,bachelor's degree,0,married,married,F,employee,0,40620,estate,"[purchase, of, the, house]"
1,have,-4024,36,secondary education,1,married,married,F,employee,0,17932,car,"[car, purchase]"
2,no kids,-5623,33,secondary education,1,married,married,M,employee,0,23341,estate,"[purchase, of, the, house]"
3,have,-4124,32,secondary education,1,married,married,M,employee,0,42820,education,"[supplementary, education]"
4,no kids,340266,53,secondary education,1,civil partnership,married,F,retiree,0,25378,wedding,"[to, have, a, wedding]"


In [42]:
report_family_grouped = report.groupby('family_status_id').agg({'debt': ['count', 'sum']})
report_family_grouped['unpaid'] = report_family_grouped['debt']['sum'] / report_family_grouped['debt']['count'] * 100
report_family_grouped

Unnamed: 0_level_0,debt,debt,unpaid
Unnamed: 0_level_1,count,sum,Unnamed: 3_level_1
family_status_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
married,16557,1319,7.966419
single,4968,422,8.494364


### Conclusion

There is a quite small difference between married and single clients. So marital status doesn't impact much on credit score

- Is there a relation between income level and repaying a loan on time?

In [43]:
def income_level(mon):
    if mon >= report['total_income'].mean():
        return 'high'
    else:
        return 'low'
    
report['total_income'] = report['total_income'].apply(income_level)
report.head()

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose,purpose_lemma
0,have,-8437,42,bachelor's degree,0,married,married,F,employee,0,high,estate,"[purchase, of, the, house]"
1,have,-4024,36,secondary education,1,married,married,F,employee,0,low,car,"[car, purchase]"
2,no kids,-5623,33,secondary education,1,married,married,M,employee,0,low,estate,"[purchase, of, the, house]"
3,have,-4124,32,secondary education,1,married,married,M,employee,0,high,education,"[supplementary, education]"
4,no kids,340266,53,secondary education,1,civil partnership,married,F,retiree,0,high,wedding,"[to, have, a, wedding]"


In [44]:
report_income_grouped = report.groupby('total_income').agg({'debt': ['count', 'sum']})
report_income_grouped['unpaid'] = report_income_grouped['debt']['sum'] / report_income_grouped['debt']['count'] * 100
report_income_grouped

Unnamed: 0_level_0,debt,debt,unpaid
Unnamed: 0_level_1,count,sum,Unnamed: 3_level_1
total_income,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
high,9147,719,7.860501
low,12378,1022,8.256584


### Conclusion

There is no big difference between clients with high and low income. Income doesn't impact much on credit score

- How do different loan purposes affect on-time repayment of the loan?

In [45]:
report_purpose_grouped = report.groupby('purpose').agg({'debt': ['count', 'sum']})
report_purpose_grouped['unpaid'] = report_purpose_grouped['debt']['sum'] / report_purpose_grouped['debt']['count'] * 100
report_purpose_grouped

Unnamed: 0_level_0,debt,debt,unpaid
Unnamed: 0_level_1,count,sum,Unnamed: 3_level_1
purpose,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
car,4315,403,9.339513
education,4022,370,9.199403
estate,10840,782,7.214022
wedding,2348,186,7.921635


### Conclusion

The lowest unpaid score have client who got the loan for buying estate. The highest unpaid score have client with car and educational purposes. Need to pay attetion on it when getting application with those purposes

## General conclusion

In this project I had a data from bank and was needed to analyze and made a conclusion about credit score of clients. 
First of all I looked carefully on the DataFrame I have. The table consists from 12 columns and 21525 rows. Data types are float64 - 2 columns, int64 - 5 columns and object - 5 columns. On the first look the names of columns are normal without any anomalies. In this table there are few problems on the first look. I see missing values in columns 'days_employed' and 'total_income'. Since the amount of missing values for those columns is equal I can think that need to check connection between them. Also I can see non-logical values in the columns days_employed (negative values and too big values), children (negative values and too big(need to check)) and different case register in the 'education' column.
First of all I checked missing values. I saw that the amount of missing values in columns 'days_employed' and 'total_income' are equal. I made a conclusion that there is connect between them. I made a conclusion that there are people who've never worked that why they don't have any income. In most cases it can be people who just turned on 21. I filtered results for 3 columns 'days_employed', 'total_income' and 'dob_years' and saw that missing values matched. Changed them to 0.
In two columns 'days_employed' and 'total_income' there were values float64 type. I changed type to int64 and now we have valuse comfortable for math operations and comparing.
In the column 'education' were duplicates values caused by a different type of cases. With the help of str.lower() method I succeded to get values without duplicates. Also there was interesting case in 'gender' column. There was one person who doesn't identify himsef with any gender. Renoving this row weren't right step, because I would lost a data from it, so I decided to replace it with majority gender Female, so it will not influence a lot at the all of data. In the column 'children' were weird values like 20 and -1. Analyzing the table I found impossible that people have 20 children in 21 years old, also -1 is not logical. Finding mean and median also brang no clean results, because median was 0 and it can impact on all of data if replace 20 children, also mean was 0.5 what was no good also. So I can assume that it was mistake and should be 2 and 1 children. Replaced 20 and -1 values with 2 and 1. Upon checking the 'children' column I noticed a value 0 in 'dob_years' column which is impossible. Calculating mean and median and decided to use median for replacement because mean was not integer value and we don't want float in our calculation.
IN the column 'purpose' with the help of lemmatization I was able to group all purposes to 4 main categories. So I will be able to analize the credit score depending on purpose.
After all preparation I've got a clean data for work. Upon analyzing I may assume that the lowest credit score have clients with kids and those who want to get a loan for buying a car or for education. Actually it does make sence, because family with kids need much more money for their everday needs, and this is extra expences for the family. About clients who want to take loan for education. I think the most of those clients are students, who don't have job experience yet, the most of them have unstable permanent job. Married clients I guess become more responsible after marriage and pay a little bit better then single. High income allows customers to feel a little be more comfident with payments and pay in time.
Buying a car increases expences and this impacts on the credit score. Because car needs extra money
Loan for wedding and estate have a low unpaid score because people who decided married and buy a house are more responsible.



