## Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

### Step 1. Open the data file and have a look at the general information. 

In [34]:
import numpy as np
import nltk
#nltk.download('all')
import pandas as pd
data=pd.read_csv('/datasets/credit_scoring_eng.csv')




In [35]:
data.info()
print(data.head(15))
data.sample()
print(data.tail())
data.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       19351 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        19351 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB
    children  days_employed  dob_years            education  education_id  \
0          1   -8437.673028         42    bachelor's degree             0   
1          1   -4024.803754         36  secondary education             1   
2          0   -5623.422610         33  Secondary Education             1   
3          3   -4124

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,26787.568355
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,16475.450632
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,3306.762
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,16488.5045
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,23202.87
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,32549.611
max,20.0,401755.400475,75.0,4.0,4.0,1.0,362496.645


### Conclusion

We have loaded data and get familiar with it by using data.info, head, tail, and describe function...our table shows 12 columns and about 21215 rows ( or entries).Those columns are population parameters that describe bank costumers and by analysing these parameters we will try to answer on some questions



### Step 2. Data preprocessing

### Processing missing values

In [36]:
print(data['income_type'].unique())
data_missing=data.isna()

data_missing.head(15)  #isna is showing boolian values of data frame, True=missing vaLue 
                        #alse=non missing value
data_missing.dtypes    #now we see in the table that every value is boolian

#boolians are treated as numeric in aritmetic operations in Python so we can use it 
#to calculate with SUM method
data_num_missing=data_missing.sum()
print(data_num_missing) #table show sum of True (missing) values in every column

len(data) #give us number of rows
data_num_missing/len(data)*100 #give us percent of missing values by columns
#since it is more than 5 we should fill it witj zero or mean

#one lead :in arow 12 we have man with NaN for days_employee and he is retired,
 #thats the reason why maybe his days_employee is Nan, he is retired so days_employee is totaly irelevant
    
data['days_employed']=data['days_employed'].fillna(value=0)

print(data['days_employed'].head(15))

            

    

#now when we did it we need to do the same thing with second column with missing values
#and that is total income column, but there we will fill with median instead of 0.

#data['total_income']=data['total_income'].fillna(data['total_income'].mean())
#df.fillna(df.mean(), inplace=True)
print(data.head(20))
print(data['total_income'].head(15))
print(data.groupby('income_type')['total_income'].transform(lambda grp: grp.replace(0, np.median(grp))))

data['total_income'] = data.groupby(['income_type','education'])['total_income'].transform(lambda grp: grp.fillna(np.mean(grp)))



['employee' 'retiree' 'business' 'civil servant' 'unemployed'
 'entrepreneur' 'student' 'paternity / maternity leave']
children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64
0      -8437.673028
1      -4024.803754
2      -5623.422610
3      -4124.747207
4     340266.072047
5       -926.185831
6      -2879.202052
7       -152.779569
8      -6929.865299
9      -2188.756445
10     -4171.483647
11      -792.701887
12         0.000000
13     -1846.641941
14     -1844.956182
Name: days_employed, dtype: float64
    children  days_employed  dob_years            education  education_id  \
0          1   -8437.673028         42    bachelor's degree             0   
1          1   -4024.803754         36  secondary education             1   
2


data.count() show us number of non missing values by columns
             we can make conclusion that there are only two columns with missing values:
             days_employed and total_income
             
Now we want to calculate precentage of missing data by column and if its <=5% (some sources say between 5 and 10 but
we will go for 5 ) then we can delete it without
negative concecquencies on our final result, other way we need to fill the missing values with either 0 or meadian.

data_missing=data.isna()
data_missing.head(15)  isna is showing boolian values of data frame, True=missing vaLue 
                        alse=non missing value
data_missing.dtypes    now we see in the table that every value is boolian

boolians are treated as numeric in aritmetic operations in Python so we can use it 
to calculate with SUM method:

data_num_missing=data_missing.sum()
print(data_num_missing)- table show sum of True (missing) values in every column

len(data) give us number of rows
data_num_missing/len(data)*100 give us percent of missing values by columns
since it is more than 5 we should fill it witj zero or mean


 
        
    import numpy as np
            
    data.groupby('income_type')['total_income'].transform(lambda grp: grp.replace(0, np.median(grp)))
        
   
one lead :in arow 12 we have man with NaN for days_employee and he is retired,
thats the reason why maybe his days_employee is Nan, he is retired so days_employee is totaly irelevant

In the column: days_employed we have filled the missing values with zero cause this column is generaly irelevant
and on the other column: total_income we have decided to use mean value to fill missing data.





### Data type replacement

In [38]:
#We need to replace all floats to integers and since thet are existing in only two cloumns that sould be fairly easy
data['days_employed']=data['days_employed'].astype(int)
print(data['days_employed'].head(15))
#same thing we will do with a total_income column
data['total_income']=data['total_income'].astype(int)
print(data['total_income'].head(15))

#I will also use this oportunity to change minus values to abs since I think its pure typo mistake and there 
#particular logical explanation why should it be minus.

data['days_employed']=data['days_employed'].abs()
print(data['days_employed'].head(15))


0      -8437
1      -4024
2      -5623
3      -4124
4     340266
5       -926
6      -2879
7       -152
8      -6929
9      -2188
10     -4171
11      -792
12         0
13     -1846
14     -1844
Name: days_employed, dtype: int64
0     40620
1     17932
2     23341
3     42820
4     25378
5     40922
6     38484
7     21731
8     15337
9     23108
10    18230
11    12331
12    20959
13    20873
14    26420
Name: total_income, dtype: int64
0       8437
1       4024
2       5623
3       4124
4     340266
5        926
6       2879
7        152
8       6929
9       2188
10      4171
11       792
12         0
13      1846
14      1844
Name: days_employed, dtype: int64


We have used astype(int) and abs() functions to get rid of floats and negative values so that we have more usable data.



### Processing duplicates

In [None]:
#Deleting duplicates could be tricky task since we have for instance  duplicates in total_income column
#as a result of filling missing data with median value.The safest thing is to keep the data as it is.
#I just want to make education column lowercase so we can ignore duplicates caused by Uperlowercase mambo jumbo

data['education']=data['education'].str.lower()
                      
print(data['education'].head(25))
#ok to drop duplicates we are using proven method as seen bellow:
print('Before ',data.duplicated().sum())
data = data.drop_duplicates().reset_index(drop=True)
print('After ',data.duplicated().sum())
data.info()






### Conclusion

we have used lowecase function to make all data in education colmn lowercase
after that we have used standard method for dropping duplicates as shown in the code bellow

### Categorizing Data

In [39]:


#now we need to group purpose column,gonna use lemmatizer:
import nltk
#nltk.download('all')

from nltk.stem import WordNetLemmatizer

def lemmatizing(text):
    wordnet_lemma=WordNetLemmatizer()
    words=nltk.word_tokenize(text)
    return [wordnet_lemma.lemmatize(w, pos='n')for w in words]

data['lemmas']=data['purpose'].apply(lemmatizing)
data['lemmas'].value_counts()

def purpose_category(purpose):
    if 'wedding' in purpose:
        return 'wedding'
    elif 'car' in purpose:
        return 'car'
    elif 'house'in purpose or 'estate'in purpose or 'housing'in purpose or 'property' in purpose:
        return 'housing'
    elif 'education' or 'university'in purpose:
        return 'education'
    else:
        return 'other'
    
data['new_purpose']=data['lemmas'].apply(purpose_category)   
data['new_purpose'].value_counts()

#so we applying this code to make like a 3 median inside total_income column

pd.qcut(data['total_income'], q=3, labels=['low_income','medium_income','high_income'])

#print(data['total_income'].head(20))

#data.groupby('income_type').mean().sort_values('total_income',ascending=False )
# for grouping income we will use if elif function :  
#def income_level(total_income):
  
#    if total_income <= 'low_income':
 #       return 'low'
#    elif total_income < 'medium_income' :
#       return 'average'
#    elif total_income >= 'high_income' :
#       return 'high'
#    else:
#       return 'undefined'
    
#data['income_level']=data['total_income'].apply(income_level)
#data['income_level'].value_counts()
#for grouping children status also if elif:
def has_children(children):
    if children == 0:
        return 'no'
    elif children>0:
        return 'yes'
    
data['has_status']=data['children'].apply(has_children)

print(data.head())





   children  days_employed  dob_years            education  education_id  \
0         1           8437         42    bachelor's degree             0   
1         1           4024         36  secondary education             1   
2         0           5623         33  Secondary Education             1   
3         3           4124         32  secondary education             1   
4         0         340266         53  secondary education             1   

       family_status  family_status_id gender income_type  debt  total_income  \
0            married                 0      F    employee     0         40620   
1            married                 0      F    employee     0         17932   
2            married                 0      M    employee     0         23341   
3            married                 0      M    employee     0         42820   
4  civil partnership                 1      F     retiree     0         25378   

                   purpose                      lemmas n

In [40]:
print(data['lemmas'][3])
print(data['new_purpose'][3])

['supplementary', 'education']
education


### Conclusion

#inside the total_income we made tree groups by median for every type ( retiree, employee, civil servant...etc), also inside of purpose column we made order grouping various types of purposes into one general ( for example housing, estate, house property , we put one group named 'housing', same thing with the others). Children column we have regrouped into two simple types with and without children.

### Step 3. Answer these questions

- Is there a relation between having kids and repaying a loan on time?

In [38]:
data.pivot_table(index='children', values = 'debt', aggfunc = ['sum', 'count', 'mean'])

Unnamed: 0_level_0,sum,count,mean
Unnamed: 0_level_1,debt,debt,debt
children,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
-1,1,47,0.021277
0,1063,14149,0.075129
1,444,4818,0.092154
2,194,2055,0.094404
3,27,330,0.081818
4,4,41,0.097561
5,0,9,0.0
20,8,76,0.105263


As we can see on the table, there is no actual relation between number of kids and repaying the loan on time

- Is there a relation between marital status and repaying a loan on time?

In [37]:
data.pivot_table(index='family_status', values = 'debt', aggfunc = ['sum', 'count', 'mean'])

Unnamed: 0_level_0,sum,count,mean
Unnamed: 0_level_1,debt,debt,debt
family_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
civil partnership,388,4177,0.09289
divorced,85,1195,0.07113
married,931,12380,0.075202
unmarried,274,2813,0.097405
widow / widower,63,960,0.065625


Here we see that civil_partnership and unmarried have a little bit bigger debt that others

- Is there a relation between income level and repaying a loan on time?

In [40]:
data.pivot_table(index='total_income', values = 'debt', aggfunc = ['sum', 'count', 'mean'])

Unnamed: 0_level_0,sum,count,mean
Unnamed: 0_level_1,debt,debt,debt
total_income,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
3306.762,1,1,1
3392.845,0,1,0
3418.824,0,1,0
3471.216,0,1,0
3503.298,0,1,0
...,...,...,...
273809.483,0,1,0
274402.943,0,1,0
276204.162,0,1,0
352136.354,1,1,1


There is no visible conection between income level and repaying on time

- How do different loan purposes affect on-time repayment of the loan?

In [41]:
data.pivot_table(index='new_purpose', values = 'debt', aggfunc = ['sum', 'count', 'mean'])

Unnamed: 0_level_0,sum,count,mean
Unnamed: 0_level_1,debt,debt,debt
new_purpose,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
car,403,4315,0.093395
education,370,4022,0.091994
housing,782,10840,0.07214
wedding,186,2348,0.079216


We see here slightly bigger debt for a car credits compared to the other types of credits

### Step 4. General conclusion

As a general conclusion we can say that we have analyse all the relevant parameters to try answer the question from the start of the project. As a general conclusion we can say that we didnt find any conection between number of children of the  bank customers  and they default on the loan. However if we can give some advice it would be that bank should pay attention when giving the loan to the elderly group of the customers as more nonstable group, and also housing loans are more problematic cause of their longer period of timebut however those are the topics that are we are less , 

### Project Readiness Checklist

Put 'x' in the completed points. Then press Shift + Enter.

- [x]  file open;
- [ ]  file examined;
- [ ]  missing values defined;
- [ ]  missing values are filled;
- [ ]  an explanation of which missing value types were detected;
- [ ]  explanation for the possible causes of missing values;
- [ ]  an explanation of how the blanks are filled;
- [ ]  replaced the real data type with an integer;
- [ ]  an explanation of which method is used to change the data type and why;
- [ ]  duplicates deleted;
- [ ]  an explanation of which method is used to find and remove duplicates;
- [ ]  description of the possible reasons for the appearance of duplicates in the data;
- [ ]  data is categorized;
- [ ]  an explanation of the principle of data categorization;
- [ ]  an answer to the question "Is there a relation between having kids and repaying a loan on time?";
- [ ]  an answer to the question " Is there a relation between marital status and repaying a loan on time?";
- [ ]   an answer to the question " Is there a relation between income level and repaying a loan on time?";
- [ ]  an answer to the question " How do different loan purposes affect on-time repayment of the loan?"
- [ ]  conclusions are present on each stage;
- [ ]  a general conclusion is made.