## Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

### Step 1. Open the data file and have a look at the general information. 

In [3]:
import pandas as pd
data = pd.read_csv('C:\Users\MiriHad\anaconda3\projects jupyter')
#information about the data
print(data.head(10))
print(data.info())
print(data.shape)

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape (<ipython-input-3-45b0e81a8844>, line 2)

### Conclusion

### Step 2. Data preprocessing

### Processing missing values

In [None]:
#changing negative values
data['days_employed'] = data['days_employed'].abs()


In [None]:
#chacking the null values and fill them:

#the rate of the missing data in the total income column 
missing_data = data.isnull().sum()[10]
row_num = data.shape[0]
num_missing_data = missing_data /  row_num
print('the rate of missing data in income column:{:.1%}'.format(num_missing_data))


#mean of days imployed
days_employed_mean = data['days_employed'].mean()
print('days employed mean value:', days_employed_mean)
#median income for more acurate answer
total_income_median = data['total_income'].median()
print('total income median value:', total_income_median)
#Filling the missing values of income with the median 
#because we want to get the middle 
#value of the population, so that a more reliable answer is obtained
#filling the missing values
data['days_employed'] = data['days_employed'].fillna(days_employed_mean)
data['total_income'] = data['total_income'].fillna(total_income_median) 
#checking
print(data.isnull().sum())

### Conclusion

### Data type replacement

In [None]:
#replacing float data to int
#we dont need float type values
print(data.info())
data['total_income'] = data['total_income'].astype('int64')
data['days_employed'] = data['days_employed'].astype('int64')
#checking
print(data.info())

### Conclusion

### Processing duplicates

In [None]:
#changing the string values to .lower to minimize duplicates
data['education'] = data['education'].str.lower()
data['family_status'] = data['family_status'].str.lower()
data['income_type'] = data['income_type'].str.lower()
data['purpose'] = data['purpose'].str.lower()


In [None]:
#lemmatization in purpose reduce the duplication
#----need more explanation to solve it propoerly----

#import nltk
#from nltk.stem import WordNetLemmatizer
#wordnet_lemma = WordNetLemmatizer()

#purpose = data['purpose']

#for i in purpose:
        
#        words = nltk.word_tokenize(i)
#        print(words)
#        for w in words:
#            if w in i
#        for w in words:
#            lemmas = [wordnet_lemma.lemmatize(w, pos = 'n')]
#            print(lemmas)
  
        
        
    
from nltk.stem import SnowballStemmer
english_stemmer = SnowballStemmer('english')

def words(purpose):   
    stemmed_word = english_stemmer.stem(purpose).split()
    #print(stemmed_word)
    for i in stemmed_word:
        #print(i)
        if 'wed' in i :
            return 'wedding'
        elif 'hous' in i or 'proper' in i or 'est' in i:  
            return 'house'
        elif 'car' in i:
            return 'car'
        elif 'educ' in i or 'university' in i:
            return 'university' 

data['purpose_clean'] = data['purpose'].apply(words)
        


  
            
            
                      
          
       
                  
                  
              

In [None]:
#finding duplicated data:
print(data.duplicated().sum())
data = data.drop_duplicates().reset_index(drop=True)

### Conclusion

### Categorizing Data

In [None]:

#more information with grouping
#categorizing by grouping
#categorizing by dictionary- we will not use the info of all the columns
data = data[['children', 'family_status', 'income_type', 'total_income', 'debt', 'purpose']]
print(data)
print(data.columns)
by_debt = data.groupby('debt').count()
print(by_debt)

by_family_status = data.groupby('family_status').sum()
print(by_family_status)

by_income_type = data.groupby('income_type').sum()
print(by_income_type)













    




### Conclusion

### Step 3. Answer these questions

- Is there a relation between having kids and repaying a loan on time?

In [None]:

no_debt= data[data['debt']==0]
yes_debt= data[data['debt']==1]

no_debt_mean = no_debt['children'].mean()
yes_debt_mean = yes_debt['children'].mean()

print('the evarege num of child of people without default loan', no_debt_mean)         
print('the evarege num of child of people with default loan', yes_debt_mean)

division =1-(no_debt_mean / yes_debt_mean)
print('there is {:.1%} difference between peop '.format(division))



We examined whether there is an difference in the average amount of children between those who repaid a loan and those who did not repay a loan on time. The amount of children does not seem to affect the repayment of a loan on time

- Is there a relation between marital status and repaying a loan on time?

In [None]:


#status = data[['debt','family_status']]
#pivot = pd.pivot_table(status, columns=['family_status'],
#                       values=['debt'], aggfunc='sum')
#print(pivot)




status_no_debt = data[data['debt']==0]['family_status'].value_counts()
all_status = data['family_status'].value_counts()

print(status_no_debt)
print('')
print(all_status)

married = status_no_debt[0] / all_status[0]
civil = status_no_debt[1] / all_status[1]
unmarried = status_no_debt[2] / all_status[2]
divorced = status_no_debt[3] / all_status[3]
widow = status_no_debt[4] / all_status[4]

#counting the rate of specific merital status with no debt compared to 
#the same marital status
print('married with no debt: {:.1%}'.format(married))
print('civil partnership with no debt: {:.1%}'.format(civil))
print('unmarried with no debt: {:.1%}'.format(unmarried))
print('divorced with no debt: {:.1%}'.format(divorced))
print('widow with no debt: {:.1%}'.format(widow))


#-----A question to the examiner--------
#I was looking to do the test more automatically. 
#Using a loop and a dictionary. 
#I would like to recieve tips on how to do the exercise 
#more effective





### Conclusion

In each status group, we examined the proportion of people who paid a loan on time versus the same population. We found that there are not many differences between populations.

- Is there a relation between income level and repaying a loan on time?

In [None]:

income_with_debt = data[data['debt']==1]['total_income'].sum()
print('average income people with debt:', income_with_debt.mean())
income_withot_debt = data[data['debt']==0]['total_income'].sum()
print('average income without debt:', income_withot_debt.mean())
div = 1-(income_with_debt.mean() / income_withot_debt.mean())
print('The ratio of average income with debt and without debt: {:.1%}'.format(div))

### Conclusion

We looked at average income of people with debt and without debt and found that the average income of customers who repaid debt are much higher (at 91%)

- How do different loan purposes affect on-time repayment of the loan?

In [None]:
print(data['purpose'].value_counts())
debt_yes = data[data['debt']==1]
debt_no = data[data['debt']==0]

purpose = data['purpose']
for row in purpose:
    try:
        



### Conclusion

### Step 4. General conclusion

Our goal is to find influences from various factors like marital status and number of children on loan repayment on time. First we arranged the missing Data and duplicates. Next, we calculated every single parameter sepatately. We discovered that the values of number of children and marital status is not a factor affecting loan repayment. In contrast, people with higher than average income can repay the loan on time and not incur debt.

### Project Readiness Checklist

Put 'x' in the completed points. Then press Shift + Enter.

- [x]  file open;
- [X]  file examined;
- [X]  missing values defined;
- [X]  missing values are filled;
- [X]  an explanation of which missing value types were detected;
- [X]  explanation for the possible causes of missing values;
- [X]  an explanation of how the blanks are filled;
- [X]  replaced the real data type with an integer;
- [X]  an explanation of which method is used to change the data type and why;
- [X]  duplicates deleted;
- [X]  an explanation of which method is used to find and remove duplicates;
- [X]  description of the possible reasons for the appearance of duplicates in the data;
- [x]  data is categorized;
- [X]  an explanation of the principle of data categorization;
- [X]  an answer to the question "Is there a relation between having kids and repaying a loan on time?";
- [X]  an answer to the question " Is there a relation between marital status and repaying a loan on time?";
- [X]   an answer to the question " Is there a relation between income level and repaying a loan on time?";
- [ ]  an answer to the question " How do different loan purposes affect on-time repayment of the loan?"
- [X]  conclusions are present on each stage;
- [X]  a general conclusion is made.