## 1. Hypothesis Generation

Generating a hypothesis is a major step in the process of analyzing data. This involves understanding the problem and formulating a meaningful hypothesis about what could potentially have a good impact on the outcome. This is done BEFORE looking at the data, and we end up creating a laundry list of the different analyses which we can potentially perform if data is available.

#### Possible hypotheses
Which applicants are more likely to get a loan

1. Applicants having a credit history 
2. Applicants with higher applicant and co-applicant incomes
3. Applicants with higher education level
4. Properties in urban areas with high growth perspectives

Do more brainstorming and create some hypotheses of your own. Remember that the data might not be sufficient to test all of these, but forming these enables a better understanding of the problem.

In [None]:
# credit rating - or current credit debt
# coapplicants history and credit rating
# previous default on loan or previous bankruptcy
# not just history but how the repayment history is
# amount asked relative to income or total income if married their income as well
# payment period vs amount (monthly payments relative to income)


In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("data/data.csv")

In [4]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


#### 1. Applicants having a credit history

I think if you have a credit history, more likely to get loan.

In [5]:
# NOTE CHANGE NA TO 0 FOR THIS COLUMN

df_credit_hist = df[['Credit_History', 'Loan_Status']]
df_credit_hist.count()

# Percentage without history: 
df_credit_hist.Credit_History.count() / df_credit_hist.Loan_Status.count()

0.9185667752442996

In [14]:
# Without history
df_no_hist = df_credit_hist[df_credit_hist.Credit_History.isna()] 
df_no_hist[df_no_hist.Loan_Status == 'Y'].count() / df_no_hist.Loan_Status.count()

Credit_History    0.00
Loan_Status       0.74
dtype: float64

In [15]:
# With history
df_with_hist = df_credit_hist[df_credit_hist.Credit_History.isna() == False] 
df_with_hist[df_with_hist.Loan_Status == 'Y'].count() / df_with_hist.Loan_Status.count()

Credit_History    0.682624
Loan_Status       0.682624
dtype: float64

In [None]:
# I was wrong, history not really indicator for loan approval (as you could have bad history), but most applications (90%) had 
# credit history

#### 2. Applicants with higher applicant and co-applicant incomes

Think higher income, higher rate of approval.

In [7]:
# Add together Applicatant and Coapplicant?
df_income = df[['ApplicantIncome', 'CoapplicantIncome', 'Loan_Status']]
df_income['Combined_Income'] = df['ApplicantIncome'] + df['CoapplicantIncome']
df_income.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_income['Combined_Income'] = df['ApplicantIncome'] + df['CoapplicantIncome']


Unnamed: 0,ApplicantIncome,CoapplicantIncome,Loan_Status,Combined_Income
0,5849,0.0,Y,5849.0
1,4583,1508.0,N,6091.0
2,3000,0.0,Y,3000.0
3,2583,2358.0,Y,4941.0
4,6000,0.0,Y,6000.0


In [37]:
# Approved loans (Applicant income)
df_income_approved = df_income[df_income.Loan_Status == 'Y']
print(f'median approved income: {df_income_approved.ApplicantIncome.median()}')
print(f'mean approved income: {df_income_approved.ApplicantIncome.mean()}')      
print(f'max approved income: {df_income_approved.ApplicantIncome.max()}')   
print(f'min approved income: {df_income_approved.ApplicantIncome.min()}')   

median approved income: 3812.5
mean approved income: 5384.068720379147
max approved income: 63337
min approved income: 210


In [38]:
# Not approved loans (Applicant income)
df_income_approved = df_income[df_income.Loan_Status == 'N']
print(f'median declined income: {df_income_approved.ApplicantIncome.median()}')
print(f'mean declined income: {df_income_approved.ApplicantIncome.mean()}')      
print(f'max declined income: {df_income_approved.ApplicantIncome.max()}')   
print(f'min declined income: {df_income_approved.ApplicantIncome.min()}')   

median declined income: 3833.5
mean declined income: 5446.078125
max declined income: 81000
min declined income: 150


In [39]:
# Approved loans (Combined income)
df_income_approved = df_income[df_income.Loan_Status == 'Y']
print(f'For combined applicant and co-applicant income')
print(f'median approved income: {df_income_approved.Combined_Income.median()}')
print(f'mean approved income: {df_income_approved.Combined_Income.mean()}')      
print(f'max approved income: {df_income_approved.Combined_Income.max()}')   
print(f'min approved income: {df_income_approved.Combined_Income.min()}')   

For combined applicant and co-applicant income
median approved income: 5439.0
mean approved income: 6888.585118456492
max approved income: 63337.0
min approved income: 1963.0


In [41]:
# Declined loans (Combined income)
df_income_approved = df_income[df_income.Loan_Status == 'N']
print(f'For combined applicant and co-applicant income')
print(f'median declined income: {df_income_approved.Combined_Income.median()}')
print(f'mean declined income: {df_income_approved.Combined_Income.mean()}')      
print(f'max declined income: {df_income_approved.Combined_Income.max()}')   
print(f'min declined income: {df_income_approved.Combined_Income.min()}')

For combined applicant and co-applicant income
median declined income: 5289.5
mean declined income: 7323.885416666667
max declined income: 81000.0
min declined income: 1442.0


In [None]:
# Interesting - no impact really - must be impact of loan ammount as well...

#### 3. Applicants with higher education level

Think higher education, higher rate of approval

In [18]:
# ENCODE THIS 1, 0
df_edu = df[['Education', 'Loan_Status']]
df_edu.Education.unique()

array(['Graduate', 'Not Graduate'], dtype=object)

In [43]:
df_edu[df_edu['Education'] == 'Graduate'].count()

Education      480
Loan_Status    480
dtype: int64

In [44]:
df_grad = df_edu[df_edu['Education'] == 'Graduate']
df_not_grad = df_edu[df_edu['Education'] == 'Not Graduate']

In [45]:
# percentage graduated approved loans
df_grad[df_grad.Loan_Status == 'Y'].count() / df_grad.Loan_Status.count()

Education      0.708333
Loan_Status    0.708333
dtype: float64

In [46]:
# percentage not graduated approved loans
df_not_grad[df_not_grad.Loan_Status == 'Y'].count() / df_not_grad.Loan_Status.count()

Education      0.61194
Loan_Status    0.61194
dtype: float64

In [None]:
# This one I may have been right on....

#### 4. Properties in urban areas with high growth perspectives

Think this could go either way, people in urban areas struggle too, but yeah - lets say higher approval rate

In [49]:
df_area = df[['Property_Area', 'Loan_Status']]
df_area.isna().sum()

Property_Area    0
Loan_Status      0
dtype: int64

In [56]:
df_area.Property_Area.unique()
df_urban = df_area[df_area.Property_Area =='Urban']
df_rural = df_area[df_area.Property_Area =='Rural']
df_semiurban = df_area[df_area.Property_Area =='Semiurban']
df_urban.head()

Unnamed: 0,Property_Area,Loan_Status
0,Urban,Y
2,Urban,Y
3,Urban,Y
4,Urban,Y
5,Urban,Y


In [59]:
# Percentage Urban Approved
print(f'Urban approval: {df_urban[df_urban.Loan_Status == "Y"].count() / df_urban.Loan_Status.count()}')

# Percentage of total Urban
df_area[df_area['Property_Area'] == 'Urban'].count()/ df_area.count()

Urban approval: Property_Area    0.658416
Loan_Status      0.658416
dtype: float64


Property_Area    0.32899
Loan_Status      0.32899
dtype: float64

In [60]:
# Percentage Rural Approved
print(f'Urban approval: {df_rural[df_rural.Loan_Status == "Y"].count() / df_rural.Loan_Status.count()}')

# Percentage of total Rural
df_area[df_area['Property_Area'] == 'Rural'].count()/ df_area.count()

Urban approval: Property_Area    0.614525
Loan_Status      0.614525
dtype: float64


Property_Area    0.291531
Loan_Status      0.291531
dtype: float64

In [61]:
# Percentage Semiurban Approved
print(f'Urban approval: {df_semiurban[df_semiurban.Loan_Status == "Y"].count() / df_semiurban.Loan_Status.count()}')

# Percentage of total Semiurban
df_area[df_area['Property_Area'] == 'Semiurban'].count()/ df_area.count()

Urban approval: Property_Area    0.76824
Loan_Status      0.76824
dtype: float64


Property_Area    0.379479
Loan_Status      0.379479
dtype: float64

In [None]:
# So it did have an impact - semi-urban seemed to have most approval rate and spread accross 3 categories seemed to be fairly even

#### 5. Self Employed 
Think if self employed, lower approval rate

#### 6. Gender 
Think females higher (think micro loans)

In [63]:
# NOTE THERE ARE NAN genders in here - drop??

df_gender = df[['Gender', 'Loan_Status']]
df_gender.Gender.unique()

array(['Male', 'Female', nan], dtype=object)

In [71]:
df_gender.isna().sum()

Gender         13
Loan_Status     0
dtype: int64

In [66]:
df_f = df_gender[df_gender.Gender == 'Female']
df_f.count()

Gender         112
Loan_Status    112
dtype: int64

In [68]:
df_m = df_gender[df_gender.Gender == 'Male']
df_m.count()

Gender         489
Loan_Status    489
dtype: int64

In [72]:
# Females aprroval rate
df_f[df_f['Loan_Status'] == 'Y'].count()/ df_f.count()

Gender         0.669643
Loan_Status    0.669643
dtype: float64

In [77]:
# Males aprroval rate
df_m[df_m['Loan_Status'] == 'N'].count()/ df_m.count()

Gender         0.306748
Loan_Status    0.306748
dtype: float64

In [None]:
# Woah - that seems quite different - must be other factors at play - like amount asked for etc...

#### 7. Married / dependents
Think married higher and with dependants higher

#### 8. Loan amount
Think this is co linked with income and loan term amount

In [25]:
# PERHAPS INCOME? REQUEST?
df.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [34]:
#
df[df['Loan_Amount_Term'].isna() == True]

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
19,LP001041,Male,Yes,0,Graduate,,2600,3500.0,115.0,,1.0,Urban,Y
36,LP001109,Male,Yes,0,Graduate,No,1828,1330.0,100.0,,0.0,Urban,N
44,LP001136,Male,Yes,0,Not Graduate,Yes,4695,0.0,96.0,,1.0,Urban,Y
45,LP001137,Female,No,0,Graduate,No,3410,0.0,88.0,,1.0,Urban,Y
73,LP001250,Male,Yes,3+,Not Graduate,No,4755,0.0,95.0,,0.0,Semiurban,N
112,LP001391,Male,Yes,0,Not Graduate,No,3572,4114.0,152.0,,0.0,Rural,N
165,LP001574,Male,Yes,0,Graduate,No,3707,3166.0,182.0,,1.0,Rural,Y
197,LP001669,Female,No,0,Not Graduate,No,1907,2365.0,120.0,,1.0,Urban,Y
223,LP001749,Male,Yes,0,Graduate,No,7578,1010.0,175.0,,1.0,Semiurban,Y
232,LP001770,Male,No,0,Not Graduate,No,3189,2598.0,120.0,,1.0,Rural,Y
