### Obtaining a data set

I've put a zipped csv file from Lending Tree containing mortage data on One Drive. You should all have access. You can find it here: https://glgit-my.sharepoint.com/:u:/g/personal/jobelenus_glgroup_com/EUoBwtK-k89KopT44j4DjqsB_N2IPj36kuZUmY7SpDgTwg?e=fgEzC0

I've added a column reference here: https://github.com/jobelenus/python-data-analysis-crash-course/blob/master/01-Pandas/reference.md.



### Analyzing the data

*Note: Skip Line 1! Line 2 is the header, so skip that too!*

1. Try and group the dataset by "grade" (A,B,C,D,F).
2. Then see if the highest interest rate in B is greater than the lowest in A, for each grade (e.g. compare a grade with the one above it)

1. Try and group the data by loan status and term, to determine whether more shorter mortgages are fully paid off than longer ones

1. Try and select all the A grade mortgages, and add a new column that calculates the total amount of dollars the buyer will owe (loan_amt * int_rate).
2. Then add another column that tells you how many years it would take to pay it off if they paid with their entire annual income each year.

Reading in csv - found issue with 3 columns - set the dtypes to object
Skipping 1 row not 2 since heading is created if not specified

In [69]:
import os
import pandas as pd
path = '/Users/dgriffis/code/glg/python-data-analysis-crash-course/data/LoanStats3d.csv.zip'
#pd.show_versions()

The next line reads in the zip file with the following:
loanStats_df = pd.read_csv(path, compression='infer', skiprows=1, dtype={"id": object, "desc": object, "verification_status_joint": object})

In [70]:
df = pd.read_csv(path, compression='infer', skiprows=1, dtype={"id": object, "desc": object, "verification_status_joint": object})
#df.describe()

In [71]:
#int_rate column is a string
df['int_rate'].iloc[0]

' 14.85%'

In [72]:
#so let's make it a float - strip the % and convert
df['int_rate'] = df['int_rate'].str.strip('%').astype(float)
print(df['int_rate'].iloc[0])
print(df['int_rate'].iloc[0].dtype)
#same here - strip ' months'
df['term'] = df['term'].str.strip(' months').astype(float)

14.85
float64


In [73]:
#Try and group the dataset by "grade" (A,B,C,D,F).
grades = df.groupby('grade').size().reset_index(name='counts')
grades

Unnamed: 0,grade,counts
0,A,73336
1,B,117606
2,C,120567
3,D,62654
4,E,34948
5,F,9817
6,G,2167


In [74]:
#Try and group the dataset by "grade" (A,B,C,D,F).
grades.shape

(7, 2)

Then see if the highest interest rate in B is greater than the lowest in A, for each grade (e.g. compare a grade with the one above it)

Let's find the highest and lowest int rate for each grade

In [75]:
#pp.155 - return a series with multiple values
def f(x):
    return pd.Series([x.min(), x.max()], index=['min','max'])

In [76]:
frame = df.groupby('grade')
#frame = frame['int_rate'].apply(f).unstack(level=0)
frame = frame['int_rate'].apply(f).unstack(level=1)#.reset_index('grade')
print(frame)
print(frame.shape)

         min    max
grade              
A       5.32   8.19
B       6.00  11.99
C       6.00  14.99
D       6.00  18.49
E       6.00  21.99
F       6.00  26.06
G      25.80  28.99
(7, 2)


In [77]:
test = frame.shift(-1)
print(test['max'])
print(frame['min'])
test['max'] > frame['min']

grade
A    11.99
B    14.99
C    18.49
D    21.99
E    26.06
F    28.99
G      NaN
Name: max, dtype: float64
grade
A     5.32
B     6.00
C     6.00
D     6.00
E     6.00
F     6.00
G    25.80
Name: min, dtype: float64


grade
A     True
B     True
C     True
D     True
E     True
F     True
G    False
dtype: bool

Try and group the data by loan status and term, to determine whether more shorter mortgages are fully paid off than longer ones

In [78]:
df.groupby(['term','loan_status']).size()

term  loan_status       
36.0  Charged Off            41939
      Current                  141
      Default                   31
      Fully Paid            240752
      In Grace Period           25
      Late (16-30 days)         27
      Late (31-120 days)       258
60.0  Charged Off            32823
      Current                45722
      Default                   92
      Fully Paid             56820
      In Grace Period          608
      Late (16-30 days)        387
      Late (31-120 days)      1470
dtype: int64

In [79]:
df.groupby(['term',df['loan_status']=='Fully Paid']).size()

term  loan_status
36.0  False           42421
      True           240752
60.0  False           81102
      True            56820
dtype: int64

Try and select all the A grade mortgages, and add a new column that calculates the total amount of dollars the buyer will owe (loan_amt * int_rate)

![title](img/LoanCalc.png)

Loan Total Cost Formula
r = Monthly Interest Rate (in Decimal Form) =
(Yearly Interest Rate/100) / 12

P = Principal Amount on the Loan

N = Total # of Months for the loan ( Years on the loan x 12)

Example: The total cost for 5 year loan, with a principal of $25,000,
and a yearly interest rate of 6.5%:

r = (6.5 / 100) / 12 = .005416667

P = 25,000

N = (30 x 12) = 60

In [172]:
def testLoanCalc(P,N,R):
    r = (R / 100) / 12
    print(r)
    rTimesP = r*P
    divisor = 1 - (1+r)**-N
    return rTimesP/divisor * N
    

In [173]:
total = testLoanCalc(25000,60,6.5)
total

0.005416666666666667


29349.22232809301

In [174]:
def loanCalc(row):
    P = row['loan_amnt']
    N = row['term']
    R = row['int_rate']
    r = (R / 100) / 12
    rTimesP = r*P
    divisor = 1 - (1+r)**-N
    return rTimesP/divisor * N
    

In [181]:
calcVals = df[['grade','loan_amnt','term','int_rate']] 
gradeA = calcVals[calcVals.grade == 'A'].reset_index()
gradeA['Cost of Loan'] = gradeA.apply(loanCalc,axis=1)

In [182]:
gradeA

Unnamed: 0,index,grade,loan_amnt,term,int_rate,Cost of Loan
0,11,A,10000.0,36.0,6.49,11032.002534
1,14,A,28000.0,36.0,6.49,30889.607096
2,18,A,9600.0,36.0,7.49,10748.721833
3,21,A,25000.0,36.0,7.49,27991.463108
4,30,A,6000.0,36.0,7.91,6759.690470
5,31,A,15000.0,36.0,5.32,16261.981895
6,35,A,11000.0,36.0,6.49,12135.202788
7,39,A,12000.0,36.0,5.32,13009.585516
8,42,A,18000.0,36.0,7.49,20153.853438
9,61,A,25000.0,36.0,5.32,27103.303159


Then add another column that tells you how many years it would take to pay it off if they paid with their entire annual income each year.