# Prosper Loan Data Exploration
## by Hao Xu

## Preliminary Wrangling


This document explores **Prosper Loan Dataset**. Prosper is America's first peer-to-peer lending company. It publishes performance statistics on its website and all market data is available to the public for analysis. 

One important thing is, on July 13 2009, Prosper reopened their website for lending ("investing") and borrowing after having obtained SEC registration for its loans(Reference Wikipedia).

[参考](https://rpubs.com/bboylowye/282962)

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [53]:
# load in the dataset into a pandas dataframe,
df = pd.read_csv('prosperLoanData.csv')

In [54]:
# high-level overview of data shape and composition
print(df.shape)
print(df.dtypes)
print(df.head())

(113937, 81)
ListingKey                              object
ListingNumber                            int64
ListingCreationDate                     object
CreditGrade                             object
Term                                     int64
LoanStatus                              object
ClosedDate                              object
BorrowerAPR                            float64
BorrowerRate                           float64
LenderYield                            float64
EstimatedEffectiveYield                float64
EstimatedLoss                          float64
EstimatedReturn                        float64
ProsperRating (numeric)                float64
ProsperRating (Alpha)                   object
ProsperScore                           float64
ListingCategory (numeric)                int64
BorrowerState                           object
Occupation                              object
EmploymentStatus                        object
EmploymentStatusDuration               float64


In [55]:
# descriptive statistics for numeric variables
print(df.describe())

       ListingNumber           Term    BorrowerAPR   BorrowerRate  \
count   1.139370e+05  113937.000000  113912.000000  113937.000000   
mean    6.278857e+05      40.830248       0.218828       0.192764   
std     3.280762e+05      10.436212       0.080364       0.074818   
min     4.000000e+00      12.000000       0.006530       0.000000   
25%     4.009190e+05      36.000000       0.156290       0.134000   
50%     6.005540e+05      36.000000       0.209760       0.184000   
75%     8.926340e+05      36.000000       0.283810       0.250000   
max     1.255725e+06      60.000000       0.512290       0.497500   

         LenderYield  EstimatedEffectiveYield  EstimatedLoss  EstimatedReturn  \
count  113937.000000             84853.000000   84853.000000     84853.000000   
mean        0.182701                 0.168661       0.080306         0.096068   
std         0.074516                 0.068467       0.046764         0.030403   
min        -0.010000                -0.182700       0.

In [57]:
df.ProsperScore.isnull().sum()

29084

In [45]:
df.ListingCreationDate = pd.to_datetime(df.ListingCreationDate)
df.ListingCreationDate.min()

Timestamp('2005-11-09 20:44:28.847000')

In [52]:
a = pd.Timestamp(2009,11,26)
b = pd.Timestamp(2009,7,13)
df.query('ListingCreationDate >= @a and ListingCreationDate <= @b')



Unnamed: 0,ListingKey,ListingNumber,ListingCreationDate,CreditGrade,Term,LoanStatus,ClosedDate,BorrowerAPR,BorrowerRate,LenderYield,EstimatedEffectiveYield,EstimatedLoss,EstimatedReturn,ProsperRating (numeric),ProsperRating (Alpha),ProsperScore,ListingCategory (numeric),BorrowerState,Occupation,EmploymentStatus,EmploymentStatusDuration,IsBorrowerHomeowner,CurrentlyInGroup,GroupKey,DateCreditPulled,CreditScoreRangeLower,CreditScoreRangeUpper,FirstRecordedCreditLine,CurrentCreditLines,OpenCreditLines,TotalCreditLinespast7years,OpenRevolvingAccounts,OpenRevolvingMonthlyPayment,InquiriesLast6Months,TotalInquiries,CurrentDelinquencies,AmountDelinquent,DelinquenciesLast7Years,PublicRecordsLast10Years,PublicRecordsLast12Months,RevolvingCreditBalance,BankcardUtilization,AvailableBankcardCredit,TotalTrades,TradesNeverDelinquent (percentage),TradesOpenedLast6Months,DebtToIncomeRatio,IncomeRange,IncomeVerifiable,StatedMonthlyIncome,LoanKey,TotalProsperLoans,TotalProsperPaymentsBilled,OnTimeProsperPayments,ProsperPaymentsLessThanOneMonthLate,ProsperPaymentsOneMonthPlusLate,ProsperPrincipalBorrowed,ProsperPrincipalOutstanding,ScorexChangeAtTimeOfListing,LoanCurrentDaysDelinquent,LoanFirstDefaultedCycleNumber,LoanMonthsSinceOrigination,LoanNumber,LoanOriginalAmount,LoanOriginationDate,LoanOriginationQuarter,MemberKey,MonthlyLoanPayment,LP_CustomerPayments,LP_CustomerPrincipalPayments,LP_InterestandFees,LP_ServiceFees,LP_CollectionFees,LP_GrossPrincipalLoss,LP_NetPrincipalLoss,LP_NonPrincipalRecoverypayments,PercentFunded,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors
528,3BBA345088754410516372A,415089,2009-04-28 14:08:57.100,,36,Completed,2010-11-04 00:00:00,0.1717,0.15,0.14,,,,,,,1,OH,Food Service,Full-time,72.0,True,False,,2010-04-16 02:06:58,720.0,739.0,2002-04-09 00:00:00,15.0,13.0,24.0,12,196.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,5662.0,0.16,28288.0,24.0,1.0,0.0,0.26,"$25,000-49,999",True,2166.666667,4E7F35859430818920DE040,3.0,15.0,15.0,0.0,0.0,7500.0,0.0,121.0,0,,47,41999,10000,2010-04-29 00:00:00,Q2 2010,BBF1336465743467378DE4F,346.65,10591.53,10000.0,591.53,-39.43,0.0,0.0,0.0,0.0,1.0,0,0,0.0,220
3140,0D113451667173664D2D2EB,415577,2009-05-02 17:17:33.677,B,36,Completed,2010-02-02 00:00:00,0.11175,0.0855,0.0755,,,,,,,1,WA,Tradesman - Electrician,Full-time,57.0,True,False,,2009-04-28 20:52:48,700.0,719.0,1998-12-20 00:00:00,10.0,9.0,20.0,7,170.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,5057.0,0.3,7601.0,18.0,0.88,2.0,0.12,"$25,000-49,999",True,3833.333333,24973555335786563CA1C8D,1.0,6.0,6.0,0.0,0.0,10000.0,0.0,-38.0,0,,58,38044,2000,2009-05-14 00:00:00,Q2 2009,1177340984660368892073C,63.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1,0,0.0,37
17774,5EF4345058093044810D28F,415116,2009-04-28 18:19:41.690,B,36,Completed,2010-09-15 00:00:00,0.19846,0.1615,0.1515,,,,,,,7,GA,Police Officer/Correction Officer,Full-time,201.0,False,False,,2009-04-28 18:15:44,680.0,699.0,1994-06-23 00:00:00,15.0,12.0,40.0,9,840.0,1.0,4.0,0.0,0.0,0.0,0.0,0.0,35522.0,0.81,8019.0,40.0,0.95,2.0,0.13,"$100,000+",True,10885.416667,AE48355536279360637289F,,,,,,,,,0,,58,38034,1500,2009-05-07 00:00:00,Q2 2009,E88D341489974012248DD2A,52.85,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,0,0.0,27
21414,0385345033494662260733C,415310,2009-04-29 15:14:59.247,,36,Completed,2012-05-24 00:00:00,0.11094,0.09,0.08,,,,,,,7,AZ,Police Officer/Correction Officer,Full-time,163.0,False,True,5E6E3385884089066287C53,2009-08-13 09:42:23,740.0,759.0,1997-01-21 00:00:00,17.0,15.0,37.0,13,525.0,5.0,10.0,0.0,0.0,0.0,0.0,0.0,19336.0,0.19,62158.0,35.0,1.0,5.0,0.22,"$50,000-74,999",True,5344.916667,447D35665465423879F0B22,1.0,18.0,18.0,0.0,0.0,1000.0,20.14,115.0,0,,54,38672,1667,2009-09-24 00:00:00,Q3 2009,37C23406768280716C5ABBB,53.01,1680.44,1667.01,13.43,-1.5,0.0,0.0,0.0,0.0,1.0,1,0,0.0,166
24433,44773450501513236DCFCEA,415324,2009-04-29 19:34:54.347,A,36,Completed,2012-02-29 00:00:00,0.09677,0.076,0.066,,,,,,,1,GA,Engineer - Electrical,Full-time,73.0,True,False,,2009-04-28 06:20:42,740.0,759.0,1991-10-30 00:00:00,18.0,13.0,55.0,8,234.0,1.0,8.0,0.0,0.0,0.0,0.0,0.0,13432.0,0.3,31068.0,49.0,0.97,1.0,0.19,"$100,000+",True,9333.333333,B76C35557836210483B5AF3,2.0,20.0,20.0,0.0,0.0,9000.0,0.0,21.0,0,,58,38033,3000,2009-05-07 00:00:00,Q2 2009,B9293392998449090A6D88E,93.46,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,0,0.0,59
28094,32E334522437040506DC85C,415637,2009-05-04 04:59:18.020,C,36,Defaulted,2009-12-13 00:00:00,0.2021,0.18,0.17,,,,,,,7,MA,Nurse (RN),Full-time,83.0,True,False,,2009-05-04 04:55:07,660.0,679.0,1990-12-01 00:00:00,5.0,4.0,16.0,3,299.0,0.0,11.0,0.0,0.0,0.0,0.0,0.0,12577.0,0.97,514.0,15.0,0.93,0.0,0.07,"$50,000-74,999",True,6000.0,FB5C35552935374991153FE,,,,,,,,,137,7.0,58,38041,3000,2009-05-12 00:00:00,Q2 2009,A2AD3432057914652B42204,108.46,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,0,0.0,36
29309,343E3452187253892A48A50,415961,2009-05-06 23:55:16.543,,36,Completed,2012-11-05 00:00:00,0.12279,0.1017,0.0917,,,,,,,1,CA,Other,Full-time,33.0,False,False,,2009-09-22 23:25:51,700.0,719.0,1994-08-02 00:00:00,15.0,12.0,39.0,11,244.0,2.0,19.0,0.0,0.0,0.0,0.0,0.0,4701.0,0.17,21949.0,26.0,1.0,1.0,0.21,"$50,000-74,999",True,4333.333333,DEF735717420908286A567E,1.0,22.0,22.0,0.0,0.0,6500.0,2942.81,46.0,0,,52,39295,4500,2009-11-05 00:00:00,Q4 2009,D751340562524315508839B,145.56,5239.59,4500.01,739.58,-72.68,0.0,0.0,0.0,0.0,1.0,0,0,0.0,239
35790,36C63450215037018088662,415423,2009-04-30 16:06:43.743,C,36,Completed,2012-05-13 00:00:00,0.22153,0.1649,0.1549,,,,,,,5,IL,Sales - Retail,Full-time,12.0,False,False,,2009-04-30 16:00:30,640.0,659.0,2004-11-16 00:00:00,3.0,3.0,14.0,2,15.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,498.0,0.17,2302.0,3.0,1.0,0.0,1.27,"$1-24,999",True,66.666667,07D135567368272216AB044,1.0,10.0,10.0,0.0,0.0,1000.0,761.69,-2.0,0,,58,38037,1000,2009-05-13 00:00:00,Q2 2009,0588342364795854665007E,35.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,0,0.0,15
45428,B7F8345209546419229B3D4,415482,2009-05-01 05:19:09.483,,36,Completed,2010-08-20 00:00:00,0.07539,0.072,0.062,,,,,,,2,NY,Teacher,Full-time,62.0,False,True,8BAB34083705675975C9772,2009-10-24 14:01:47,760.0,779.0,1999-07-26 00:00:00,21.0,15.0,57.0,16,248.0,1.0,15.0,0.0,0.0,0.0,0.0,0.0,1321.0,0.01,75795.0,52.0,1.0,0.0,0.07,"$50,000-74,999",True,5416.666667,BC883572675660194DD0E7F,1.0,10.0,10.0,0.0,0.0,1000.0,0.0,4.0,0,,52,39703,3000,2009-11-30 00:00:00,Q4 2009,E8873411403573167219827,92.91,3140.04,3000.0,140.04,-19.45,0.0,0.0,0.0,0.0,1.0,0,0,0.0,141
51937,21F63451223082614E3D321,415381,2009-04-30 08:53:14.533,D,36,Completed,2012-02-09 00:00:00,0.39951,0.3365,0.3265,,,,,,,6,GA,Military Enlisted,Full-time,155.0,False,False,,2009-04-30 08:49:57,620.0,639.0,1993-07-19 00:00:00,10.0,8.0,36.0,7,281.0,0.0,2.0,0.0,0.0,4.0,0.0,0.0,8004.0,0.91,169.0,28.0,0.82,1.0,0.22,"$25,000-49,999",True,3801.166667,E6763556456189684A817B8,1.0,11.0,11.0,0.0,0.0,5000.0,3700.97,-22.0,0,,58,38036,1000,2009-05-13 00:00:00,Q2 2009,01873419418028013A65F9B,44.48,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1,0,0.0,19


In [41]:

df2 = df[df['ListingCreationDate']>= pd.Timestamp(2009,7,13)]

In [43]:
df2.isnull().sum()

ListingKey                                 0
ListingNumber                              0
ListingCreationDate                        0
CreditGrade                            84868
Term                                       0
LoanStatus                                 0
ClosedDate                             58848
BorrowerAPR                                0
BorrowerRate                               0
LenderYield                                0
EstimatedEffectiveYield                   28
EstimatedLoss                             28
EstimatedReturn                           28
ProsperRating (numeric)                   28
ProsperRating (Alpha)                     28
ProsperScore                              28
ListingCategory (numeric)                  0
BorrowerState                              0
Occupation                              1333
EmploymentStatus                           0
EmploymentStatusDuration                  19
IsBorrowerHomeowner                        0
CurrentlyI

In [14]:
(1+0.092/24)**24-1

0.09617200601272713

In [5]:
df.columns

Index(['ListingKey', 'ListingNumber', 'ListingCreationDate', 'CreditGrade',
       'Term', 'LoanStatus', 'ClosedDate', 'BorrowerAPR', 'BorrowerRate',
       'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss',
       'EstimatedReturn', 'ProsperRating (numeric)', 'ProsperRating (Alpha)',
       'ProsperScore', 'ListingCategory (numeric)', 'BorrowerState',
       'Occupation', 'EmploymentStatus', 'EmploymentStatusDuration',
       'IsBorrowerHomeowner', 'CurrentlyInGroup', 'GroupKey',
       'DateCreditPulled', 'CreditScoreRangeLower', 'CreditScoreRangeUpper',
       'FirstRecordedCreditLine', 'CurrentCreditLines', 'OpenCreditLines',
       'TotalCreditLinespast7years', 'OpenRevolvingAccounts',
       'OpenRevolvingMonthlyPayment', 'InquiriesLast6Months', 'TotalInquiries',
       'CurrentDelinquencies', 'AmountDelinquent', 'DelinquenciesLast7Years',
       'PublicRecordsLast10Years', 'PublicRecordsLast12Months',
       'RevolvingCreditBalance', 'BankcardUtilization',
       'Availa

### What is the structure of your dataset?

There are 113,937 records in the dataset with 81 features. Most variables are numeric in nature, but the variables cut, color, and clarity are ordered factor variables with the following levels.
(worst) ——> (best) 
cut: Fair, Good, Very Good, Premium, Ideal 
color: J, I, H, G, F, E, D 
clarity: I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF

### What is/are the main feature(s) of interest in your dataset?

Prosper reports a 10.69% annualized seasoned rate of return, net of fees, for all loans issued from its re-opening after SEC registration (July 1, 2009) to the 30th of September, 2011. Prosper's returns for this period have been independently audited by Ashland Partners & Company LLP.A number of factors, including Prosper's decision to set the interest rates on all loans (rather than let investors choose the rates they would accept), occurred after Prosper registered with the SEC and began issuing new loan notes in July, 2009. Additionally, after Prosper began setting the rates on all loans itself, Prosper significantly tightened the minimum credit quality necessary for a borrower to receive a Prosper loan. Many borrowers who received loans prior to 2009 (which were priced by investors) would no longer qualify for a loan, at any rate, under Prosper's new underwriting policies.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> Your answer here!

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

> Make sure that, after every plot or related series of plots, that you
include a Markdown cell with comments about what you observed, and what
you plan on investigating next.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!