# Data Wrangling

The dataset downloaded from Lending Club was relatively clean aside from a high number of empty columns. I removed all but 17 of the columns from the original dataset. The columns removed were either empty, contained mostly null values or were not of interest to the current project.

### Import Packages and Dataset

In [15]:
# Import packages
import pandas as pd

In [16]:
# Import dataframe
df = pd.read_csv('LoanStats3a.csv', header = 1, dtype={'next_pymnt_d': object, 'id': object})

## Explore the Dataframe

First I looked at the dataframe to understand the structure and see what information is missing from the dataset.

In [17]:
# Print information about the dataframe 
df.iloc[:, 1:50].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42542 entries, 0 to 42541
Data columns (total 49 columns):
member_id                     0 non-null float64
loan_amnt                     42535 non-null float64
funded_amnt                   42535 non-null float64
funded_amnt_inv               42535 non-null float64
term                          42535 non-null object
int_rate                      42535 non-null object
installment                   42535 non-null float64
grade                         42535 non-null object
sub_grade                     42535 non-null object
emp_title                     39911 non-null object
emp_length                    42535 non-null object
home_ownership                42535 non-null object
annual_inc                    42531 non-null float64
verification_status           42535 non-null object
issue_d                       42535 non-null object
loan_status                   42535 non-null object
pymnt_plan                    42535 non-null object
url  

In [18]:
df.iloc[:, 51:100].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42542 entries, 0 to 42541
Data columns (total 49 columns):
policy_code                       42535 non-null float64
application_type                  42535 non-null object
annual_inc_joint                  0 non-null float64
dti_joint                         0 non-null float64
verification_status_joint         0 non-null float64
acc_now_delinq                    42506 non-null float64
tot_coll_amt                      0 non-null float64
tot_cur_bal                       0 non-null float64
open_acc_6m                       0 non-null float64
open_act_il                       0 non-null float64
open_il_12m                       0 non-null float64
open_il_24m                       0 non-null float64
mths_since_rcnt_il                0 non-null float64
total_bal_il                      0 non-null float64
il_util                           0 non-null float64
open_rv_12m                       0 non-null float64
open_rv_24m                     

In [19]:
df.iloc[:, 101:].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42542 entries, 0 to 42541
Data columns (total 44 columns):
num_tl_90g_dpd_24m                            0 non-null float64
num_tl_op_past_12m                            0 non-null float64
pct_tl_nvr_dlq                                0 non-null float64
percent_bc_gt_75                              0 non-null float64
pub_rec_bankruptcies                          41170 non-null float64
tax_liens                                     42430 non-null float64
tot_hi_cred_lim                               0 non-null float64
total_bal_ex_mort                             0 non-null float64
total_bc_limit                                0 non-null float64
total_il_high_credit_limit                    0 non-null float64
revol_bal_joint                               0 non-null float64
sec_app_earliest_cr_line                      0 non-null float64
sec_app_inq_last_6mths                        0 non-null float64
sec_app_mort_acc                      

## Clean Dataframe

### Drop columns and rows with no data 

Looking at the dataset, at least half of the columns seem to have all null values. I removed those columns using .dropna(). I also dropped the columns that only had 158 non-null values. This left me with 57 columns. 

In [20]:
# Drop all columns and rows with 0 non-null values 
df = df.dropna(axis = 1, how = 'all')
df = df.dropna(axis = 0, how = 'all')

In [21]:
# Drop columns with only 158 non-null values
df = df.iloc[:, 1:58]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42538 entries, 0 to 42541
Data columns (total 57 columns):
loan_amnt                     42535 non-null float64
funded_amnt                   42535 non-null float64
funded_amnt_inv               42535 non-null float64
term                          42535 non-null object
int_rate                      42535 non-null object
installment                   42535 non-null float64
grade                         42535 non-null object
sub_grade                     42535 non-null object
emp_title                     39911 non-null object
emp_length                    42535 non-null object
home_ownership                42535 non-null object
annual_inc                    42531 non-null float64
verification_status           42535 non-null object
issue_d                       42535 non-null object
loan_status                   42535 non-null object
pymnt_plan                    42535 non-null object
desc                          29243 non-null object
pu

### Drop additional columns

In order to focus my project, I decided to drop any columns that were not of interest to the project. This left me with 18 columns. 

I used the Lending Club data dictionary to help me decide which columns to remove from the dataframe. 
Data dictionary: https://resources.lendingclub.com/LCDataDictionary.xlsx

In [22]:
# Drop columns
df = df.iloc[:, 1:23]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42538 entries, 0 to 42541
Data columns (total 22 columns):
funded_amnt            42535 non-null float64
funded_amnt_inv        42535 non-null float64
term                   42535 non-null object
int_rate               42535 non-null object
installment            42535 non-null float64
grade                  42535 non-null object
sub_grade              42535 non-null object
emp_title              39911 non-null object
emp_length             42535 non-null object
home_ownership         42535 non-null object
annual_inc             42531 non-null float64
verification_status    42535 non-null object
issue_d                42535 non-null object
loan_status            42535 non-null object
pymnt_plan             42535 non-null object
desc                   29243 non-null object
purpose                42535 non-null object
title                  42523 non-null object
zip_code               42535 non-null object
addr_state             42535 non

In [23]:
# Drop additional columns
df = df.drop('funded_amnt_inv', 1)
df = df.drop('desc', 1)
df = df.drop('zip_code', 1)
df = df.drop('delinq_2yrs', 1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42538 entries, 0 to 42541
Data columns (total 18 columns):
funded_amnt            42535 non-null float64
term                   42535 non-null object
int_rate               42535 non-null object
installment            42535 non-null float64
grade                  42535 non-null object
sub_grade              42535 non-null object
emp_title              39911 non-null object
emp_length             42535 non-null object
home_ownership         42535 non-null object
annual_inc             42531 non-null float64
verification_status    42535 non-null object
issue_d                42535 non-null object
loan_status            42535 non-null object
pymnt_plan             42535 non-null object
purpose                42535 non-null object
title                  42523 non-null object
addr_state             42535 non-null object
dti                    42535 non-null float64
dtypes: float64(4), object(14)
memory usage: 6.2+ MB


### Remove empty rows

I noticed that while most rows had 42535 non-null objects, there were 42538 entries in the dataframe. I used .isnull() to find and remove the 3 empty rows. 

In [24]:
# Identify the rows with missing data
nans = lambda df: df[df.isnull().all(axis=1)]
nans(df)

Unnamed: 0,funded_amnt,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,addr_state,dti
39788,,,,,,,,,,,,,,,,,,
42540,,,,,,,,,,,,,,,,,,
42541,,,,,,,,,,,,,,,,,,


In [25]:
# Drop empty rows 
df = df.drop([39788, 42540, 42541])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42535 entries, 0 to 42537
Data columns (total 18 columns):
funded_amnt            42535 non-null float64
term                   42535 non-null object
int_rate               42535 non-null object
installment            42535 non-null float64
grade                  42535 non-null object
sub_grade              42535 non-null object
emp_title              39911 non-null object
emp_length             42535 non-null object
home_ownership         42535 non-null object
annual_inc             42531 non-null float64
verification_status    42535 non-null object
issue_d                42535 non-null object
loan_status            42535 non-null object
pymnt_plan             42535 non-null object
purpose                42535 non-null object
title                  42523 non-null object
addr_state             42535 non-null object
dti                    42535 non-null float64
dtypes: float64(4), object(14)
memory usage: 6.2+ MB


## Fill missing data
### Employment Title

The employment title column is missing 2624 entries. I decided to replace those with 'Unknown'

In [26]:
# Explore column
df['emp_title'].value_counts(dropna=False)

NaN                                     2624
US Army                                  139
Bank of America                          115
IBM                                       72
Kaiser Permanente                         61
AT&T                                      61
UPS                                       58
Wells Fargo                               57
USAF                                      56
US Air Force                              55
Self Employed                             49
United States Air Force                   48
Walmart                                   47
Lockheed Martin                           46
State of California                       45
Verizon Wireless                          43
U.S. Army                                 42
USPS                                      41
Walgreens                                 41
US ARMY                                   40
Self                                      39
Target                                    38
JP Morgan 

In [27]:
## Replace NaN with 'Unknown'
df['emp_title'] = df['emp_title'].fillna('Unknown')

### Annual Income

The Annual Income column is missing 4 values. I decided to use mean to fill those missing values

In [28]:
# Explore annual_inc column
df['annual_inc'].describe()

count    4.253100e+04
mean     6.913656e+04
std      6.409635e+04
min      1.896000e+03
25%      4.000000e+04
50%      5.900000e+04
75%      8.250000e+04
max      6.000000e+06
Name: annual_inc, dtype: float64

In [29]:
# Calculate the mean of annual_inc
inc_mean = df['annual_inc'].mean()
inc_mean

69136.55642025822

In [30]:
# Replace all the missing values in annual_inc with the mean
df['annual_inc'] = df['annual_inc'].fillna(inc_mean)

### Title

The title column is missing 12 values. I decided to also replace these NaN values with 'Unknown'

In [31]:
# Explore column
df['title'].value_counts(dropna=False)

Debt Consolidation                                                            2259
Debt Consolidation Loan                                                       1760
Personal Loan                                                                  708
Consolidation                                                                  547
debt consolidation                                                             532
Home Improvement                                                               373
Credit Card Consolidation                                                      370
Debt consolidation                                                             347
Small Business Loan                                                            333
Personal                                                                       330
Credit Card Loan                                                               323
personal loan                                                                  266
Cons

In [32]:
## Replace NaN with 'Unknown'
df['title'] = df['title'].fillna('Unknown')

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42535 entries, 0 to 42537
Data columns (total 18 columns):
funded_amnt            42535 non-null float64
term                   42535 non-null object
int_rate               42535 non-null object
installment            42535 non-null float64
grade                  42535 non-null object
sub_grade              42535 non-null object
emp_title              42535 non-null object
emp_length             42535 non-null object
home_ownership         42535 non-null object
annual_inc             42535 non-null float64
verification_status    42535 non-null object
issue_d                42535 non-null object
loan_status            42535 non-null object
pymnt_plan             42535 non-null object
purpose                42535 non-null object
title                  42535 non-null object
addr_state             42535 non-null object
dti                    42535 non-null float64
dtypes: float64(4), object(14)
memory usage: 6.2+ MB


## Additional Exploration

In order to see if any additional columns needed cleaning, I explored each column individually. 

### Explore numeric columns

In [34]:
df.describe()

Unnamed: 0,funded_amnt,installment,annual_inc,dti
count,42535.0,42535.0,42535.0,42535.0
mean,10821.585753,322.623063,69136.56,13.373043
std,7146.914675,208.927216,64093.34,6.726315
min,500.0,15.67,1896.0,0.0
25%,5000.0,165.52,40000.0,8.2
50%,9600.0,277.69,59000.0,13.47
75%,15000.0,428.18,82500.0,18.68
max,35000.0,1305.19,6000000.0,29.99


### Explore term

In [37]:
df['term'].value_counts()

 36 months    31534
 60 months    11001
Name: term, dtype: int64

In [38]:
# Convert the term column to a category
df.term = df.term.astype('category')

### Explore int_rate

In [39]:
df['int_rate'].value_counts()

10.99%    970
11.49%    837
13.49%    832
7.51%     787
7.88%     742
7.49%     656
11.71%    609
9.99%     607
7.90%     582
5.42%     573
11.99%    535
12.69%    492
10.37%    470
12.99%    456
6.03%     447
8.49%     445
12.42%    443
10.65%    435
11.86%    418
5.79%     410
8.90%     402
10.59%    400
7.29%     397
6.62%     396
14.27%    391
9.63%     384
9.91%     377
12.53%    356
5.99%     347
7.14%     342
         ... 
20.52%      4
12.62%      3
18.86%      3
20.69%      3
14.67%      3
14.57%      3
16.46%      3
18.49%      3
24.11%      3
22.94%      2
21.82%      2
13.84%      2
17.59%      2
17.09%      2
16.33%      2
17.28%      2
17.91%      2
20.20%      2
16.20%      1
17.72%      1
18.72%      1
17.41%      1
17.78%      1
17.44%      1
17.46%      1
21.48%      1
24.59%      1
22.64%      1
16.83%      1
24.40%      1
Name: int_rate, Length: 394, dtype: int64

In [40]:
# convert the percentages to a float
df['int_rate'] = df['int_rate'].str.rstrip('%').astype('float')/100.00

### Explore installment

In [41]:
df['installment'].value_counts()

311.11     68
180.96     59
311.02     54
150.80     48
368.45     46
372.12     45
330.76     43
339.31     42
317.72     42
186.61     41
301.60     41
304.36     40
187.69     40
373.33     40
276.06     39
396.92     39
155.56     39
365.23     39
312.82     39
310.10     39
303.27     37
120.64     37
325.74     37
322.63     36
187.75     36
186.67     36
152.18     36
361.92     35
307.04     35
156.41     34
           ..
275.09      1
498.91      1
204.84      1
105.82      1
111.88      1
523.14      1
215.70      1
470.33      1
400.65      1
278.92      1
349.74      1
82.05       1
713.40      1
639.03      1
1106.83     1
195.58      1
660.12      1
351.50      1
267.68      1
411.99      1
98.99       1
282.43      1
365.78      1
356.75      1
77.86       1
322.64      1
368.49      1
690.13      1
169.83      1
316.58      1
Name: installment, Length: 16459, dtype: int64

In [42]:
# convert installment to a float
df.installment = df.installment.astype('float')

### Explore grade and sub_grade

In [43]:
df.grade.value_counts()

B    12389
A    10183
C     8740
D     6016
E     3394
F     1301
G      512
Name: grade, dtype: int64

In [44]:
# convert grade column to category
df.grade = df.grade.astype('category')

In [45]:
df.sub_grade.value_counts()

B3    2997
A4    2905
B5    2807
A5    2793
B4    2590
C1    2264
C2    2157
B2    2113
B1    1882
A3    1823
C3    1658
A2    1520
D2    1485
C4    1370
D3    1322
C5    1291
A1    1142
D4    1140
D1    1053
D5    1016
E1     884
E2     791
E3     668
E4     552
E5     499
F1     392
F2     308
F3     236
F4     211
F5     154
G1     141
G2     107
G4      99
G5      86
G3      79
Name: sub_grade, dtype: int64

In [46]:
# convert grade column to category
df.sub_grade = df.sub_grade.astype('category')

### Explore emp_title

In [47]:
df.emp_title.value_counts()

Unknown                                 2624
US Army                                  139
Bank of America                          115
IBM                                       72
AT&T                                      61
Kaiser Permanente                         61
UPS                                       58
Wells Fargo                               57
USAF                                      56
US Air Force                              55
Self Employed                             49
United States Air Force                   48
Walmart                                   47
Lockheed Martin                           46
State of California                       45
Verizon Wireless                          43
U.S. Army                                 42
USPS                                      41
Walgreens                                 41
US ARMY                                   40
Self                                      39
JPMorgan Chase                            38
Best Buy  

I decided to leave this column as an object

### Explore emp_length

In [48]:
df.emp_length.value_counts()

10+ years    9369
< 1 year     5062
2 years      4743
3 years      4364
4 years      3649
1 year       3595
5 years      3458
6 years      2375
7 years      1875
8 years      1592
9 years      1341
n/a          1112
Name: emp_length, dtype: int64

In [49]:
# convert emp_length column to category
df.emp_length = df.emp_length.astype('category')

### Explore home_ownership

In [50]:
df.home_ownership.value_counts()

RENT        20181
MORTGAGE    18959
OWN          3251
OTHER         136
NONE            8
Name: home_ownership, dtype: int64

In [51]:
# Convert to category
df.home_ownership = df.home_ownership.astype('category')

### Explore verification_status

In [52]:
df.verification_status.value_counts()

Not Verified       18758
Verified           13471
Source Verified    10306
Name: verification_status, dtype: int64

In [53]:
# Convert to category
df.verification_status = df.verification_status.astype('category')

### Explore issue_d

In [54]:
df.issue_d.value_counts()

11-Dec    2267
11-Nov    2232
11-Oct    2118
11-Sep    2067
11-Aug    1934
11-Jul    1875
11-Jun    1835
11-May    1704
11-Apr    1563
11-Mar    1448
11-Jan    1380
10-Dec    1335
11-Feb    1298
10-Oct    1232
10-Nov    1224
10-Jul    1204
10-Sep    1189
10-Aug    1175
10-Jun    1105
10-May     989
10-Apr     912
10-Mar     828
10-Feb     682
9-Nov      662
10-Jan     662
9-Dec      658
9-Oct      604
9-Sep      507
9-Aug      446
9-Jul      411
9-Jun      406
8-Mar      402
9-May      359
9-Apr      333
9-Mar      324
8-Feb      306
8-Jan      305
9-Feb      302
9-Jan      269
8-Apr      259
8-Dec      253
8-Nov      209
7-Dec      172
8-Jul      141
8-Jun      124
8-Oct      122
8-May      115
7-Nov      112
7-Oct      105
8-Aug      100
7-Aug       74
7-Jul       63
8-Sep       57
7-Sep       53
7-Jun       24
Name: issue_d, dtype: int64

### Explore loan_status

In [55]:
df.loan_status.value_counts()

Fully Paid                                             34116
Charged Off                                             5670
Does not meet the credit policy. Status:Fully Paid      1988
Does not meet the credit policy. Status:Charged Off      761
Name: loan_status, dtype: int64

In [56]:
# Convert to category
df.loan_status = df.loan_status.astype('category')

### Explore pymnt_plan

In [57]:
df.pymnt_plan.value_counts()

n    42535
Name: pymnt_plan, dtype: int64

Since all columns have the same value, I removed this column. 

In [58]:
df = df.drop('pymnt_plan', 1)

### Explore purpose

In [59]:
df.purpose.value_counts()

debt_consolidation    19776
credit_card            5477
other                  4425
home_improvement       3199
major_purchase         2311
small_business         1992
car                    1615
wedding                1004
medical                 753
moving                  629
house                   426
educational             422
vacation                400
renewable_energy        106
Name: purpose, dtype: int64

In [60]:
# convert to category
df.purpose = df.purpose.astype('category')

### Explore title

In [62]:
df.title.value_counts()

Debt Consolidation                                                            2259
Debt Consolidation Loan                                                       1760
Personal Loan                                                                  708
Consolidation                                                                  547
debt consolidation                                                             532
Home Improvement                                                               373
Credit Card Consolidation                                                      370
Debt consolidation                                                             347
Small Business Loan                                                            333
Personal                                                                       330
Credit Card Loan                                                               323
personal loan                                                                  266
Cons

This column seems to be redundant of the purpose column. However, I am going to keep it for now because it might provide some opportunities for text mining. 

### Explore addr_state

In [63]:
df.addr_state.value_counts()

CA    7429
NY    4065
FL    3104
TX    2915
NJ    1988
IL    1672
PA    1651
GA    1503
VA    1487
MA    1438
OH    1329
MD    1125
AZ     933
WA     888
CO     857
NC     830
CT     816
MI     796
MO     765
MN     652
NV     527
WI     516
SC     489
AL     484
OR     468
LA     461
KY     359
OK     317
KS     298
UT     278
AR     261
DC     224
RI     208
NM     205
NH     188
WV     187
HI     181
DE     136
MT      96
WY      87
AK      86
SD      67
VT      57
TN      32
MS      26
IN      19
IA      12
NE      11
ID       9
ME       3
Name: addr_state, dtype: int64

In [64]:
# convert to category
df.addr_state = df.addr_state.astype('category')

## Results

In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42535 entries, 0 to 42537
Data columns (total 17 columns):
funded_amnt            42535 non-null float64
term                   42535 non-null category
int_rate               42535 non-null float64
installment            42535 non-null float64
grade                  42535 non-null category
sub_grade              42535 non-null category
emp_title              42535 non-null object
emp_length             42535 non-null category
home_ownership         42535 non-null category
annual_inc             42535 non-null float64
verification_status    42535 non-null category
issue_d                42535 non-null object
loan_status            42535 non-null category
purpose                42535 non-null category
title                  42535 non-null object
addr_state             42535 non-null category
dti                    42535 non-null float64
dtypes: category(9), float64(5), object(3)
memory usage: 3.3+ MB


In [66]:
df.describe()

Unnamed: 0,funded_amnt,int_rate,installment,annual_inc,dti
count,42535.0,42535.0,42535.0,42535.0,42535.0
mean,10821.585753,0.12165,322.623063,69136.56,13.373043
std,7146.914675,0.037079,208.927216,64093.34,6.726315
min,500.0,0.0542,15.67,1896.0,0.0
25%,5000.0,0.0963,165.52,40000.0,8.2
50%,9600.0,0.1199,277.69,59000.0,13.47
75%,15000.0,0.1472,428.18,82500.0,18.68
max,35000.0,0.2459,1305.19,6000000.0,29.99
