# Data Preparation

**Importing Libraries**

In [None]:
import numpy as np
import pandas as pd
import datetime

**Importing data**

The Lending Club Dataset is used for this project: a large US peer-to-peer lending company. There are several different versions of this dataset. We have used the updated dataset (version 2), which is available on kaggle: https://www.kaggle.com/wendykan/lending-club-loan-data/

We divided the data into two periods because we assume that some data are available at the moment when we need to build Expected Loss models, and some data comes from applications after. Later, we investigate whether the applications we have after we built the Probability of Default (PD) model have similar characteristics with the applications we used to build the PD model.

In [None]:
loan = r'F:\Data Analysis\Springboard\Data Science Career Track\Projects\Capstone 2\lending club loan data_version 2\loan.csv'

loan_data_backup = pd.read_csv(loan)

loan_data = loan_data_backup.copy()

**Explore Data**

In [None]:
loan_data.head()

In [None]:
loan_data.tail()

In [None]:
#Display all columns
#pd.options.display.max_columns = None
#loan_data

In [None]:
#Display all rows
#pd.options.display.max_rows = None
#loan_data

In [None]:
loan_data.columns.values

In [None]:
# Displays column names, complete (non-missing) cases per column, and datatype per column.
loan_data.info()

In [None]:
loan_data.dtypes

In [None]:
loan_data.shape

# Data Pre-processing

**Pre-processing few continuous variables**

In [None]:
# Display unique values of a column.
loan_data['emp_length'].unique()

The `emp_length` has four things we have to remove to be able to convert it into an integer:
    1.`+ years`
    2.`< 1 year`
    3.`nan`
    4.` years` and `year`  (space years and space year)

In [None]:
#Coverting the employment lenght from object into integer. We will store the new variable as 'employment length int'

# 1.Assign the new ‘employment length int’ to be equal to the ‘employment length’ variable with the string ‘+ years’
loan_data['emp_length_int'] = loan_data['emp_length'].str.replace('\+ years', '')

# 2Replace the whole string ‘less than 1 year’ with the string ‘0’.
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace('< 1 year', str(0))

# 3.Replace the ‘n/a’ string with the string ‘0’.
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace('n/a',  str(0))

# 4.Replace the string ‘space years’  and 'space year' with nothing.
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace(' years', '')
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace(' year', '')

In [None]:
# Checks the datatype of a single element of a column.
type(loan_data['emp_length_int'][0])

Now we transform it into numeric

In [None]:
loan_data['emp_length_int'] = pd.to_numeric(loan_data['emp_length_int'])

In [None]:
type(loan_data['emp_length_int'][0])

**Let's pre-process some other variables too**

In [None]:
loan_data['term'].unique()

In [None]:
#Replace the ‘space 36 months’ and 'space 60 months' string with the string ‘36’ and '60'.
loan_data['term_int'] = loan_data['term'].str.replace(' 36 months',  str(36))

loan_data['term_int'] = loan_data['term_int'].str.replace(' 60 months',  str(60))

In [None]:
type(loan_data['term_int'][0])

In [None]:
#Covert to integer
loan_data['term_int'] = pd.to_numeric(loan_data['term_int'])

In [None]:
type(loan_data['term_int'][0])

Next is `earliest credit line`

In [None]:
# Next is 'earliest credit line'
#loan_data['earliest_cr_line']

In [None]:
type(loan_data['earliest_cr_line'][0])

In [None]:
# 'earliest credit line' is a date variable. We can extracts the date and the time from a string variable that is in a given format.
#loan_data['earliest_cr_line_date'] = pd.to_datetime(loan_data['earliest_cr_line'], format= '%b-%Y')

#loan_data['earliest_cr_line_date'] = loan_data['earliest_cr_line'].apply(pd.to_datetime)

loan_data['earliest_cr_line_date'] = pd.to_datetime(loan_data['earliest_cr_line'], infer_datetime_format=True)


In [None]:
type(loan_data['earliest_cr_line'][0])