### Machine Learning Project Walkthrough:
# Preparing the Features

## Recap

In the past mission, you removed all of the columns that contained redundant information, weren't useful for modeling, required too much processing to make useful, or leaked information from the future. We've exported the Dataframe from the end of the last mission to a CSV file named `filtered_loans_2007.csv` to differentiate the file with the `loans_2007.csv` we used in the last mission. In this mission, we'll prepare the data for machine learning by focusing on handling missing values, converting categorical columns to numeric columns, and removing any other extraneous columns we encounter throughout this process.<br>

This is because the mathematics underlying most machine learning models assumes that the data is numerical and contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to train a model using data that contain missing values or non-numeric values when working with models like linear regression and logistic regression.<br>

Let's start by computing the number of missing values and come up with a strategy for handling them. Then, we'll focus on the categorical columns.<br>

We can return the number of missing values across the Dataframe by:

* first using the Pandas Dataframe method [isnull](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html) to return a Dataframe containing Boolean values:
  * `True` if the original value is null,
  * `False` if the original value isn't null.
* then using the Pandas Dataframe method [sum](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html) to calculate the number of null values in each column.

```python
null_counts = df.isnull().sum()
```

* Read in `filtered_loans_2007.csv` as a Dataframe and assign it to `loans`.
* Use the `isnull` and `sum` methods to return the number of null values in each column. Assign the resulting Series object to `null_counts`.
* Use the `print` function to display `null_counts`.

In [2]:
import pandas as pd

In [3]:
loans = pd.read_csv('data/filtered_loans_2007.csv')
null_counts = loans.isnull().sum()
print(null_counts)

loan_amnt                  0
term                       0
int_rate                   0
installment                0
emp_length              1036
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
purpose                    0
title                     11
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                50
total_acc                  0
last_credit_pull_d         2
pub_rec_bankruptcies     697
dtype: int64


## Handling missing values

While most of the columns have 0 missing values, 2 columns have 50 or less rows with missing values, and 1 column, `pub_rec_bankruptcies`, contains 697 rows with missing values. Let's remove columns entirely where more than 1% of the rows for that column contain a null value. In addition, we'll remove the remaining rows containing null values.<br>

This means that we'll keep the following columns and just remove rows containing missing values for them:

* title
* revol_util
* last_credit_pull_d

and drop the `pub_rec_bankruptcies` column entirely since more than 1% of the rows have a missing value for this column.<br>

Let's use the strategy of removing the `pub_rec_bankruptcies` column first then removing all rows containing any missing values at all to cover both of these cases. This way, we only remove the rows containing missing values for the `title` and `revol_util` columns but not the `pub_rec_bankruptcies` column.

* Use the [drop method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) to remove the `pub_rec_bankruptcies` column from `loans`.
* Use the [dropna method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) to remove all rows from `loans` containing any missing values.
* Use the `dtypes` attribute followed by the `value_counts()` method to return the counts for each column data type. Use the `print` function to display these counts.

In [4]:
loans = loans.drop(['pub_rec_bankruptcies'], axis=1)
loans = loans.dropna(how='any', axis=0)
print(loans.dtypes.value_counts())

object     11
float64    10
int64       1
dtype: int64


## Text columns

While the numerical columns can be used natively with scikit-learn, the object columns that contain text need to be converted to numerical data types. Let's return a new Dataframe containing just the object columns so we can explore them in more depth. You can use the Dataframe method [select_dtypes](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.select_dtypes.html) to select only the columns of a certain data type:

```python
float_df = df.select_dtypes(include=['float'])
```

Let's select just the object columns then display a sample row to get a better sense of how the values in each column are formatted.

* Use the Dataframe method `select_dtypes` to select only the columns of `object` type from `loans` and assign the resulting Dataframe `object_columns_df`.
* Display the first row in `object_columns_df` using the `print` function.

In [5]:
object_columns_df = loans.select_dtypes(include=['object'])
print(object_columns_df.iloc[0])

term                     36 months
int_rate                    10.65%
emp_length               10+ years
home_ownership                RENT
verification_status       Verified
purpose                credit_card
title                     Computer
addr_state                      AZ
earliest_cr_line          Jan-1985
revol_util                   83.7%
last_credit_pull_d        Jun-2016
Name: 0, dtype: object


## Converting text columns

Some of the columns seem like they represent categorical values, but we should confirm by checking the number of unique values in those columns:

* `home_ownership`: home ownership status, can only be 1 of 4 categorical values according to the data dictionary,
* `verification_status`: indicates if income was verified by Lending Club,
* `emp_length`: number of years the borrower was employed upon time of application,
* `term`: number of payments on the loan, either 36 or 60,
* `addr_state`: borrower's state of residence,
* `purpose`: a category provided by the borrower for the loan request,
* `title`: loan title provided the borrower,

There are also some columns that represent numeric values, that need to be converted:

* `int_rate`: interest rate of the loan in %,
* `revol_util`: revolving line utilization rate or the amount of credit the borrower is using relative to all available credit, read more [here](http://blog.credit.com/2013/04/what-is-revolving-utilization-65530/).

Based on the first row's values for `purpose` and `title`, it seems like these columns could reflect the same information. Let's explore the unique value counts separately to confirm if this is true.<br>

Lastly, some of the columns contain date values that would require a good amount of feature engineering for them to be potentially useful:<br>

* `earliest_cr_line`: The month the borrower's earliest reported credit line was opened,
* `last_credit_pull_d`: The most recent month Lending Club pulled credit for this loan.

Since these date features require some feature engineering for modeling purposes, let's remove these date columns from the Dataframe.

## First 5 categorical columns

Let's explore the unique value counts of the columnns that seem like they contain categorical values.

* Display the unique value counts for the following columns: `home_ownership`, `verification_status`, `emp_length`, `term`, `addr_state` columns:
  * Store these column names in a list named `cols`.
  * Use a for loop to iterate over `cols`:
    * Use the `print` function combined with the Series method `value_counts` to display each column's unique value counts.

In [6]:
cols = ['home_ownership', 'verification_status', 
        'emp_length', 'term', 'addr_state']

for col in cols:
    print(loans[col].value_counts())
    

RENT        18112
MORTGAGE    16686
OWN          2778
OTHER          96
NONE            3
Name: home_ownership, dtype: int64
Not Verified       16281
Verified           11856
Source Verified     9538
Name: verification_status, dtype: int64
10+ years    8545
< 1 year     4513
2 years      4303
3 years      4022
4 years      3353
5 years      3202
1 year       3176
6 years      2177
7 years      1714
8 years      1442
9 years      1228
Name: emp_length, dtype: int64
 36 months    28234
 60 months     9441
Name: term, dtype: int64
CA    6776
NY    3614
FL    2704
TX    2613
NJ    1776
IL    1447
PA    1442
VA    1347
GA    1323
MA    1272
OH    1149
MD    1008
AZ     807
WA     788
CO     748
NC     729
CT     711
MI     678
MO     648
MN     581
NV     466
SC     454
WI     427
OR     422
AL     420
LA     420
KY     311
OK     285
UT     249
KS     249
AR     229
DC     209
RI     194
NM     180
WV     164
HI     162
NH     157
DE     110
MT      77
WY      76
AK      76
SD      60
VT  

## The reason for the loan

The `home_ownership`, `verification_status`, `emp_length`, `term`, and `addr_state` columns all contain multiple discrete values. We should clean the `emp_length` column and treat it as a numerical one since the values have ordering (2 years of employment is less than 8 years).<br>

First, let's look at the unique value counts for the `purpose` and `title` columns to understand which column we want to keep.

* Use the `value_counts` method and the `print` function to display the unique values in the following columns:
  * `purpose`
  * `title`

In [7]:
print(loans['purpose'].value_counts())
print(loans['title'].value_counts())

debt_consolidation    17751
credit_card            4911
other                  3711
home_improvement       2808
major_purchase         2083
small_business         1719
car                    1459
wedding                 916
medical                 655
moving                  552
house                   356
vacation                348
educational             312
renewable_energy         94
Name: purpose, dtype: int64
Debt Consolidation                            2068
Debt Consolidation Loan                       1599
Personal Loan                                  624
Consolidation                                  488
debt consolidation                             466
Credit Card Consolidation                      345
Home Improvement                               336
Debt consolidation                             314
Small Business Loan                            298
Credit Card Loan                               294
Personal                                       290
Consolidation Loan 

## Categorical columns

The `home_ownership`, `verification_status`, `emp_length`, and `term` columns each contain a few discrete categorical values. We should encode these columns as dummy variables and keep them.<br>

It seems like the `purpose` and `title` columns do contain overlapping information but we'll keep the `purpose` column since it contains a few discrete values. In addition, the `title` column has data quality issues since many of the values are repeated with slight modifications (e.g. `Debt Consolidation` and `Debt Consolidation Loan` and `debt consolidation`).<br>

We can use the following mapping to clean the `emp_length` column:

* "10+ years": 10
* "9 years": 9
* "8 years": 8
* "7 years": 7
* "6 years": 6
* "5 years": 5
* "4 years": 4
* "3 years": 3
* "2 years": 2
* "1 year": 1
* "< 1 year": 0
* "n/a": 0

We erred on the side of being conservative with the `10+ years`, `< 1 year` and `n/a` mappings. We assume that people who may have been working more than 10 years have only really worked for 10 years. We also assume that people who've worked less than a year or if the information is not available that they've worked for 0. This is a general heuristic but it's not perfect.<br>

Lastly, the `addr_state` column contains many discrete values and we'd need to add 49 dummy variable columns to use it for classification. This would make our Dataframe much larger and could slow down how quickly the code runs. Let's remove this column from consideration.


* Remove the `last_credit_pull_d`, `addr_state`, `title`, and `earliest_cr_line` columns from `loans`.
* Convert the `int_rate` and `revol_util` columns to float columns by:
  * Using the `str` acessor followed by the `rstrip` string method to strip the right trailing percent sign (%):
    * `loans['int_rate'].str.rstrip('%')` returns a new Series with % stripped from the right side of each value.
  * On the resulting Series object, use the `astype` method to convert to the float type.
  * Assign the new Series of float values back to the respective columns in the Dataframe.
* Use the `replace` method to clean the `emp_length` column.

In [8]:
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}

In [9]:
cols_to_drop = ['last_credit_pull_d',
               'addr_state', 'title',
               'earliest_cr_line']
loans = loans.drop(cols_to_drop, axis=1)

loans['int_rate'] = loans['int_rate'].str.rstrip('%').astype('float64')

#print(loans['revol_util'].value_counts())

loans['revol_util'] = loans['revol_util'].str.rstrip('%').astype('float64')

loans = loans.replace(mapping_dict)

## Dummy variables

Let's now encode the `home_ownership`, `verification_status`, `title`, and `term` columns as dummy variables so we can use them in our model. We first need to use the Pandas [get_dummies](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) method to return a new Dataframe containing a new column for each dummy variable:

```python
# Returns a new Dataframe containing 1 column for each dummy variable.
dummy_df = pd.get_dummies(loans["term", "verification_status"])
```

We can then use the [concat](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) method to add these dummy columns back to the original Dataframe:

```python
loans = pd.concat([loans, dummy_df], axis=1]
```
                  
and then drop the original columns entirely using the `drop` method:

```python
loans = loans.drop(["verification_status", "term"], axis=1)
```

* Encode the home_ownership, verification_status, purpose, and term columns as integer values:
  * Use the Series method astype to convert each column to the category data type.
  * Use the get_dummies function to return a Dataframe containing the dummy columns.
  * Use the concat method to add these dummy columns back to loans.
  * Remove the original, non-dummy columns (home_ownership, verification_status, purpose, and term) from loans.

In [10]:
cols_to_encode = ['home_ownership', 'verification_status', 
                  'purpose','term']

dummy_df = pd.get_dummies(loans[cols_to_encode])
loans = pd.concat([loans, dummy_df], axis=1)
    
loans = loans.drop(cols_to_encode, axis=1)

## Next Steps

In this mission, we performed the last amount of data preparation necessary to start training machine learning models. We converted all of the columns to numerical values because those are the only type of value scikit-learn can work with. In the next mission, we'll experiment with training models and evaluating accuracy using cross-validation.