# Instructor Do: Dealing with Categorical Data in ML

In [1]:
# initial imports
import pandas as pd
from path import Path

## Dataset Information

The file `loans_data.csv`, contains simulated data about loans, there are a total of 500 records. Each row represents a loan application along an arbitrary year, where every column represents the following data about every loan application.

* `amount`: The loan amount in USD.
* `term`: The loan term in months.
* `month`: The month of the year when the loan was requested.
* `age`: Age of the loan applicant.
* `education`: Educational level of the loan applicant.
* `gender`: Gender of the loan applicant.
* `bad`: Stands for a bad or good loan applicant (`1` - bad, `0` - good).

In [2]:
# Load data
file_path = Path("../Resources/loans_data.csv")
loans_df = pd.read_csv(file_path)
loans_df.head()

Unnamed: 0,amount,term,month,age,education,gender,bad
0,1000,30,June,45,High School or Below,male,0
1,1000,30,July,50,Bachelor,female,0
2,1000,30,August,33,Bachelor,female,0
3,1000,15,September,27,college,male,0
4,1000,30,October,28,college,female,0


In [3]:
# Binary encoding using Pandas (single column)
loans_binary_encoded = pd.get_dummies(loans_df, columns=["gender"])
loans_binary_encoded.head()

Unnamed: 0,amount,term,month,age,education,bad,gender_female,gender_male
0,1000,30,June,45,High School or Below,0,0,1
1,1000,30,July,50,Bachelor,0,1,0
2,1000,30,August,33,Bachelor,0,1,0
3,1000,15,September,27,college,0,0,1
4,1000,30,October,28,college,0,1,0


In [4]:
# Binary encoding using Pandas (multiple columns)
loans_binary_encoded = pd.get_dummies(loans_df, columns=["education", "gender"])
loans_binary_encoded.head()

Unnamed: 0,amount,term,month,age,bad,education_Bachelor,education_High School or Below,education_Master or Above,education_college,gender_female,gender_male
0,1000,30,June,45,0,0,1,0,0,0,1
1,1000,30,July,50,0,1,0,0,0,1,0
2,1000,30,August,33,0,1,0,0,0,1,0
3,1000,15,September,27,0,0,0,0,1,0,1
4,1000,30,October,28,0,0,0,0,1,1,0


In [5]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df2 = loans_df.copy()
df2['education'] = le.fit_transform(df2['education']) 
df2.head()

Unnamed: 0,amount,term,month,age,education,gender,bad
0,1000,30,June,45,1,male,0
1,1000,30,July,50,0,female,0
2,1000,30,August,33,0,female,0
3,1000,15,September,27,3,male,0
4,1000,30,October,28,3,female,0


## Integer Encoding

In [6]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df2 = loans_df.copy()
df2['education'] = le.fit_transform(df2['education'])

In [7]:
df2.head()

Unnamed: 0,amount,term,month,age,education,gender,bad
0,1000,30,June,45,1,male,0
1,1000,30,July,50,0,female,0
2,1000,30,August,33,0,female,0
3,1000,15,September,27,3,male,0
4,1000,30,October,28,3,female,0


# Custom Encoding

In [8]:
# Creating an instance of label encoder
label_encoder = LabelEncoder()
loans_df["month_le"] = label_encoder.fit_transform(loans_df["month"])
loans_df.head()

Unnamed: 0,amount,term,month,age,education,gender,bad,month_le
0,1000,30,June,45,High School or Below,male,0,6
1,1000,30,July,50,Bachelor,female,0,5
2,1000,30,August,33,Bachelor,female,0,1
3,1000,15,September,27,college,male,0,11
4,1000,30,October,28,college,female,0,10


In [9]:
# Months dictionary
months_num = {
    "January": 1,
    "February": 2,
    "March": 3,
    "April": 4,
    "May": 5,
    "June": 6,
    "July": 7,
    "August": 8,
    "September": 9,
    "October": 10,
    "November": 11,
    "December": 12,
}



In [10]:
# Months' names encoded using the dictionary values
loans_df["month_num"] = loans_df["month"].apply(lambda x: months_num[x])
loans_df.head()



Unnamed: 0,amount,term,month,age,education,gender,bad,month_le,month_num
0,1000,30,June,45,High School or Below,male,0,6,6
1,1000,30,July,50,Bachelor,female,0,5,7
2,1000,30,August,33,Bachelor,female,0,1,8
3,1000,15,September,27,college,male,0,11,9
4,1000,30,October,28,college,female,0,10,10


In [11]:
# Drop the month and month_le columns
loans_df = loans_df.drop(["month", "month_le"], axis=1)
loans_df.head()

## 17.6.1
### Encode Labels With Pandas
It's often said that much of a data scientist's time is spent cleaning and preparing data. In machine learning, too, the data rarely comes ready for analysis.

One of the tasks involved in data preparation for machine learning is to convert textual data into numerical data.

While many datasets contain categorical features (e.g., M or F), machine learning algorithms typically only work with numerical data. Categorical and text data must therefore be converted to numerical data for use in machine learning—which is what we'll do in this section.

First, download the files you'll need for this task.

Download 17-6-1-label_encode.zip (Links to an external site.)
**THIS IS THE ABOVE CODE** 

Download the files and open the Jupyter Notebook. We'll first import the modules we'll use and open the dataset in a Pandas DataFrame with the following code:

- import pandas as pd
- from path import Path

- file_path = Path("../Resources/loans_data.csv")
- loans_df = pd.read_csv(file_path)
- loans_df.head()

A preview of the DataFrame reveals seven columns: six features and a target. The dataset contains simulated loan data. There are 500 records, and each row represents a loan application:

The dataset shows the features for each loan application and the target.

The dataset includes the following columns:

    Amount: The loan amount in U.S. dollars.
    Term: The loan term in months.
    Month: The month of the year when the loan was requested.
    Age: Age of the loan applicant.
    Education: Educational level of the loan applicant.
    Gender: The sex of the loan applicant.
    Bad: Status of the application (1: bad, or denial; 0: good, or approval).

**important**
Scikit-learn's algorithms only understand numeric data.

To use Scikit-learn's machine learning algorithms, the text features (month, education, and gender) will have to be converted into numbers. This process is called encoding. Furthermore, the steps taken to prepare the data to make them usable for building machine learning models are called preprocessing. Encoding text labels into numerical values is one preprocessing step. Later we'll discuss scaling, another preprocessing step.

The first and the simplest encoding we'll perform in this dataset is with the gender column, which contains only two values: male and female. We'll convert these values into numerical ones with the pd.get_dummies() method:

    loans_binary_encoded = pd.get_dummies(loans_df, columns=["gender"])
    loans_binary_encoded.head()

The method takes two arguments:

    The first argument for pd.get_dummies() here is the DataFrame.
    The second argument specifies the column to be encoded.

The gender column has been encoded numerically. Use the pd.get_dummies() method to encode the gender column and convert text into numerical values.

The gender column has split into two columns, gender_female and gender_male, with each column now containing 0 (false) or 1 (true). Since the first row represents a male loan applicant, the gender_female column reads 0 and the gender_male column reads 1.

It's also possible to encode multiple columns at the same time.

    loans_binary_encoded = pd.get_dummies(loans_df, columns=["education", "gender"])
    loans_binary_encoded.head()

Two columns are numerically encoded simultaneously. Use the pd.get_dummies() method to encode multiple columns.

As before, the gender column has split into two columns. The education column has split into four columns (Bachelor, High School or Below, Master or Above, and college), with an associated 0 or 1. If a loan applicant has a bachelor's degree, that column will read 1, and the others (High School or Below, Master or Above, and college) will read 0. For an applicant who did not graduate from high school, the education_Bachelor, education_Master or Above, and education_college columns will be 0, and the education_High School or Below will show 1.

## 17.6.2
### Encode Labels With Scikit-learn
Pandas, as you have seen, offers tools to encode your data. Scikit-learn offers another way to encode your labels.

Scikit-learn's LabelEncoder module can also transform text into numerical data. Let's look at an example. Continue down the notebook from the preceding section:

In [12]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df2 = loans_df.copy()
df2['education'] = le.fit_transform(df2['education'])

The code includes the following elements:

    After importing the module, an instance of the label encoder object is created and assigned the variable le.
    A copy of the original loans_df is created for this example, but this step is not necessary for using label encoder.
    The label encoder's fit_transform() method is used to first train the label encoder, then convert the text data into numerical data.

The result is a numerical encoding of the education column. In contrast to pd.get_dummies(), the label encoder assigns a number between 0 and 3 for each of the education categories. The applicant in the first row, for example, has the value 1, which represents high school or below:

It's also possible to create custom encoding functions. To understand why this might be useful, let's first look at using the LabelEncoder module. With it, you'll transform the month column into numbers. The goal is to transform each month into its corresponding order: for example, January should be transformed to 1, since it's the first month of the year. Similarly, July should be transformed to 7, since it's the seventh month of the year:

The month column has been numerically encoded with LabelEncoder.

Note that a new instance of LabelEncoder was created here as label_encoder. The month of August, for example, is converted to 1 instead of 8. July is converted to 5 instead of 7.

Instead, we can create a dictionary of the months of the year and apply a custom function to convert the month names to their corresponding integers:

In [13]:
# months_num = {
#    "January": 1,
#    "February": 2,
#    "March": 3,
#    "April": 4,
#    "May": 5,
#    "June": 6,
#    "July": 7,
#    "August": 8,
#    "September": 9,
#    "October": 10,
#    "November": 11,
#    "December": 12,
# }

In [14]:
# loans_df["month_num"] = loans_df["month"].apply(lambda x: months_num[x])

The following actions are taking place:

    A transformation is made to the values of the month column, and the transformed values are placed in the month_num column.
    The apply() method runs the function inside its parentheses on each element of the month column.
    The lambda function takes an argument (x), and returns months_num[x]. For example, if the value in the month column is "June," the function returns months_num["June"], which is 6.

rewind

Lambda functions are anonymous Python functions.

The DataFrame's month_num column now displays each month as a number:

The custom encoding generates the correct numerical value for each month in the month_num column.

The code in the next cell is merely cleanup—it drops the unnecessary columns related to the month:

In [15]:
# loans_df = loans_df.drop(["month", "month_le"], axis=1)
# loans_df.head()

In [None]:
# Goal is to make things into 0s and 1s; 1 if true, 0 if false