This example is based on the examples posted on GitHub for [Feature Engineering for Machine Learning Course](https://github.com/solegalli/feature-engineering-for-machine-learning)


## Categorical Variables
The values of a categorical variable are selected from a group of **categories**, also called **labels**. Examples are gender (male or female) and marital status (single, married, divorced, or widowed).
Categorical variables can be further categorized into:
- **Ordinal Variables**: can be meaningfully ordered (e.g., height: tall, medium, short)
- **Nominal variables**: no inherent order or ranking

Sometimes categorical variables are coded as numbers (e.g., gender may be coded as 0 for males and 1 for females). The variable is still categorical, despite the use of numbers.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Let's load the loans dataset.
df = pd.read_csv('./datasets/loan.csv')
df.head()

In [None]:
# Let's inspect the variable householder,
# which indicates whether the borrowers own their home,
# or if they are renting, among other things.

df['householder'].unique()

In [None]:
# Let's make a bar plot with the number of loans
# disbursed in each category of home ownership.

# The code below counts the number of observations (customers)
# in each category and then makes a bar plot.

fig = df['householder'].value_counts().plot.bar()
fig.set_title('Householder')
fig.set_ylabel('Number of customers')

The majority of the borrowers either own their house with a mortgage or rent their property. A few borrowers own their homes completely.

In [None]:
df['householder'].value_counts()

In [None]:
# The "loan_purpose" variable is another categorical variable
# that indicates how the borrowers intend to use the
# money they are borrowing. For example, to improve their
# house, or to cancel previous debt.

df['loan_purpose'].unique()

Debt consolidation means that the borrower will use the loan to cancel previous debts; car purchase means that the borrower will use the money to buy a car; and so on. It gives an idea of the intended use of the loan.

In [None]:
# Let's make a bar plot with the number of borrowers
# in each category.

# The code below counts the number of observations (borrowers)
# per category and then makes a plot.

fig = df['loan_purpose'].value_counts().plot.bar()
fig.set_title('Loan Purpose')
fig.set_ylabel('Number of customers')

The majority of debtors plan to use the money to "consolidate debt." This is a common occurrence. The borrowers want to consolidate all of their debts.

In [None]:
# Let's look at one additional categorical variable:
# "market", which represents the risk band assigned to the borrower.
df = df.rename(columns={"market": "risk_band"})
df['risk_band'].sort_values().unique()

In [None]:
# Let's make a bar plot with the number of borrowers
# per category.

fig = df['risk_band'].value_counts().sort_index().plot.bar()
fig.set_title('Status of the Loan')
fig.set_ylabel('Number of customers')

Most customers are assigned to risk bands B and C. A and B are lower risk customers, and E is the highest risk customer. The higher the risk, the more likely the customer is to default; thus, the finance companies charge higher interest rates on those loans.

## Binary variables

In [None]:
# A binary variable, can take 2 values.
# e.g., variable "loan_status", either the loan is defaulted (1) or not (0).
df = df.rename(columns={"target": "loan_status"})
df['loan_status'].unique()

In [None]:
# Let's make a bar plot with the number of loans per loan status.
loan_status_counts = df['loan_status'].value_counts()
fig = loan_status_counts.plot.bar()
fig.set_title('Status of the Loan')
fig.set_xlabel('Loan Status')
fig.set_ylabel('Number of loans')

for indx, value in enumerate(loan_status_counts):
    fig.text(indx, value, str(value), 
            color = 'brown', fontweight = 'bold')

As we can see, the variable shows only 2 values, 0 and 1, and the majority of the loans have not been defaulted.

In [None]:
# Let's make a Pie chart showing the % of defaulted (1) or not (0)
loan_status_counts = df['loan_status'].value_counts()
print(loan_status_counts)

loan_status_counts = loan_status_counts.rename({0: 'Paid', 1: 'Defaulted'})
print(loan_status_counts)

total = loan_status_counts.values.sum()
print(f'Loans count: {total}')

def fmt(x):
    return '{:.1f}%\n{:.0f}'.format(x, total*x/100)

plt.pie(x=loan_status_counts.values, labels=loan_status_counts.index, autopct=fmt)

In [None]:
# Finally, let's look at a variable that is numerical,
# but its numbers have no real meaning.

df['customer_id'].head()

Each id corresponds to a single consumer. This number is assigned to uniquely identify each customer.

In [None]:
# The variable has as many different id values as customers:
# in this case 10000.

len(df['customer_id'].unique())