# Exercise 3.02: Applying Label Encoding to Transform Categorical Variables into Numerical Variables

In this exercise, we will use one of the preprocessing techniques we just learned, label encoding, to transform all categorical variables into numerical ones. This step is necessary before training any machine learning model.

The following steps will help you complete this exercise:

1.- Import the `pandas` package as `pd`. Also  import `preprocessing` from `scikit-learn`.

In [1]:
import pandas as pd
from sklearn import preprocessing

2.- Create a new variable called `file_url`, which will contain the URL to the raw dataset. Use the `data/german_credit.csv` file.

In [2]:
file_url = "https://raw.githubusercontent.com/applied-data-mining-master/syllabus_intelligencesystems/main/data/german_credit.csv"

3.- Load the data using the `pd.read_csv()` method

In [3]:
df = pd.read_csv(file_url)

4.- Define a function called `fit_encoder()` that takes a DataFrame and a column name as parameters and will fit a label encoder on the values of the column. You will use `.LabelEncoder()` and `.fit()` from `preprocessing` and `.unique()` from pandas (this will extract all the possible values of a DataFrame column):

In [4]:

def fit_encoder(df, columna):
    label_encoder = preprocessing.LabelEncoder()
    label_encoder.fit(df[columna].unique())

    return label_encoder

5.- Define a function called `encode()` that takes a DataFrame, a column name, and a label encoder as parameters and will transform the values of the column using the label encoder. You will use the `.transform()` method to do this

In [5]:
def encode(df, columna, label_encoder):
    transform = label_encoder.transform(df[columna])
    return transform

6.- Create a new DataFrame called `cat_df` that contains only non-numeric columns and print its first five rows.

  > **Hints**  
  > You will use the .select_dtypes() method from pandas and specify exclude='number'
  
Output:

![Figure 3.5](img/fig3_05.jpg)

In [6]:
cat_df = df.select_dtypes(exclude='number')
cat_df.head()

Unnamed: 0,account_check_status,credit_history,purpose,savings,present_emp_since,other_debtors,property,other_installment_plans,housing,job,telephone,foreign_worker
0,< 0 DM,critical account/ other credits existing (not ...,domestic appliances,unknown/ no savings account,.. >= 7 years,none,real estate,none,own,skilled employee / official,"yes, registered under the customers name",yes
1,0 <= ... < 200 DM,existing credits paid back duly till now,domestic appliances,... < 100 DM,1 <= ... < 4 years,none,real estate,none,own,skilled employee / official,none,yes
2,no checking account,critical account/ other credits existing (not ...,(vacation - does not exist?),... < 100 DM,4 <= ... < 7 years,none,real estate,none,own,unskilled - resident,none,yes
3,< 0 DM,existing credits paid back duly till now,radio/television,... < 100 DM,4 <= ... < 7 years,guarantor,if not A121 : building society savings agreeme...,none,for free,skilled employee / official,none,yes
4,< 0 DM,delay in paying off in the past,car (new),... < 100 DM,1 <= ... < 4 years,none,unknown / no property,none,for free,skilled employee / official,none,yes


7.- Create a list called `cat_cols` that contains the column name of `cat_df` and print its content.

  > **Hint**  
  > You will use .columns from pandas to do this
  
Output:

```
Index(['account_check_status', 'credit_history', 'purpose', 'savings',
       'present_emp_since', 'other_debtors', 'property',
       'other_installment_plans', 'housing', 'job', 'telephone',
       'foreign_worker'],
      dtype='object')
```

In [7]:
cat_cols = cat_df.columns
cat_cols

Index(['account_check_status', 'credit_history', 'purpose', 'savings',
       'present_emp_since', 'other_debtors', 'property',
       'other_installment_plans', 'housing', 'job', 'telephone',
       'foreign_worker'],
      dtype='object')

8.- Create a `for` loop that will iterate through each column from `cat_cols`, fit a label encoder using `fit_encoder()`, and transform the column with the `encode()` function

In [8]:
for columna in cat_cols:
    label_encoder = fit_encoder(df, columna)
    df[columna] = encode(df, columna, label_encoder)

9 .- Print the first five rows of `df`

Output:

![Figure 3.6](img/fig3_06.jpg)

In [9]:
df.head()

Unnamed: 0,default,account_check_status,duration_in_month,credit_history,purpose,credit_amount,savings,present_emp_since,installment_as_income_perc,other_debtors,present_res_since,property,age,other_installment_plans,housing,credits_this_bank,job,people_under_maintenance,telephone,foreign_worker
0,0,1,6,1,4,1169,4,0,4,2,4,2,67,1,1,2,1,1,1,1
1,1,0,48,3,4,5951,1,2,2,2,2,2,22,1,1,1,1,1,0,1
2,0,3,12,1,0,2096,1,3,2,2,3,2,49,1,1,1,3,2,0,1
3,0,1,42,3,7,7882,1,3,2,1,4,0,45,1,0,1,1,2,0,1
4,1,1,24,2,2,4870,1,2,3,2,4,3,53,1,0,2,1,2,0,1


We have successfully encoded non-numeric columns. Now, our DataFrame contains only numeric values.