Link to Medium blog post: https://python.plainenglish.io/label-encoding-in-python-machine-learning-fa971a751317

# Label Encoding in Python — Machine Learning

In [2]:
!pip install sklearn
!pip install pandas

Collecting sklearn
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[15 lines of output][0m
  [31m   [0m The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
  [31m   [0m rather than 'sklearn' for pip commands.
  [31m   [0m 
  [31m   [0m Here is how to fix this error in the main use cases:
  [31m   [0m - use 'pip install scikit-learn' rather than 'pip install sklearn'
  [31m   [0m - replace 'sklearn' by 'scikit-learn' in your pip requirements files
  [31m   [0m   (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
  [31m   [0m - if the 'sklearn' package is used by one of your dependencies,
  [31m   [0m   it would be great if you take some time to track which package uses
  [31m   [0m   'sklearn' instead of 'scikit-lea

## Step 1: Create a dataframe with the required data

In [3]:
import pandas as pd

df = {'Position': ['Customer Service','Manager','Assistant Manager','Director'],
    'Salary': [44000,75000,65000,90000]
    }

df = pd.DataFrame(df)

First, we import pandas library as it will be required to create a pandas dataframe. Then we create a Python dictionary df and convert it to a dataframe.

Let’s take a look at the result:

In [4]:
print(df)

            Position  Salary
0   Customer Service   44000
1            Manager   75000
2  Assistant Manager   65000
3           Director   90000


## Step 2.1: Label encoding in Python using current order

In [5]:
df['code'] = pd.factorize(df['Position'])[0]

We create a new feature “code” and assign categorical feature “ position “ in numerical format to it.

The sequence of numbers in “ code” by default follows the order of the original dataframe df:

In [6]:
print(df)

            Position  Salary  code
0   Customer Service   44000     0
1            Manager   75000     1
2  Assistant Manager   65000     2
3           Director   90000     3


## Step 2.2: Label encoding in Python using alphabetical order

This case is a little more interesting as we can achieve the same result using both of the methods mentioned earlier.

scikit-learn method



In [7]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['code']= le.fit_transform(df['Position'])

We will first import LabelEncoder() from the sci-kit learn library and define le as its instance. Then we will apply it to the “ Position” feature to convert it to numerical format and store it as a new feature “code”.

What’s interesting about this method is that by default LabelEncoder() orders values in alphabetical order without us having to specify anything.

Let’s take a look at what we arrived at:

In [8]:
print(df)

            Position  Salary  code
0   Customer Service   44000     1
1            Manager   75000     3
2  Assistant Manager   65000     0
3           Director   90000     2


LabelEncoder() correctly order the values in the “Position” feature and generated the corresponding numerical values in the following sequence: Assistant Manager, Customer Service, Director, Manager.

pandas method

In [9]:

df['code'] = pd.factorize(df['Position'], sort=True)[0]

What’s different from Step 2.1 where we worked with the original order, we added “sort=True” (alphabetically) parameter to identify that we need the conversion to numerical format of the sorted “Position” feature.

Let’s take a look at the result:

In [10]:
print(df)

            Position  Salary  code
0   Customer Service   44000     1
1            Manager   75000     3
2  Assistant Manager   65000     0
3           Director   90000     2


We can see that both the scikit-learn method and the pandas method generate the same result.

## Step 2.3: Label encoding in Python using “Salary” feature order

As we discussed in the Understanding Label Encoding section, most likely this will be the most algorithm-friendly way to convert categorical features to numeric format.

In general, the majority of algorithms prefer some logic behind the numerical value assignment, that being sequence, hierarchy, or other. It will also make your results more valid and definitely scalable and interpretable.

In [11]:
df=df.sort_values(by=['Salary'])

df['code'] = pd.factorize(df['Position'])[0]

Since we already know that the sequence of numbers in “code” by default follows the order of the original dataframe df (Step 2.1), what we will do first is sort the original df by “Salary” feature values and then convert “Position” feature to numerical format and store it as “code”.

Let’s take a look at the result:

In [12]:
print(df)

            Position  Salary  code
0   Customer Service   44000     0
2  Assistant Manager   65000     1
1            Manager   75000     2
3           Director   90000     3
