# Encoding categorical variables

## What is a categorical variable?

In layman's terms, a quantity that is not a number but rather is a string or something of that sort is a categorical variable.

## Why do we need to encode the categorical variables?

Usually in machine learning, the data fed to the model or the results calculated are numeric. 

For example, say we want to predict a crop for a farmer to grow. Let the inputs be ambient temperature, average rainfall and soil type (red, black, loamy, sandy) in his farm land. The output would be the name of a crop.

A problem arises.

Not all inputs are in numeric format. The soil types are of string type. Neither is the output numeric.

One way would be to assign numeric values to the soil types. Say 'red' = 0, 'black' = 1, 'loamy' = 2, 'sandy' = 3, or something of this type. This is called Label Encoding.

Label Encoding has a huge flaw. The model would give more weightage to 'sandy' as compared to 'red' as the numeric value associated with 'sandy' is greater than that of 'red'.

Hence we need to encode the variables in a different manner.

## How do we encode the variables?

There are various methods provided in python to do so.
1. Ordinal
2. OneHot (Most commonly used)
3. Binary
4. BaseN
5. Hashing
6. Sum
7. Helmert
8. Backward Difference
9. Polynomial
10. Target
11. LeaveOneOut

#### As OneHotEncoder has very high usage, we would be considering that.

## OneHotEncoder

Though label encoding is straight but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them. This ordering issue is addressed in another common alternative approach called ‘One-Hot Encoding’. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column.

# Implementation of OneHotEncoder

In [1]:
import numpy as np
import pandas as pd

Now we import the dataset

In [2]:
crime_data = pd.read_csv('Crime_2015.csv')

In [3]:
print(crime_data)

                                              MSA ViolentCrime  Murder  Rape  \
0                              Abilene, TX M.S.A.        412.5     5.3  56.0   
1                                Akron, OH M.S.A.        238.4     5.1  38.2   
2                               Albany, GA M.S.A.        667.9     7.8  30.4   
3                               Albany, OR M.S.A.        114.3     2.5  28.2   
4                          Albuquerque, NM M.S.A.        792.6     6.1  63.8   
..                                            ...          ...     ...   ...   
373                   Guayama, Puerto Rico M.S.A.        251.6    11.4   6.3   
374                  Mayaguez, Puerto Rico M.S.A.        237.5    11.5   5.2   
375                     Ponce, Puerto Rico M.S.A.        231.4    18.0   5.0   
376                San German, Puerto Rico M.S.A.         92.1     5.4   4.6   
377  San Juan-Carolina-Caguas, Puerto Rico M.S.A.        262.0    20.6   4.9   

     Robbery  AggravatedAssault Propert

As we can see, the data contains categorical variables in the form of cities.

Here, we will try to encode the data of the states.

# OneHotEncoding

We import OneHotEncoder library from sklearn.preprocessing

In [4]:
from sklearn.preprocessing import OneHotEncoder

Creating an object of OneHotEncoder class

In [5]:
"""
OneHotEncoder(categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')

params:
1. categories (unique values per feature): 
    'auto'(default): determine categories automatically from training data
    'list'
    
2. drop (specifies a methodology to use to drop one of the categories per feature)
    'none' (default): retain all features
    'first': drop first category in each feature
    drop[i] (array): ith column should be dropped
    
3. sparse: will return sparse matrix if set true, else array

4. dtype: desired dtype of output
    np.float (default)
    
5. handle_unknown: whether to raise an error or ignore if an unknown categorical feature is present during transform 
    'error' (default): raise error
    'ignore': unknown category column would be set to zeros
"""


enc = OneHotEncoder(categories = 'auto', drop = None, sparse = True, dtype = np.float, handle_unknown = 'error')

In [6]:
""" 
fit_transform(self, X, y = None)

params:
X: array-like, the data to encode
y: None(default)
returns a sparse matrix if sparse = True, else a 2-d array.

"""

"""
Here we have type casted the sparse matrix to a DataFrame.
"""
enc_df = pd.DataFrame(enc.fit_transform(crime_data[['State']]).toarray())

In [7]:
print(enc_df)

      0    1    2    3    4    5    6    7    8    9   ...   48   49   50  \
0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  1.0   
1    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
2    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
3    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
4    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
..   ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...   
373  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
374  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
375  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
376  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
377  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  1.0  0.0  0.0   

      51   52   53   54   55   56   57  
0    0.0  0.0  0.0  0.0  0.0  0.0 


The steps now are to see how the data has been encoded.


In [8]:
states_df = pd.DataFrame(crime_data['State'])

In [9]:
states_df = states_df.join(enc_df)

In [10]:
states_df

Unnamed: 0,State,0,1,2,3,4,5,6,7,8,...,48,49,50,51,52,53,54,55,56,57
0,TX,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,OH,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,GA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,OR,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,NM,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
373,"Guayama, Puerto Rico M.S.A.",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
374,"Mayaguez, Puerto Rico M.S.A.",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
375,"Ponce, Puerto Rico M.S.A.",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
376,"San German, Puerto Rico M.S.A.",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Dummy variables approach

We can also use dummy variables to encode. This approach is more flexible as we can encode any number of columns and choose how to label the columns using a prefix. Proper naming makes the data analysis process easier

Here, we are going to encode the names of famous cricketers.

In [32]:
names = ('Sachin Tendulkar', 'Virat Kohli', 'MS Dhoni', 'Rahul Dravid', 'VVS Laxman')

names_df = pd.DataFrame(names, columns = ['Names of cricketers'])

names_df

Unnamed: 0,Names of cricketers
0,Sachin Tendulkar
1,Virat Kohli
2,MS Dhoni
3,Rahul Dravid
4,VVS Laxman


In [33]:
"""
get_dummies(data, prefix = None, prefix_sep = '_', dummy_na = False, columns = None, sparse = False, drop_first = False, 
            dtype = None)
            
params:
data: Data of which we want to get dummy indicators.
      array-like, Series, or DataFrame
prefix: String to append DataFrame column names.
        Pass a list with length equal to column names.
prefix_sep: default '_'. used while appending a prefix.
dummy_na: Add a column to indicate NaNs, if false NaN's ignored.
columns: column names in df to be encoded
drop_first: Whether to remove first categorical level
dtype: Data type for new colums.

returns: DataFrame object
"""


dummy_df = pd.get_dummies(names_df, columns = ["Names of cricketers"], prefix = ["Type_is"], prefix_sep = ' ')

In [34]:
names_df = names_df.join(dummy_df)
names_df

Unnamed: 0,Names of cricketers,Type_is MS Dhoni,Type_is Rahul Dravid,Type_is Sachin Tendulkar,Type_is VVS Laxman,Type_is Virat Kohli
0,Sachin Tendulkar,0,0,1,0,0
1,Virat Kohli,0,0,0,0,1
2,MS Dhoni,1,0,0,0,0
3,Rahul Dravid,0,1,0,0,0
4,VVS Laxman,0,0,0,1,0


### Conclusion: 
As categorical variables are encountered frequently in Data Science, I think that this notebook will help you to understand them in a better way, and also how to encode them.