# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Ungraded Additional Notebook: Approaching Categorical Variable

## Learning Objectives

At the end of the experiment, you will be able to

* know categorical variables and its types
* understand the significance of encoding categorical variables
* understand and implement different process to encode categorical variable using a real life dataset
* understand and handle miscellaneous categorical variable

## Introduction




All Machine Learning models are some kind of mathematical model that needs numbers to work with. Categorical data have possible values (categories) and it can be in text form. For example, **Gender**: Male/Female/Others, **Ranks**: 1st/2nd/3rd, etc.

While working on a data science project after handling the missing value of datasets. The next work is to handle categorical data in datasets before applying any ML models.

First, let’s understand the types of categorical data:
1. **Nominal Data:** The nominal data called labelled/named data. Allowed to change the order of categories, change in order doesn’t affect its value. For example, Gender (Male/Female/Other), Age Groups (Young/Adult/Old), etc.
2. **Ordinal Data:** Represent discretely and ordered units. Same as nominal data but have ordered/rank. Not allowed to change the order of categories. For example, Ranks: 1st/2nd/3rd, Education: (High School/Undergrads/Postgrads/Doctorate), etc.

Regardless of what the value is used for, the challenge is determining how to use this (categorical) data in the analysis because of the following constraints:

* Categorical features may have a very large number of levels, known as high cardinality, (for example, cities or URLs), where most of the levels appear in a relatively small number of instances.
* Many machine learning models, such as regression or SVM, are algebraic. This means that their input must be numerical. To use these models, categories must be transformed into numbers first, before you can apply the learning algorithm on them.
* While some ML packages or libraries might transform categorical data to numeric automatically based on some default embedding method, many other ML packages don’t support such inputs.

### Import required libraries

In [None]:
import numpy as np
import pandas as pd
import copy
import seaborn as sns
import matplotlib.pyplot as plt

### General Exploration steps for Categorical Data

In this notebook, we'll focus on dealing with categorical features in the **nycflights13 dataset**. This dataset is a collection of data pertaining to different airlines flying from different
airports in NYC, also capturing flight, plane and weather specific details during the year of 2013.
This dataset contains information about on-time departure of all flights from NYC (i.e. JFK, LGA or EWR airports) in 2013.



In [None]:
#@title Download the data
from IPython.display import clear_output
!wget https://cdn.iisc.talentsprint.com/CDS/Datasets/flight_data.csv
clear_output()
print("Dataset downloaded!")
!ls | grep '.csv'

In [None]:
# read the data
nyc_flights = pd.read_csv('flight_data.csv')
nyc_flights.shape

In [None]:
# first five rows of the dataset
nyc_flights.head()

The next step is to gather some information about different column in our DataFrame. We can do so by using `.info()`, which basically gives the information about the number of rows, columns, column data types, memory usage, etc.

In [None]:
# information of the dataset
nyc_flights.info()

#### Box Plot

Now, to analyze the relationship between a categorical feature and a continuous feature, we create a boxplot. The boxplot is a simple way of representing statistical data on a plot in which a rectangle is drawn to represent the second and third quartiles, usually with a vertical line inside to indicate the median value. The lower and upper quartiles are shown as horizontal lines at either side of the rectangle.

We can plot a boxplot by utilizing `.boxplot()` on our DataFrame. Here, we will plot a boxplot of the `dep_time` column with respect to the three `origin` of the flights from JFK, LGA and EWR.

In [None]:
nyc_flights.boxplot('dep_time', 'origin', rot=30, figsize=(5,6))
plt.ylabel('dep_time')
plt.show()

As we will only be dealing with categorical features in this tutorial, it's better to filter them out and after that we will check the data for null values. The method `.copy()` is used here so that any changes made in new DataFrame does not get reflected in the original one.

In [None]:
# filtering the categorical data (data type = 'object')
cat_nyc_flights = nyc_flights.select_dtypes(include=['object']).copy()

In [None]:
cat_nyc_flights.head()

In [None]:
# total null values
cat_nyc_flights.isnull().values.sum()

In [None]:
# checking null values in each feature
cat_nyc_flights.isnull().sum()

It seems that only the tailnum column has null values. We can do a mode imputation for those null values. The `.fillna()` method is handy for such operations.

In [None]:
# Get mode of 'tailnum' column
mode = cat_nyc_flights['tailnum'].mode()[0]
print(mode)

# OR

mode = cat_nyc_flights['tailnum'].value_counts().index[0]
print(mode)

In [None]:
cat_nyc_flights = cat_nyc_flights.fillna(mode)

Using the above method, `.fillna()` will fill the null instances with the mode value.

In [None]:
# checking for null values after imputation
cat_nyc_flights.isnull().sum()

Another Exploratory Data Analysis (EDA) step that we might want to do on categorical features is the frequency distribution of categories within the feature, which can be done with the `.value_counts()` method as described earlier.

In [None]:
# value counts of carrier
cat_nyc_flights['carrier'].value_counts()

In [None]:
# different carrier counts
cat_nyc_flights['carrier'].value_counts().count()

This means there are 16 different carriers. Now, we will plot the frequency distribution plot to visualize the carriers.

In [None]:
carrier_count = cat_nyc_flights['carrier'].value_counts()
sns.set(style="darkgrid")
sns.barplot(x=carrier_count.index, y=carrier_count.values)
plt.title('Frequency Distribution of Carriers')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Carrier', fontsize=12)
plt.show()

In the next section, we will see different methods to encode the categorical variable so that they can be used in machine learning models.

### **Encoding Categorical Data**

To keep it simple, we will apply these encoding methods only on the `carrier` column. However, the same approach can be extended to all other features.

The different methods we will be covering here are as follows:

* Replacing values
* Encoding labels
* One-Hot encoding
* Binary encoding
* Miscellaneous features

#### **Replacing Values**

Let's start with the most basic method, which is just replacing the categories with the desired numbers. This can be achieved with the help of the `replace()` function in pandas.


In [None]:
map_dict = {'carrier':{'UA': 1, 'B6': 2, 'EV': 3, 'DL': 4, 'AA': 5, 'MQ': 6, 'US': 7, '9E': 8, 'WN': 9,
                       'VX': 10, 'FL': 11, 'AS': 12, 'F9': 13, 'YV': 14, 'HA': 15, 'OO':16}}
map_dict

In [None]:
labels = cat_nyc_flights['carrier'].astype('category').cat.categories.tolist()
labels

In [None]:
replace_map_dict = {'carrier' : {k: v for k,v in zip(labels, list(range(1, len(labels) + 1)))}}

print(replace_map_dict)

Above, the numbering is replaced alphabatically.

In [None]:
# make a copy of data
cat_nyc_flights_replace = cat_nyc_flights.copy()

Use the `.replace()` function on the DataFrame by passing the mapping dictionary as argument:

In [None]:
cat_nyc_flights_replace.replace(replace_map_dict, inplace=True)

cat_nyc_flights_replace.head()

As we can observe, we have encoded the `carrier` categories with the mapped numbers in your DataFrame.

In [None]:
# checking the data type
cat_nyc_flights_replace['carrier'].dtype

In python, it is a good practice that we change the data type of categorical features to `category`. This can be done using `.astype` as shown.

In [None]:
cat_nyc_flights_c = cat_nyc_flights.copy()       # making a copy of the dataset

cat_nyc_flights_c['carrier'] = cat_nyc_flights_c['carrier'].astype('category')
cat_nyc_flights_c['origin'] = cat_nyc_flights_c['origin'].astype('category')

cat_nyc_flights_c.dtypes

#### **Label Encoding**

Another approach is to encode categorical values with a technique called "label encoding", which allows you to convert each value in a column to a number. Numerical labels are always between 0 and n_categories-1.

We can do label encoding via attributes `.cat.codes` on your DataFrame's column.



In [None]:
# label encoding using .cat.codes
cat_nyc_flights_c['carrier'] = cat_nyc_flights_c['carrier'].cat.codes

In [None]:
cat_nyc_flights_c.head()     # alphabetically labeled from 0 to 15

Suppose we want only a particular category to some value and other category to some other values. This can be done by `numpy` `.where()` method. Here, we will encode all the UA carrier flights to value 1 and other carriers to value 0.

In [None]:
cat_nyc_flights_specific = cat_nyc_flights.copy()
cat_nyc_flights_specific['UA_encode'] = np.where(cat_nyc_flights_specific['carrier'].str.contains('UA'), 1, 0)

cat_nyc_flights_specific.head()

We can also use scikit-learn's **LabelEncoder**.

In [None]:
from sklearn.preprocessing import LabelEncoder

cat_nyc_flights_LE = cat_nyc_flights.copy()      #copying the original data

le = LabelEncoder()
cat_nyc_flights_LE['carrier_label_code'] = le.fit_transform(cat_nyc_flights['carrier'])

cat_nyc_flights_LE.head()                        #Results in appending a new column to df

Label encoding is pretty much intuitive and straight-forward and may give a good performance from our learning algorithm, but it has a disadvantage that the numerical values can be misinterpreted by the algorithm. Should the carrier UA (encoded to 11) be given 11x more weight than the carrier AA (encoded to 1) ?

To solve this issue there is another popular way to encode the categories via something called one-hot encoding.

#### **One Hot Encoding**

The basic strategy is to convert each category value into a new column and assign a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly.

There are many libraries out there that support one-hot encoding but the simplest one is using pandas' `.get_dummies()` method.

This function is named this way because it creates dummy/indicator variables (1 or 0). There are mainly three arguments important here,

* the first one is the DataFrame we want to encode on,
* second the columns we need to create dummy variables for, and
* third the prefix that need to be appended at the beginning of the new feature names.

In [None]:
cat_nyc_flights_onehot = cat_nyc_flights.copy()

cat_nyc_flights_onehot = pd.get_dummies(cat_nyc_flights_onehot, columns=['carrier'], prefix = ['carrier'])

cat_nyc_flights_onehot.head()

As we can see, the column carrier_US gets value 1 at the 0th and 1st observation points as those points had the UA category labeled in the original DataFrame. Likewise for other columns also.

Scikit-learn also supports one hot encoding via **LabelBinarizer** and OneHotEncoder in its preprocessing module. Just for the sake of practicing, will do the same encoding via LabelBinarizer:

In [None]:
from sklearn.preprocessing import LabelBinarizer
cat_nyc_flights_onehot_sklearn = cat_nyc_flights.copy()

lb = LabelBinarizer()
lb_code = lb.fit_transform(cat_nyc_flights_onehot_sklearn['carrier'])
lb_code_nyc = pd.DataFrame(lb_code, columns=lb.classes_)

lb_code_nyc.head()

Now, this resulted in a new DataFrame with only the one hot encodings for the feature `carrier` and it needs to be added to the origial dataframe using `.concat()` method in pandas.

In [None]:
# adding one hot encoding columns with the dataset
result_df = pd.concat([cat_nyc_flights_onehot_sklearn, lb_code_nyc], axis=1)

result_df.head()

While one-hot encoding solves the problem of unequal weights given to categories within a feature, it is not very useful when there are many categories, as that will result in formation of as many new columns, which can result in the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality).

#### **Binary Encoding**

This technique is not as intuitive as the previous ones. In this technique, first the categories are encoded as ordinal, then those integers are converted into binary code, then the digits from that binary string are split into separate columns. This encodes the data in fewer dimensions than one-hot. We can do binary encoding via a number of ways but the simplest one is using the `category_encoders` library.

In [None]:
!pip install -q category_encoders               # installing category_encoders library

In [None]:
# importing category_encoders library for labelbinarizer
import category_encoders as ce
cat_nyc_flights_ce = cat_nyc_flights.copy()

encoder = ce.BinaryEncoder(cols=['carrier'])
df_binary = encoder.fit_transform(cat_nyc_flights_ce)

df_binary.head(10)

**Note:** Notice that four new columns are created in place of the carrier column with binary encoding for each category in the feature.

#### **Miscellaneous Features**

Sometimes we may encounter categorical feature columns which specify the ranges of values for observation points, for example, the age column might be described in the form of categories like 0-20, 20-40 and so on.

While there can be a lot of ways to deal with such features, the most common ones are either split these ranges into two separate columns or replace them with some measure like the mean of that range.

First, we will create a dummy DataFrame which has just one feature age with ranges specified using the pandas DataFrame function.

In [None]:
dummy_df_age = pd.DataFrame({'age': ['0-20', '20-40', '40-60','60-80']})

Then we will split the column on the delimeter - into two columns start and end using `split()` with a `lambda()` function.

In [None]:
dummy_df_age['start'], dummy_df_age['end'] = zip(*dummy_df_age['age'].map(lambda x: x.split('-')))

dummy_df_age.head()

To replace the range with its mean, we will write a `split_mean()` function which basically takes one range at a time, splits it, then calculates the mean and returns it. To apply a certain function to all the entities of a column you will use the `.apply()` method:

In [None]:
def split_mean(x):
    split_list = x.split('-')
    mean = (float(split_list[0])+float(split_list[1]))/2
    return mean

dummy_df_age['mean_age'] = dummy_df_age['age'].apply(lambda x: split_mean(x))

dummy_df_age.head()