# Convert Categorical Data to numeric data using one-hot encoding

This notebook contains implementations of one-hot-encoding to convert our categorical data into numeric form
    - Converts discrete input values to one-hot encoded form (one of k or dummy encoding)

## One hot encoding
This transforms each categorical feature with `n_categories` possible values into `n_categories binary features`, with one of them `1, and all others 0`.

### 1) OneHotEncoding Using `sklearn.preprocessing.OneHotEncoder`

#### `sklearn.preprocessing.OneHotEncoder` Encode categorical integer features as a one-hot numeric array

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [1]:
from sklearn.preprocessing import OneHotEncoder

import pandas as pd
import numpy as np

### Creating One hot encoder object

In [2]:
enc = OneHotEncoder()

We will start with a dummy dataset to show how the one hot encoder works, the encoder expects input in the form of a 2d array where the categories and in one column. 
- The encoder will be instantiated with default parameters

In [3]:
# starting off with dummy data

majors = [['Engineering'], 
          ['Math'], 
          ['Chemistry']]

In [4]:
enc.fit(majors)

OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='error', sparse=True)

One thing to note here is the handle_unknown ='error'- this means that we have envoked the fit function to tell the encoder which categories are present, if new categories are introduced whe applying transform to another dataset there will be a problem. I will show an example of this error occuring

In [5]:
enc.transform(majors).toarray()

array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [6]:
enc.categories_

[array(['Chemistry', 'Engineering', 'Math'], dtype=object)]

In [7]:
new_majors = [['Media Studies'], 
              ['Math'],
              ['Stats']]

As you can see the new_majors 2D array has a new categories that the encoder was not told are present, when we want to apply transform to noew_majors it will result in an error as you can see below

In [8]:
enc.transform(new_majors).toarray()

ValueError: Found unknown categories ['Stats', 'Media Studies'] in column 0 during transform

To Prevent this we can set the handle_unknown parameter to ignore

In [9]:
enc_unk = OneHotEncoder(handle_unknown='ignore')

enc_unk.fit(majors)

OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=True)

Since the categories do not exist and the two majors that are not in our encoder will result in our encoder saying that these datapoints belong to none of the categories in our encoder

In [10]:
enc_unk.transform(new_majors).toarray()

array([[0., 0., 0.],
       [0., 0., 1.],
       [0., 0., 0.]])

### Reading csv file

Now that we saw the basics for encoding we can start working with real data

In [11]:
student_info = pd.read_csv('datasets/academic_info.csv')

student_info

Unnamed: 0,StudentID,Department,Nationality,Batch
0,19022,Polical Science,Egypt,2010
1,19087,Journalism,Germany,2012
2,12809,Engineering,Germany,2014
3,12809,Math,China,2014


### Creating a encoder object to encode the column 'Department'

Use fit_transform to fit and transform the data in one command

In [12]:
dept_enc = OneHotEncoder()

In [13]:
dept_enc_transformed = dept_enc.fit_transform(student_info[['Department']])

dept_enc_transformed

<4x4 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>

One hot encoding is used to encode nominal data with no inherent ordering and as you can see from the array output there is no ordering apparent in the one-hot encoded form

In [14]:
dept_enc_transformed.toarray()

array([[0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.]])

The categories_ function tells us which columns correspont to which categories

In [15]:
dept_enc.categories_

[array(['Engineering', 'Journalism', 'Math', 'Polical Science'],
       dtype=object)]

### Printing the labels along their encoded value

In [16]:
dept_data = pd.DataFrame(dept_enc_transformed.toarray(), 
                       columns = dept_enc.categories_, dtype=np.int)

dept_data

Unnamed: 0,Engineering,Journalism,Math,Polical Science
0,0,0,0,1
1,0,1,0,0
2,1,0,0,0
3,0,0,1,0


### Creating an object to encode the column `Nationality`

We have different estimator objects to work with the categories of a particular feature

In [17]:
nationality_enc = OneHotEncoder()

In [18]:
nationality_transformed = nationality_enc\
                            .fit_transform(student_info[['Nationality']])

In [19]:
nationality_transformed.toarray()

array([[0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [20]:
nationality_enc.categories_

[array(['China', 'Egypt', 'Germany'], dtype=object)]

In [21]:
nationality_data = pd.DataFrame(nationality_transformed.toarray(), 
                              columns = nationality_enc.categories_, dtype=np.int)

nationality_data

Unnamed: 0,China,Egypt,Germany
0,0,1,0
1,0,0,1
2,0,0,1
3,1,0,0


## 2) Getting the same output as OneHotEncoding Using `pandas.get_dummies` 

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

We can use the pandas.get_dummies function on a specific column you want one-hot-encoded, if you are working with dataframes this is much better than using the one hot encoder estimator object 

In [22]:
dummy_dept = pd.get_dummies(student_info['Department'])

dummy_dept

Unnamed: 0,Engineering,Journalism,Math,Polical Science
0,0,0,0,1
1,0,1,0,0
2,1,0,0,0
3,0,0,1,0


In [23]:
dummy_nationality = pd.get_dummies(student_info['Nationality'])

dummy_nationality

Unnamed: 0,China,Egypt,Germany
0,0,1,0
1,0,0,1
2,0,0,1
3,1,0,0


In [24]:
student_info = pd.concat([student_info, 
                           dummy_dept, 
                           dummy_nationality], 
                          axis=1)

student_info

Unnamed: 0,StudentID,Department,Nationality,Batch,Engineering,Journalism,Math,Polical Science,China,Egypt,Germany
0,19022,Polical Science,Egypt,2010,0,0,0,1,0,1,0
1,19087,Journalism,Germany,2012,0,1,0,0,0,0,1
2,12809,Engineering,Germany,2014,1,0,0,0,0,0,1
3,12809,Math,China,2014,0,0,1,0,1,0,0
