# One Hot Encoding

In this notebook we will use one hot encoding to trasnform categorical variables into numeric ones.

### Import Basic Packages & Data

In [1]:
# Basics
import numpy as np
import pandas as pd

The dataset contains data about airbnb rentals in New York. We have a number of independent variables and a target variable, price. The goal is to transform our dataset so that all features are numeric, making it more suitable for the majority of machine learning algorithms.

The original data can be found here: http://insideairbnb.com/

In [2]:
# Importing data
df = pd.read_csv('airbnb_dataset_training.csv')

### Identify Categorical Variables

In [3]:
# Checking data types to confirm categorical columns
print(df.dtypes)

id                      int64
minimum_nights          int64
number_of_reviews       int64
neighbourhood_group    object
neighbourhood          object
room_type              object
price                   int64
dtype: object


In [4]:
# Count unique values in all columns
print(df.nunique())

id                     8
minimum_nights         5
number_of_reviews      7
neighbourhood_group    2
neighbourhood          6
room_type              2
price                  8
dtype: int64


In [5]:
# Count unique values in a single column
print(df['neighbourhood_group'].value_counts())

Manhattan    5
Brooklyn     3
Name: neighbourhood_group, dtype: int64


In [6]:
# Separating categorical columns into their own dataframe
categorical = df.select_dtypes('object')
categorical

Unnamed: 0,neighbourhood_group,neighbourhood,room_type
0,Brooklyn,Kensington,Private room
1,Manhattan,Midtown,Entire home/apt
2,Manhattan,Midtown,Private room
3,Brooklyn,Clinton Hill,Entire home/apt
4,Manhattan,Murray Hill,Entire home/apt
5,Manhattan,Murray Hill,Entire home/apt
6,Brooklyn,Bedford-Stuyvesant,Private room
7,Manhattan,Hell's Kitchen,Private room


In [7]:
# Using a loop to find all the unique values of a categorical column
for i in categorical:
    print(i,'\n',df[i].unique(),'\n')

neighbourhood_group 
 ['Brooklyn' 'Manhattan'] 

neighbourhood 
 ['Kensington' 'Midtown' 'Clinton Hill' 'Murray Hill' 'Bedford-Stuyvesant'
 "Hell's Kitchen"] 

room_type 
 ['Private room' 'Entire home/apt'] 



### One Hot Encoding in Pandas

We can very easily accomplish our variable encoding by using the Pandas get_dummies function.

**pd.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)**
Documentation: https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html
- The first argument **data**, defines the dataset on which we want to perform the encoding.
- The **columns** argument defines which columns specifically we want to encode.
- The **prefix** argument helps us to understand which column each encoded column came from.
- The **prefix separator** allows us to separate the prefix from the original column name.
- The **drop_first** argument removes the first of the categorical encoded columns.
    - drop_first = **False**      will create a dummy variable for every category.
    - drop_first = **True**       will create n-1 dummy variables.

In [None]:
# Import the required packages from pandas
# No additional packages are required since get_dummies is part of pandas.

In [8]:
# Perform the same action but on the full dataset, creating one hot encoded columns for the following columns:
#'neighbourhood_group','neighbourhood','room_type'
df_pd_encoded = pd.get_dummies(df,
                                columns = ['neighbourhood_group'],
                                prefix = ['ng'],
                                drop_first=False)
df_pd_encoded

Unnamed: 0,id,minimum_nights,number_of_reviews,neighbourhood,room_type,price,ng_Brooklyn,ng_Manhattan
0,2539,1,9,Kensington,Private room,149,1,0
1,2595,1,45,Midtown,Entire home/apt,225,0,1
2,3647,3,0,Midtown,Private room,150,0,1
3,3831,1,270,Clinton Hill,Entire home/apt,89,1,0
4,5022,10,9,Murray Hill,Entire home/apt,80,0,1
5,5099,3,74,Murray Hill,Entire home/apt,200,0,1
6,5121,45,49,Bedford-Stuyvesant,Private room,60,1,0
7,5178,2,430,Hell's Kitchen,Private room,79,0,1


### One Hot Encoding in Skikit Learn using OneHotEncoder (Recommended Approach)

We can achieve the same thing in Skikit Learn by using the OneHotEncoder function. We recommend using this approach as it's easier to build a production machine learning model that can be tested.

**OneHotEncoder()**
- Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
- The **drop** argument removes the first of the categorical encoded columns.
    - drop: **False**      will create a dummy variable for every category.
    - drop: **True**       will create n-1 dummy variables.
- The **handle_unknown** argument can take one of several values: {'error', 'ignore', 'infrequent_if_exist'}, **default='error'**
    - 'error' will raise an error if an unknown category is present when transforming the testing/real data.
    - 'ignore' will set all unrecognized categories in the testing data to zero in all dummy columns.
   

In [None]:
#Import the required packages from sklearn
from sklearn.preprocessing import OneHotEncoder

In [None]:
# Use sklearn's OneHotEncoder to apply one hot encoding to multiple categorical features.

# Initiatlize the ohe method
ohe = OneHotEncoder(sparse=False)                   

#Specify the categorical columns
col_names = ['neighbourhood_group','neighbourhood', 'room_type']
col_prefixes = ['ng','n', 'rt']

#Apply the ohe encoding
dummy_cols = ohe.fit_transform(df[col_names])
dummy_names = ohe.get_feature_names(col_prefixes)
dummy_cols = pd.DataFrame(dummy_cols,columns = dummy_names, dtype = int)

#Combine the encoded columns with the non categorical columns
df_ohe_encoded = pd.concat([df,dummy_cols], axis = 1)
df_ohe_encoded = df_ohe_encoded.drop(col_names, axis = 1)
df_ohe_encoded

In the next notebook we'll show you how to apply the ohe transformer to a testing dataset.

### Exercise 1 (Basic) Import & Identify Categorical Variables

For these exercises, you will be working with the well known breastcancer dataset. There are a number of features being used to predict the target variable, whether there is recurrence or no-recurrence of breastcancer.

Source: UMCIO, Ljulljana, Yugoslavia. M. Zwitter & M. Soklic.
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer

Follow the below steps to complete the one hot encoding exercise:
- Import the breastcancer dataset into a dataframe
- Find out what column types exist in the dataset
- Return a list of object (category) columns
- Explore the unique values that exist in each column. What unique values exist in the node-caps column?

In [None]:
#Import the csv file using the pre-populated code
bc_df = pd.read_csv('breastcancer_dataset.csv')
bc_df

In [None]:
# Checking data types to confirm categorical columns
print(bc_df.dtypes, '\n')

# Identify list of category columns.
cat_cols = bc_df.select_dtypes('object').columns
cat_cols

# Using a loop to find all the unique values of each categorical column
for i_feature in cat_cols:
    print(i_feature,'\n',bc_df[i_feature].unique(),'\n')

### Exercise 2 (Advanced) Apply the One Hot Encoder to the dataset

- Apply one hot encoding to the node-caps column and check the the correct number of columns have been returned.
- Optional: Apply one hot encoding to all categorical columns.

In [None]:
# Use sklearn's OneHotEncoder to apply one hot encoding to the node-caps column.

# Initiatlize the ohe method
ohe = OneHotEncoder(sparse=False)                       
#col_names = ['age','menopause', 'tumor-size', 'inv-nodes', 'node-caps', 'breast', 'breast-quad','irradiat']
#bc_cols_prefix = ['age','mp', 'ts', 'in', 'nc', 'br','bq','irr']
col_names = ['node_caps']
bc_cols_prefix = ['nc']

bc_dummy_cols = ohe.fit_transform(bc_df[col_names])
bc_dummy_names = ohe.get_feature_names(bc_cols_prefix)
bc_dummy_cols = pd.DataFrame(bc_dummy_cols,columns = bc_dummy_names, dtype = int)

#Combine the encoded columns with the non categorical columns
bc_encoded = pd.concat([bc_df,bc_dummy_cols], axis = 1)
bc_encoded = bc_encoded.drop(col_names, axis = 1)
bc_encoded