# Lab 5: Ordinal and OneHot Encoding

The goal of this lab is to evaluate the impact of using ordinal encoding and onehot encoding for categorical variables along with a Logistic Regression and Decision Tree.

## 0. Data Loading

In [1]:
import pandas as pd

adult_census = pd.read_csv("adult-census.csv")

In [2]:
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

**Note**: we could use `sklearn.compose.make_column_selector` to automatically select columns with `object` dtype that correspond to categorical features in our dataset.

In [3]:
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
print(categorical_columns)
data_categorical = data[categorical_columns]

['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']


In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data_categorical, target, test_size=0.3, random_state=42)

## 1. DummyClassifier

**DummyClassifier** makes predictions that ignore the input features. This classifier serves as a *simple baseline to compare against other more complex classifiers*. The specific behavior of the baseline is selected with the strategy parameter.

In [5]:
from sklearn.pipeline import make_pipeline
from sklearn.dummy import DummyClassifier

dummy_model = make_pipeline(DummyClassifier())

In [6]:
from sklearn.metrics import accuracy_score

dummy_model.fit(X_train, y_train)
accuracy_score(y_test, dummy_model.predict(X_test))

0.766600696103187

## 2. OrdinalEncoder and LogisticRegression

**Task#1**: Define a scikit-learn pipeline composed of an `OrdinalEncoder` and a
`LogisticRegression` classifier. Fit your pipeline on training dataset and evaluate your prediction accuracy on your test dataset.
- `OrdinalEncoder` can raise errors if it sees an unknown category at
prediction time, you can set the `handle_unknown="use_encoded_value"` and
`unknown_value=-1` parameters. You can refer to the
[scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)
for more details regarding these parameters.
- Use hyperparameter of `max_iter=500` in your LogisticRegression

In [7]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression

# Write your code here (model)

# Defining the pipeline
model_ordinal_lr = make_pipeline(
    OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
    LogisticRegression(max_iter=500)
)

# Fitting the pipeline
model_ordinal_lr.fit(X_train, y_train)


In [8]:
# Write your code here (accuracy)

# Predicting and evaluating the accuracy
accuracy_ordinal_lr = accuracy_score(y_test, model_ordinal_lr.predict(X_test))
print(f"Accuracy using OrdinalEncoder and LogisticRegression: {accuracy_ordinal_lr:.2f}")


Accuracy using OrdinalEncoder and LogisticRegression: 0.76


## 3. OneHotEncoder and LogisticRegression

**Task#2**: Define a scikit-learn pipeline composed of an `OneHotEncoder` and a
`LogisticRegression` classifier. Fit your pipeline on training dataset and evaluate your prediction accuracy on your test dataset.

- `OneHotEncoder` can raise errors if it sees an unknown category at
prediction time, you can set the `handle_unknown="ignore"` parameter. 
- Use hyperparameter of `max_iter=500` in your LogisticRegression

In [9]:
from sklearn.preprocessing import OneHotEncoder

# Write your code here (model).

# Defining the pipeline
model_onehot_lr = make_pipeline(
    OneHotEncoder(handle_unknown="ignore"),
    LogisticRegression(max_iter=500)
)

# Fitting the pipeline
model_onehot_lr.fit(X_train, y_train)


In [10]:
# Write your code here (accuracy)

# Predicting and evaluating the accuracy
accuracy_onehot_lr = accuracy_score(y_test, model_onehot_lr.predict(X_test))
print(f"Accuracy using OneHotEncoder and LogisticRegression: {accuracy_onehot_lr:.2f}")


Accuracy using OneHotEncoder and LogisticRegression: 0.84


## 4. DecisionTree on Categorical Data

**Important**: tree in sklkearn only accecpt numerical values, categorical values not accepted.

In [11]:
from sklearn.tree import DecisionTreeClassifier
tree_model = make_pipeline(DecisionTreeClassifier())

#tree_model.fit(X_train, y_train)

## 5. OrdinalEncoder and DecisionTree

**Task#3**: Define a scikit-learn pipeline composed of an `OrdinalEncoder` and a
`DecisionTree` classifier. Fit your pipeline on training dataset and evaluate your prediction accuracy on your test dataset.

- `OrdinalEncoder` can raise errors if it sees an unknown category at
prediction time, you can set the `handle_unknown="use_encoded_value"` and
`unknown_value=-1` parameters.

In [12]:
# Write your code here (model)
model_ordinal_tree = make_pipeline(
    OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
    DecisionTreeClassifier()
)

# Fitting the pipeline
model_ordinal_tree.fit(X_train, y_train)



In [13]:
# Write your code here (accuracy)

# Predicting and evaluating the accuracy
accuracy_ordinal_tree = accuracy_score(y_test, model_ordinal_tree.predict(X_test))
print(f"Accuracy using OrdinalEncoder and DecisionTree: {accuracy_ordinal_tree:.2f}")


Accuracy using OrdinalEncoder and DecisionTree: 0.82


## 6. OneHotEncoder and DecisionTree

**Task#4**: Define a scikit-learn pipeline composed of an `OneHotEncoder` and a
`DecisionTreeClassifier'. Fit your pipeline on training dataset and evaluate your prediction accuracy on your test dataset.

- `OneHotEncoder` can raise errors if it sees an unknown category at
prediction time, you can set the `handle_unknown="ignore"` parameter. 

In [14]:
# Write your code here (model)
# Defining the pipeline
model_onehot_tree = make_pipeline(
    OneHotEncoder(handle_unknown="ignore"),
    DecisionTreeClassifier()
)

# Fitting the pipeline
model_onehot_tree.fit(X_train, y_train)

In [15]:
# Write your code here (accuracy)

# Predicting and evaluating the accuracy
accuracy_onehot_tree = accuracy_score(y_test, model_onehot_tree.predict(X_test))
print(f"Accuracy using OneHotEncoder and DecisionTree: {accuracy_onehot_tree:.2f}")


Accuracy using OneHotEncoder and DecisionTree: 0.83
