## Encoding categorical data.

We use the data 

- default
- bikes

These were used last week and are on Brightspace.

We consider the categorical predictors, encoding "manually" and then using `sklearn`.

First import libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

Import and inspect data. You will of course need to import from the appropriate directory on your machine.

In [None]:
default = pd.read_csv('../data/default.csv')  #You will of course need to import from the appropriate directory on your machine.
default.head()
default.dtypes

Can use `map` to encode the `student` variable

In [None]:
dummies = default['student'].map({'Yes': 1, 'No': 0})
dummies.head()


In practice we would not keep both columns. This is just for illustration, and to leave the original column intact.

In [None]:
default['student2'] = default['student'].map({'Yes': 1, 'No': 0})

default.head()  

Or use a lambda function. This is overkill- just shown as illustration.

In [None]:
default['student2'] = default['student'].apply(lambda x: 1 if x == 'Yes' else 0)
default.head()  

Or use `get_dummies`. Note we would typically only keep one of these columns.

In [None]:
pd.get_dummies(default['student']).head()

or we could do this (maybe the simplest for binary data). Note that boolean `True`, `False` behave like ones and zeros- we can find the mean, add, subtract, etc. 

In [None]:
default['student2'] = default['student']=="Yes"

default.head()

Now we try one hot from sklearn. It will generalize to situations where we one hot many predictors at once. We re-load the data to get a fresh copy.

In [None]:
default = pd.read_csv('../data/default.csv')

Note that `OneHotEncoder` has `fit`, `transform`, and `fit_transform` methods. You can guess what they do but see the documentation!

In [None]:
ohc = OneHotEncoder()

student_encoded = ohc.fit_transform(default[['student']])

What "attributes" does our `ohc` encoder have? A quick way to find out is to type `ohc.` You will get a list of auto-complete options right after you type the the period. For a description, check the documentation.

In [None]:
ohc.get_feature_names_out()

Make a data frame- later we will not need to do this. We do it here for illustration. We use the `toarray` method to put `student_encoded` in a pandas friendly format.

Note that the encoding creates two columns. Usually we will only keep one. We address this below.Note that it creates two columns. Usually we will only keep one. We address this below. 

In [None]:
student_df = pd.DataFrame(student_encoded.toarray(), columns = ohc.get_feature_names_out())
student_df.head()

In [41]:
default_enc = pd.concat([default, student_df])
default_enc.head()

Unnamed: 0.1,Unnamed: 0,default,student,balance,income,student_Yes
0,1.0,No,No,729.526495,44361.625074,
1,2.0,No,Yes,817.180407,12106.1347,
2,3.0,No,No,1073.549164,31767.138947,
3,4.0,No,No,529.250605,35704.493935,
4,5.0,No,No,785.655883,38463.495879,


#### Multicolinearity

Or simply, colinearity. This happens when columns - that is, the predictors - are dependent. See ISLR/ISLP for review. For example, homework, quiz, test grades, and final average- knowing three of these determines the fourth. In this case, with student_yes and student_no, the value of one of these determines the other. Such is the case when one hot encoding categorical predictors if we generate a column for each category, or *level*. For some models this is problematic (unregularized linear regression models in particular). So we often want to drop one of the columns. K nearest neighbors and tree based methods dont care one way or the other. 

To drop columns and handle other options we can add arguments to `OneHotEncoder`. For example, when might unknown category be encountered? How will it be encoded? See the documentation. 

We can combine the encoded predictor columns with the original data using pandas `concat` method.

In [None]:
ohc = OneHotEncoder(drop='first', handle_unknown='ignore')

student_encoded = ohc.fit_transform(default[['student']])

student_df = pd.DataFrame(student_encoded.toarray(), columns = ohc.get_feature_names_out())

default_enc = pd.concat([default, student_df] ## combine the dataframes

default_enc.head()

### Bikes data
- categorical predictors with more levels 
- ordinal predictors

In [2]:
bikes = pd.read_csv('../data/bikes.csv') ### your path will depend on where you put the file.
bikes.head()

Unnamed: 0,date,season,year,month,day_of_week,weekend,holiday,temp_actual,temp_feel,humidity,windspeed,weather_cat,rides
0,2011-01-01,winter,2011,Jan,Sat,True,no,57.399525,64.72625,80.5833,10.749882,categ2,654
1,2011-01-03,winter,2011,Jan,Mon,False,no,46.491663,49.04645,43.7273,16.636703,categ1,1229
2,2011-01-04,winter,2011,Jan,Tue,False,no,46.76,51.09098,59.0435,10.739832,categ1,1454
3,2011-01-05,winter,2011,Jan,Wed,False,no,48.749427,52.6343,43.6957,12.5223,categ1,1518
4,2011-01-07,winter,2011,Jan,Fri,False,no,46.503324,50.79551,49.8696,11.304642,categ2,1362


In [3]:
bikes.dtypes

date            object
season          object
year             int64
month           object
day_of_week     object
weekend           bool
holiday         object
temp_actual    float64
temp_feel      float64
humidity       float64
windspeed      float64
weather_cat     object
rides            int64
dtype: object

We will drop some columns for simplicity, and focus on day of week, weekend, weather category

In [None]:
bikes = bikes.drop(columns = ['date', 'season', 'year', 'month'])

We encode the `day_of_week` predictor using one hot encoding. 

In [None]:
ohc = OneHotEncoder(drop='first', handle_unknown='error')
week_day_enc = ohc.fit_transform(bikes[["day_of_week"]])
week_day_df = pd.DataFrame(week_day_enc.toarray(), columns = ohc.get_feature_names_out())
week_day_df.head()



Now `weekend`. Notice that it is boolean. So we should be able to leave this as is. 

What about the weather_cat?

But weather is actually an ordinal predictor. We will use ordinal encoder. Does it matter if we drop the first? How are the numerical values determined? can we specify? Is there a default? Check the documentation!

In [None]:
ord = OrdinalEncoder(categories=[['categ1', 'categ2', 'categ3']])

weather_ord = ord.fit_transform(bikes[["weather_cat"]])
weather_df = pd.DataFrame(weather_ord, columns = ['weather_ord'])

weather_df.head()

In [None]:
ord.categories_

The one hot encoder maps integer values 0 to number of categories - 1, in the order you specify. The default is alphabetical order! So you should make sure your levels are in the correct order.

Ordinal encoder does not support custom mappings. However, you could use the following to map your own numerical values.

In [None]:
# Define your custom mapping as a dictionary (assumes your categories are category1 , et. )
mapping = {'category1': 3, 'category2': 2, 'category3': 1}

# Apply the mapping to your column (assumes "column_to_encocde" is the name of your column and df is the name of your dataframe)
df['column_to_encode'] = df['column_to_encode'].map(mapping)