<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Feature Engineering: Categorical
              
</p>
</div>

Data Science Cohort Live NYC May 2024
<p>Phase 3: Topic 19</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

#### Feature Engineering: Transforming input data
- Categorical data to numeric form
- Input in form that the better conforms to structure of input-output relationship.
- Model interactions between features.

#### A key aspect to making a better prediction machine

#### Categorical data 
- Suspect that the status of a categorical value affects outcome.
- Want to add as a variable to regress on.
- Need to convert to numeric form.

Two types of categorical data:

<center><img src = "Images/ordinalvsnominal.png" width = 900/></center>


**Dealing with ordinal categoricals**

-Clear progression/order of values:

Pizza cheesiness rating:

- E.g., not cheesy, slightly cheesy, cheesy, very cheesy, extremely cheesy, dripping oceans of cheese



Ordinal encoding:

not cheesy: 0, slightly cheesy: 1, cheesy: 2, very cheesy: 3, extremely cheesy: 4, dripping oceans of cheese: 5


<center><img src = "Images/cheesy_pizza.jpg" width = 600 /></center>
<center>


A real example: housing dataset
- Using pandas categorical coding
-Ordinal encoding

In [1]:
import pandas as pd
housing_df = pd.read_csv('Data/ames_housing.csv')
housing_df.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

Lots of columns. Let's check out 'ExterQual' column: quality of material on house exterior.

In [2]:
housing_df['ExterQual'].unique()

array(['Gd', 'TA', 'Ex', 'Fa'], dtype=object)

This is a column of strings, but these are really categories.
- Pandas has categorical datatype.
- Special methods for categorical datatype.

In [3]:
housing_df['ExterCond'] = housing_df['ExterCond'].astype('category')
housing_df['ExterCond']

0       TA
1       TA
2       TA
3       TA
4       TA
        ..
1455    TA
1456    TA
1457    Gd
1458    TA
1459    TA
Name: ExterCond, Length: 1460, dtype: category
Categories (5, object): ['Ex', 'Fa', 'Gd', 'Po', 'TA']

Good, but need to establish category order

In [4]:
housing_df['ExterCond'] = housing_df['ExterCond'].cat.reorder_categories(['Po', 'Fa', 'TA', 'Gd', 'Ex'])
housing_df['ExterCond']

0       TA
1       TA
2       TA
3       TA
4       TA
        ..
1455    TA
1456    TA
1457    Gd
1458    TA
1459    TA
Name: ExterCond, Length: 1460, dtype: category
Categories (5, object): ['Po', 'Fa', 'TA', 'Gd', 'Ex']

Get the numerical values of ordinal categorical:

In [5]:
housing_df['ExterCond']

0       TA
1       TA
2       TA
3       TA
4       TA
        ..
1455    TA
1456    TA
1457    Gd
1458    TA
1459    TA
Name: ExterCond, Length: 1460, dtype: category
Categories (5, object): ['Po', 'Fa', 'TA', 'Gd', 'Ex']

In [6]:
housing_df['ExterCond'].cat.codes

0       2
1       2
2       2
3       2
4       2
       ..
1455    2
1456    2
1457    3
1458    2
1459    2
Length: 1460, dtype: int8

#### Using scikit learn Ordinalencoder()

In [7]:
from sklearn.preprocessing import OrdinalEncoder

Some objects in scikit-learn are predictive models:
- LinearRegression()
    - .fit() 
    - .predict()

Other objects are transformers:
- OrdinalEncoder(), StandardScaler(), Normalizer(), etc.
    -  .fit()
    - .transform()
    - .fit_transform()


.fit() method for transformers:
- fit() or fit_transform() transformer to **training set**.
- transform() test set and/or train set.

OrdinalEncoder fits and transforms categorical data to numerical.
- Can do many ordinal categorical columns at once.

In [8]:
ord_cat_selector = ['ExterCond', 'LotShape']
cat_subset = housing_df[ord_cat_selector]
cat_subset

Unnamed: 0,ExterCond,LotShape
0,TA,Reg
1,TA,Reg
2,TA,IR1
3,TA,IR1
4,TA,IR1
...,...,...
1455,TA,Reg
1456,TA,Reg
1457,Gd,Reg
1458,TA,Reg


Measure of irregularity of lot shape.
- Clearly ordinal.

In [9]:
cat_subset['LotShape'].unique()

array(['Reg', 'IR1', 'IR2', 'IR3'], dtype=object)

Ordinal encoder will do the mapping all at once:
- Define ordinal order for each categorical variable.

In [10]:
extcond_list = ['Po', 'Fa', 'TA', 'Gd', 'Ex'] 
reg_list = ['Reg', 'IR1', 'IR2', 'IR3', 'IR4']

In [11]:
o_enc = OrdinalEncoder(categories = [extcond_list, reg_list])
o_enc.fit(cat_subset)

Now transform the categorical subset

In [12]:
X_subset = pd.DataFrame(o_enc.transform(cat_subset),
                        columns = cat_subset.columns)
X_subset

Unnamed: 0,ExterCond,LotShape
0,2.0,0.0
1,2.0,0.0
2,2.0,1.0
3,2.0,1.0
4,2.0,1.0
...,...,...
1455,2.0,0.0
1456,2.0,0.0
1457,3.0,0.0
1458,2.0,0.0


In [13]:
cat_subset

Unnamed: 0,ExterCond,LotShape
0,TA,Reg
1,TA,Reg
2,TA,IR1
3,TA,IR1
4,TA,IR1
...,...,...
1455,TA,Reg
1456,TA,Reg
1457,Gd,Reg
1458,TA,Reg


Nice thing is you've also set up inverse transform:

In [14]:
X_subset

Unnamed: 0,ExterCond,LotShape
0,2.0,0.0
1,2.0,0.0
2,2.0,1.0
3,2.0,1.0
4,2.0,1.0
...,...,...
1455,2.0,0.0
1456,2.0,0.0
1457,3.0,0.0
1458,2.0,0.0


In [15]:
o_enc.inverse_transform(X_subset)

array([['TA', 'Reg'],
       ['TA', 'Reg'],
       ['TA', 'IR1'],
       ...,
       ['Gd', 'Reg'],
       ['TA', 'Reg'],
       ['TA', 'Reg']], dtype=object)

Some advantages of ordinal encoder:
- Set up encoding order for many categorical columns once.
- Transform/inverse transform at same time
- **Integrates into scikit learn pipeline architecture (will see this later)**

**Dealing with nominal categoricals**

Label encoding nominal categoricals introduce spurious relations:

- Doesn't make sense 

In [16]:
housing_df['RoofStyle'].unique()

array(['Gable', 'Hip', 'Gambrel', 'Mansard', 'Flat', 'Shed'], dtype=object)

- pd.get_dummies()
- sklearn's OneHotEncoder()

Create column for each unique value of nominal categorical:
- Each column takes on 0/1 value.

In [39]:
pd.get_dummies(housing_df['RoofStyle']).tail().astype('int')

Unnamed: 0,Flat,Gable,Gambrel,Hip,Mansard,Shed
1455,0,1,0,0,0,0
1456,0,1,0,0,0,0
1457,0,1,0,0,0,0
1458,0,0,0,1,0,0
1459,0,1,0,0,0,0


In [34]:
housing_df['RoofStyle'].tail()

1455    Gable
1456    Gable
1457    Gable
1458      Hip
1459    Gable
Name: RoofStyle, dtype: object

When doing regression, there is issue with transforming feature in this way:
- Accidentally introduced a correlation.
- E.g., constraint: if 5 of the columns are zero the last one must be 1.
- For $k$ values of nominal categorical only $k-1$ carry information.
- **Solution**: Get rid of one of the columns.

In [18]:
X_roof = pd.get_dummies(housing_df['RoofStyle'], drop_first = True)
X_roof.tail()

Unnamed: 0,Gable,Gambrel,Hip,Mansard,Shed
1455,True,False,False,False,False
1456,True,False,False,False,False
1457,True,False,False,False,False
1458,False,False,True,False,False
1459,True,False,False,False,False


#### Using scikit-learn OneHotEncoder

In [47]:
from sklearn.preprocessing import OneHotEncoder
onehot_enc = OneHotEncoder(drop = 'first', sparse_output = False)

In [48]:
nominal_cols = ['RoofStyle','HouseStyle']
onehot_enc.fit_transform(housing_df[nominal_cols])

array([[1., 0., 0., ..., 1., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 1., 0., 0.],
       ...,
       [1., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]])

In [49]:
onehot_enc.get_feature_names_out()

array(['RoofStyle_Gable', 'RoofStyle_Gambrel', 'RoofStyle_Hip',
       'RoofStyle_Mansard', 'RoofStyle_Shed', 'HouseStyle_1.5Unf',
       'HouseStyle_1Story', 'HouseStyle_2.5Fin', 'HouseStyle_2.5Unf',
       'HouseStyle_2Story', 'HouseStyle_SFoyer', 'HouseStyle_SLvl'],
      dtype=object)

In [50]:
pd.DataFrame(onehot_enc.fit_transform(housing_df[nominal_cols]), columns = onehot_enc.get_feature_names_out())

Unnamed: 0,RoofStyle_Gable,RoofStyle_Gambrel,RoofStyle_Hip,RoofStyle_Mansard,RoofStyle_Shed,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1456,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1457,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1458,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


Can also initiate inverse transform

In [29]:
X_nom_trans = onehot_enc.fit_transform(housing_df[nominal_cols] )
X_nom_trans

array([[1., 0., 0., ..., 1., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 1., 0., 0.],
       ...,
       [1., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]])

In [30]:
onehot_enc.inverse_transform(X_nom_trans)

array([['Gable', '2Story'],
       ['Gable', '1Story'],
       ['Gable', '2Story'],
       ...,
       ['Gable', '2Story'],
       ['Hip', '1Story'],
       ['Gable', '1Story']], dtype=object)

#### Some general advice on encoding many nominal variables

- Watch out for feature size explosion!
- Features are great, but...
    - Lots of features can lead to problems (we will see this later in **great detail**)
    - Can use up tons of memory (will need to encode as sparse matrix)
