# Feature Engineering on Categorical Data

In [2]:
import numpy as np
import pandas as pd

vg_df = pd.read_csv('Data/videogame_sales.csv')
vg_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher']].head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher
0,Wii Sports,Wii,2006.0,Sports,Nintendo
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo


### Transforming Nominal Features
Note here that this doesn’t indicate that the transformed feature will be a numeric feature. It will still be a discrete valued categorical feature with numbers instead of text for each genre.

In [3]:
# Take a look at the unique values of Genre (Just to have an idea and we don't use this in our transformation)
genres = np.unique(vg_df['Genre'])
genres

array(['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle',
       'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports',
       'Strategy'], dtype=object)

In [4]:
# Create mapping scheme for genres
from sklearn.preprocessing import LabelEncoder

gle = LabelEncoder()
genre_labels = gle.fit_transform(vg_df['Genre'])
genre_mappings = {index: label for index, label in enumerate(gle.classes_)}
genre_mappings

{0: 'Action',
 1: 'Adventure',
 2: 'Fighting',
 3: 'Misc',
 4: 'Platform',
 5: 'Puzzle',
 6: 'Racing',
 7: 'Role-Playing',
 8: 'Shooter',
 9: 'Simulation',
 10: 'Sports',
 11: 'Strategy'}

In [5]:
vg_df['GenreLabel'] = genre_labels
vg_df[['Name', 'Platform', 'Year', 'Genre', 'Publisher', 'GenreLabel']].head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,GenreLabel
0,Wii Sports,Wii,2006.0,Sports,Nintendo,10
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,4
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,6
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,10
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,7


### Transforming Ordinal Features
Note here that this doesn’t indicate that the transformed feature will be a numeric feature. It will still be a discrete valued categorical feature with numbers instead of text for each genre.

In [6]:
poke_df = pd.read_csv('Data/pokemon.csv')
poke_df.head()

# Take a look at the unique values of Genre (Just to have an idea and we don't use this in our transformation)
np.unique(poke_df['Generation'])

array(['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6'], dtype=object)

In [7]:
# Manually map ordinals
gen_ord_map = {'Gen 1':1, 'Gen 2':2, 'Gen 3':3, 'Gen 4':4, 'Gen 5':5, 'Gen 6':6}
poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map)

poke_df[['Name', 'Generation', 'GenerationLabel']].head()

Unnamed: 0,Name,Generation,GenerationLabel
0,Bulbasaur,Gen 1,1
1,Ivysaur,Gen 1,1
2,Venusaur,Gen 1,1
3,VenusaurMega Venusaur,Gen 1,1
4,Charmander,Gen 1,1


## Encoding Categorical Features
The above transformation and **mapping of categorical variables into numerical representations can’t be fed to machine learning algorithms directly** as they may interpret these as raw numeric features and hence try to **wrongly introduce the notion of magnitude** in the system. The converted and mapped values are a representation for categorical variables only and have **no numeric sense**. So, we will be using **schemes and strategies** where **dummy features are created** for each **unique value** or label out of all the **distinct categories in any feature**.

### 1) One Hot Encoding Scheme
Considering we have numeric representation of any categorical feature with **m labels**, the one hot encoding scheme, encodes or transforms the feature into **m binary features**, which can only contain a value of 1 or 0.

Note: You can't apply the scheme to **textual** categorical feature, it should be **numerical** category.

In [8]:
# We apply one hot encoding scheme to the feature - Generation
poke_df = pd.read_csv('Data/pokemon.csv')
poke_df[['Name', 'Generation']].head()

Unnamed: 0,Name,Generation
0,Bulbasaur,Gen 1
1,Ivysaur,Gen 1
2,Venusaur,Gen 1
3,VenusaurMega Venusaur,Gen 1
4,Charmander,Gen 1


In [9]:
# Take a look at the unique values of Genre (Just to have an idea and we don't use this in our transformation)
print(np.unique(poke_df['Generation']))

['Gen 1' 'Gen 2' 'Gen 3' 'Gen 4' 'Gen 5' 'Gen 6']


In [10]:
# Let us transform the text labels to numeric representation
from sklearn.preprocessing  import LabelEncoder, OneHotEncoder

gen_le = LabelEncoder()
# Creates an array of labels based on 'Generation'
gen_labels = gen_le.fit_transform(poke_df['Generation'])
# Add generated Label column to the data frame 
poke_df['Gen_Label'] = gen_labels

poke_df_sub = poke_df[['Name', 'Generation', 'Gen_Label']]
poke_df_sub.head()

Unnamed: 0,Name,Generation,Gen_Label
0,Bulbasaur,Gen 1,0
1,Ivysaur,Gen 1,0
2,Venusaur,Gen 1,0
3,VenusaurMega Venusaur,Gen 1,0
4,Charmander,Gen 1,0


In [11]:
# Apply One hot encoding
gen_ohe = OneHotEncoder(categories='auto')
# Use array generated by Label encoder and create array vector for each row 
gen_feature_arr = gen_ohe.fit_transform(poke_df[['Gen_Label']]).toarray()
# Get feature label titles
gen_feature_labels = list(gen_le.classes_)
# Create data frame of newly generated features
gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels)

In [12]:
poke_df_ohe = pd.concat([poke_df_sub, gen_features], axis=1)
poke_df_ohe.head()

Unnamed: 0,Name,Generation,Gen_Label,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6
0,Bulbasaur,Gen 1,0,1.0,0.0,0.0,0.0,0.0,0.0
1,Ivysaur,Gen 1,0,1.0,0.0,0.0,0.0,0.0,0.0
2,Venusaur,Gen 1,0,1.0,0.0,0.0,0.0,0.0,0.0
3,VenusaurMega Venusaur,Gen 1,0,1.0,0.0,0.0,0.0,0.0,0.0
4,Charmander,Gen 1,0,1.0,0.0,0.0,0.0,0.0,0.0


### 2) Dummy Coding Scheme

The dummy coding scheme is like the one hot encoding scheme, except in the case of dummy coding scheme, when applied on a categorical feature with **m distinct labels**, we get **m-1 binary features**. Thus, each value of the categorical variable gets converted into a vector of size m-1. The extra feature is completely disregarded and thus if the category values range from {0, 1, ..., m-1} **the 0th or the m-1th feature is usually represented by a vector of all zeros (0)**.

In [13]:
poke_df = pd.read_csv('Data/pokemon.csv')
poke_df[['Name', 'Generation']].head()

Unnamed: 0,Name,Generation
0,Bulbasaur,Gen 1
1,Ivysaur,Gen 1
2,Venusaur,Gen 1
3,VenusaurMega Venusaur,Gen 1
4,Charmander,Gen 1


In [14]:
# Generate features using Dummy Coding Scheme (Dropping first feature)
gen_dummy_features = pd.get_dummies(poke_df['Generation'], drop_first=True)

df = pd.concat([poke_df[['Name', 'Generation']], gen_dummy_features], axis=1)
df.groupby(['Generation']).head()

Unnamed: 0,Name,Generation,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6
0,Bulbasaur,Gen 1,0,0,0,0,0
1,Ivysaur,Gen 1,0,0,0,0,0
2,Venusaur,Gen 1,0,0,0,0,0
3,VenusaurMega Venusaur,Gen 1,0,0,0,0,0
4,Charmander,Gen 1,0,0,0,0,0
166,Chikorita,Gen 2,1,0,0,0,0
167,Bayleef,Gen 2,1,0,0,0,0
168,Meganium,Gen 2,1,0,0,0,0
169,Cyndaquil,Gen 2,1,0,0,0,0
170,Quilava,Gen 2,1,0,0,0,0


### 3) Effect Coding Scheme

**The effect coding scheme** is **very similar to the dummy coding scheme** in most aspects. However, **the
encoded features or feature vector**, for the category values **that represent all 0s** in the dummy coding scheme,
**is replaced by -1s** in the effect coding scheme.

In [85]:
poke_df = pd.read_csv('Data/pokemon.csv')
poke_df[['Name', 'Generation']].head()

Unnamed: 0,Name,Generation
0,Bulbasaur,Gen 1
1,Ivysaur,Gen 1
2,Venusaur,Gen 1
3,VenusaurMega Venusaur,Gen 1
4,Charmander,Gen 1


In [86]:
# Generated dummy columns for 'Generation'
gen_onehot_features = pd.get_dummies(poke_df['Generation'])
gen_onehot_features.head()

Unnamed: 0,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5,Gen 6
0,1,0,0,0,0,0
1,1,0,0,0,0,0
2,1,0,0,0,0,0
3,1,0,0,0,0,0
4,1,0,0,0,0,0


In [87]:
# Get all generated columns except last one (indicated by mentioning -1 in iloc)
gen_effect_features = gen_onehot_features.iloc[:,:-1]
# Fill the rows where all generations are zeroes (except the last one; as we've already excluded it)
gen_effect_features.loc[np.all(gen_effect_features == 0, axis=1)] = -1.
# Add the generated features to the data frame
pd.concat([poke_df[['Name', 'Generation']], gen_effect_features], axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,Name,Generation,Gen 1,Gen 2,Gen 3,Gen 4,Gen 5
0,Bulbasaur,Gen 1,1.0,0.0,0.0,0.0,0.0
1,Ivysaur,Gen 1,1.0,0.0,0.0,0.0,0.0
2,Venusaur,Gen 1,1.0,0.0,0.0,0.0,0.0
3,VenusaurMega Venusaur,Gen 1,1.0,0.0,0.0,0.0,0.0
4,Charmander,Gen 1,1.0,0.0,0.0,0.0,0.0
5,Charmeleon,Gen 1,1.0,0.0,0.0,0.0,0.0
6,Charizard,Gen 1,1.0,0.0,0.0,0.0,0.0
7,CharizardMega Charizard X,Gen 1,1.0,0.0,0.0,0.0,0.0
8,CharizardMega Charizard Y,Gen 1,1.0,0.0,0.0,0.0,0.0
9,Squirtle,Gen 1,1.0,0.0,0.0,0.0,0.0
