## Categorical Data
Typically, any data attribute which is categorical in nature represents discrete values which belong to a specific finite set of categories or classes. These are also often known as classes or labels in the context of attributes or variables which are to be predicted by a model (popularly known as response variables). These discrete values can be text or numeric in nature (or even unstructured data like images!). \
There are two major classes of categorical data, nominal and ordinal.

### Ordinal Data

Ordinal categorical attributes have some sense or notion of order amongst its values. For instance look at the following figure for shirt sizes. It is quite evident that order or in this case ‘size’ matters when thinking about shirts (S is smaller than M which is smaller than L and so on).

Shirt size as an ordinal categorical attribute
Shoe sizes, education level and employment roles are some other examples of ordinal categorical attributes

### Nominal Data
In any nominal categorical data attribute, there is no concept of ordering amongst the values of that attribute.
Ex,weather categories,movie, music and video game genres, country names, food and cuisine types are other examples of nominal categorical attributes.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv("vgsales.csv",encoding="utf-8")

In [4]:
df.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [5]:
# It is quite evident that this is a nominal categorical attribute just like Publisher and Platform

#unique video game genre

df.Genre.unique()

array(['Sports', 'Platform', 'Racing', 'Role-Playing', 'Puzzle', 'Misc',
       'Shooter', 'Simulation', 'Action', 'Fighting', 'Adventure',
       'Strategy'], dtype=object)

In [65]:
#label encoding for this 12 values

from sklearn.preprocessing import LabelEncoder,OneHotEncoder

In [66]:
le = LabelEncoder()
genre_label = le.fit_transform(df['Genre'])    #return array
genre_mapping = {label : categ for label,categ in enumerate(le.classes_)}
genre_mapping


KeyError: 'Genre'

In [67]:
df['Genre_Label'] = genre_label
df.head()

NameError: name 'genre_label' is not defined

#### Ordinal Data(sense of order amongst the values)

In [68]:
pdf = pd.read_csv(r"C:\Users\UMANG PATEL\Desktop\Data Science\Pokemon.csv")

we can see there are a total of 6 generations and each Pokémon typically belongs to a specific generation based on 
the video games (when they were released) and also the television series follows a similar timeline. This attribute is typically ordinal (domain knowledge is necessary here) because most Pokémon belonging to Generation 1 and were introduced earlier in the video games and the television shows than Generation 2 as so on. 


In [69]:
pdf.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [70]:
pdf.Generation.unique()     #six genereation

array([1, 2, 3, 4, 5, 6], dtype=int64)

In [71]:
# gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 
#                'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}
# p_df['GenerationLabel'] = p_df['Generation'].map(gen_ord_map)

### Label Encoder and OneHotEncoder

What is One Hot Encoding?      
A one hot encoding is a representation of categorical variables as binary vectors.

This first requires that the categorical values be mapped to integer values.

Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.



In [72]:
#let's consider only pokemon _name,Generation,Legendary

#categorical data to numeric labels
le = LabelEncoder()
leg_label = le.fit_transform(pdf['Legendary'])
pdf['Legendary_Label'] = leg_label


gl = LabelEncoder()
gen_label = gl.fit_transform(pdf['Generation'])
pdf['Generation_Label'] = gen_label
pdf2 = pdf[['Name','Generation','Generation_Label','Legendary','Legendary_Label']]

In [73]:
#onehot encoding scheme

ohe = OneHotEncoder()
legendary_array = ohe.fit_transform(pdf2[['Legendary_Label']]).toarray()
legendary_labels = le.classes_
legendary_df = pd.DataFrame(legendary_array,columns=legendary_labels)


ohe_g = OneHotEncoder()
generation_array = ohe_g.fit_transform(pdf2[['Generation_Label']]).toarray()
generation_labels = ['Generation'+str(label) for label in gl.classes_]
generation_df = pd.DataFrame(generation_array,columns=generation_labels)
generation_df

Unnamed: 0,Generation1,Generation2,Generation3,Generation4,Generation5,Generation6
0,1.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...
795,0.0,0.0,0.0,0.0,0.0,1.0
796,0.0,0.0,0.0,0.0,0.0,1.0
797,0.0,0.0,0.0,0.0,0.0,1.0
798,0.0,0.0,0.0,0.0,0.0,1.0


In [77]:
new_df = pd.concat([pdf2,legendary_df,generation_df],axis=1)
new_df.head()



Unnamed: 0,Name,Generation,Generation_Label,Legendary,Legendary_Label,False,True,Generation1,Generation2,Generation3,Generation4,Generation5,Generation6
0,Bulbasaur,1,0,False,0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,Ivysaur,1,0,False,0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,Venusaur,1,0,False,0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,VenusaurMega Venusaur,1,0,False,0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,Charmander,1,0,False,0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [84]:
inverted = le.inverse_transform([argmax(legendary_array[0,:])])
inverted
legendary_array

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [0., 1.],
       [0., 1.],
       [0., 1.]])

In [93]:
#this is our new_data our model already trained
new_poke_df = pd.DataFrame([['PikaZoom', 1, True], 
                           ['CharMyToast', 5, False]],
                       columns=['Name', 'Generation', 'Legendary'])
new_poke_df

Unnamed: 0,Name,Generation,Legendary
0,PikaZoom,1,True
1,CharMyToast,5,False


In [95]:
new_gen_labels = gl.transform(new_poke_df['Generation'])
new_poke_df['Gen_Label'] = new_gen_labels
new_leg_labels = le.transform(new_poke_df['Legendary'])
new_poke_df['Lgnd_Label'] = new_leg_labels
new_poke_df[['Name', 'Generation', 'Gen_Label', 'Legendary', 
             'Lgnd_Label']]

Unnamed: 0,Name,Generation,Gen_Label,Legendary,Lgnd_Label
0,PikaZoom,1,0,True,1
1,CharMyToast,5,4,False,0


In [103]:
new_gen_feature_arr = ohe_g.transform(new_poke_df[['Gen_Label']]).toarray()
new_gen_features = pd.DataFrame(new_gen_feature_arr, 
                                columns=generation_labels)
new_leg_feature_arr = ohe.transform(new_poke_df[['Lgnd_Label']]).toarray()
new_leg_features = pd.DataFrame(new_leg_feature_arr, 
                                columns=legendary_labels)
new_poke_ohe = pd.concat([new_poke_df, new_gen_features, new_leg_features], axis=1)
# columns = sum([['Name', 'Generation', 'Gen_Label'], 
#                generation_labels,
#                ['Legendary', 'Lgnd_Label'], legendary_labels], [])
# new_poke_ohe[columns]
new_poke_ohe

Unnamed: 0,Name,Generation,Legendary,Gen_Label,Lgnd_Label,Generation1,Generation2,Generation3,Generation4,Generation5,Generation6,False,True
0,PikaZoom,1,True,0,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,CharMyToast,5,False,4,0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


### Manual Encoding

In [56]:

from numpy import argmax
# define input string
data = 'hello world'
# print(data)
# define universe of possible input values
alphabet = 'abcdefghijklmnopqrstuvwxyz '
# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))
# integer encode input data
integer_encoded = [char_to_int[char] for char in data]

print(integer_encoded)
# one hot encode
onehot_encoded = list()
for value in integer_encoded:
    letter = [0 for _ in range(len(alphabet))]
    letter[value] = 1
    onehot_encoded.append(letter)
print(onehot_encoded)
# invert encoding
inverted = int_to_char[argmax(onehot_encoded[0])] #armax return index of max value which is our letter value in our case
print(inverted)



[7, 4, 11, 11, 14, 26, 22, 14, 17, 11, 3]
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
h


## Encoding With Keras

In [83]:
from numpy import array
from numpy import argmax
# from keras import to_categorical

# define example
data = [1, 3, 2, 0, 3, 2, 2, 1, 0, 1]
data = array(data)
print(data)
# one hot encode
encoded = to_categorical(data)
print(encoded)
# invert encoding
inverted = argmax(encoded[0])
print(inverted)


[1 3 2 0 3 2 2 1 0 1]


NameError: name 'to_categorical' is not defined

## Effect Coding Scheme
The effect coding scheme is actually very similar to the dummy coding scheme, except during the encoding process, the encoded features or feature vector, for the category values which represent all 0 in the dummy coding scheme, is replaced by -1 in the effect coding scheme. This will become clearer with the following example.


In [107]:
gen_onehot_features = pd.get_dummies(pdf2['Generation'])
gen_effect_features = gen_onehot_features.iloc[:,:-1]
gen_effect_features.loc[np.all(gen_effect_features == 0, 
                               axis=1)] = -1.
pd.concat([pdf2[['Name', 'Generation']], gen_effect_features], 
          axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0,Name,Generation,1,2,3,4,5
0,Bulbasaur,1,1.0,0.0,0.0,0.0,0.0
1,Ivysaur,1,1.0,0.0,0.0,0.0,0.0
2,Venusaur,1,1.0,0.0,0.0,0.0,0.0
3,VenusaurMega Venusaur,1,1.0,0.0,0.0,0.0,0.0
4,Charmander,1,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...
795,Diancie,6,-1.0,-1.0,-1.0,-1.0,-1.0
796,DiancieMega Diancie,6,-1.0,-1.0,-1.0,-1.0,-1.0
797,HoopaHoopa Confined,6,-1.0,-1.0,-1.0,-1.0,-1.0
798,HoopaHoopa Unbound,6,-1.0,-1.0,-1.0,-1.0,-1.0


### Bin-counting scheme 
It is a useful scheme for dealing with categorical variables having many categories. In this scheme, instead of using the actual label values for encoding, we use probability based statistical information about the value and the actual target or response value which we aim to predict in our modeling efforts. Ex,IP addresses, This scheme needs historical data as prereqisite

In [25]:

# import the packages
import numpy as np
import pandas as pd
import category_encoders as ce

# make some data
df = pd.DataFrame({
 'color':["a", "b", "a", "c","d","e"], 
 'outcome':[1, 2, 4, 2, 2,1]})

# split into X and y
X = df.drop('outcome', axis = 1)
y = df.drop('color', axis = 1)

# instantiate an encoder - here we use Binary()
ce_binary = ce.BinaryEncoder(cols = ['color'])

# fit and transform and presto, you've got encoded data
ce_binary.fit_transform(X,y)             #here u can see that we have a==1,b==2,c==3 and so on and their binary val
# The first column has no variance, so it isn’t doing anything to help the model.

Unnamed: 0,color_0,color_1,color_2,color_3
0,0,0,0,1
1,0,0,1,0
2,0,0,0,1
3,0,0,1,1
4,0,1,0,0
5,0,1,0,1


### OrdinalEncoder code to transform the color column values from letters to integers.


In [46]:
ce_oe = ce.OrdinalEncoder()
new_label = ce_oe.fit_transform(df['color'])
new_label
# print(ce_oe.get_feature_names)

Unnamed: 0,color
0,1
1,2
2,1
3,3
4,4
5,5


### Feature Hashing Scheme

-->In machine learning, feature hashing, also known as the hashing trick (by analogy to the kernel trick), is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix.It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array. 

The feature hashing scheme is another useful feature engineering scheme for dealing with large scale categorical features. In this scheme, a hash function is typically used with the number of encoded features pre-set (as a vector of pre-defined length) such that the hashed values of the features are used as indices in this pre-defined vector and values are updated accordingly. Since a hash function maps a large number of values into a small finite set of values, multiple different values might create the same hash which is termed as collisions,so for that lesser collision we use signed hash function.

In [10]:
from sklearn.feature_extraction import FeatureHasher
fh = FeatureHasher(n_features=6, input_type='string')
hashed_features = fh.fit_transform(df['Genre'])
hashed_features = hashed_features.toarray()
pd.concat([df[['Name', 'Genre']], pd.DataFrame(hashed_features)], 
          axis=1).iloc[1:7]

# We can also see that rows 1 and 6 denote the same genre of games, Platform which have been rightly encoded into the same feature vector.

Unnamed: 0,Name,Genre,0,1,2,3,4,5
1,Super Mario Bros.,Platform,0.0,2.0,2.0,-1.0,1.0,0.0
2,Mario Kart Wii,Racing,-1.0,0.0,0.0,0.0,0.0,-1.0
3,Wii Sports Resort,Sports,-2.0,2.0,0.0,-2.0,0.0,0.0
4,Pokemon Red/Pokemon Blue,Role-Playing,-1.0,1.0,2.0,0.0,1.0,-1.0
5,Tetris,Puzzle,0.0,1.0,1.0,-2.0,1.0,-1.0
6,New Super Mario Bros.,Platform,0.0,2.0,2.0,-1.0,1.0,0.0


## Count encoding 
• Replace categorical variables with their count in the train set   
• Useful for both linear and non-linear algorithms      
• Can be sensitive to outliers         
• May add log-transform, works well with counts      
• Replace unseen variables with `1`      
• May give collisions: same encoding, different variables 

## . Label encoding 
• Give every categorical variable a unique numerical ID  
• Useful for non-linear tree-based algorithms          
• Does not increase dimensionality                
• Randomize the cat_var -> num_id mapping and retrain, average, for small bump in accuracy.
Ex, Label encoding Sample: ["Queenstown"]          
city  ----------- ----  city 

Cherbourg ------------1                         
Queenstown => 2              
Southampton ---------- 3              
Encoded: [2]

## Pandas.get_dummies

In [116]:
df = pd.DataFrame({'A': ['a', 'b', 'a',np.nan], 'B': ['b', 'a', 'c','d'],
                   'C': [1, 2, 3,4]})

new_df=pd.get_dummies(df,prefix="Col",dummy_na=True)      #other drop_first,pre_sep and all....
print(new_df)

data_matrix = pd.get_dummies(pd.Series(list('umang_pua')))
data_matrix

   C  Col_a  Col_b  Col_nan  Col_a  Col_b  Col_c  Col_d  Col_nan
0  1      1      0        0      0      1      0      0        0
1  2      0      1        0      1      0      0      0        0
2  3      1      0        0      0      0      1      0        0
3  4      0      0        1      0      0      0      1        0


Unnamed: 0,_,a,g,m,n,p,u
0,0,0,0,0,0,0,1
1,0,0,0,1,0,0,0
2,0,1,0,0,0,0,0
3,0,0,0,0,1,0,0
4,0,0,1,0,0,0,0
5,1,0,0,0,0,0,0
6,0,0,0,0,0,1,0
7,0,0,0,0,0,0,1
8,0,1,0,0,0,0,0


### Dummy Encoding vs OneHotEncoding
I have 3 categorical variables, each of which has 4 levels. In dummy encoding, 3*4-3=9 variables are built with one intercept. In one-hot encoding, 3*4=12 variables are built without an intercept


when we use an intercept, model.matrix uses dummy encoding with each variable w, x, and y being turned into 3 dummy variables, plus an intercept column. So there is a total of 10 degrees of freedom.

When we don't use an intercept, model.matrix creates 4 dummy variables for w and 3 dummy variables for x and y (and no intercept column). So the number of degrees of freedom is still 10.

In [117]:
letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']

## likelihood encoding
(also known as impact coding or mean or target coding) for the categorical features in python.       
It is basically, creating a new feature from existing features and the target variable.

In [158]:
data = {
    'Feature_1':[1,2,3,4,5,6,7],
    'Feature_2':['A','A','B','A','A','B','B'],
    'Target':[1,1,0,0,1,1,1]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Feature_1,Feature_2,Target
0,1,A,1
1,2,A,1
2,3,B,0
3,4,A,0
4,5,A,1
5,6,B,1
6,7,B,1


#### Feature-2 is the categorical variable that we want to mean encode with the help of the Target.
In case of Feature-2 having value the ‘A’, we have 3 ones and 1 zero in the corresponding Target column.The mean encoding for value ‘A’ becomes 3/4 = 0.75.

### Classification

In [160]:
mean_encoding = df.groupby(by="Feature_2").agg({'Target':['mean']}).reset_index()

new_df = pd.merge(df,mean_encoding,on="Feature_2",how="left")

new_df

  obj = obj._drop_axis(labels, axis, level=level, errors=errors)


Unnamed: 0,Feature_1,Feature_2,Target,"(Target, mean)"
0,1,A,1,0.75
1,2,A,1,0.75
2,3,B,0,0.666667
3,4,A,0,0.75
4,5,A,1,0.75
5,6,B,1,0.666667
6,7,B,1,0.666667


### Example 2: Regression Task
Let’s take a look at an example where Target is a continuous value   
Since here Target is continuous, we have more flexibility with generating new target encoded features.  
For example, we can take mean, mode, standard deviation or percentiles to create new features.

In [161]:
data = {
    'Feature_1':[1,2,3,4,5,6,7],
    'Feature_2':['A','A','B','A','A','B','B'],
    'Target':[1,2,3,4,1,5,5]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Feature_1,Feature_2,Target
0,1,A,1
1,2,A,2
2,3,B,3
3,4,A,4
4,5,A,1
5,6,B,5
6,7,B,5


In [175]:
std_df = df.groupby(df['Feature_2']).agg(['std'])['Target'].reset_index()
std_df
med_df = df.groupby(df['Feature_2']).agg([np.median])['Target'].reset_index()
med_df

new_df = df.merge(std_df,on="Feature_2")
new_df

Unnamed: 0,Feature_1,Feature_2,Target,std
0,1,A,1,1.414214
1,2,A,2,1.414214
2,4,A,4,1.414214
3,5,A,1,1.414214
4,3,B,3,1.154701
5,6,B,5,1.154701
6,7,B,5,1.154701


### Pitfalls
If target encoding is performed before training and validation data split, it may simply overfit with the validation data and the results may not be reliable.             
Therefore, encoding should be performed after training and validation data split.      
This method might fail in cases where the feature can have values which are rare in the data.

### Overfitting

Overfitting is a modeling error that occurs when a function is too closely fit to a limited set of data points. 

Overfitting is suspect when the model accuracy is high with respect to the data used in training the model but drops significantly with new data. Effectively the model knows the training data well but does not generalize. This makes the model useless for purposes such as prediction.

Overfitting refers to a model that models the training data too well

In [180]:
import tsfresh as ts