## TYPES OF DATA:

**DISCRETE**
It has a discrete value that means it takes only counted value not a decimal values. Like count of student in a class.

**CONTINUOUS**
A number within a range of a value is usually measured such as height.

**NOMINAL**
It represent qualitative information without order. Value represent discrete units. Like- Gender:Male/Female, Eye, Color etc.

**ORDINAL**
It represent qualitative information with order. It indicate measurement classification and can be ranked. Ex: high/medium/low.


## FEATURE ENCODING:

Encoding categorical data is a process of converting categorical data into integer format so that the data with converted categorical values can be provided to the models to give and improve the predictions. 

- Label Encoding or Ordinal Encoding
- One-Hot Encoding
- Effect Encoding
- Binary Encoding
- Base-N Encoding
- Hash Encoding
- Target Encoding

## Label Encoding or Ordinal Encoding

This type of encoding is used when the variables in the data are ordinal, ordinal encoding converts each label into integer values and the encoded data represents the sequence of labels.

In [1]:
import category_encoders as ce
import pandas as pd

df=pd.DataFrame({'height':['tall','medium','short','tall','medium','short','tall','medium','short',]})


# create object of Ordinalencoding

encoder= ce.OrdinalEncoder(cols=['height'],return_df=True,
                           mapping=[{'col':'height','mapping':{'None':0,'tall':1,'medium':2,'short':3}}])

#Original data

print(df)
df['transformed'] = encoder.fit_transform(df)

print(df)

   height
0    tall
1  medium
2   short
3    tall
4  medium
5   short
6    tall
7  medium
8   short
   height  transformed
0    tall            1
1  medium            2
2   short            3
3    tall            1
4  medium            2
5   short            3
6    tall            1
7  medium            2
8   short            3


## One-Hot Encoding
In One-Hot Encoding, each category of any categorical variable gets a new variable. It maps each category with binary numbers (0 or 1). This type of encoding is used when the data is nominal. Newly created binary features can be considered dummy variables. After one hot encoding, the number of dummy variables depends on the number of categories presented in the data.

In [2]:
df=pd.DataFrame({'name':['rahul','ashok','ankit','aditya','yash','vipin','amit']})

encoder=ce.OneHotEncoder(cols='name',handle_unknown='return_nan',return_df=True,use_cat_names=True)

#Original Data
print(df)

#Fit and transform Data
df_encoded = encoder.fit_transform(df)
print(df_encoded)

     name
0   rahul
1   ashok
2   ankit
3  aditya
4    yash
5   vipin
6    amit
   name_rahul  name_ashok  name_ankit  name_aditya  name_yash  name_vipin  \
0         1.0         0.0         0.0          0.0        0.0         0.0   
1         0.0         1.0         0.0          0.0        0.0         0.0   
2         0.0         0.0         1.0          0.0        0.0         0.0   
3         0.0         0.0         0.0          1.0        0.0         0.0   
4         0.0         0.0         0.0          0.0        1.0         0.0   
5         0.0         0.0         0.0          0.0        0.0         1.0   
6         0.0         0.0         0.0          0.0        0.0         0.0   

   name_amit  
0        0.0  
1        0.0  
2        0.0  
3        0.0  
4        0.0  
5        0.0  
6        1.0  


## Effect Encoding
In this type of encoding, encoders provide values to the categories in -1,0,1 format. -1 formation is the only difference between One-Hot encoding and effect encoding.

In [3]:
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad']}) 

encoder=ce.sum_coding.SumEncoder(cols='City',verbose=False,)

#Original Data

print(data)

df=encoder.fit_transform(data)

df

        City
0      Delhi
1     Mumbai
2  Hyderabad
3    Chennai
4  Bangalore
5      Delhi
6  Hyderabad




Unnamed: 0,intercept,City_0,City_1,City_2,City_3
0,1,1.0,0.0,0.0,0.0
1,1,0.0,1.0,0.0,0.0
2,1,0.0,0.0,1.0,0.0
3,1,0.0,0.0,0.0,1.0
4,1,-1.0,-1.0,-1.0,-1.0
5,1,1.0,0.0,0.0,0.0
6,1,0.0,0.0,1.0,0.0


Here in the above output, we can see that the encoder has given -1 to Bangalore in every dummy variable. This is how a dummy variable is generated by the consists of -1,0 and 1 as an encoded category.

## Hash Encoder
Just like One-Hot encoding, the hash encoder converts the category into binary numbers using new data variables but here we can fix the number of new data variables. 
So hashing is used for the transformation of arbitrary size input in the form of a fixed-size value.

In [4]:
data=pd.DataFrame({'Month':['January','April','March','April','Februay','June','July','June','September']})

#Create object for hash encoder
encoder=ce.HashingEncoder(cols='Month',n_components=6)#Fit and Transform Data
encoder.fit_transform(data)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5
0,0,0,0,0,1,0
1,0,0,0,1,0,0
2,0,0,0,0,1,0
3,0,0,0,1,0,0
4,0,0,0,1,0,0
5,0,1,0,0,0,0
6,1,0,0,0,0,0
7,0,1,0,0,0,0
8,0,0,0,0,1,0


Hashing is a one-way technique of encoding which is unlike other encoders. The Hash encoder’s output can not be converted again into the input. That is why we can say it may cause loss of information from the data. It should be applied with high dimension data in terms of categorical values.

In the above example, we implemented a hashing encoder for 6 dummy variables instead of 8 dummy variables just by using n_encoding = 6.

## Binary Encoding
In the hash encoding, we have seen that using hashing can cause the loss of data and on the other hand we have seen in one hot encoding dimensionality of the data is increasing. The binary encoding is a process where we can perform hash encoding look like encoding without losing the information just like one hot encoding.

Basically, we can say that binary encoding is a combination process of hash and one hot encoding.

In [5]:
data=pd.DataFrame({'Month':['January','April','March','April','Februay','June','July','June','September']})

encoder= ce.BinaryEncoder(cols=['Month'],return_df=True)
data=encoder.fit_transform(data) 
data

Unnamed: 0,Month_0,Month_1,Month_2
0,0,0,1
1,0,1,0
2,0,1,1
3,0,1,0
4,1,0,0
5,1,0,1
6,1,1,0
7,1,0,1
8,1,1,1


## Base-N Encoding
In a positional number system, base or radix is the number of unique digits including zero used to represent numbers. In base n encoding, if the base is two then the encoder will convert categories into the numerical form using their respective binary form which is formally one-hot encoding. But if we change the base to 10 which means the categories will get converted into numeric form between 0-9.

In [6]:
data=pd.DataFrame({'Month':['January','April','March','April','Februay','June','July','June','September']})

#Create an object for Base N Encoding
encoder= ce.BaseNEncoder(cols=['Month'],return_df=True,base=5)

#Fit and Transform Data
data_encoded=encoder.fit_transform(data)

data_encoded

Unnamed: 0,Month_0,Month_1
0,0,1
1,0,2
2,0,3
3,0,2
4,0,4
5,1,0
6,1,1
7,1,0
8,1,2


In the above output, we can see that we have used base 5. Somewhere it is pretty simple to the binary encoding but where in binary we have got 4 dimensions after conversion here we have 3 dimensions only and also the numbers are varying between 0-4. 

If we do not define the base by default it is set to 2 which basically performs the binary encoding.

## Target Encoding
Target encoding is the method of converting a categorical value into the mean of the target variable. This type of encoding is a type of bayesian encoding method where bayesian encoders use target variables to encode the categorical value.

The target encoding encoder calculates the mean of the target variable for each category and by the mean, the categories get replaced.

In [7]:
df=pd.DataFrame({'name':['rahul','ashok','ankit','rahul','ashok','ankit'],'marks' : [10,20,30,60,70,80,]})

print(df)

#Create target encoding object
encoder=ce.TargetEncoder(cols='name') 

#Fit and Transform Train Data
encoder.fit_transform(df['name'],df['marks'])

    name  marks
0  rahul     10
1  ashok     20
2  ankit     30
3  rahul     60
4  ashok     70
5  ankit     80


Unnamed: 0,name
0,43.581489
1,45.0
2,46.418511
3,43.581489
4,45.0
5,46.418511


Here we can see the names of students are changed with the mean of their marks. This is a good method for encoding: using this we can encode any number of categories. But it can cause the overfitting of any model because we are using mean as a category and this generates a hard correlation between these features.

Using this we can train the model but in testing, it can lead to the failure or inaccuracy of the model.