# Miscellaneous 

## How to Deal with Categorical Data for Machine Learning
https://www.kdnuggets.com/2021/05/deal-with-categorical-data-machine-learning.html?utm_source=pocket_mylist

In [3]:
import pandas as pd
import sklearn

#pip install category_encoders

import category_encoders as ce

In [4]:
data = pd.DataFrame({ 'gender' : ['Male', 'Female', 'Male', 'Female', 'Female'],
                       'class' : ['A','B','C','D','A'],
                        'city' : ['Delhi','Gurugram','Delhi','Delhi','Gurugram'] })
data.head()

Unnamed: 0,gender,class,city
0,Male,A,Delhi
1,Female,B,Gurugram
2,Male,C,Delhi
3,Female,D,Delhi
4,Female,A,Gurugram


Implementing one-hot encoding through category_encoder

In this method, each category is mapped to a vector that contains 1 and 0 denoting the presence or absence of the feature. The number of vectors depends on the number of categories for features.

Create an object of the one-hot encoder:

In [5]:
ce_OHE = ce.OneHotEncoder(cols=['gender','city'])

data1 = ce_OHE.fit_transform(data)
data1.head()

Unnamed: 0,gender_1,gender_2,class,city_1,city_2
0,1,0,A,1,0
1,0,1,B,0,1
2,1,0,C,1,0
3,0,1,D,1,0
4,0,1,A,0,1


Binary Encoding

Binary encoding converts a category into binary digits. Each binary digit creates one feature column.

In [6]:
ce_be = ce.BinaryEncoder(cols=['class']);

# transform the data
data_binary = ce_be.fit_transform(data["class"]);
data_binary

Unnamed: 0,class_0,class_1,class_2
0,0,0,1
1,0,1,0
2,0,1,1
3,1,0,0
4,0,0,1


Method 2: Using Pandas' get dummies

In [7]:
pd.get_dummies(data,columns=["gender","city"])

Unnamed: 0,class,gender_Female,gender_Male,city_Delhi,city_Gurugram
0,A,0,1,1,0
1,B,1,0,0,1
2,C,0,1,1,0
3,D,1,0,1,0
4,A,1,0,0,1


In [8]:
# We can assign a prefix if we want to, if we do not want the encoding to use the default.
pd.get_dummies(data,prefix=["gen","city"],columns=["gender","city"])

Unnamed: 0,class,gen_Female,gen_Male,city_Delhi,city_Gurugram
0,A,0,1,1,0
1,B,1,0,0,1
2,C,0,1,1,0
3,D,1,0,1,0
4,A,1,0,0,1


Method 3: Using Scikit-learn
 
Scikit-learn also has 15 different types of built-in encoders, which can be accessed from sklearn.preprocessing.

Scikit-learn One-hot Encoding

In [9]:
s = (data.dtypes == 'object')
cols = list(s[s].index)

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown='ignore',sparse=False)

In [10]:
data_gender = pd.DataFrame(ohe.fit_transform(data[["gender"]]))

data_gender

Unnamed: 0,0,1
0,0.0,1.0
1,1.0,0.0
2,0.0,1.0
3,1.0,0.0
4,1.0,0.0


In [11]:
data_city = pd.DataFrame(ohe.fit_transform(data[["city"]]))

data_city

Unnamed: 0,0,1
0,1.0,0.0
1,0.0,1.0
2,1.0,0.0
3,1.0,0.0
4,0.0,1.0


In [12]:
data_class = pd.DataFrame(ohe.fit_transform(data[["class"]]))

data_class

Unnamed: 0,0,1,2,3
0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0


In [13]:
# Applying to the list of categorical variables:

data_cols = pd.DataFrame(ohe.fit_transform(data[cols]))

data_cols

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
3,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


Scikit-learn Label Encoding
 
In label encoding, each category is assigned a value from 1 through N where N is the number of categories for the feature. There is no relation or order between these assignments.

In [16]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
# Label encoder takes no arguments
le_class = le.fit_transform(data[["class"]])

  return f(**kwargs)


In [17]:
# Comparing with one-hot encoding
data_class

Unnamed: 0,0,1,2,3
0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0


Ordinal Encoding
 
Ordinal encoding’s encoded variables retain the ordinal (ordered) nature of the variable. It looks similar to label encoding, the only difference being that label coding doesn't consider whether a variable is ordinal or not; it will then assign a sequence of integers.

Example: Ordinal encoding will assign values as Very Good(1) < Good(2) < Bad(3) < Worse(4)

First, we need to assign the original order of the variable through a dictionary.

In [18]:
temp = {'temperature' :['very cold', 'cold', 'warm', 'hot', 'very hot']}
df=pd.DataFrame(temp,columns=["temperature"])
temp_dict = {'very cold': 1,'cold': 2,'warm': 3,'hot': 4,"very hot":5}
df

Unnamed: 0,temperature
0,very cold
1,cold
2,warm
3,hot
4,very hot


In [19]:
# Then we can map each row for the variable as per the dictionary.
df["temp_ordinal"] = df.temperature.map(temp_dict)
df

Unnamed: 0,temperature,temp_ordinal
0,very cold,1
1,cold,2
2,warm,3
3,hot,4
4,very hot,5


#### Frequency Encoding
 
The category is assigned as per the frequency of values in its total lot.

In [20]:
data_freq = pd.DataFrame({'class' : ['A','B','C','D','A',"B","E","E","D","C","C","C","E","A","A"]})

Grouping by class column:

In [21]:
fe = data_freq.groupby("class").size()

Dividing by length:

In [22]:
fe_ = fe/len(data_freq)

Mapping and rounding off:

In [23]:
data_freq["data_fe"] = data_freq["class"].map(fe_).round(2)
data_freq

Unnamed: 0,class,data_fe
0,A,0.27
1,B,0.13
2,C,0.27
3,D,0.13
4,A,0.27
5,B,0.13
6,E,0.2
7,E,0.2
8,D,0.13
9,C,0.27


In this article, we saw 5 types of encoding schemes. Similarly, there are 10 other types of encoding which we have not looked at:

- Helmert Encoding
- Mean Encoding
- Weight of Evidence Encoding
- Probability Ratio Encoding
- Hashing Encoding
- Backward Difference Encoding
- Leave One Out Encoding
- James-Stein Encoding
- M-estimator Encoding
- Thermometer Encoder

### Understanding RNN Step by Step with PyTorch
https://www.analyticsvidhya.com/blog/2021/07/understanding-rnn-step-by-step-with-pytorch/?utm_source=pocket_mylist

#### Input To RNN

Input data: RNN should have 3 dimensions. (Batch Size, Sequence Length and Input Dimension)

Batch Size is the number of samples we send to the model at a time. In this example, we have batch size = 2 but you can take it 4, 8,16, 32, 64 etc depends on the memory (basically in 2’s power)

Sequence Length is the length of the sequence of input data (time step:0,1,2…N), the RNN learn the sequential pattern in the dataset. Here the grey colour part is sequence length so our sequence length = 3. Suppose you have share market data on a daily basis (frequency = 1day) and you want that the network to learn the sequence of 30 days of data. So your sequence length will be 30.

Input Dimension or Input Size is the number of features or dimensions you are using in your data set. In this case, it is one (Columns/ Features). Suppose you have share market data with the following features: High, Low, Open and Close and you want to predict Close. In this case, you have input dimension = 4: High, Low, Open and Close. We will see the input dimension in more detail.

PyTorch takes input in two Shape :

Input Type 1: Sequence Length, Batch Size, Input Dimension

Input Type 2: Batch Size, Sequence Length, Input Dimension

If we choose Input type 1 our shape will be = 3, 2, 1

If we choose Input type 2 our shape will be = 2, 3, 1





Let’s implement our small Recurrent Neural Net class, Inherit the base class nn.Module. HL_size = hidden size we can define as 32, 64, 128 (again better in 2’s power) and input size is a number of features in our data (input dimension). Here input size is 2 for data type 2 and 1 for data type 1.

Bidirectional true will make this RNN bidirectional this is very useful in many applications where the next sequences can help previous sequences in learning. If bidirectional is true the number of directions will be 2 otherwise it will be 1.

batch_first=True means batch should be our first dimension (Input Type 2) otherwise if we do not define batch_first=True in RNN we need data in Input type 1 shape (Sequence Length, Batch Size, Input Dimension).`

RNN returns output and is hidden.


Output
Output Shape: If we use batch_first=True, then output shape is (Batch Size, Seq Len, # Direction * Hidden Size). If we use batch_first=False, then output shape is ( Seq Len, Batch Size, No of Direction * Hidden Size)

Suppose if we consider data type 2 as input where seq_len is 3, batch is 2, hidden size = 128 and bidirectional = False then our output shape will be: (3, 2, 1 * 128) for batch_first=False and (2, 3, 1 * 128) for batch_first=True.

Hidden Shape: (No of Direction * num_layers, Batch Size, Hidden Size) which holds information about final hidden state. So most of the time we took hidden as an input in self.linear2.

Linear Transformation after RNN: If you are doing regression or binary classification then the output_size in Linear Transformation should be 1, If you are doing multi-class classification then Output_size will be a number of classes.

After __init__ you have to define forward class, this is the method of your RNN Class, which computes the hidden in the network. If you are using output (out in below code) as an input, then it means you will have hidden states for all time steps in the last layer, you need to select which time-step data you want to feed to the linear layer.

