# One hot encoding:

on Categorical columns:

**One Hot Encoding:**

One hot encoding is a technique that we use to represent categorical variables as numerical values in a machine learning model.

**The advantages of using one hot encoding include:**

- It allows the use of categorical variables in models that require numerical input.
- It can improve model performance by providing more information to the model about the categorical variable.
- It can help to avoid the problem of ordinality, which can occur when a categorical variable has a natural ordering (e.g. “small”, “medium”, “large”).

**The disadvantages of using one hot encoding include:**

- It can lead to increased dimensionality, as a separate column is created for each category in the variable. This can make the model more complex and slow to train.
- It can lead to sparse data, as most observations will have a value of 0 in most of the one-hot encoded columns.
- It can lead to overfitting, especially if there are many categories in the variable and the sample size is relatively small.

 One-hot-encoding is a powerful technique to treat categorical data, but it can lead to increased dimensionality, sparsity, and overfitting. It is important to use it cautiously and consider other methods such as ordinal encoding or binary encoding.
 
One Hot Encoding Examples
In One Hot Encoding, the categorical parameters will prepare separate columns for both Male and Female labels. So, wherever there is a Male, the value will be 1 in the Male column and 0 in the Female column

- Its use on nominal data.
- Example if there are 50 or 100 catrgories in column you apply one hot encoding.We know its can take more processing time and space but this is best way to do that,
  - Lets take a column has 100 categories soo we take a treshold like top 10 to 20 categories based on their size and remaining categories put on new others catrgorie.
- Remember when we use OHE so their is a problem of multicolinearity so we can tackle this which remove (n-1) column.

## Import libraries:

In [1]:
import pandas as pd
import numpy as np

## Load dataset:

In [2]:
df=pd.read_csv('../Data/Salary_Data.csv')
df.sample(5)

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
3717,44.0,Male,PhD,Data Scientist,16.0,160000.0
4266,28.0,Female,Bachelor's Degree,Marketing Coordinator,2.0,41000.0
6466,43.0,Male,Bachelor's Degree,Content Marketing Manager,14.0,140000.0
4486,31.0,Female,High School,Junior Sales Associate,1.0,25000.0
1693,55.0,Male,PhD,Software Engineer Manager,18.0,210000.0


In [3]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6704 entries, 0 to 6703
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Age                  6702 non-null   float64
 1   Gender               6702 non-null   object 
 2   Education Level      6701 non-null   object 
 3   Job Title            6702 non-null   object 
 4   Years of Experience  6701 non-null   float64
 5   Salary               6699 non-null   float64
dtypes: float64(3), object(3)
memory usage: 314.4+ KB


Unnamed: 0,Age,Years of Experience,Salary
count,6702.0,6701.0,6699.0
mean,33.620859,8.094687,115326.964771
std,7.614633,6.059003,52786.183911
min,21.0,0.0,350.0
25%,28.0,3.0,70000.0
50%,32.0,7.0,115000.0
75%,38.0,12.0,160000.0
max,62.0,34.0,250000.0


In [4]:
df.dropna(inplace=True)
df.shape

(6698, 6)

--------

1. Age columns is numeric. 
2. Gender column is categorical but its nominal so here we use (One hot encoding).
3. Education level is categorical + Ordinal here we use (Ordinal encoding).
4. Job title is categorical.(One hot encoding)
5. Years of experience is numeric.
6. Salary the target column is numeric if there is categorical column so we can use (Label encoding).

-------

In [5]:
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0


In [6]:
df['Job Title'].nunique()

191

In [7]:
df['Gender'].value_counts()

Gender
Male      3671
Female    3013
Other       14
Name: count, dtype: int64

**There are 191 categories**

In [8]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(df.iloc[:,0:5],df.iloc[:,-1],test_size=0.2,random_state=2)

In [9]:
X_train

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience
3517,34.0,Male,Bachelor's Degree,Junior Web Developer,3.0
5157,27.0,Female,Bachelor's Degree,Graphic Designer,2.0
4001,27.0,Female,Master's Degree,Human Resources Coordinator,3.0
1777,43.0,Female,PhD,Senior Project Engineer,14.0
3694,28.0,Male,High School,Junior Sales Associate,2.0
...,...,...,...,...,...
6448,35.0,Female,PhD,Senior Product Marketing Manager,9.0
3610,27.0,Female,High School,Junior HR Generalist,1.0
5709,36.0,Female,PhD,Research Scientist,12.0
6643,49.0,Female,PhD,Director of Marketing,20.0


## One hot encoding:

In [24]:
from sklearn.preprocessing import OneHotEncoder

# Drop='first', is used for removing the first columns as i said above for multicolinearity.

ohe=OneHotEncoder(drop='first',handle_unknown="ignore",dtype=int)

ohe.fit(X_train[['Gender','Job Title']])

Xtrain_ohe=ohe.transform(X_train[['Gender','Job Title']]).toarray()
Xtest_ohe=ohe.transform(X_test[['Gender','Job Title']]).toarray()



**It comes with sparse format so i use 'toarray' func to convert in np array.**

In [25]:
print(Xtrain_ohe.shape)
print(Xtest_ohe.shape)

(5358, 172)
(1340, 172)


**Now we extract other columns from Xtrain then append  with Xtrain_ohe**  

In [26]:
Xtrain_ohe=np.hstack((X_train[['Age','Education Level','Years of Experience']],Xtrain_ohe))
Xtest_ohe=np.hstack((X_test[['Age','Education Level','Years of Experience']],Xtest_ohe))

In [27]:
print(Xtrain_ohe.shape)
print(Xtest_ohe.shape)

(5358, 175)
(1340, 175)


In [28]:
Xtrain_ohe

array([[34.0, "Bachelor's Degree", 3.0, ..., 0, 0, 0],
       [27.0, "Bachelor's Degree", 2.0, ..., 0, 0, 0],
       [27.0, "Master's Degree", 3.0, ..., 0, 0, 0],
       ...,
       [36.0, 'PhD', 12.0, ..., 0, 0, 0],
       [49.0, 'PhD', 20.0, ..., 0, 0, 0],
       [26.0, "Master's Degree", 4.0, ..., 0, 0, 0]], dtype=object)