## TASK 10

### NAME : AVINASH M
### REG NO. : GO_STP_8305

___

## Pandas Dummy Variables and One-hot encoding sklearn

___

**Discuss the concept of One-Hot-Encoding, Multicollinearity and the Dummy Variable Trap.  What is Nominal and Ordinal Variables?**

One-Hot Encoding

> ***One hot encoding*** is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. One hot encoding is a crucial part of feature engineering for machine learning. With one-hot, we convert each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns. Each integer value is represented as a binary vector. All the values are zero, and the index is marked with a 1.

Multicollinearity

> ***Multicollinearity*** is the occurrence of high intercorrelations among two or more independent variables in a multiple regression model. Multicollinearity can lead to skewed or misleading results when a researcher or analyst attempts to determine how well each independent variable can be used most effectively to predict or understand the dependent variable in a statistical model.

Dummy Variable Trap

> ***The Dummy variable trap*** is a scenario where there are attributes which are highly correlated (Multicollinear) and one variable predicts the value of others. When we use one hot encoding for handling the categorical data, then one dummy variable (attribute) can be predicted with the help of other dummy variables. Hence, one dummy variable is highly correlated with other dummy variables. Using all dummy variables for regression models lead to dummy variable trap. So, the regression models should be designed excluding one dummy variable.

Nominal and Ordinal Variables

> A ***nominal*** scale describes a variable with categories that do not have a natural order or ranking.You can code nominal variables with numbers if you want, but the order is arbitrary and any calculations, such as computing a mean, median, or standard deviation, would be meaningless.  
<br>
Examples of nominal variables include:
*genotype, blood type, zip code, gender, race, eye color, political party*

> An ***ordinal*** scale is one where the order matters but not the difference between values.  
<br>
Examples of ordinal variables include: *socio economic status (“low income”,”middle income”,”high income”), education level (“high school”,”BS”,”MS”,”PhD”), income level (“less than 50K”, “50K-100K”, “over 100K”), satisfaction rating (“extremely dislike”, “dislike”, “neutral”, “like”, “extremely like”).*

___

**Salary Dataset of 52 professors having categorical columns. Apply dummy variables concept and one-hot-encoding on categorical columns.**

Importing the *pandas* library

In [1]:
import pandas as pd

Loading the dataset

In [2]:
salaryDataset = pd.read_csv("salary.txt", delim_whitespace= True)

Examining the dataset

In [3]:
salaryDataset.head()

Unnamed: 0,sx,rk,yr,dg,yd,sl
0,male,full,25,doctorate,35,36350
1,male,full,13,doctorate,22,35350
2,male,full,10,doctorate,23,28200
3,female,full,7,doctorate,27,26775
4,male,full,19,masters,30,33696


___

**DUMMY VARIABLES**

Getting the DUMMY VARIABLES

In [4]:
dummy = pd.get_dummies(salaryDataset[["sx", "rk", "dg"]], drop_first = True)
# drop_first = True is given to overcome the DUMMY VARIABLE TRAP

In [5]:
dummy.head()

Unnamed: 0,sx_male,rk_associate,rk_full,dg_masters
0,1,0,1,0
1,1,0,1,0
2,1,0,1,0
3,0,0,1,0
4,1,0,1,1


Adding the dummy variables to the actual dataset

In [6]:
salaryDataset = pd.concat([salaryDataset, dummy], axis = "columns")

In [7]:
salaryDataset.head()

Unnamed: 0,sx,rk,yr,dg,yd,sl,sx_male,rk_associate,rk_full,dg_masters
0,male,full,25,doctorate,35,36350,1,0,1,0
1,male,full,13,doctorate,22,35350,1,0,1,0
2,male,full,10,doctorate,23,28200,1,0,1,0
3,female,full,7,doctorate,27,26775,0,0,1,0
4,male,full,19,masters,30,33696,1,0,1,1


Dropping all the Categorical Columns

In [8]:
salaryDataset.drop(["sx", "rk", "dg"], axis = "columns", inplace = True)

Now the dataset is ready to be fed for the machine learning algorithms

In [9]:
salaryDataset.head()

Unnamed: 0,yr,yd,sl,sx_male,rk_associate,rk_full,dg_masters
0,25,35,36350,1,0,1,0
1,13,22,35350,1,0,1,0
2,10,23,28200,1,0,1,0
3,7,27,26775,0,0,1,0
4,19,30,33696,1,0,1,1


___

**ONE-HOT ENCODING**

Importing libraries needed for ONE-HOT ENCODING

In [10]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

Loading the dataset

In [11]:
salaryDataSet = pd.read_csv("salary.txt", delim_whitespace = True)

In [12]:
salaryDataSet.head()

Unnamed: 0,sx,rk,yr,dg,yd,sl
0,male,full,25,doctorate,35,36350
1,male,full,13,doctorate,22,35350
2,male,full,10,doctorate,23,28200
3,female,full,7,doctorate,27,26775
4,male,full,19,masters,30,33696


___

Label Encoding

In [13]:
labelEncoder = LabelEncoder()
salaryDataSet.sx = labelEncoder.fit_transform(salaryDataSet["sx"])
salaryDataSet.rk = labelEncoder.fit_transform(salaryDataSet["rk"])
salaryDataSet.dg = labelEncoder.fit_transform(salaryDataSet["dg"])

In [14]:
salaryDataSet.head()

Unnamed: 0,sx,rk,yr,dg,yd,sl
0,1,2,25,0,35,36350
1,1,2,13,0,22,35350
2,1,2,10,0,23,28200
3,0,2,7,0,27,26775
4,1,2,19,1,30,33696


___

One-Hot Encoding

In [15]:
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [0,1,3])], remainder = "passthrough")

In [16]:
newColumnTransformer = columnTransformer.fit_transform(salaryDataSet)

In [17]:
newColumnTransformer[:10]

array([[0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00,
        1.0000e+00, 0.0000e+00, 2.5000e+01, 3.5000e+01, 3.6350e+04],
       [0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00,
        1.0000e+00, 0.0000e+00, 1.3000e+01, 2.2000e+01, 3.5350e+04],
       [0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00,
        1.0000e+00, 0.0000e+00, 1.0000e+01, 2.3000e+01, 2.8200e+04],
       [1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00,
        1.0000e+00, 0.0000e+00, 7.0000e+00, 2.7000e+01, 2.6775e+04],
       [0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00,
        0.0000e+00, 1.0000e+00, 1.9000e+01, 3.0000e+01, 3.3696e+04],
       [0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00,
        1.0000e+00, 0.0000e+00, 1.6000e+01, 2.1000e+01, 2.8516e+04],
       [1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00,
        0.0000e+00, 1.0000e+00, 0.0000e+00, 3.2000e+01, 2.4900e+04],
       [0.0000e+00, 1.0000e+00, 0.0000e+0

___