# One Hot Encoding

One-Hot Encoding is a technique used in machine learning to represent categorical data with multiple categories as binary vectors. It's particularly useful when the categories don't have an inherent order or when you want to avoid implying any ordinal relationship between them.

Here's a simple explanation of One-Hot Encoding:

1. **Representation:**
   - For each category in your original categorical feature, a binary column (or bit) is created.
   - The binary column is set to 1 if the original data point belongs to that category, and 0 otherwise.


![](OHE.png)


2. **Example:**
   - Suppose you have a "Color" feature with categories "Red," "Green," and "Blue."
   - After One-Hot Encoding, you would create three binary columns, one for each color.
   - If a data point is "Red," the "Red" column is set to 1, and the "Green" and "Blue" columns are set to 0.

   ```python
   Color:      Red   Green   Blue
   One-Hot:     1      0      0   (or Red)
   ```

3. **Benefits:**
   - Prevents the model from misinterpreting ordinal relationships in the data.
   - Ensures that each category is treated as a separate and independent feature.

In Python, you can use libraries like scikit-learn or pandas to perform One-Hot Encoding. Here's a simple examplecoded DataFrame:")
print(one_hot_df)
```

Remember that One-Hot Encoding increases the dimensionality of your data, so it's essential to consider the potential impact on the performance of your machine learning model, especially if you have a large number of unique categories.

In [1]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Create a DataFrame with a "Color" column
data = {'Color': ['Red', 'Green', 'Blue']}
df = pd.DataFrame(data)

# Use OneHotEncoder
encoder = OneHotEncoder(sparse=False)
one_hot_encoded = encoder.fit_transform(df[['Color']])
one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(['Color']))

print("Original DataFrame:")
print(df)

print("\nOne-Hot Encoded DataFrame:")
print(one_hot_df)

Original DataFrame:
   Color
0    Red
1  Green
2   Blue

One-Hot Encoded DataFrame:
   Color_Blue  Color_Green  Color_Red
0         0.0          0.0        1.0
1         0.0          1.0        0.0
2         1.0          0.0        0.0




In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('cars.csv')
df.sample(5)

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
3182,Mahindra,88754,Diesel,First Owner,1100000
7470,Honda,127991,Diesel,First Owner,675000
1007,Toyota,114368,Diesel,Second Owner,1325000
8092,Hyundai,28000,Diesel,First Owner,780000
577,Mahindra,100000,Diesel,First Owner,1100000


In [4]:
df['brand'].value_counts()

Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Peugeot             1
Name: brand, dtype: int64

In [5]:
df['brand'].nunique()

32

In [6]:
df.nunique()

brand             32
km_driven        921
fuel               4
owner              5
selling_price    677
dtype: int64

In [7]:
df.shape

(8128, 5)

In [8]:
df['owner'].unique(), df['fuel'].unique()

(array(['First Owner', 'Second Owner', 'Third Owner',
        'Fourth & Above Owner', 'Test Drive Car'], dtype=object),
 array(['Diesel', 'Petrol', 'LPG', 'CNG'], dtype=object))

In [9]:
np.round(df.describe())

Unnamed: 0,km_driven,selling_price
count,8128.0,8128.0
mean,69820.0,638272.0
std,56551.0,806253.0
min,1.0,29999.0
25%,35000.0,254999.0
50%,60000.0,450000.0
75%,98000.0,675000.0
max,2360457.0,10000000.0


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8128 entries, 0 to 8127
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   brand          8128 non-null   object
 1   km_driven      8128 non-null   int64 
 2   fuel           8128 non-null   object
 3   owner          8128 non-null   object
 4   selling_price  8128 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 317.6+ KB


## One-Hot Encoding using Pandas

In [11]:
# Not doing OHE on the 'brand' column now because it has 32 unique values

In [12]:
pd.get_dummies(df, columns=['fuel', 'owner']) 

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,0,1,0,0,1,0,0,0,0
1,Skoda,120000,370000,0,1,0,0,0,0,1,0,0
2,Honda,140000,158000,0,0,0,1,0,0,0,0,1
3,Hyundai,127000,225000,0,1,0,0,1,0,0,0,0
4,Maruti,120000,130000,0,0,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,0,1,1,0,0,0,0
8124,Hyundai,119000,135000,0,1,0,0,0,1,0,0,0
8125,Maruti,120000,382000,0,1,0,0,1,0,0,0,0
8126,Tata,25000,290000,0,1,0,0,1,0,0,0,0


## K-1 One-Hot Encoding
- The reason for using K-1 encoding is to avoid the "dummy variable trap" or multicollinearity. Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, which can lead to issues in the estimation of coefficients.
- Here's an example: 
Suppose you have a categorical variable "Color" with three levels: Red, Green, and Blue. If you use K-1 encoding, you create two binary columns: "IsGreen" and "IsBlue." If both are 0, it implies that the color is Red. If "IsGreen" is 1, it implies Green, and if "IsBlue" is 1, it implies Blue. This way, you avoid having a separate column for each category, preventing multicollinearity issues.







In [13]:
pd.get_dummies(df, columns=['fuel', 'owner'], drop_first=True) 

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,1,0,0,0,0,0,0
1,Skoda,120000,370000,1,0,0,0,1,0,0
2,Honda,140000,158000,0,0,1,0,0,0,1
3,Hyundai,127000,225000,1,0,0,0,0,0,0
4,Maruti,120000,130000,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,1,0,0,0,0
8124,Hyundai,119000,135000,1,0,0,1,0,0,0
8125,Maruti,120000,382000,1,0,0,0,0,0,0
8126,Tata,25000,290000,1,0,0,0,0,0,0


In [14]:
# Above code is not a permanent change.

## One-Hot Encoding using Sklearn

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [16]:
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [17]:
# define x and y
x = df.iloc[:,:4]
y = df['selling_price']

In [18]:
X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,0:4],df.iloc[:,-1],test_size=0.2,random_state=2)

In [19]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
5571,Hyundai,35000,Diesel,First Owner
2038,Jeep,60000,Diesel,First Owner
2957,Hyundai,25000,Petrol,First Owner
7618,Mahindra,130000,Diesel,Second Owner
6684,Hyundai,155000,Diesel,First Owner


In [20]:
ohe = OneHotEncoder(drop='first',sparse=False,dtype=np.int32)

In [21]:
# We only want to encode ['fuel', 'owner'] columns from xtrain variable

In [22]:
# Storing encoded ['fuel', 'owner'] cols in new Variable

In [23]:
X_train_new = ohe.fit_transform(X_train[['fuel','owner']])



In [24]:
X_test_new = ohe.transform(X_test[['fuel','owner']])

In [25]:
X_train_new.shape

(6502, 7)

In [26]:
# Now from xtrain varibale we will take [brand, km_driven] and stack with encoded ['fuel', 'owner'] from xtrain_new

The np.hstack function in NumPy is used to horizontally stack or concatenate two arrays along their second axis (axis with index 1). This means that it concatenates the arrays column-wise.

In [27]:
#encoded ['fuel', 'owner'] is an array 
#we have convert [brand, km_driven] to array (X_train[['brand','km_driven']].values) this code convert to array
np.hstack((X_train[['brand','km_driven']].values,X_train_new))

array([['Hyundai', 35000, 1, ..., 0, 0, 0],
       ['Jeep', 60000, 1, ..., 0, 0, 0],
       ['Hyundai', 25000, 0, ..., 0, 0, 0],
       ...,
       ['Tata', 15000, 0, ..., 0, 0, 0],
       ['Maruti', 32500, 1, ..., 1, 0, 0],
       ['Isuzu', 121000, 1, ..., 0, 0, 0]], dtype=object)

In [28]:
X_train.shape

(6502, 4)

In [29]:
X_train_new.shape

(6502, 7)

`ohe = OneHotEncoder(drop='first',sparse=False,dtype=np.int32)`
drop='first', it means that the first level of each categorical feature will be dropped after one-hot encoding.

In [30]:
np.hstack((X_train[['brand','km_driven']].values,X_train_new)).shape

(6502, 9)

## OneHotEncoding with Top Categories

In [31]:
df['brand'].value_counts()

Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Peugeot             1
Name: brand, dtype: int64

In [32]:
counts = df['brand'].value_counts()

In [33]:
df['brand'].nunique()
threshold = 100

In [34]:
# brands having less than or equal to 100 cars
counts[counts <= threshold].index

Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Force', 'Land', 'Isuzu', 'Kia',
       'Ambassador', 'Daewoo', 'MG', 'Ashok', 'Opel', 'Peugeot'],
      dtype='object')

In [35]:
repl = counts[counts <= threshold].index

In [36]:
pd.get_dummies(df['brand'].replace(repl, 'uncommon')).sample(5)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
7768,0,0,0,0,0,0,0,1,0,0,0,0,0
602,0,0,0,0,1,0,0,0,0,0,0,0,0
554,0,0,0,0,1,0,0,0,0,0,0,0,0
4190,0,0,0,0,0,0,1,0,0,0,0,0,0
4583,0,0,0,0,0,0,0,0,0,1,0,0,0
