# One Hot Encoder


In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
df = pd.read_csv("../../Datasets/Data.csv")
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,,No
1,Spain,,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    10 non-null     object 
 1   Age        8 non-null      float64
 2   Salary     8 non-null      float64
 3   Purchased  10 non-null     object 
dtypes: float64(2), object(2)
memory usage: 452.0+ bytes


In [4]:
df.describe()

Unnamed: 0,Age,Salary
count,8.0,8.0
mean,40.25,62750.0
std,6.734771,12691.391908
min,30.0,48000.0
25%,36.5,53500.0
50%,39.0,59500.0
75%,45.0,70000.0
max,50.0,83000.0


In [5]:
df.isna().sum()

Country      0
Age          2
Salary       2
Purchased    0
dtype: int64

In [6]:
df.columns

Index(['Country', 'Age', 'Salary', 'Purchased'], dtype='object')

`OneHotEncoder` is a class in scikit-learn that is used to encode categorical features as one-hot numeric arrays. It creates a binary column for each unique category in the input data, and assigns a value of 1 to the column corresponding to the category of each sample.

The `OneHotEncoder` class takes several parameters that control the behavior of the encoding process. Here are the parameters of the `OneHotEncoder` class:

- `categories`: The categories of each feature in the input data. If not specified, the categories are inferred from the input data.
- `drop`: Whether to drop one of the binary columns for each feature to avoid multicollinearity. The default value is `'first'`, which drops the first column. Other options include `'if_binary'`, which drops the column if the feature has only two categories, and `None`, which keeps all columns.
- `sparse`: Whether to return a sparse matrix instead of a dense array. The default value is `True`.
- `dtype`: The data type of the output array. The default value is `numpy.float64`.

Here's an example of how to use `OneHotEncoder` to encode categorical features:


In this example, we are creating a `OneHotEncoder` object called `encoder`. We then use the `fit_transform` method of the `encoder` object to encode the categorical features.


In [7]:
ohe = OneHotEncoder()
data = ohe.fit_transform(df[["Country", "Purchased"]]).toarray()  # type: ignore
ohe.categories_

[array(['France', 'Germany', 'Spain'], dtype=object),
 array(['No', 'Yes'], dtype=object)]

In [8]:
df1 = pd.DataFrame(
    data,
    columns=ohe.get_feature_names_out(["Country", "Purchased"]),
    dtype=int,
)
df1

Unnamed: 0,Country_France,Country_Germany,Country_Spain,Purchased_No,Purchased_Yes
0,1,0,0,1,0
1,0,0,1,0,1
2,0,1,0,1,0
3,0,0,1,1,0
4,0,1,0,0,1
5,1,0,0,0,1
6,0,0,1,1,0
7,1,0,0,0,1
8,0,1,0,1,0
9,1,0,0,0,1


In [9]:
dataset = pd.concat([df, df1], axis=1)
dataset.drop(["Country", "Purchased"], axis=1, inplace=True)
dataset

Unnamed: 0,Age,Salary,Country_France,Country_Germany,Country_Spain,Purchased_No,Purchased_Yes
0,44.0,,1,0,0,1,0
1,,48000.0,0,0,1,0,1
2,30.0,54000.0,0,1,0,1,0
3,38.0,61000.0,0,0,1,1,0
4,40.0,,0,1,0,0,1
5,35.0,58000.0,1,0,0,0,1
6,,52000.0,0,0,1,1,0
7,48.0,79000.0,1,0,0,0,1
8,50.0,83000.0,0,1,0,1,0
9,37.0,67000.0,1,0,0,0,1
