# Feature Engerring
# Encoding Categorical Variable
Categorical data is a common occurrence in many datasets, especially in fields like marketing, finance, and social sciences. Unlike numerical data, categorical data represents discrete values or categories, such as gender, country, or product type.

## Why Encode Categorical Data?
Machine learning models work with numbers, not categories. Encoding categorical data into numerical format helps:
* Make data usable: Models like linear regression and neural networks require numbers.
* Improve accuracy: Good encoding improves model performance.
* Ensure consistency: Proper encoding helps in data preprocessing.

## Types of Categorical Data
* <b>Nominal Data:</b> Categories with no specific order (e.g., gender, colors, country).
* <b>Ordinal Data:</b> Categories with a meaningful order (e.g., education level, satisfaction rating).

## Encoding Techniques in Sklearn
1. Label Encoding
   * Assigns a unique number to each category.
   * Suitable for ordinal data.

In [2]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data["encoded_column"] = le.fit_transform(data["categorical_column"])

SyntaxError: incomplete input (2601706304.py, line 3)

2. One-Hot Encoding
* Converts categories into a binary matrix.
* Best for nominal data.


In [None]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

data = np.array(['red', 'blue', 'green']).reshape(-1, 1)
ohe = OneHotEncoder(sparse=False)
encoded_data = ohe.fit_transform(data)
print(encoded_data)

3. Ordinal Encoding

* Assigns ordered numbers to categories.
* Best for ordinal data.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

data = np.array(['low', 'medium', 'high']).reshape(-1, 1)
oe = OrdinalEncoder(categories=[['low', 'medium', 'high']])
encoded_data = oe.fit_transform(data)
print(encoded_data)

4. Binary Encoding
* Converts categories to binary format.
* Works well for large categories.

In [None]:
from category_encoders import BinaryEncoder
be = BinaryEncoder()
encoded_data = be.fit_transform(data['categorical_column'])

5. Frequency Encoding
* Replaces each category with its frequency in the dataset.
* Useful for high-cardinality data.

In [None]:
import pandas as pd
data['encoded'] = data['categorical_column'].map(data['categorical_column'].value_counts())

## Choosing the right Encoding Method
* Use One-Hot Encoding for nominal data.
* Use Label or Ordinal Encoding for ordinal data.
* Use Frequency Encoding for high-cardinality features.
* Use Binary Encoding to reduce dimensionality.

## Conclusion
Encoding categorical data is essential for machine learning. Sklearn provides multiple encoding techniques, each with its benefits and limitations. Choosing the right method ensures better model performance and accuracy.

# Solving Problem

In [4]:
import numpy as np
import pandas as pd

In [7]:
df = pd.read_csv("customer.csv")

In [9]:
df.sample(4)

Unnamed: 0,age,gender,review,education,purchased
36,34,Female,Good,UG,Yes
28,48,Male,Poor,School,No
6,18,Male,Good,School,No
2,70,Female,Good,PG,No


In [10]:
df = df.iloc[:,2:]

In [11]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [12]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,0:2],df.iloc[:,-1],test_size=0.2)

In [13]:
X_train

Unnamed: 0,review,education
26,Poor,PG
14,Poor,PG
41,Good,PG
3,Good,PG
37,Average,PG
38,Good,School
44,Average,UG
40,Good,School
1,Poor,UG
22,Poor,PG


# OrdinalEncoder

In [14]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[["Poor","Average","Good"],["School", "UG","PG"]])
X_train = oe.fit_transform(X_train)
X_train

array([[0., 2.],
       [0., 2.],
       [2., 2.],
       [2., 2.],
       [1., 2.],
       [2., 0.],
       [1., 1.],
       [2., 0.],
       [0., 1.],
       [0., 2.],
       [0., 1.],
       [1., 0.],
       [0., 2.],
       [2., 1.],
       [1., 1.],
       [2., 1.],
       [0., 2.],
       [2., 0.],
       [0., 2.],
       [0., 0.],
       [0., 0.],
       [1., 1.],
       [2., 0.],
       [1., 1.],
       [0., 1.],
       [0., 0.],
       [2., 1.],
       [2., 0.],
       [0., 2.],
       [0., 2.],
       [1., 2.],
       [2., 1.],
       [1., 0.],
       [0., 2.],
       [2., 2.],
       [2., 2.],
       [2., 1.],
       [2., 2.],
       [0., 0.],
       [1., 0.]])

# LabelEncoder

## Note
* Encode target labels with value between 0 and n_classes-1.
* This transformer should be used to encode target values, i.e. y, and not the input X.

In [32]:
print(y_train)

LabelEncoder()


In [33]:
import numpy as np

y_train = np.array(y_train)  
y_train = y_train.ravel()

In [34]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(y_train)

In [37]:
y_train = le.transform(y_train)
y_test = le.transform(y_test)

ValueError: y contains previously unseen labels: 0

In [36]:
y_train

array([0])