<a href="https://colab.research.google.com/github/faisu6339-glitch/Machine-learning/blob/main/F3_Ordinal_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Feature Engineering

### Data in Feature Engineering

**Feature engineering** is the process of using domain knowledge to extract features from raw data. These features are then used to improve the performance of machine learning algorithms. The 'data' in feature engineering refers to the raw information that is transformed into a set of meaningful features.

Effective feature engineering can make a significant difference in model accuracy and interpretability. It involves understanding the data, selecting relevant variables, and creating new variables that better represent the underlying problem.

### Numerical Data

**Numerical data** (also known as quantitative data) represents measurable quantities. These are values that can be counted or measured and are always expressed in numbers. Numerical data can be used for mathematical operations.

There are two main types of numerical data:

1.  **Discrete Data**: Represents items that can be counted and have distinct, separate values. They are usually integers and cannot be divided into smaller units meaningfully (e.g., number of students in a class, number of cars).
2.  **Continuous Data**: Represents measurements and can take any value within a given range. They can be divided into finer and finer levels (e.g., height, weight, temperature, time).

### Categorical Data

**Categorical data** (also known as qualitative data) represents characteristics or qualities that can be grouped into categories. These values are not inherently numerical and cannot be subjected to mathematical operations in their raw form. They often represent labels or names.

For machine learning models, categorical data usually needs to be converted into a numerical format (e.g., through one-hot encoding or label encoding) before it can be used.

### Types of Categorical Data

There are several types of categorical data, distinguished by their level of measurement:

1.  **Nominal Data**: This type of data consists of categories that have no natural order or ranking. The categories are simply names or labels. There is no concept of one category being 'greater' or 'lesser' than another.
    *   **Examples**: Colors (Red, Blue, Green), Marital Status (Single, Married, Divorced), Types of Animals (Dog, Cat, Bird).

2.  **Ordinal Data**: This type of data consists of categories that have a natural, meaningful order or ranking, but the differences between categories are not necessarily equal or quantifiable. We know the relative order, but not the magnitude of the difference.
    *   **Examples**: Educational Level (High School, Bachelor's, Master's, PhD), Customer Satisfaction (Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied), T-shirt Sizes (S, M, L, XL).

3.  **Binary/Dichotomous Data**: This is a special case of nominal data where there are only two possible categories. It's often used to represent 'yes/no' or 'true/false' scenarios.
    *   **Examples**: Gender (Male, Female), Outcome (Pass, Fail), Email Status (Opened, Unopened).

Understanding these distinctions is crucial for choosing the right preprocessing techniques in feature engineering.

### When to use One-Hot Encoding and Label Encoding

Understanding when to use One-Hot Encoding versus Label Encoding is crucial for preparing categorical data for machine learning models. Here's a breakdown:

### Label Encoding

**When to Use:**

*   **Ordinal Data**: Label Encoding is best suited for categorical features that have a natural, intrinsic order or ranking (ordinal data).
    *   **Examples**: 'Low', 'Medium', 'High' could be encoded as 0, 1, 2 respectively. 'Small', 'Medium', 'Large' for clothing sizes.
*   **High Cardinality Features**: If you have a very large number of unique categories, and One-Hot Encoding would lead to an excessively sparse dataset (too many columns), Label Encoding might be considered. However, this comes with a caveat for nominal data.
*   **Tree-based Algorithms**: Some tree-based algorithms (like Decision Trees, Random Forests, Gradient Boosting Machines) can sometimes handle Label Encoded features directly, especially if the algorithm can split based on these numerical values. They might infer the order, but this is not always guaranteed to be optimal.

**Why to Use/Consideration:**

*   It's simple and efficient, adding only one new column to the dataset.
*   It preserves the ordinal relationship if one exists.

**Caution:**

*   **Avoid for Nominal Data**: If the categories do **not** have an intrinsic order (nominal data), Label Encoding will impose an arbitrary order. This can mislead models into thinking there's a numerical relationship (e.g., that 'Red' (encoded as 0) is 'less than' 'Blue' (encoded as 1)), which can negatively impact model performance, especially with linear models or neural networks.

### One-Hot Encoding

**When to Use:**

*   **Nominal Data**: One-Hot Encoding is the preferred method for categorical features where there is **no inherent order or ranking** among the categories (nominal data).
    *   **Examples**: 'Red', 'Blue', 'Green' (colors); 'Male', 'Female' (gender); 'Dog', 'Cat', 'Bird' (animal types).
*   **Models Sensitive to Arbitrary Ordering**: For models like Linear Regression, Logistic Regression, Support Vector Machines (SVMs), or Neural Networks, which interpret numerical values as having a magnitude, One-Hot Encoding prevents the model from assuming an incorrect ordinal relationship.

**Why to Use/Consideration:**

*   It avoids implying any ordinal relationship between categories where none exists.
*   Each category is treated as an independent feature.

**Caution:**

*   **Increased Dimensionality**: It creates a new binary column for each unique category. If a feature has many unique categories (high cardinality), this can lead to a very wide dataset (the "curse of dimensionality"), which can increase computational cost and potentially lead to sparsity issues.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [None]:
df=pd.read_csv("customer_120_rows.csv")

In [None]:
df.head()

Unnamed: 0,age,gender,review,education,purchased
0,58,Male,Will buy again,Graduate,Yes
1,25,Male,Great product,Undergraduate,Yes
2,19,Female,Not satisfied,Undergraduate,No
3,35,Male,Great product,Graduate,Yes
4,33,Male,Disappointed,Undergraduate,No


In [None]:
df.columns

Index(['age', 'gender', 'review', 'education', 'purchased'], dtype='object')

In [None]:
df["gender"].unique()

array(['Male', 'Female'], dtype=object)

In [None]:
df["review"].unique()

array(['Will buy again', 'Great product', 'Not satisfied', 'Disappointed',
       'Could be better', 'Highly recommended', 'Very satisfied',
       'Amazing experience', 'Excellent quality', 'Not good', 'Loved it',
       'Average product', 'Bad quality', 'Waste of money', 'Poor service',
       'Worth the money', 'Very useful'], dtype=object)

In [None]:
df["education"].unique()

array(['Graduate', 'Undergraduate', 'Postgraduate'], dtype=object)

In [None]:
df["purchased"].unique()

array(['Yes', 'No'], dtype=object)

In [None]:
print(df['education'].unique())
print(df['review'].unique())


['Graduate' 'Undergraduate' 'Postgraduate']
['Will buy again' 'Great product' 'Not satisfied' 'Disappointed'
 'Could be better' 'Highly recommended' 'Very satisfied'
 'Amazing experience' 'Excellent quality' 'Not good' 'Loved it'
 'Average product' 'Bad quality' 'Waste of money' 'Poor service'
 'Worth the money' 'Very useful']


Based on the unique values of your dataset columns, here's how they can be classified:

*   **Nominal Data:**
    *   `gender`: 'Male', 'Female' - These are categories with no intrinsic order.
    *   `review`: This column contains various phrases like 'Will buy again', 'Great product', 'Not satisfied', etc. While some imply sentiment, there isn't a clear, universally ordered ranking for all unique values, making it best classified as nominal.
    *   `purchased`: 'Yes', 'No' - These are two distinct categories without a natural order.

*   **Ordinal Data:**
    *   `education`: 'Undergraduate', 'Graduate', 'Postgraduate' - These categories have a clear, inherent order (e.g., Undergraduate < Graduate < Postgraduate).

The `age` column is numerical.

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.preprocessing import OneHotEncoder

#Select Only Ordinal Columns

In [None]:
X = df[['education', 'review']]
y = df['purchased']


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [None]:
oe = OrdinalEncoder(categories=[
    ['Undergraduate', 'Graduate', 'Postgraduate'],  # education
    [
        'Bad quality',
        'Poor service',
        'Waste of money',
        'Not good',
        'Disappointed',
        'Could be better',
        'Average product',
        'Not satisfied',
        'Very useful',
        'Worth the money',
        'Great product',
        'Excellent quality',
        'Very satisfied',
        'Highly recommended',
        'Amazing experience',
        'Loved it',
        'Will buy again'
    ]  # review
])

X_train_encoded = oe.fit_transform(X_train)
X_test_encoded = oe.transform(X_test)

print("Encoded X_train sample:\n", X_train_encoded[:5])


Encoded X_train sample:
 [[ 0.  2.]
 [ 2.  3.]
 [ 1. 12.]
 [ 2. 13.]
 [ 2.  9.]]


In [None]:
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)

print("Encoded y_train sample:\n", y_train_encoded[:10])


Encoded y_train sample:
 [0 0 1 1 1 0 0 1 1 1]


In [None]:
model = LogisticRegression()
model.fit(X_train_encoded, y_train_encoded)


In [None]:
y_pred = model.predict(X_test_encoded)


In [None]:
accuracy = accuracy_score(y_test_encoded, y_pred)
print("Model Accuracy:", accuracy)


Model Accuracy: 1.0


#New Customer – HIGH education + POSITIVE review

In [None]:
new_customer = pd.DataFrame({
    'education': ['Postgraduate'],
    'review': ['Will buy again']
})

new_customer_encoded = oe.transform(new_customer)
prediction = model.predict(new_customer_encoded)

print("Prediction (1 = Purchased, 0 = Not Purchased):", prediction)


Prediction (1 = Purchased, 0 = Not Purchased): [1]


#New Customer – LOW education + NEGATIVE review

In [None]:
new_customer2 = pd.DataFrame({
    'education': ['Undergraduate'],
    'review': ['Bad quality']
})

new_customer2_encoded = oe.transform(new_customer2)
prediction2 = model.predict(new_customer2_encoded)

print("Prediction (1 = Purchased, 0 = Not Purchased):", prediction2)


Prediction (1 = Purchased, 0 = Not Purchased): [0]


#Medium Case – Graduate + Average review

In [None]:
new_customer3 = pd.DataFrame({
    'education': ['Graduate'],
    'review': ['Average product']
})

new_customer3_encoded = oe.transform(new_customer3)
prediction3 = model.predict(new_customer3_encoded)

print("Prediction (1 = Purchased, 0 = Not Purchased):", prediction3)


Prediction (1 = Purchased, 0 = Not Purchased): [0]


In [None]:
print("Model Coefficients:", model.coef_)


Model Coefficients: [[-0.57688855  2.13374315]]
