# Handling Non-Numeric Features: A Complete Guide

In the world of data science and machine learning, not everything can be measured with numbers. Imagine you're analyzing employee data and you might have information like departments (Sales, Engineering, Marketing), education levels (High School, Bachelor's, Master's), or office locations (New York, Seattle, Denver). These are examples of **categorical features**, and knowing how to handle them properly can make or break your machine learning models.

### Outline:
1. Setting Up Our Demo Dataset
2. What Are Categorical Features?
3. Two Main Types of Categorical Features
4. Why Do Computers Struggle with Categories?
5. Label Encoding: The Naive Approach (And Why to Avoid It)
6. Handling Nominal Features
7. Handling Ordinal Features


## 1. Setting Up Our Demo Dataset

Throughout this article, we'll use a demo dataset: **employee data from a company**. This dataset will help us explore different types of categorical features and encoding techniques.


In [1]:
import pandas as pd
import numpy as np

# Create our employee dataset
data = pd.DataFrame({
    'employee_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'department': ['Sales', 'Engineering', 'Marketing', 'Sales', 'Engineering',
                   'HR', 'Marketing', 'Engineering', 'Sales', 'Finance'],
    'education': ['Bachelor', 'Master', 'High School', 'Bachelor', 'PhD',
                  'Master', 'Bachelor', 'Master', 'Bachelor', 'Master'],
    'city': ['New York', 'Seattle', 'Miami', 'Seattle', 'New York',
             'Denver', 'Miami', 'Denver', 'Chicago', 'Austin'],
    'experience_level': ['Junior', 'Senior', 'Entry', 'Mid', 'Senior',
                        'Mid', 'Junior', 'Senior', 'Mid', 'Junior'],
    'salary': [50000, 95000, 35000, 65000, 120000, 75000, 55000, 85000, 60000, 48000]
})

print(data.head())

   employee_id   department    education      city experience_level  salary
0            1        Sales     Bachelor  New York           Junior   50000
1            2  Engineering       Master   Seattle           Senior   95000
2            3    Marketing  High School     Miami            Entry   35000
3            4        Sales     Bachelor   Seattle              Mid   65000
4            5  Engineering          PhD  New York           Senior  120000


## 2. What Are Categorical Features?

Categorical features are *labels or categories that describe characteristics of the data*. Unlike numerical features (like salary, age, or temperature), categorical features represent groups, classes, or categories that things belong to.

Looking at our employee dataset:
- **Numerical features**: `salary` ($50,000), `employee_id` (1, 2, 3...)
- **Categorical features**: `department` (Sales, Engineering, Marketing), `education` (Bachelor, Master, PhD), `city` (New York, Seattle, Miami)

We can think of categorical features as answers to questions like "What type?", "Which category?", or "What level?" rather than "How much?" or "How many?"

## 3. Two Main Types of Categorical Features

### 1. Nominal Features (No Natural Order)

**Nominal** features are categories that are simply labels with *no inherent ranking or order*. In our dataset, these include:

**Department**: Sales, Engineering, Marketing, HR, Finance
- We can't say "Sales > Engineering" or "Marketing < HR" in any meaningful way

**City**: New York, Seattle, Miami, Denver, Chicago, Austin  
- Cities are different locations, but one isn't "greater than" another


In [2]:
# Check our nominal features
print("Departments:", data['department'].unique())
print("Cities:", data['city'].unique())

Departments: ['Sales' 'Engineering' 'Marketing' 'HR' 'Finance']
Cities: ['New York' 'Seattle' 'Miami' 'Denver' 'Chicago' 'Austin']


### 2. Ordinal Features (Natural Order Exists)

**Ordinal** features have a *clear ranking or order* between categories. We can arrange them from lowest to highest, worst to best, or smallest to largest.

**Education Level**: High School < Bachelor < Master < PhD
- Each level represents more years of education
- Clear progression in academic achievement

**Experience Level**: Entry < Junior < Mid < Senior
- Each level represents more years of work experience  
- Clear hierarchy in responsibility and expertise


In [3]:
# Check our ordinal features
print("Education levels:", data['education'].unique())
print("Experience levels:", data['experience_level'].unique())

Education levels: ['Bachelor' 'Master' 'High School' 'PhD']
Experience levels: ['Junior' 'Senior' 'Entry' 'Mid']


## 4. Why Do Computers Struggle with Categories?

Here's the fundamental challenge: *most machine learning algorithms are mathematical at their core, and mathematics works with numbers, not words*.

Imagine trying to calculate the average of these departments from our dataset:
- Sales
- Engineering  
- Marketing

It doesn't make sense, right? We can't add "Sales + Engineering" and divide by 2!

Let's see what happens if we try to use categorical features directly in a model:


In [5]:
from sklearn.linear_model import LinearRegression

# This will cause an error!
try:
    model = LinearRegression()
    X = data[['department', 'education', 'city']]  # Categorical features
    y = data['salary']
    model.fit(X, y)
except Exception as e:
    print(f"Error: {e}")
    print("Machine learning algorithms need numerical input!")

Error: could not convert string to float: 'Sales'
Machine learning algorithms need numerical input!


This is why we need special techniques to convert categorical features into a numerical format that computers can understand, while preserving the meaning and relationships in our data.

By properly handling categorical features like department, education, and experience level, our model could learn that:
- Engineering employees typically earn more than Marketing employees
- PhD holders generally earn more than Bachelor's degree holders  
- Senior employees earn more than Junior employees


## 5. Label Encoding: The Naive Approach (And Why to Avoid It)

Before diving into proper encoding techniques, let's look at **label encoding**. The approach that seems intuitive but often creates more problems than it solves.

### What is Label Encoding?

Label encoding simply assigns a unique integer to each category:


In [6]:
from sklearn.preprocessing import LabelEncoder

# Apply label encoding to department
label_encoder = LabelEncoder()
data['department_label'] = label_encoder.fit_transform(data['department'])

# See the mapping
for i, dept in enumerate(label_encoder.classes_):
    print(f"{dept}: {i}")

print("\nLabel encoded departments:")
print(data[['department', 'department_label']].drop_duplicates())

Engineering: 0
Finance: 1
HR: 2
Marketing: 3
Sales: 4

Label encoded departments:
    department  department_label
0        Sales                 4
1  Engineering                 0
2    Marketing                 3
5           HR                 2
9      Finance                 1


**Results look like:**
```
Engineering: 0
Finance: 1
HR: 2
Marketing: 3
Sales: 4
```


### The Problem with Label Encoding

Label encoding creates **artificial mathematical relationships** that don't exist in reality.

In [7]:
# What label encoding implies:
print("What the algorithm 'thinks' about departments:")
print("Finance (1) = 2 × Engineering (0) + 1")
print("HR (2) = 2 × Finance (1)")
print("Sales (4) = 4 × Engineering (0)")
print("Sales (4) = Engineering (0) + HR (2) + Finance (1) + 1")

# This is completely meaningless for departments!

What the algorithm 'thinks' about departments:
Finance (1) = 2 × Engineering (0) + 1
HR (2) = 2 × Finance (1)
Sales (4) = 4 × Engineering (0)
Sales (4) = Engineering (0) + HR (2) + Finance (1) + 1


The algorithm now thinks:
- Sales (4) is "4 times more" than Engineering (0)
- HR (2) is "halfway between" Engineering (0) and Sales (4)
- We can "add" departments together mathematically

### When Label Encoding Might Be Acceptable

Label encoding is **only appropriate for ordinal features** where we explicitly want a simple 1, 2, 3, 4... progression:

In [8]:
# Label encoding for education (ordinal) - still not ideal
education_label = LabelEncoder()
data['education_label'] = education_label.fit_transform(data['education'])

print("Education label encoding:")
for i, edu in enumerate(education_label.classes_):
    print(f"{edu}: {i}")

Education label encoding:
Bachelor: 0
High School: 1
Master: 2
PhD: 3


But even here, the order might be wrong as `LabelEncoder` often uses alphabetical order:

- Bachelor: 0
- High School: 1  ← This should be lower than Bachelor!
- Master: 2
- PhD: 3

### Bottom Line on Label Encoding

**For nominal features (like department, city)**: We shouldn't use label encoding as it creates meaningless mathematical relationships.

**For ordinal features (like education, experience)**: Label encoding is rarely the best choice because:
- It doesn't guarantee the correct order
- It assumes equal spacing between categories

## 6. Handling Nominal Features

Now let's explore some proper techniques for handling nominal features i.e., categories with no natural order like department and city.

### One-Hot Encoding

One-hot encoding is like giving each category its own *yes/no question*. Instead of having one column with multiple categories, we create separate columns for each category, using 1 for "yes" and 0 for "no."

#### Implementing One-Hot Encoding

Let's transform the `department` column:

In [23]:
# Method 1: Implementing one-hot encoding using pandas get_dummies
dept_encoded = pd.get_dummies(data['department'], prefix='dept', dtype=int)
print(dept_encoded.head())

   dept_Engineering  dept_Finance  dept_HR  dept_Marketing  dept_Sales
0                 0             0        0               0           1
1                 1             0        0               0           0
2                 0             0        0               1           0
3                 0             0        0               0           1
4                 1             0        0               0           0


In [24]:
# Method 2: Implementing one-hot encoding using sklearn (more control)
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False, drop='first', dtype=int)  # drop='first' avoids redundancy
dept_encoded = encoder.fit_transform(data[['department']])

# Get feature names
feature_names = encoder.get_feature_names_out(['department'])
dept_df = pd.DataFrame(dept_encoded, columns=feature_names)
print(dept_df.head())

   department_Finance  department_HR  department_Marketing  department_Sales
0                   0              0                     0                 1
1                   0              0                     0                 0
2                   0              0                     1                 0
3                   0              0                     0                 1
4                   0              0                     0                 0


**Results:**
```
   dept_Engineering  dept_Finance  dept_HR  dept_Marketing  dept_Sales
0                 0             0        0               0           1
1                 1             0        0               0           0
2                 0             0        0               1           0
3                 0             0        0               0           1
4                 1             0        0               0           0
```

Now each employee has a clear numerical representation:
- Employee 1 (Sales): [0, 0, 0, 0, 1]
- Employee 2 (Engineering): [1, 0, 0, 0, 0]  
- Employee 3 (Marketing): [0, 0, 0, 1, 0]

#### When One-Hot Encoding Works Best
**Perfect scenarios for one-hot encoding:**
- **Low cardinality**: Our dataset has 5 departments (manageable)
- **All categories are important**: Each department affects salary differently
- **No natural relationship**: Departments are truly independent categories


### Target Encoding: For High-Cardinality Features

What happens when we have a feature like `city` with many unique values? Let's check our city data:

In [14]:
print(f"Number of unique cities: {data['city'].nunique()}")
print(f"Cities: {list(data['city'].unique())}")

# Check if we have enough data per city
city_counts = data['city'].value_counts()
print(f"\nEmployee count per city:")
print(city_counts)

Number of unique cities: 6
Cities: ['New York', 'Seattle', 'Miami', 'Denver', 'Chicago', 'Austin']

Employee count per city:
city
New York    2
Seattle     2
Miami       2
Denver      2
Chicago     1
Austin      1
Name: count, dtype: int64


In our current dataset, we have 6 cities with 1-2 employees each. In a real-world scenario, we might have:
- 100+ cities with customers from across the country
- Some cities with thousands of customers, others with just one

**Problems with one-hot encoding high-cardinality features:**
- **Curse of dimensionality**: Too many features, not enough data per feature
- **Memory problems**: Dataset becomes huge and sparse (mostly zeros)
- **Overfitting**: Model might memorize categories instead of learning patterns

#### How Target Encoding Works

Target encoding is *excellent for high-cardinality features*. Instead of creating many columns, it replaces each category with a statistic related to the target variable (usually the mean).

Let's say for each category, we calculate the average salary:

In [15]:
# Target encoding for departments
dept_target_encoding = data.groupby('department')['salary'].mean()
print("Target encoding for departments:")
print(dept_target_encoding.sort_values(ascending=False))

# Apply the encoding
data['dept_target_encoded'] = data['department'].map(dept_target_encoding)

print("\nOriginal vs Target Encoded:")
comparison = data[['department', 'dept_target_encoded', 'salary']].head()
print(comparison)

Target encoding for departments:
department
Engineering    100000.000000
HR              75000.000000
Sales           58333.333333
Finance         48000.000000
Marketing       45000.000000
Name: salary, dtype: float64

Original vs Target Encoded:
    department  dept_target_encoded  salary
0        Sales         58333.333333   50000
1  Engineering        100000.000000   95000
2    Marketing         45000.000000   35000
3        Sales         58333.333333   65000
4  Engineering        100000.000000  120000


**Results:**
```
Target encoding for departments:
Engineering    100000.0
HR              75000.0
Sales           58333.3
Finance         48000.0
Marketing       45000.0

Original vs Target Encoded:
   department  dept_target_encoded  salary
0       Sales              58333.3   50000
1 Engineering             100000.0   95000
2   Marketing              45000.0   35000
3       Sales              58333.3   65000
4 Engineering             100000.0  120000
```

#### When to Use Target Encoding

**Perfect for:**
- **High-cardinality features**: 20+ unique categories
- **Strong relationship with target**: Categories clearly influence the outcome  
- **Regression problems**: Works especially well predicting continuous values

Target encoding can cause some issues as well if not done carefully.

**Be careful of:**
- **Overfitting**: The encoding might memorize training data too well
- **Data leakage**: Always use cross-validation or holdout sets for encoding

## 7. Handling Ordinal Features

Now let's tackle ordinal features i.e., categories with natural order like `education` and `experience_level`. These require different techniques that preserve the ranking information.

### Ordinal Encoding: Preserving Order

Ordinal encoding assigns consecutive integers to categories based on their natural order.

#### Using Dictionary Mapping for Encoding

In [18]:
# Define proper order for education
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
education_mapping = {level: i+1 for i, level in enumerate(education_order)}
print("Education mapping:", education_mapping)

# Apply ordinal encoding
data['education_encoded'] = data['education'].map(education_mapping)

# Same for experience level
experience_order = ['Entry', 'Junior', 'Mid', 'Senior']
experience_mapping = {level: i+1 for i, level in enumerate(experience_order)}
data['experience_encoded'] = data['experience_level'].map(experience_mapping)

print("\nOrdinal encoding results:")
sample = data[['education', 'education_encoded', 'experience_level', 'experience_encoded', 'salary']].head()
print(sample)

Education mapping: {'High School': 1, 'Bachelor': 2, 'Master': 3, 'PhD': 4}

Ordinal encoding results:
     education  education_encoded experience_level  experience_encoded  salary
0     Bachelor                  2           Junior                   2   50000
1       Master                  3           Senior                   4   95000
2  High School                  1            Entry                   1   35000
3     Bachelor                  2              Mid                   3   65000
4          PhD                  4           Senior                   4  120000


#### Using Sklearn for Ordinal Encoding

In [27]:
from sklearn.preprocessing import OrdinalEncoder

# Create encoder
ord_encoder = OrdinalEncoder(categories=[
    ['High School', 'Bachelor', 'Master', 'PhD'],
    ['Entry', 'Junior', 'Mid', 'Senior']])


# Fit and transform
encoded_both = ord_encoder.fit_transform(data[['education', 'experience_level']])

# Add back to dataframe
data['education_ord_encoded'] = encoded_both[:, 0]  # First column
data['experience_ord_encoded'] = encoded_both[:, 1]  # Second column

print("\nSklearn ordinal encoding results:")
ord_encoded_sample = data[['education', 'education_ord_encoded', 'experience_level', 'experience_ord_encoded']].head()
print(ord_encoded_sample)


Sklearn ordinal encoding results:
     education  education_ord_encoded experience_level  experience_ord_encoded
0     Bachelor                    1.0           Junior                     1.0
1       Master                    2.0           Senior                     3.0
2  High School                    0.0            Entry                     0.0
3     Bachelor                    1.0              Mid                     2.0
4          PhD                    3.0           Senior                     3.0


**When to use ordinal encoding:**
- Clear monotonic relationship (higher order = higher target value)
- You want to preserve the natural progression
- Linear relationship between consecutive categories

## Summary
### Quick Reference Guide

| Feature Type | Cardinality | Relationship | Recommended Method | Why |
|--------------|-------------|--------------|-------------------|-----|
| Nominal | 2-10 categories | Any | One-hot encoding | Preserves all information, manageable size |
| Nominal | 10+ categories | Strong target relationship | Target encoding | Reduces dimensionality, captures patterns |
| Nominal | 10+ categories | No target available | Frequency encoding | Uses category popularity as proxy |
| Ordinal | Any cardinality | Monotonic | Ordinal encoding | Preserves natural order and relationship |



### Key Takeaways

1. **Identify feature types correctly**: Distinguish between nominal (no order) and ordinal (natural order) features

2. **Avoid label encoding for nominal features**: It creates meaningless mathematical relationships

3. **Choose encoding wisely**:
   - **Nominal + Low cardinality (≤10)**: Use one-hot encoding
   - **Nominal + High cardinality (>10)**: Use target encoding (with cross-validation)
   - **Ordinal + Monotonic relationship**: Use ordinal encoding

4. **Always validate your approach**: Test different encoding methods and compare model performance

5. **Consider the business context**: Our domain knowledge often guides the best encoding strategy


**Remember:** properly handling categorical features often separates good data science projects from great ones. These features contain rich information about the real world and our job is to help machines understand that information as clearly as humans do.