# Cleaning and Treating Categorical Variables

In this notebook, we will explore how to handle categorical variables in a dataset. Machine learning algorithms typically require numerical input, so categorical variables need to be transformed into a numerical format. We will demonstrate two common encoding techniques:
- **Label Encoding**
- **One-Hot Encoding**

We will also cover handling missing values in categorical columns.

## Step 1: Import Required Libraries

We will use `numpy` and `pandas` for data manipulation, and `LabelEncoder` and `OneHotEncoder` from `sklearn.preprocessing` for encoding categorical variables.

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

## Step 2: Create the Dataset

We will create a dataset containing names, age, gender, and rank. Note that the `gender` column contains some missing values (`np.nan`).

In [None]:
data = {
    'names': ['steve', 'john', 'richard', 'sarah', 'randy', 'micheal', 'julie'],
    'age': [20, 22, 20, 21, 24, 23, 22],
    'gender': ['Male', 'Male', np.nan, 'Female', np.nan, 'Male', np.nan],
    'rank': [2, 1, 4, 5, 3, 7, 6]
}

df = pd.DataFrame(data)
df

## Step 3: Handle Missing Values

There are different strategies to handle missing values in categorical columns:
- Drop rows containing missing values (if few).
- Drop the entire column (if too many missing values).
- Replace missing values with the most frequent category.

In this example, we will drop the `gender` column because filling it might introduce incorrect assumptions.

In [None]:
df = df.drop('gender', axis=1)
df

## Step 4: Label Encoding

Label Encoding converts categorical values into numerical values. Each unique category is assigned a number from 0 to N-1, where N is the number of unique categories.

**Hint:** Label encoding is suitable for ordinal data or when there is no need to avoid implicit ranking of categories.

In [None]:
label_encoder = LabelEncoder()
label_encoder.fit(df['names'])

In [None]:
label_encoded_names = label_encoder.transform(df['names'])
label_encoded_names

## Step 5: One-Hot Encoding

One-Hot Encoding creates a binary column for each category. Each row will have a `1` in the column corresponding to its category, and `0` elsewhere.

**Hint:** One-Hot Encoding is preferred when categorical variables are nominal (no implicit order).

In [None]:
onehot_encoder = OneHotEncoder(sparse_output=False)
onehot_encoder.fit(df[['names']])

In [None]:
onehot_encoded_names = onehot_encoder.transform(df[['names']])

In [None]:
onehot_encoded_df = pd.DataFrame(onehot_encoded_names, columns=onehot_encoder.categories_)
onehot_encoded_df['names'] = df['names']
onehot_encoded_df

## Summary

1. We handled missing values by dropping the `gender` column.
2. We applied **Label Encoding** to convert names into numerical labels.
3. We applied **One-Hot Encoding** to create binary columns for each unique name.

These preprocessing steps make categorical data compatible with machine learning models.