<div class="span5 alert alert-success">
<h3>Preprocessing Categorical Data</h3>


### <font color='olive'><b>OrdinalEncoder</b></font> 

- Converts categorical data into integer codes. Suitable for ordinal data where the categories have a meaningful order (e.g., “low”, “medium”, “high”). Each category is assigned a unique integer value while preserving the order of the categories.
- Data must be clean with no missing values before running this 


**Code** 
```Python
categorical_cols = data.select_dtypes(include = ["object"]).columns
cat_df = data[categorical_cols]

from sklearn.preprocessing import OrdinalEncoder
import numpy as np

encoder = OrdinalEncoder()
encoded_data = encoder.fit_transform(cat_df)

print(encoded_data)

encoded_df = pd.DataFrame(encoded_data, index=data.index, columns=categorical_cols)


final_data = pd.concat([data.drop(columns=categorical_cols), encoded_df], axis=1)
print(final_data.head())
```

### <font color='olive'><b>OneHotEncoder</b></font> 
  
- Converts categorical data into a one-hot (binary) format. Suitable for nominal data where the categories do not have a meaningful order (e.g., “red”, “blue”, “green”). Each category is represented by a binary vector, with a 1 in the position corresponding to the category and 0s elsewhere.
- Data must be clean with no missing values before running this 


**Code**

```python
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Identify categorical columns
categorical_cols = data.select_dtypes(include=["object"]).columns
cat_df = data[categorical_cols]

# Apply One-Hot Encoding
encoder = OneHotEncoder(sparse_output=False)
encoded_data = encoder.fit_transform(cat_df)

# Convert encoded array back into a DataFrame (with meaningful column names)
encoded_df = pd.DataFrame(encoded_data, index=data.index, columns=encoder.get_feature_names_out(categorical_cols))

# Merge with original DataFrame (after dropping original categorical columns)
final_data = pd.concat([data.drop(columns=categorical_cols), encoded_df], axis=1)

print(final_data.head())

```

### <font color='olive'><b>Mask</b></font> 


If you have too many categorical features, and some aren’t relevant, it’s best to reduce them for efficiency.

**Code**

```python
# Create a series out of the column
column_series = data['column']

# Get the counts of each category
column_counts = column_series.value_counts()

# Create a mask for only categories that occur less than 10 times
mask = column_series.isin(column_counts[column_counts < 10].index)

# Label all other categories as Other
column_series[mask] = 'Other'

# Print the updated category counts
print(column_series.value_counts())

```

and then proceed to do **OrdinalEncoder and OneHotEncoder**




### <font color='olive'><b>Clean and Transform Categorical Data at Once With Column Transformer</b></font> 

```python
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

#Convert categorical features to strings before encoding
data[categorical_columns] = data[categorical_columns].astype(str)

# Columns with skewed distributions - will use median 
skewed_columns = ["Adult Mortality", "infant deaths", "Alcohol", "Measles", "under-five deaths",
                  "Total expenditure", "HIV/AIDS", "thinness 5-9 years", 
                  "Hepatitis B", "Polio", "Diphtheria", "GDP", "Population", "thinness  1-19 years", "BMI", 
                  "Income composition of resources","life_expc_cat", "Life expectancy", 
                  'Alchohol_per_capita', 'GDP_per_Capita', 'Social_devel_in' ]

# Columns with normal distributions - use mean
normal_columns = ["Schooling"]

# Categorical columns - use most frequent 
categorical_columns = ['Country', 'Status']

from sklearn.preprocessing import OneHotEncoder

preprocessor = ColumnTransformer(
    transformers=[
        ('median_imputer', SimpleImputer(strategy='median'), skewed_columns),
        ('mean_imputer', SimpleImputer(strategy='mean'), normal_columns),
        ('onehot', OneHotEncoder(drop='first', sparse_output=False), categorical_columns)
    ],
    remainder='passthrough'  # Keep the remaining columns as they are
```


### <font color='olive'><b>Binarizing Columns<b></b></font> 

- Binarizing columns means converting numerical or categorical values into binary (0 or 1) format based on a condition. This is useful when you only care about whether a condition is met, rather than the actual value.

```python
# Step 1: Create a new binary column filled with zeros
df["New_Binary_Column"] = 0

# Step 2: Apply condition to modify values (convert 0 to 1 based on a threshold)
# The threshold value can be anything you define based on your needs.
df.loc[df["Target_Column"] > threshold_value, "New_Binary_Column"] = 1


# Step 3: Bin the continuous variable into defined number of bins
df["Binned_Column"] = pd.cut(df["Continuous_Column"], num_bins)

# Step 4: Print the first few rows to verify
print(df[["New_Binary_Column", "Target_Column", "Binned_Column", "Continuous_Column"]].head())
```

### <font color='olive'><b>Numerical Categorical Data<b></b></font> 




### <font color='olive'><b>Target Labels in a Multi-Class Classification Setting (Keras/TensorFlow)<b></b></font> 

Converts categorical class labels (like "glioma", "no tumor", etc.) into a one-hot format usable by a neural network. Suitable for multi-class classification tasks in Keras/TensorFlow. Only needed for the y data. 


```python
import numpy as np
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical
# Step 1: Integer-encode class labels

label_encoder = LabelEncoder()

y_train_encoded = label_encoder.fit_transform(y_train)

y_test_encoded = label_encoder.transform(y_test)

# Step 2: One-hot encode the integer labels

y_train_cat = to_categorical(y_train_encoded, num_classes=4)

y_test_cat = to_categorical(y_test_encoded, num_classes=4)
```


### <font color='olive'><b>Optional: Decode predictions later</b></font> 

```python

# Convert predictions back to class names after inference
predicted_classes = model.predict(X_test_rgb)
predicted_labels = label_encoder.inverse_transform(np.argmax(predicted_classes, axis=1))

