# <span style="color:darkblue;">[LDATS2350] - DATA MINING</span>

### <span style="color:darkred;">Python04 - Convert Categorical Values</span>

**Prof. Robin Van Oirbeek**  

<br/>

**<span style="color:darkgreen;">Guillaume Deside</span>** (<span style="color:gray;">guillaume.deside@uclouvain.be</span>)

---

## **Goal of This Session**

In data mining and machine learning, datasets often contain **categorical variables**, which represent discrete categories or labels. However, most machine learning algorithms require **numerical inputs**, meaning these categorical variables must be converted into a numerical format before use.  

This session focuses on:
- Understanding the importance of converting categorical variables.
- Learning different techniques to transform categorical data into numerical representations.
- Applying these techniques to prepare datasets for predictive modeling.

---

### **Why Convert Categorical Values?**

1. **Machine Learning Algorithms Require Numerical Data**:
   - Algorithms like **linear regression**, **logistic regression**, and **support vector machines** require numerical inputs for computation.

2. **Improve Model Performance**:
   - Properly encoding categorical variables ensures that the relationships between features and the target variable are preserved.

3. **Enable Compatibility**:
   - Encoding categorical data ensures compatibility with numerical-based operations like distance calculations, correlations, and more.

---


# Get_dummies

<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

### **Exercise - One-Hot Encoding Categorical Features**

#### **Objective**
Learn how to convert **categorical features** into **numerical representations** using **one-hot encoding** with Pandas' `get_dummies()` method. 

---

#### **Instructions**

1. **Load the Dataset**:
   - Use `pandas` to read the dataset `data_example.csv` containing a column named `"type_of_food"`.

2. **Apply One-Hot Encoding**:
   - Use the `pd.get_dummies()` function to transform the `"type_of_food"` column into binary indicator variables.
   - Add the new encoded variables to the DataFrame.

3. **Test `drop_first` Parameter**:
   - Create one-hot encodings **with** and **without** the `drop_first=True` parameter.
   - Observe how the number of resulting columns changes when `drop_first=True` is applied.

4. **Compare Outputs**:
   - Compare the DataFrame before and after applying one-hot encoding.
   - Note the difference in the number of columns with and without `drop_first=True`.


# sklear OneHot encoder 

<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

### **Exercise - One-Hot Encoding with Scikit-Learn's `OneHotEncoder`**

#### **Objective**
Learn how to convert **categorical variables** into **numerical representations** using Scikit-Learn's `OneHotEncoder`. This exercise will also demonstrate how to handle **unknown categories** and create a new DataFrame with encoded features.

---

#### **Instructions**

1. **Load the Dataset**:
   - Use `pandas` to load a dataset `data_example.csv` containing a column named `"type_of_food"`.

2. **Initialize the Encoder**:
   - Use Scikit-Learn's `OneHotEncoder` with the parameter `handle_unknown='ignore'` to avoid errors from unseen categories during encoding.

3. **Fit the Encoder**:
   - Fit the encoder to the `"type_of_food"` column in the DataFrame.

4. **Check Encoded Categories**:
   - Use the `.categories_` attribute to inspect the categories identified by the encoder.

5. **Transform the Data**:
   - Apply the encoder using the `.transform()` method and convert the output to a NumPy array.

6. **Create a New Encoded DataFrame**:
   - Convert the encoded array into a new Pandas DataFrame.
   - Use the categories from the encoder as column names for the DataFrame.

7. **Combine the Encoded Data**:
   - Merge the new DataFrame containing encoded features with the original dataset.

</div>
