
# 👩‍💻 Data Preparation: Encoding Categorical Variables for Salary Prediction

## 📋 Overview
In this lab, you'll learn how to transform categorical data into numerical form - a crucial step in preparing data for machine learning models. You'll work with a dataset containing information about individuals' salaries, education levels, and occupations. By applying appropriate encoding techniques to these categorical variables, you'll prepare the data for a salary prediction model while maintaining the underlying relationships within the data.

## 🎯 Learning Outcomes
By the end of this lab, you will be able to:

- Identify which categorical variables need transformation and select appropriate encoding techniques

- Implement One-Hot Encoding for nominal categorical variables using scikit-learn

- Apply Label Encoding for ordinal categorical variables

- Integrate encoded variables back into a dataset for machine learning preparation

## 🚀 Starting Point


In [None]:
# Starter code - imports and data loading
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Load the dataset
data = pd.read_csv('Salary Data.csv')

## Task 1: Dataset Exploration
**Context:** Before we can process categorical data, we need to understand the structure of our dataset and identify which variables need encoding.

**Steps:**

1. Use the `head()` method to view the first few rows of the dataset
 - This will help you identify the columns present in the dataset
 - Look for categorical variables like job titles and education levels

2. Use the `describe()` method to get summary statistics for numerical columns
  - Pay attention to which columns are included or excluded from this summary
  - Columns not included are likely categorical and will need encoding

3. Examine the unique values in each categorical column

  - Use `data['column_name'].unique()` to see the distinct categories
  - Note whether categories have a natural order (ordinal) or not (nominal)


In [None]:
# Your code for dataset exploration

💡 **Tip:** Pay special attention to the data types returned by `data.info()` - columns with 'object' or 'string' types typically contain categorical data.

⚙️ **Test Your Work:**

- Your output should show that the dataset contains categorical columns including 'Job Title' and 'Education Level'
- You should be able to identify which columns need encoding for machine learning use


## Task 2: Select Encoding Techniques
**Context:** Different types of categorical data require different encoding approaches. In this task, you'll determine which encoding technique is best for each categorical variable.

**Steps:**

1. Identify nominal categorical variables (no inherent order)

 - Look at 'Job Title' - does one occupation inherently rank above another?
 - Consider whether One-Hot Encoding would be appropriate for this variable

2. Identify ordinal categorical variables (have a natural order)

 - Examine 'Education Level' - do the values have a natural progression?
 - Consider whether Label Encoding would preserve the ordinal relationship


In [None]:
# Your code for identifying variable types and selecting encoding techniques

💡 **Tip:** One-Hot Encoding creates new binary columns for each category, while Label Encoding assigns a single numeric value to each category. Choose based on whether the relationship between categories matters.

⚙️ **Test Your Work:**

- You should have clearly identified 'Job Title' as a nominal variable suitable for One-Hot Encoding
- You should have identified 'Education Level' as an ordinal variable suitable for Label Encoding


## Task 3: Apply One-Hot Encoding
**Context:** One-Hot Encoding transforms a categorical variable into multiple binary columns, one for each category. This prevents the model from assuming ordinal relationships between categories.

**Steps:**

1. Import and initialize the OneHotEncoder

 - Set `sparse_output=False` to get a dense array rather than a sparse matrix

2. Apply the encoder to the 'Job Title' column

 - Use `fit_transform()` on the column reshaped as a 2D array
 - Note that OneHotEncoder expects a 2D array input, so use double brackets: `[['column_name']]`

3. Convert the encoded array to a DataFrame with appropriate column names

 - Use `encoder.get_feature_names_out()` to generate the column names
 - Create a DataFrame with these names and the encoded values

In [None]:
# Your code for One-Hot Encoding

💡 **Tip:** Check the shape of your output to ensure you have the expected number of new columns (one for each unique job title).

⚙️ **Test Your Work:**

- Your output should include new binary columns for each unique job title
- Each row should have a 1 in exactly one of the job title columns and 0s elsewhere


## Task 4: Apply Label Encoding
**Context:** Label Encoding converts categorical values into numeric values while preserving the ordinal relationship between categories.

**Steps:**

1. Import and initialize the LabelEncoder

2. Apply the encoder to the 'Education Level' column

 - Use `fit_transform()` directly on the column
 - Add the result as a new column in your DataFrame

3. Confirm the mapping between original values and encoded values

 - Use `label_encoder.classes_` to see the original categories
 - The index of each category in this array corresponds to its encoded value


In [None]:
# Your code for Label Encoding

💡 **Tip:** If the order of your categories matters (e.g., 'High School' < 'Bachelor' < 'Master'), you may need to manually specify this order rather than letting LabelEncoder decide automatically.

⚙️ **Test Your Work:**

- You should have a new column like 'Education_Level_Encoded' containing numeric values
- Higher education levels should generally have higher encoded values


## Task 5: Prepare Data for Modeling
**Context:** Now that we've encoded our categorical variables, we need to integrate them into a final dataset ready for machine learning.

**Steps:**

1. Drop the original categorical columns from the dataset

 - These are no longer needed since we have encoded versions

2. Concatenate the original numeric columns with the encoded categorical columns

 - Use `pd.concat()` to join DataFrames along the column axis (axis=1)

3. Verify that the final dataset contains all necessary features in appropriate format

 - Check that all categorical data has been properly encoded
 - Ensure there are no missing values


In [None]:
# Your code for integrating encoded data

💡 **Tip:** Always check your final dataset shape to confirm you have the expected number of rows and columns after joining.

⚙️ **Test Your Work:**

- Your final DataFrame should no longer contain string-based categorical columns
- The row count should match your original dataset
- The column count should be greater than the original due to one-hot encoding


## ✅ Success Checklist

- Successfully loaded and explored the salary dataset
- Identified which variables require encoding and selected appropriate techniques
- Applied One-Hot Encoding to the 'Job Title' variable
- Applied Label Encoding to the 'Education Level' variable
- Created a final dataset with all categorical variables properly encoded
- Documented the reasoning behind encoding choices
- Program runs without errors

## 🔍 Common Issues & Solutions

**Problem:** One-Hot Encoding creates too many columns when there are many unique categories.

**Solution:** Consider grouping similar categories together before encoding, or use dimensionality reduction techniques afterward.

**Problem:** LabelEncoder assigns numbers based on alphabetical order, which may not match the logical ordering.

**Solution:** Create a custom mapping dictionary to assign numbers in the correct order, or use OrdinalEncoder with specified category orders.

**Problem:** Missing values in categorical columns cause errors during encoding.

**Solution:** Handle missing values before encoding, either by filling them in or creating a separate category for missing values.

## 🔑 Key Points
- The choice of encoding technique depends on the nature of the categorical variable and its relationship to the target variable
- One-Hot Encoding preserves the lack of order in nominal variables but increases dimensionality
- Label Encoding is efficient for ordinal variables but can mislead models with non-ordinal data
- Proper encoding is essential for machine learning models to effectively use categorical information


## 💻 Exemplar Solution
After completing this activity (or if you get stuck!), take a moment to review the exemplar solution. This sample solution can offer insights into different techniques and approaches.
Reflect on what you can learn from the exemplar solution to improve your coding skills.
Remember, multiple solutions can exist for some problems; the goal is to learn and grow as a programmer by exploring various approaches.
Use the exemplar solution as a learning tool to enhance your understanding and refine your approach to coding challenges.


<details>

<summary><strong>Click HERE to see an exemplar solution</strong></summary>    
    
```python
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Load the dataset
data = pd.read_csv('Salary Data.csv')

# Task 1: Dataset Exploration
print(data.head())
print(data.describe())
print(data['Job Title'].unique())
print(data['Education Level'].unique())

# Task 2: Select Encoding Technique
# (This is a decision point based on analysis)

# Task 3: Apply One-Hot Encoding
encoder = OneHotEncoder(sparse=False)
occupation_encoded = encoder.fit_transform(data[['Job Title']])
occupation_encoded_df = pd.DataFrame(occupation_encoded, 
                                    columns=encoder.get_feature_names_out(['Job Title']))

# Task 4: Apply Label Encoding
label_encoder = LabelEncoder()
data['Education_Level_Encoded'] = label_encoder.fit_transform(data['Education Level'])
print("Original education levels:", label_encoder.classes_)
print("Encoded education levels:", {edu: idx for idx, edu in enumerate(label_encoder.classes_)})

# Task 5: Prepare Data for Modeling
data_encoded = pd.concat([data.drop(['Job Title', 'Education Level'], axis=1), 
                         occupation_encoded_df], axis=1)
print(data_encoded.head())

# Reflection
# Note: provide your own reflection on how encoding choices
# impact model development and performance
    
       
```     