# Assignment (20th March) : Feature Engineering - 4

### Q1. What is data encoding? How is it useful in data science?

**ANS:** `Data encoding` is the process of converting categorical or textual data into numerical format so that it can be used by machine learning algorithms, which typically require numerical input. 

**`Usefulness in Data Science`**:

1. **Algorithm Compatibility**: Many machine learning algorithms require numerical input. Encoding ensures categorical data can be used with these algorithms.
2. **Model Performance**: Proper encoding can improve the performance and accuracy of machine learning models.
3. **Feature Engineering**: Encoded data can be used to create new features, enhancing model prediction capabilities.
4. **Data Interpretation**: Helps in converting complex categorical data into a format that can be more easily analyzed and interpreted. 

Overall, data encoding is a critical step in data preprocessing that makes it possible to apply machine learning techniques to a wide range of datasets.

### Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

**ANS:** `Nominal Encoding` is a method of converting categorical data without any inherent order into numerical format, often using one-hot encoding.

**`Example`**: In a customer database for an e-commerce site, if the "Preferred Payment Method" column contains categories like 'Credit Card', 'PayPal', 'Bank Transfer', and 'Bitcoin', you would use one-hot encoding to convert these categories into binary vectors:

- **Original**: ['Credit Card', 'PayPal', 'Bank Transfer', 'Bitcoin']
- **One-Hot Encoded**: 
  - 'Credit Card' -> [1, 0, 0, 0]
  - 'PayPal' -> [0, 1, 0, 0]
  - 'Bank Transfer' -> [0, 0, 1, 0]
  - 'Bitcoin' -> [0, 0, 0, 1]

This ensures the data can be used in machine learning models.

### Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

**ANS:** `Nominal Encoding (Label Encoding)` is preferred when the categorical data has a large number of unique values, and one-hot encoding would result in a very sparse matrix.

**`Example`**:
In a dataset with the "City" column containing 1000 unique city names, using one-hot encoding would create 1000 binary columns, leading to a sparse and high-dimensional dataset. Instead, nominal encoding (label encoding) would assign a unique integer to each city:

- **Original**: ['New York', 'Los Angeles', 'Chicago', ..., 'Miami']
- **Label Encoded**: [1, 2, 3, ..., 1000]

This keeps the feature space manageable and reduces computational complexity.

### Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

**ANS:** We will use `One-Hot Encoding` technique to transform this data due to the following reasons:

- With only 5 unique values, one-hot encoding will create a manageable number of binary columns (5 columns).
- One-hot encoding prevents the machine learning algorithm from assuming any ordinal relationship between the categories, which is important for ensuring accurate model training when the categorical data has no inherent order.

**`Example`**:
For a categorical feature with values ['A', 'B', 'C', 'D', 'E']:
- **One-Hot Encoded**:
  - 'A' -> [1, 0, 0, 0, 0]
  - 'B' -> [0, 1, 0, 0, 0]
  - 'C' -> [0, 0, 1, 0, 0]
  - 'D' -> [0, 0, 0, 1, 0]
  - 'E' -> [0, 0, 0, 0, 1]

This encoding method is `simple`, `effective`, and ensures the categorical `data is in a suitable format` for machine learning algorithms.

### Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

**ANS:** **`Assumption`**: The two categorical columns each have \( n_1 \) and \( n_2 \) unique values respectively.

**`Nominal Encoding Calculation`**:
1. **One-Hot Encoding** for each categorical column.
   - If the first categorical column has \( n_1 \) unique values, it will create \( n_1 \) new binary columns.
   - If the second categorical column has \( n_2 \) unique values, it will create \( n_2 \) new binary columns.

**`Total Columns Calculation`**:
- Original numerical columns: 3
- New columns from first categorical column: \( n_1 \)
- New columns from second categorical column: \( n_2 \)

**`Total new columns`**: \( 3 + n_1 + n_2 \)

**`Example Calculation`**:
- If the first categorical column has 4 unique values (\( n_1 = 4 \)) and the second categorical column has 5 unique values (\( n_2 = 5 \)):

<p align="center">
  \[
  \text{Total columns} = 3 (\text{numerical}) + 4 (\text{first categorical}) + 5 (\text{second categorical}) = 12
  \]
</p>

So, using nominal encoding (one-hot encoding) in this example would result in 12 columns in total. The actual number of new columns created depends on the number of unique values in each categorical column.

### Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

**ANS:** We will use `One-Hot Encoding` technique to transform the categorical data into a format suitable for machine learning algorithms. The `Justification` for it is as follows:

- **No Inherent Order**: The categorical data (species, habitat, diet) likely has no inherent order. One-hot encoding ensures that the machine learning algorithm does not assume any ordinal relationship between categories.
- **Sparsity Tolerable**: Assuming the number of unique categories in each feature (species, habitat, diet) is not excessively large, one-hot encoding will create a manageable number of columns.
- **Model Compatibility**: One-hot encoding is widely supported by many machine learning algorithms and helps improve model performance by ensuring each category is treated independently.

**`Example`**:
If the "species" column has categories ['cat', 'dog', 'fish'], the "habitat" column has categories ['land', 'water', 'air'], and the "diet" column has categories ['herbivore', 'carnivore', 'omnivore']:

- **One-Hot Encoded**:
  - **Species**:
    - 'cat' -> [1, 0, 0]
    - 'dog' -> [0, 1, 0]
    - 'fish' -> [0, 0, 1]
  - **Habitat**:
    - 'land' -> [1, 0, 0]
    - 'water' -> [0, 1, 0]
    - 'air' -> [0, 0, 1]
  - **Diet**:
    - 'herbivore' -> [1, 0, 0]
    - 'carnivore' -> [0, 1, 0]
    - 'omnivore' -> [0, 0, 1]

One-hot encoding transforms each categorical feature into a series of binary columns, preserving the non-ordinal nature of the data and making it suitable for machine learning algorithms.

### Q7.You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

**ANS:** Encoding Techniques that we will use to transform the categorical data into numerical data are as follows:
1.  **`Label Encoding`**: For binary categorical features.
2.  **`One-Hot Encoding`**: For non-binary categorical features.

**`Step-by-Step Explanation`**:

1. **Identify Categorical and Numerical Features**:
   - Categorical: gender, contract type.
   - Numerical: age, monthly charges, tenure.

2. **Encode 'gender' (Binary Categorical Feature)**:
   - **Label Encoding**: Assign 0 to 'male' and 1 to 'female'.
     - 'male' -> 0
     - 'female' -> 1

3. **Encode 'contract type' (Non-Binary Categorical Feature)**:
   - **One-Hot Encoding**: Create binary columns for each contract type.
     - 'Month-to-month', 'One year', 'Two year'.
     - One-Hot Encoded:
       - 'Month-to-month' -> [1, 0, 0]
       - 'One year' -> [0, 1, 0]
       - 'Two year' -> [0, 0, 1]

4. **Combine Encoded Features with Numerical Features**:
   - Numerical features: age, monthly charges, tenure.
   - Encoded categorical features: gender (label encoded), contract type (one-hot encoded).