### Q1. **What is data encoding? How is it useful in data science?**

**Data encoding** refers to the process of converting categorical data (non-numeric data like labels, names, etc.) into numerical formats that machine learning models can process. Since most machine learning algorithms work with numerical data, encoding helps transform non-numeric features into numerical ones.

**Usefulness in data science:**
1. **Model compatibility:** Machine learning models like decision trees, logistic regression, and neural networks require input features to be numerical.
2. **Improved performance:** Encoding can improve the accuracy and performance of machine learning models by representing categorical data effectively.
3. **Handling different types of variables:** Different encoding methods (one-hot encoding, label encoding, etc.) help manage various types of categorical variables based on their structure and relationships.

---

### Q2. **What is nominal encoding? Provide an example of how you would use it in a real-world scenario.**

**Nominal encoding** (also known as label encoding) is a method that assigns a unique numerical value to each category in a categorical feature. It’s often used for nominal (unordered) categorical data, where there’s no implicit ranking or order between the categories.

**Example:** In a dataset about car colors, suppose the colors are: `Red`, `Blue`, `Green`, and `Yellow`. Nominal encoding will assign a unique integer to each category, such as:
- Red = 0
- Blue = 1
- Green = 2
- Yellow = 3

**Real-world use case:** Consider a system predicting the brand of a phone a customer might purchase based on user behavior data. The brand names (e.g., Apple, Samsung, Nokia, etc.) can be nominally encoded since the brands do not have any inherent order.

---

### Q3. **In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.**

**Nominal encoding is preferred** when:
1. **Ordinal relationships don't exist**: Nominal encoding works well if there’s no meaningful ordinal relationship among the categories, and the model can handle arbitrary numeric labels.
2. **Memory efficiency**: If the dataset has a high cardinality (many unique values), one-hot encoding would create too many columns, making it less memory efficient.

**Example:** In predicting customer preferences for different countries, there may be dozens of country categories. Instead of one-hot encoding, which would create a large number of new columns (one for each country), nominal encoding would assign a unique integer to each country, saving memory and computation time.

---

### Q4. **Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.**

**Choice: One-hot encoding**

**Reason:** With only 5 unique categories, one-hot encoding would create 5 additional columns, each representing whether the instance belongs to one of the categories. This method is effective for ensuring no ordinal relationships are implied, which is important for models that could incorrectly interpret numerical encodings as having order.

For example, if you have categories like `Dog`, `Cat`, `Bird`, `Fish`, and `Horse`, one-hot encoding would create binary columns for each category:
- `Dog`: 1 or 0
- `Cat`: 1 or 0
- `Bird`: 1 or 0
- `Fish`: 1 or 0
- `Horse`: 1 or 0

One-hot encoding ensures no hierarchical or ordinal information is accidentally introduced into the model.

---

### Q5. **In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.**

**Assumptions:**
- Suppose the two categorical columns have `m` and `n` unique categories, respectively.
- Nominal encoding does not increase the number of columns but rather transforms the categorical values into numerical labels.

Since nominal encoding assigns one unique numerical label per category in each column, the total number of columns remains **5**:
- 3 original numerical columns.
- 2 transformed categorical columns, which remain as single columns after nominal encoding.

Thus, **no additional columns are created** in nominal encoding. The final dataset will still have 5 columns.

---

### Q6. **You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.**

**Choice: One-hot encoding**

**Reason:** Species, habitat, and diet are nominal categorical features (i.e., there’s no inherent order among the categories like “herbivore,” “carnivore,” or “desert,” “forest”). One-hot encoding would ensure that no artificial ordinal relationship is introduced by encoding them numerically. Each category would be represented as a binary column, ensuring the model does not make unintended assumptions about the relationships between categories.

For example:
- Species (`Dog`, `Cat`, `Bird`) would be transformed into three binary columns.
- Habitat (`Forest`, `Desert`, `Ocean`) would also be represented as binary columns.
  
One-hot encoding is more suitable here as the dataset likely has limited categories for each attribute.

---

### Q7. **You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.**

**Categorical features:**
1. **Gender** (Male/Female) – Binary categorical
2. **Contract Type** (Month-to-month, One-year, Two-year) – Nominal categorical

**Step-by-step encoding:**

1. **Gender:**
   - Since this is a binary categorical feature, it can be represented with **binary encoding** (i.e., 0 for Male, 1 for Female or vice versa).

2. **Contract Type:**
   - **One-hot encoding** would be preferred here since `Contract Type` has no ordinal relationship, and one-hot encoding will ensure that each contract type is treated independently by the machine learning model. This would create three binary columns: `Month-to-month`, `One-year`, and `Two-year`.

3. **Age, Monthly Charges, and Tenure:**
   - These are already numerical features and do not require encoding.

**Final implementation:**
- Apply binary encoding to the `Gender` column, transforming it into 1 column.
- Apply one-hot encoding to the `Contract Type` column, creating 3 new columns.
- Keep the numerical columns (`Age`, `Monthly Charges`, `Tenure`) as they are.

After encoding, the dataset will have a total of 7 columns (1 for gender, 3 for contract types, and 3 numerical columns).