Q1. What is data encoding? How is it useful in data science?

Ans. Data encoding is the process of converting categorical data into a numerical format that can be used for machine learning algorithms and statistical models. 

Data encoding is essential in data science because it transforms categorical data into numerical formats, making it compatible with machine learning models. It helps improve model performance, preserves valuable information, reduces dimensionality, and enables effective feature engineering.

Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

Ans. Nominal encoding is a method of converting categorical data with no intrinsic order (nominal data) into numerical values. Each category is assigned a unique integer.

Example: For a dataset with Colors (Red, Green, Blue), nominal encoding might assign Red = 0, Green = 1, and Blue = 2. This encoding allows machine learning models to process categorical variables without assuming any order.








Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

Ans. Nominal encoding is preferred over one-hot encoding when dealing with high-cardinality features, where one-hot encoding would create many columns and increase dimensionality.

Example: For a feature like ZIP Code with thousands of unique values, nominal encoding (assigning each ZIP code a unique integer) is more efficient than one-hot encoding, which would create a large number of binary columns.

Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding
technique would you use to transform this data into a format suitable for machine learning algorithms?
Explain why you made this choice.

Ans. For a dataset with 5 unique categorical values, One-Hot Encoding is typically preferred.

One-Hot Encoding converts each categorical value into a binary vector, creating a separate column for each category. This avoids any implicit ordinal relationship and allows machine learning algorithms to handle the data effectively. For 5 unique values, this results in 5 binary columns, which is manageable and ensures that the model treats each category independently.

Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns
are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to
transform the categorical data, how many new columns would be created? Show your calculations.

Ans. If we use nominal encoding for the categorical columns, we do not create additional columns; instead, we replace each categorical column with a single column containing integer values.

Rows: 1000

Columns: 5 (2 categorical, 3 numerical)

Nominal encoding assigns a unique integer to each category within a categorical column. This means that each categorical column is replaced with a single new column of integers.

Since nominal encoding replaces each categorical column with a single integer column, the number of new columns created by nominal encoding is equal to the number of categorical columns.

Calculation:

Number of categorical columns: 2

New columns created by nominal encoding: 2

Q6. You are working with a dataset containing information about different types of animals, including their
species, habitat, and diet. Which encoding technique would you use to transform the categorical data into
a format suitable for machine learning algorithms? Justify your answer.

Ans. For a dataset with categorical features like species, habitat, and diet, use One-Hot Encoding. It converts each category into a binary vector, avoiding any ordinal assumptions and clearly representing each category.

Q7.You are working on a project that involves predicting customer churn for a telecommunications
company. You have a dataset with 5 features, including the customer's gender, age, contract type,
monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical
data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

Ans. For a dataset predicting customer churn with features like gender, age, contract type, monthly charges, and tenure, you should use One-Hot Encoding for the categorical features. Here's a step-by-step explanation:

Identify Categorical Features:

Gender (Nominal)

Contract Type (Nominal)

Apply One-Hot Encoding:

Step-by-Step Implementation:

Prepare the Data: Load the dataset into a DataFrame.

Perform One-Hot Encoding: Use a library like pandas or scikit-learn to convert categorical features into binary columns.



In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample dataset
data = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Female', 'Male'],
    'Age': [25, 34, 45, 52],
    'Contract Type': ['Month-to-Month', 'One Year', 'Two Year', 'Month-to-Month'],
    'Monthly Charges': [70.5, 85.0, 90.0, 70.0],
    'Tenure': [12, 24, 36, 15]
})

# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse=False, drop='first')  # drop='first' to avoid multicollinearity

# Encode categorical features
encoded_features = encoder.fit_transform(data[['Gender', 'Contract Type']])

# Create DataFrame with encoded features
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['Gender', 'Contract Type']))

# Combine with numerical features
final_data = pd.concat([data[['Age', 'Monthly Charges', 'Tenure']], encoded_df], axis=1)

print("Encoded Data:\n", final_data)


Encoded Data:
    Age  Monthly Charges  Tenure  Gender_Male  Contract Type_One Year  \
0   25             70.5      12          1.0                     0.0   
1   34             85.0      24          0.0                     1.0   
2   45             90.0      36          0.0                     0.0   
3   52             70.0      15          1.0                     0.0   

   Contract Type_Two Year  
0                     0.0  
1                     0.0  
2                     1.0  
3                     0.0  


