In [None]:
Q1. What is data encoding? How is it useful in data science?
Ans:Data Encoding: A Fundamental Concept in Data Science
Data encoding is the process of transforming data into a format that can be easily understood and processed by computers. It's a crucial step in data science, especially when dealing with categorical or textual data.

Data Encoding Important:
Machine Learning Compatibility: Most machine learning algorithms are designed to work with numerical data. Encoding converts categorical or textual data into numerical representations, making it suitable for these algorithms.
Efficient Data Storage: Encoded data often occupies less storage space compared to its original format.
Improved Model Performance: The choice of encoding technique can significantly impact the performance of a machine learning model.
Feature Engineering: Encoding can create new features that capture valuable information from the original data.



In [None]:
Q2. What is nominal encoding? 
Ans:Nominal encoding is a technique used to represent categorical data with no inherent order. In this method, each unique category is assigned a unique integer or string.

In [None]:
Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.
Ans:Nominal Encoding vs. One-Hot Encoding: When to Choose Which
Both nominal encoding and one-hot encoding are popular techniques for handling categorical data. However, there are specific scenarios where one might be preferred over the other.

Nominal Encoding is Preferred When:
Cardinality is high: When there are a large number of unique categories, one-hot encoding can create a high-dimensional feature space, which can lead to the curse of dimensionality. Nominal encoding can help to reduce the dimensionality.
Categories have a natural order: If the categories have a meaningful order (e.g., "Low," "Medium," "High"), nominal encoding can capture this relationship. One-hot encoding treats all categories as equal.
Example:
Consider a dataset of customer reviews with a "Rating" column that can take values from 1 to 5. In this case, nominal encoding might be preferred over one-hot encoding because:

Cardinality is low: There are only five possible values, so one-hot encoding would not create a high-dimensional feature space.
Categories have a natural order: The ratings have a clear order from "1" (worst) to "5" (best). Nominal encoding can capture this order by assigning sequential integers.
By using nominal encoding in this scenario, we can preserve the ordinal relationship between the ratings and potentially improve the performance of the machine learning model.

In [None]:
Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding
technique would you use to transform this data into a format suitable for machine learning algorithms?
Explain why you made this choice.
Ans:For a dataset with 5 unique categorical values, one-hot encoding would be a suitable choice.



Low Cardinality: With only 5 unique values, the dimensionality of the feature space after one-hot encoding will be relatively low, making it computationally efficient for most machine learning algorithms.
No Natural Order: If there no inherent order or hierarchy among the categories, one-hot encoding is a straightforward and effective way to represent them as binary vectors.
Preserves Category Independence: One-hot encoding treats each category as independent, which is often desirable when the categories are mutually exclusive.

In [None]:
Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns
are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to
transform the categorical data, how many new columns would be created? Show your calculations.
Ans:Understanding the Problem:

We have 2 categorical columns.
Nominal encoding creates a new column for each unique value in the categorical column.
Calculating New Columns:

We need to know the total number of unique values in both categorical columns.

Let's assume:

Column 1 has 3 unique values (e.g., "Red," "Blue," "Green").
Column 2 has 4 unique values (e.g., "Small," "Medium," "Large," "Extra Large").
Total new columns: 3 (from Column 1) + 4 (from Column 2) = 7 new columns.

In [None]:
Q6. You are working with a dataset containing information about different types of animals, including their
species, habitat, and diet. Which encoding technique would you use to transform the categorical data into
a format suitable for machine learning algorithms? Justify your answer.
Ans:For a dataset containing information about animals, such as species, habitat, and diet, one-hot encoding would be a suitable technique to transform the categorical data.



No Natural Order: Categories like "species," "habitat," and "diet" typically don't have a clear hierarchical or ordinal relationship. One-hot encoding treats each category as independent, which is appropriate in this case.
Preserves Category Independence: One-hot encoding ensures that the categories are treated as mutually exclusive, preventing unintended correlations or biases.
Clarity and Interpretability: The resulting binary representation is easy to interpret and understand, making it easier to analyze the model's results.

In [None]:
Q7.You are working on a project that involves predicting customer churn for a telecommunications
company. You have a dataset with 5 features, including the customer's gender, age, contract type,
monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical
data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.
Ans:

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
data = pd.read_csv("customer_data.csv")
gender_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
gender_encoded = gender_encoder.fit_transform(data[["Gender"]])
gender_df = pd.DataFrame(gender_encoded, columns=[f"Gender_{col}" for col in gender_encoder.get_feature_names_out()])
data.drop("Gender", axis=1, inplace=True)
data = pd.concat([data, gender_df], axis=1)
contract_encoder = LabelEncoder()
contract_encoder.fit(data["Contract Type"])
data["Contract Type"] = contract_encoder.transform(data["Contract Type"])
contract_mapping = {"Monthly": 1, "Yearly": 2, "Two-Year": 3} 
data["Contract Type"] = data["Contract Type"].replace(contract_mapping)