In [72]:
import pandas as pd
import numpy as np

## Q1. What is data encoding? How is it useful in data science?

In Machine Learning, the models that are being created works only on mathematical equations and for solving such equations, we only need the data in numerical form. But this might not be possible as in real world scenario, there could be multiple variables which are categorical in nature which we cannot avoid. Now for transforming this Categorical variables to Numerical variables we use a technique known as **Data Encoding**. This technique helps in converting the data in different ways and tries to preserve the information that these variables represent.

#### <u>Data Encoding Usefulness in Data Science</u>

1. It converts the categorical features to numerical features which helps in model training.
2. It also tries to preserve the meaning of these categorical variables which helps in holding the information within the data.

## Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

**Nominal Encoding -** This technique of encoding is used where variables are just names and there is no order or rank to this variable's feature.

![image.png](attachment:image.png)

**For example:** City of person lives in, Gender of person, Marital Status, etc.

In the above example, We do not have any order or rank, or sequence. All the variables in the respective feature are equal. We can't give them any orders or ranks. Those features are called Nominal features.

The technique used to convvert these nominal categorical features to numerical feature is called **Nominal Encoding**. One of the techniques of Nominal Encoding is known as **One Hot Encoding**

#### Example of Nominal Encoding(One Hot Encoding)

In [73]:
# Defining the dataset
colors = ["Violet", "Indigo", "Blue", "Green", "Yellow", "Orange", "Red"]

df = pd.DataFrame(data=colors, columns=["color"])
df.head()

Unnamed: 0,color
0,Violet
1,Indigo
2,Blue
3,Green
4,Yellow


In [74]:
# Performing One Hot Encoding
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

encoder.fit(df)
final_df = pd.DataFrame(encoder.transform(df[["color"]]).toarray(), columns=encoder.get_feature_names_out())
final_df

Unnamed: 0,color_Blue,color_Green,color_Indigo,color_Orange,color_Red,color_Violet,color_Yellow
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,1.0,0.0,0.0


## Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

The question seems incorrect. The reason for being that One-Hot Encoding is a Data Encoding technique that is used to convert Nominal Categorical variables to Numerical Fetaures.

Let's consider the question as - **When ordinal Encoding is preferred over One-Hot Encoding**

Let's try to udnerstand the scenarios where ordinal encoding is preferred over nominal encoding(One-Hot Encoding):

1. Ordinal encoding is used when there is a natural ordering in the categorical variables. **For example-** degree of a student, student's performance in class, etc. All of these vriables have an internal natural ordering to them.

![image-2.png](attachment:image-2.png)

2.  It is important to note that forcing an ordinal relationship via an ordinal encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

![image.png](attachment:image.png)

## Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

Here, we have a dataset containing a categorical variable having 5 unique values. The type os encoding can be based on different scenarios as discussed below:

1. We could perform **Nominal Encoding** when there is no natural ordering within these variable values such if they are the name of 5 cities.

2. We could perform **Ordinal Encoding** when there is a natural ordering among these values and we could assign some rank to them. For example - grade of 5 students.

3. One Hot encoding is preferred sometimes because it doesn't assume any relationship or rank among the variable values. But on the other hand, it can give rise to a large number of features because it treats all the categorical values as a vector which can lead to Overfitting of the model.

4. Ordinal encoding is preferred when the number of unique values is large and one-hot encoding would lead to the curse of dimensionality.

## Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

Here, we have a Machine Learning Project with a dataset with 1000 rows and 5 columns. Out of these 5 variables, 2 of them are categorical columns. Suppose there is no order or rank could be assigned among the values of these 2 variables and assuming there are 5 unique values in first categorical variable and 10 unique values in second categorical variable.

Then the first catgorical will be converted to a vector of size 5.
Then the second catgorical will be converted to a vector of size 10.

The total number of columns that will be added are = 10 + 5 = 15

We can subtract 2 because 2 columns which already exist can be dropped later after conversion.

## Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

In the dataset about the information of different types of animals it contain 2 variables namely **Species, Habitat and Diet**.

All of these variables doesn't have a natural order to them and hence Ordinal Encoding is not suitable to them.

So here we would be performing **Nominal Encoding** of these categorical features for the model to be trained.

## Q7.You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

Steps involved for converting the above dataset for implementing encoding are as follows:

1. Identifying the categorical features - here these are Gender and Contract Types.
2. Identify which features are ordinal or nominal.
3. Here Gender is a Nominal Variable. Hence we will perform One Hot Encoding for this feature.
4. Contract Type is a Ordinal Variable because it can be yearly, half-yearly, quarterly or monthly. Hence we will perform Ordinal Encoding.
5. Combine all these features to form the final dataframe

In [116]:
gender = np.random.choice(["Male", "Female"], size = 100)
age = np.random.normal(loc=30, scale=10, size=100)
contract_type = np.random.choice(["yearly", "half-yearly", "quarterly", "monthly"], size=100)
charges = np.random.normal(loc=150, scale=30, size=100)
tenure = np.random.randint(low=1, high=5, size=100)
churn = np.random.randint(low=0, high=1, size=100)

df = pd.DataFrame({'Gender': gender, 
                   'Age' :  age,
                   'Contract_Type' : contract_type,
                   "Charges" : charges,
                   "Tenure" : tenure,
                   "Churn" : churn
                  })
df.head()

Unnamed: 0,Gender,Age,Contract_Type,Charges,Tenure,Churn
0,Male,47.14226,monthly,219.824667,2,0
1,Male,22.462569,quarterly,126.155758,4,0
2,Male,37.562815,yearly,165.831061,2,0
3,Male,27.538034,monthly,114.196594,3,0
4,Female,22.381945,half-yearly,181.332588,4,0


In [117]:
df.dtypes

Gender            object
Age              float64
Contract_Type     object
Charges          float64
Tenure             int32
Churn              int32
dtype: object

In [118]:
cat_features = list(df.columns[df.dtypes == "object"])
print(f"Categorical features are : {cat_features}")

Categorical features are : ['Gender', 'Contract_Type']


In [119]:
# One Hot Encoding the Gender column
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder()
onehot.fit(df[[cat_features[0]]])

gender_df = pd.DataFrame(onehot.transform(df[[cat_features[0]]]).toarray(), columns=onehot.get_feature_names_out())
gender_df.head()

Unnamed: 0,Gender_Female,Gender_Male
0,0.0,1.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,1.0,0.0


In [120]:
# Ordinal Encoding the Contract Types column
from sklearn.preprocessing import OrdinalEncoder

ordinal = OrdinalEncoder(categories=np.array(df[cat_features[1]].unique()).tolist())
contract_encoded = ord_enc.fit_transform(df[['Contract_Type']]).flatten()

ordinal_df = pd.DataFrame(contract_encoded, columns=["Contract_Type"])
ordinal_df.head()

Unnamed: 0,Contract_Type
0,0.0
1,1.0
2,3.0
3,0.0
4,2.0


In [122]:
final_df = df.drop(labels=cat_features, axis=1)

In [123]:
final_df = pd.concat([df, gender_df, ordinal_df], axis=1)
final_df.head()

Unnamed: 0,Gender,Age,Contract_Type,Charges,Tenure,Churn,Gender_Female,Gender_Male,Contract_Type.1
0,Male,47.14226,monthly,219.824667,2,0,0.0,1.0,0.0
1,Male,22.462569,quarterly,126.155758,4,0,0.0,1.0,1.0
2,Male,37.562815,yearly,165.831061,2,0,0.0,1.0,3.0
3,Male,27.538034,monthly,114.196594,3,0,0.0,1.0,0.0
4,Female,22.381945,half-yearly,181.332588,4,0,1.0,0.0,2.0
