## Q1. What is data encoding? How is it useful in data science?

#### 1.Data encoding is the process of converting data from one form to another. In machine learning, data encoding is used to convert categorical data into numerical data so that it can be used in machine learning models. There are two popular techniques for encoding categorical data: Ordinal Encoding and One-Hot Encoding.
#### 2.Categorical data is data that is divided into categories or groups, such as colors, shapes, or sizes. Machine learning models cannot work with categorical data directly, so it needs to be converted into numerical data.
#### 3.Ordinal encoding is a process of converting categorical data into integer format so that the data with converted categorical values can be provided to the models to give and improve the predictions.
#### 4.One-hot encoding is a process of converting categorical data into binary format so that the data with converted categorical values can be provided to the models to give and improve the predictions.
#### 5.Data encoding is useful in data science because it helps to convert categorical data into numerical data, which can be used in machine learning models. This helps to improve the accuracy of the models and make them more efficient.

## Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

#### Nominal encoding is a process of encoding categorical data into a numerical format, where each category is assigned a unique integer value. This type of encoding is often used when the categorical variable doesn't have an inherent order or hierarchy. Nominal encoding is also known as one-hot encoding or dummy coding.

#### For example, let's say you have a dataset containing information about the type of vehicle owned by customers, such as "car", "truck", "motorcycle", and "van". To perform analysis on this data using machine learning algorithms, you need to convert the categorical variable "vehicle type" into a numerical format.

#### One way to do this is by using nominal encoding. In nominal encoding, you would assign a unique integer value to each category, such as:

#### car: 1
#### truck: 2
#### motorcycle: 3
#### van: 4
#### Now, you can represent the "vehicle type" column in your dataset using the encoded values. For example, if a customer owns a car, the "vehicle type" column would contain the value 1.
#### In a real-world scenario, nominal encoding can be used in various applications such as in natural language processing (NLP) to convert textual data such as customer reviews, into numerical format for further analysis. Another example could be in credit scoring, where the loan officer can encode the categorical variables such as employment type, income level, and education level into numerical format to determine the creditworthiness of an individual.

## Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

#### Nominal encoding and one-hot encoding are both methods used to convert categorical variables into numerical format. Nominal encoding assigns a unique integer value to each category while one-hot encoding creates a binary vector representation where each category is represented by a column with binary values indicating the presence or absence of that category.

#### #### Nominal encoding is preferred over one-hot encoding in situations where the number of categories is large, and the number of occurrences of each category is balanced. One-hot encoding can result in a very high-dimensional data matrix, which can lead to overfitting in machine learning models and can also be computationally expensive.

#### For example, suppose you have a dataset containing information about the type of cuisine of restaurants in a city. The dataset has 100,000 rows, and there are 50 different cuisine types. In this case, using one-hot encoding would result in a matrix with 50 columns and 100,000 rows, which could be computationally expensive to process. On the other hand, nominal encoding would result in a single column with 50 different integer values, which would be more manageable.
#### Another situation where nominal encoding may be preferred over one-hot encoding is when there is a small number of categories, and the categorical variable has an inherent order or hierarchy. For example, if a dataset contains a variable for education level, where the categories are "high school", "college", and "graduate school", it may be more appropriate to encode this variable using nominal encoding, where "high school" is assigned the value 1, "college" is assigned the value 2, and "graduate school" is assigned the value 3. This approach takes into account the hierarchy of the categories, whereas one-hot encoding would treat each category as equal, which may not be appropriate in this context.

## Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

#### 1.If I have a dataset containing categorical data with 5 unique values, I can use either ordinal encoding or one-hot encoding to transform this data into a format suitable for machine learning algorithms.
#### 2.If the categorical variable has a natural order or ranking, then ordinal encoding can be used. If the categorical variable has no natural order or ranking, then one-hot encoding can be used.
#### 3.In general, one-hot encoding is preferred over ordinal encoding because it does not assume any ordinal relationship between the categories and can be used for categorical variables with any number of unique values. However, one-hot encoding can lead to the curse of dimensionality if the number of unique values is very large.
#### 4.Ordinal encoding is preferred when the number of unique values is large and one-hot encoding would lead to the curse of dimensionality.

## Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

#### If I want to use nominal encoding to transform the categorical data, I would create a new column for each unique value in the categorical columns.
#### Let’s assume that the first categorical column has 12 unique values and the second categorical column has 5 unique values. Then, nominal encoding/one-hot encoding would create 12 + 5 = 17 new columns.
#### In general, if the first categorical column has n unique values and the second categorical column has m unique values, then nominal encoding would create n + m new columns.

## Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

#### It is observed that variables species , habitat and diet are NOMINAL features with no natural order or ranking.
#### If the categorical variable has no natural order or ranking, then one-hot encoding can be used.
#### Hence One-Hot Encoding would be preffered in above case.

## Q7.You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

#### To transform the categorical data in the customer churn dataset into numerical data, you can use either ordinal encoding or one-hot encoding. If the categorical variable has a natural order or ranking, then ordinal encoding can be used. For example, if the dataset contains information about the contract type of the customers, such as “month-to-month”, “one year”, and “two year”, then ordinal encoding can be used to encode this information. If the categorical variable has no natural order or ranking, then one-hot encoding can be used. For example, if the dataset contains information about the gender of the customers, such as “male” and “female”, then one-hot encoding can be used to encode this information. Here are the steps to implement this encoding: Identify the categorical variables in the dataset. In this case, the categorical variables are the customer’s gender and contract type.

#### Seperate Nominal and Ordinal Variables. In this case Gender is an Nominal variable, while contract type is ordinal variable.

#### Apply One Hot Encoding to Nominal Variable in this case Gender Variable.

#### Apply Ordinal Encoding to Ordinal Variable in this case contract type variable.

#### Scale Numerical data using StandardScaler

#### Combine all 3 encoding into single dataframe

#### Data is now ready for machine learning model

#### Creating an example dataset for above and providing how to perform above in python.

In [1]:
# Generating Synthetic Dataset for above variables
import numpy as np
import pandas as pd

# Setting random seed 
np.random.seed(543)
n = 1000 # dataset size
gender = np.random.choice(['Male','Female'],size=n)
age = np.random.randint(low=25, high=65, size=n)
contract = np.random.choice(['monthly','quarterly','half yearly','yearly'], size=n)
monthly_charges = np.random.normal(loc=1000, scale=100,size=n)
tenure = np.random.randint(low=12, high=36, size=n)
churn = np.random.choice([1,0], p=[0.2,0.8], size=n)

# Creating dictionary
dct = {
    'gender':gender,
    'age':age,
    'contract':contract,
    'monthly_charges':monthly_charges,
    'tenure':tenure,
    'churn':churn
}

# Creating Dataframe
df = pd.DataFrame(dct)
df.head()

Unnamed: 0,gender,age,contract,monthly_charges,tenure,churn
0,Female,34,yearly,1042.202157,25,1
1,Female,36,half yearly,966.337735,12,1
2,Female,30,yearly,1165.040177,16,0
3,Female,44,quarterly,1002.266319,35,0
4,Male,53,half yearly,1043.851952,16,1


In [4]:
# Seperating X and Y
X = df.drop(labels=['churn'],axis=1)
Y = df[['churn']]

In [5]:
# Performing one hot encoding on gender column
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
gender_ohe = ohe.fit_transform(X[['gender']]).toarray()
# Get the DataFrame
X_gender = pd.DataFrame(gender_ohe,columns=ohe.get_feature_names_out())
X_gender.head()

Unnamed: 0,gender_Female,gender_Male
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,0.0,1.0


In [6]:
# Performing ordinal encoding on Contract type
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder(categories=[['monthly','quarterly','half yearly','yearly']])
# Getting ordinal dataframe
contract_encoded = ord_enc.fit_transform(X[['contract']]).flatten()
X_contract = pd.DataFrame(contract_encoded,columns=['contract'])
X_contract.head()

Unnamed: 0,contract
0,3.0
1,2.0
2,3.0
3,1.0
4,2.0


In [7]:
# Getting numeric variables
X_numeric = X.select_dtypes(exclude='object')
X_numeric.head()

Unnamed: 0,age,monthly_charges,tenure
0,34,1042.202157,25
1,36,966.337735,12
2,30,1165.040177,16
3,44,1002.266319,35
4,53,1043.851952,16


In [8]:
# Concatenating all 3 variables Nominal, Ordinal and Numerical
X_encoded = pd.concat([X_numeric,X_contract,X_gender],axis=1)
X_encoded.head()

Unnamed: 0,age,monthly_charges,tenure,contract,gender_Female,gender_Male
0,34,1042.202157,25,3.0,1.0,0.0
1,36,966.337735,12,2.0,1.0,0.0
2,30,1165.040177,16,3.0,1.0,0.0
3,44,1002.266319,35,1.0,1.0,0.0
4,53,1043.851952,16,2.0,0.0,1.0


In [9]:
# Applying StandardScaler to entire encoded dataset
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_final = pd.DataFrame(scaler.fit_transform(X_encoded),columns=X_encoded.columns)
X_final.head()

Unnamed: 0,age,monthly_charges,tenure,contract,gender_Female,gender_Male
0,-0.897852,0.446463,0.244225,1.354668,0.992032,-0.992032
1,-0.722935,-0.322028,-1.622281,0.462265,0.992032,-0.992032
2,-1.247688,1.690786,-1.047972,1.354668,0.992032,-0.992032
3,-0.023264,0.041921,1.679999,-0.430138,0.992032,-0.992032
4,0.763865,0.463175,-1.047972,0.462265,-1.008032,1.008032


In [10]:
# Train test split
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(X_final,Y,test_size=0.2, random_state=42, stratify=Y)

In [11]:
xtrain.head()

Unnamed: 0,age,monthly_charges,tenure,contract,gender_Female,gender_Male
514,-0.635476,-0.898498,-1.478704,1.354668,-1.008032,1.008032
224,-1.07277,-0.721926,-1.191549,-1.322542,-1.008032,1.008032
845,1.463536,-0.933469,-0.617239,0.462265,0.992032,-0.992032
736,1.026242,-1.537408,0.962112,-1.322542,0.992032,-0.992032
792,-0.722935,-0.673444,0.818535,-0.430138,0.992032,-0.992032


In [12]:
ytrain.head()

Unnamed: 0,churn
514,0
224,0
845,0
736,0
792,0


In [13]:
xtest.head()

Unnamed: 0,age,monthly_charges,tenure,contract,gender_Female,gender_Male
855,0.676407,1.817057,0.244225,-0.430138,0.992032,-0.992032
942,0.064195,-0.282519,-1.335126,1.354668,0.992032,-0.992032
234,-0.897852,1.087411,-1.335126,0.462265,0.992032,-0.992032
108,-0.985311,1.01579,-1.622281,0.462265,0.992032,-0.992032
697,0.151654,-1.651405,-1.622281,1.354668,0.992032,-0.992032


In [14]:
ytest.head()

Unnamed: 0,churn
855,0
942,0
234,0
108,0
697,0


#### Above data is now ready for machine learning algorithms.