# Q1. What is data encoding? How is it useful in data science?

Data encoding is the process of converting data from one format or representation into another format, typically for the purpose of storage, transmission, or processing by a computer or data analysis tool. It involves translating data into a standardized format that can be easily interpreted and manipulated by software or algorithms.

### Usefulness:-
1. Compatibility: Data encoding ensures data is compatible with tools, libraries, or algorithms. For instance, it can convert categorical data (e.g., "red," "green," "blue") into numerical values (e.g., 1, 2, 3) for mathematical operations.

2. Normalization: Encoding normalizes data to a consistent scale. This is crucial when different features have different units, ensuring fair treatment in machine learning.

3. Privacy and Security: Encoding conceals sensitive information while preserving analysis capabilities. Techniques like hashing and tokenization anonymize data.

4. Reducing Dimensionality: Techniques like Principal Component Analysis (PCA) reduce the dimensionality of high-dimensional data, aiding visualization and machine learning.

5. Feature Engineering: Encoding contributes to feature engineering by creating new features from existing data. For example, one-hot encoding converts categorical data into binary features.

6. Text Processing: In natural language processing (NLP), text data is encoded into numerical vectors (word embeddings) for analysis.

7. Time Series Analysis: Encoding timestamps and time-related data is essential for time series analysis and forecasting.

8. Machine Learning Input: Machine learning models require structured data, making encoding necessary to prepare raw data for training and prediction.


# Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

Nominal encoding is a method used to represent categorical data as binary vectors. In this encoding, each category is transformed into a binary vector with one element set to 1 and all others set to 0. It's called "nominal" because it deals with unordered categories or labels without any inherent order or ranking. It is also known as one-hot encoding.

In [29]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

In [30]:
data = {'Customer ID': [1, 2, 3, 4, 5],
        'Internet Service Type': ['DSL', 'Fiber Optic', 'DSL', 'No Internet', 'Fiber Optic']}

df = pd.DataFrame(data)
df

Unnamed: 0,Customer ID,Internet Service Type
0,1,DSL
1,2,Fiber Optic
2,3,DSL
3,4,No Internet
4,5,Fiber Optic


In [31]:
encoded_df = pd.get_dummies(df, columns=['Internet Service Type'],prefix='internet')

print(encoded_df)

   Customer ID  internet_DSL  internet_Fiber Optic  internet_No Internet
0            1             1                     0                     0
1            2             0                     1                     0
2            3             1                     0                     0
3            4             0                     0                     1
4            5             0                     1                     0


# Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

In [20]:
df1 = pd.DataFrame(data)
df1

Unnamed: 0,Customer ID,Internet Service Type
0,1,DSL
1,2,Fiber Optic
2,3,DSL
3,4,No Internet
4,5,Fiber Optic


In [23]:
# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse=False)

# Fit and transform the encoder on the categorical column
encoded_data = encoder.fit_transform(df[['Internet Service Type']])

# Create a DataFrame with the encoded data
encoded_df1 = pd.DataFrame(encoded_data, columns=encoder.get_feature_names(['Internet Service Type']))

# Concatenate the encoded DataFrame with the original DataFrame
result_df1 = pd.concat([df1, encoded_df], axis=1)

# Display the result
print(result_df1)

   Customer ID Internet Service Type  Customer ID  internet_DSL  \
0            1                   DSL            1             1   
1            2           Fiber Optic            2             0   
2            3                   DSL            3             1   
3            4           No Internet            4             0   
4            5           Fiber Optic            5             0   

   internet_Fiber Optic  internet_No Internet  
0                     0                     0  
1                     1                     0  
2                     0                     0  
3                     0                     1  
4                     1                     0  


# Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

### If you have a dataset containing categorical data with 5 unique values, we can use either ordinal encoding or one-hot encoding to transform this data into a format suitable for machine learning algorithms.

### If the categorical variable has a natural order or ranking, then we can opt for ordinal encoding. However, if the categorical variable lacks a natural order or ranking, we recommend using one-hot encoding.

### In general, we prefer one-hot encoding over ordinal encoding because it does not assume any ordinal relationship between the categories and can be used for categorical variables with any number of unique values. However, it's important to be aware that one-hot encoding can lead to the curse of dimensionality if the number of unique values is very large.

### Ordinal encoding is preferred when we encounter situations where the number of unique values is large, and using one-hot encoding could potentially lead to the curse of dimensionality.


# Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

When using nominal encoding (one-hot encoding) to transform categorical data, we create a new binary column for each unique category within each categorical column. The number of new columns created for each categorical column is equal to the number of **unique categories minus one.**

First Categorical Column: Let's say this column has "n" unique categories. We create "n-1" new columns because the information from "n-1" columns is sufficient to represent all "n" categories (the last category is represented when all others are 0).

Second Categorical Column: Similarly, for the second categorical column with "m" unique categories, we create "m-1" new columns.

 **total no. of= new columns= (n-1) + (m-1) **.


Let's assume:

The first categorical column has 4 unique categories (n1 = 4).
The second categorical column has 3 unique categories (n2 = 3).
The total number of new columns created would be:

(n-1) + (m-1) = (4-1) + (3-1) = 3 + 2 

                       =5 new column



# Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

### 1. It is observed that variables species , habitat and diet are NOMINAL features with no natural order or ranking.

### 2. It does not assume any ordinal relationship among categories.
 
### 3. If the categorical variable has no natural order or ranking, then one-hot encoding can be used.



# Q7.You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

In [27]:
import pandas as pd


data = {
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'Contract Type': ['Month-to-Month', 'One Year', 'Two Year', 'Month-to-Month', 'One Year'],
    'Monthly Charges': [65.5, 85.2, 75.0, 68.3, 92.5],
    'Tenure': [12, 24, 6, 18, 36]
}


df = pd.DataFrame(data)
df

Unnamed: 0,Gender,Contract Type,Monthly Charges,Tenure
0,Male,Month-to-Month,65.5,12
1,Female,One Year,85.2,24
2,Male,Two Year,75.0,6
3,Female,Month-to-Month,68.3,18
4,Male,One Year,92.5,36


In [28]:
# Binary encoding for "Gender"
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})

# One-hot encoding for "Contract Type"
df = pd.get_dummies(df, columns=['Contract Type'], prefix=['Contract'])


print(df)

   Gender  Monthly Charges  Tenure  Contract_Month-to-Month  \
0       0             65.5      12                        1   
1       1             85.2      24                        0   
2       0             75.0       6                        0   
3       1             68.3      18                        1   
4       0             92.5      36                        0   

   Contract_One Year  Contract_Two Year  
0                  0                  0  
1                  1                  0  
2                  0                  1  
3                  0                  0  
4                  1                  0  
