# PARAMETERS OF ONE HOT ENCODING

In data preprocessing for machine learning tasks, one common operation is converting categorical variables into a numerical 
format that machine learning algorithms can understand. The get_dummies function is a popular tool for doing this, and it's 
typically available in libraries like pandas in Python. It's used to create binary (0 or 1) columns for each category in a 
categorical variable, effectively one-hot encoding the data. Let's break down the parameters of the get_dummies function with 
practical examples:

In [1]:
import pandas as pd

# Sample DataFrame with a categorical column 'Color'
data = {'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']}
df = pd.DataFrame(data)

# Using get_dummies to one-hot encode 'Color'
encoded_df = pd.get_dummies(df['Color'])

print(encoded_df)


   Blue  Green  Red
0     0      0    1
1     0      1    0
2     1      0    0
3     0      0    1
4     0      1    0


In this example, we have a DataFrame df with a categorical column 'Color'. We'll use pd.get_dummies() to one-hot 
encode this column. Here are the key parameters:

# PREFIX PARAMETER

prefix: It is used to add a prefix to the newly created dummy columns. This parameter is useful 
    for keeping track of which variable the dummy columns represent. By default, it's set to None.

In [2]:
encoded_df = pd.get_dummies(df['Color'], prefix=None)


In [3]:
encoded_df

Unnamed: 0,Blue,Green,Red
0,0,0,1
1,0,1,0
2,1,0,0
3,0,0,1
4,0,1,0


In [4]:
ncoded_df = pd.get_dummies(df['Color'], prefix='color')

In [5]:
ncoded_df

Unnamed: 0,color_Blue,color_Green,color_Red
0,0,0,1
1,0,1,0
2,1,0,0
3,0,0,1
4,0,1,0


# PREFIX_SEP PARAMETER

prefix_sep: If you set a prefix, this parameter allows you to specify a separator between the prefix and the 
category name. The default is '_'.

In [6]:
encoded_df = pd.get_dummies(df['Color'], prefix='Color', prefix_sep='_')


In [7]:
encoded_df

Unnamed: 0,Color_Blue,Color_Green,Color_Red
0,0,0,1
1,0,1,0
2,1,0,0
3,0,0,1
4,0,1,0


This would result in column names like 'Color-Red', 'Color-Green', etc.

# DUMMY_NA PARAMETER

dummy_na: By default, it's set to False, which means it won't create dummy columns for missing values. 
    If you set it to True, it will create a separate column for missing values and mark them as 1.

In [8]:
data = {'Color': ['Red', 'Green', None, 'Red', 'Green']}
df = pd.DataFrame(data)
encoded_df = pd.get_dummies(df['Color'], dummy_na=True)


In [9]:
encoded_df

Unnamed: 0,Green,Red,NaN
0,0,1,0
1,1,0,0
2,0,0,1
3,0,1,0
4,1,0,0


This would create a column 'Color_nan' with 1 where the value is missing.

# COLUMNS PARAMETER

columns: You can specify the subset of columns from your DataFrame to be one-hot encoded. This is useful when you only 
    want to encode specific categorical columns and leave others as they are.

In [10]:
data = {'Color': ['Red', 'Green', 'Blue'], 'Size': ['Small', 'Medium', 'Large']}
df = pd.DataFrame(data)
encoded_df = pd.get_dummies(df, columns=['Color'])


In [11]:
encoded_df

Unnamed: 0,Size,Color_Blue,Color_Green,Color_Red
0,Small,0,0,1
1,Medium,0,1,0
2,Large,1,0,0


    Here, only the 'Color' column is one-hot encoded.

Remember that one-hot encoding can increase the dimensionality of your dataset, which may not be suitable for all machine 
learning algorithms. It's essential to consider the impact on your model's performance and the potential need for dimensionality 
reduction techniques like PCA or feature selection when working with one-hot encoded data.