## About Categorical Data 

What is Categorical Data ? <br>
Categorical data is a type of data that can be divided into groups or categories. These categories have no inherent order or ranking, and each value in the categorical data can only belong to one category.

Examples of categorical data include:

Nominal data: This type of data is used to name or label a set of observations. Examples include gender, hair color, and eye color.
Ordinal data: This type of data has a clear order or ranking. Examples include education level (high school, college, graduate), movie ratings (G, PG, PG-13, R), and survey responses (strongly disagree, disagree, neutral, agree, strongly agree).<br>
Categorical data can be further divided into two types:

Binary data: This type of data has only two categories, such as true/false, male/female, or 1/0.<br>
Multi-class data: This type of data has more than two categories, such as different types of fruits, different types of weather, or different types of vehicle.


It is important to note that the way categorical data is encoded in a dataset will affect the way the data is analyzed and modeled. Common ways of encoding categorical data include:

Numeric encoding: Each category is assigned a numerical value. For example, 0 for "red" and 1 for "blue".<br>
One-hot encoding: Each category is represented by a binary variable, with a value of 1 indicating the observation belongs to that category and a value of 0 indicating it does not.

## Import Library 

In [2]:
import pandas as pd

## Dataset

In [4]:
# https://www.kaggle.com/datasets/brendan45774/test-file
df = pd.read_csv("tested.csv")

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB


Lets pick  column 'Sex' for one hot encoding

In [7]:
df["Sex"]

0        male
1      female
2        male
3        male
4      female
        ...  
413      male
414    female
415      male
416      male
417      male
Name: Sex, Length: 418, dtype: object

## One Hot Encoding

In [9]:
# Perform one-hot encoding on the 'sex' column
sex_encoded = pd.get_dummies(df["Sex"], prefix="Sex")

In [11]:
sex_encoded

Unnamed: 0,Sex_female,Sex_male
0,0,1
1,1,0
2,0,1
3,0,1
4,1,0
...,...,...
413,0,1
414,1,0
415,0,1
416,0,1


In [10]:
# Concatenate the encoded column with the original dataframe
df = pd.concat([df, sex_encoded], axis=1)

In [12]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_female,Sex_male
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0,1
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S,1,0
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0,1
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0,1
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,0,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,0,1
414,1306,1,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C,1,0
415,1307,0,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,0,1
416,1308,0,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,0,1


In [14]:
# Dropping the original 'Sex' column
df.drop(["Sex"], axis=1, inplace=True)

In [15]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_female,Sex_male
0,892,0,3,"Kelly, Mr. James",34.5,0,0,330911,7.8292,,Q,0,1
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",47.0,1,0,363272,7.0000,,S,1,0
2,894,0,2,"Myles, Mr. Thomas Francis",62.0,0,0,240276,9.6875,,Q,0,1
3,895,0,3,"Wirz, Mr. Albert",27.0,0,0,315154,8.6625,,S,0,1
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",22.0,1,1,3101298,12.2875,,S,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,0,3,"Spector, Mr. Woolf",,0,0,A.5. 3236,8.0500,,S,0,1
414,1306,1,1,"Oliva y Ocana, Dona. Fermina",39.0,0,0,PC 17758,108.9000,C105,C,1,0
415,1307,0,3,"Saether, Mr. Simon Sivertsen",38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,0,1
416,1308,0,3,"Ware, Mr. Frederick",,0,0,359309,8.0500,,S,0,1


## When to NOT use one hot encoding


One-hot encoding is a popular technique for encoding categorical variables, but there are a few cases when it may not be the best option:

High cardinality: One-hot encoding can create a large number of new columns if the categorical variable has a large number of categories. This can lead to increased memory usage and decreased model interpretability.

Collinearity: One-hot encoded columns are highly correlated with each other, which can lead to issues of collinearity when building linear models.

Small sample size: If the sample size of the dataset is small, one-hot encoding can lead to sparse data, where many of the new columns will have very few non-zero values.

Ordinal data: If the categorical variable represents ordinal data, such as levels of education (high school, college, graduate), it may be more meaningful to encode the categories as integers rather than one-hot encoding.

Information loss: One-hot encoding discards information about the relationships between the categories, it may be a good idea to use other encoding techniques like ordinal encoding or binary encoding which will retain more information about the relationship between categories.

In such cases, it may be more beneficial to use other encoding techniques such as ordinal encoding, binary encoding, or target encoding. It is important to consider the characteristics of the data and the goals of the analysis when choosing an encoding method.