# [Categorical Data](#)

Categorical data is a type of data that can take on a limited, and usually fixed, number of possible values. These values are called categories or levels. In data analysis, categorical data is common and can represent various types of information such as:

- Qualitative attributes (e.g., colors, brands, countries)
- Ordinal rankings (e.g., low/medium/high, satisfaction ratings)
- Binned numerical data (e.g., age groups, income brackets)


<img src="../images/categorical.png" width="800">

Pandas provides a special data type called `Categorical` to efficiently store and manipulate categorical data. Using the `Categorical` data type offers several advantages:

1. **Memory efficiency**: For datasets with many repeated values, categorical data can significantly reduce memory usage.
2. **Performance**: Many operations on categorical data are faster than on object dtype or strings.
3. **Convenience**: Categorical data provides built-in methods for common operations like reordering or grouping.


In this lecture, we'll explore how to create, manipulate, and analyze categorical data in Pandas. We'll cover various techniques for working with categories, including:

- Creating and converting categorical data
- Accessing and modifying categories
- Sorting and comparing categorical data
- Encoding categorical variables for machine learning
- Grouping, aggregating, and visualizing categorical data


Let's start by importing Pandas and creating a simple dataset with categorical data:


In [1]:
import pandas as pd
import numpy as np

In [2]:
# Create a sample DataFrame with categorical data
df = pd.DataFrame({
    'ID': range(1, 11),
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Green', 'Blue', 'Red', 'Green', 'Blue', 'Red'],
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small', 'Large', 'Medium', 'Small', 'Large', 'Medium'],
    'Rating': ['Good', 'Excellent', 'Fair', 'Good', 'Excellent', 'Fair', 'Good', 'Fair', 'Excellent', 'Good']
})

df

Unnamed: 0,ID,Color,Size,Rating
0,1,Red,Small,Good
1,2,Blue,Medium,Excellent
2,3,Green,Large,Fair
3,4,Red,Medium,Good
4,5,Green,Small,Excellent
5,6,Blue,Large,Fair
6,7,Red,Medium,Good
7,8,Green,Small,Fair
8,9,Blue,Large,Excellent
9,10,Red,Medium,Good


In [3]:
df.dtypes

ID         int64
Color     object
Size      object
Rating    object
dtype: object

In this example, 'Color', 'Size', and 'Rating' are categorical variables. By default, Pandas stores these as object dtype, but we can convert them to `Categorical` for more efficient processing and additional functionality.


In [4]:
# Convert columns to categorical
df['Color'] = pd.Categorical(df['Color'])
df['Size'] = pd.Categorical(df['Size'], categories=['Small', 'Medium', 'Large'], ordered=True)
df['Rating'] = pd.Categorical(df['Rating'], categories=['Fair', 'Good', 'Excellent'], ordered=True)

df

Unnamed: 0,ID,Color,Size,Rating
0,1,Red,Small,Good
1,2,Blue,Medium,Excellent
2,3,Green,Large,Fair
3,4,Red,Medium,Good
4,5,Green,Small,Excellent
5,6,Blue,Large,Fair
6,7,Red,Medium,Good
7,8,Green,Small,Fair
8,9,Blue,Large,Excellent
9,10,Red,Medium,Good


In [5]:
# Display DataFrame info
df.dtypes

ID           int64
Color     category
Size      category
Rating    category
dtype: object

Notice how the 'Size' and 'Rating' columns are now categorical with a specified order. This ordinal information can be useful for sorting and comparing values.


Throughout this lecture, we'll use this dataset and others to demonstrate various aspects of working with categorical data in Pandas. Understanding how to effectively manipulate categorical data is crucial for many data analysis tasks, from exploratory data analysis to preparing data for machine learning models.

## <a id='toc1_'></a>[Creating and Converting Categorical Data](#toc0_)

Pandas offers multiple ways to create and convert categorical data. Let's explore these methods in detail.


### <a id='toc1_1_'></a>[Creating Categorical Data from Scratch](#toc0_)


You can create categorical data directly using the `pd.Categorical()` function:


In [6]:
# Create a categorical Series from scratch
colors = pd.Categorical(['Red', 'Blue', 'Green', 'Red', 'Green'])
colors

['Red', 'Blue', 'Green', 'Red', 'Green']
Categories (3, object): ['Blue', 'Green', 'Red']

In [7]:
# Create a categorical Series with specified categories
sizes = pd.Categorical(['Medium', 'Large', 'Small', 'Medium', 'Small'],
                       categories=['Small', 'Medium', 'Large'])
sizes

['Medium', 'Large', 'Small', 'Medium', 'Small']
Categories (3, object): ['Small', 'Medium', 'Large']

In [8]:
# Create an ordered categorical Series
ratings = pd.Categorical(['Fair', 'Good', 'Excellent', 'Good', 'Fair'],
                         categories=['Poor', 'Fair', 'Good', 'Excellent'],
                         ordered=True)
ratings

['Fair', 'Good', 'Excellent', 'Good', 'Fair']
Categories (4, object): ['Poor' < 'Fair' < 'Good' < 'Excellent']

### <a id='toc1_2_'></a>[Converting Existing Columns to Categorical](#toc0_)


You can convert existing columns in a DataFrame to categorical type:


In [9]:
# Create a sample DataFrame
df = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Green'],
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small'],
    'Rating': ['Good', 'Excellent', 'Fair', 'Good', 'Excellent']
})
df

Unnamed: 0,Color,Size,Rating
0,Red,Small,Good
1,Blue,Medium,Excellent
2,Green,Large,Fair
3,Red,Medium,Good
4,Green,Small,Excellent


In [10]:
# Convert a single column to categorical
df['Color'] = df['Color'].astype('category')

In [11]:
# Convert multiple columns to categorical
df[['Size', 'Rating']] = df[['Size', 'Rating']].astype('category')

In [12]:
df.dtypes


Color     category
Size      category
Rating    category
dtype: object

You can also use the `pd.Categorical()` function to convert columns:


In [13]:
df['Color'] = pd.Categorical(df['Color'])
df['Color']

0      Red
1     Blue
2    Green
3      Red
4    Green
Name: Color, dtype: category
Categories (3, object): ['Blue', 'Green', 'Red']

### <a id='toc1_3_'></a>[Specifying Category Order](#toc0_)


For ordinal categorical data, you can specify the order of categories:


In [14]:
# Create ordered categorical data
df['Size'] = pd.Categorical(
    df['Size'],
    categories=['Small', 'Medium', 'Large'],
    ordered=True
)

In [15]:
df['Rating'] = pd.Categorical(
    df['Rating'],
    categories=['Poor', 'Fair', 'Good', 'Excellent'],
    ordered=True
)


In [16]:
# Demonstrate the order
df['Size'].cat.ordered
df['Size'].cat.categories


Index(['Small', 'Medium', 'Large'], dtype='object')

In [17]:
# Compare values
df['Size'][0] < df['Size'][1]

False

You can also change the order of categories for an existing categorical column:


In [18]:
# Change category order
df['Color'] = df['Color'].cat.reorder_categories(['Blue', 'Green', 'Red'], ordered=True)
df['Color'].cat.categories

Index(['Blue', 'Green', 'Red'], dtype='object')

When working with categorical data, specifying the order can be crucial for correct sorting and comparison operations. It's particularly useful for ordinal data where there's a natural ordering of categories.


In [19]:
# Sort the DataFrame by the ordered categorical columns
df.sort_values(['Rating', 'Size'])

Unnamed: 0,Color,Size,Rating
2,Green,Large,Fair
0,Red,Small,Good
3,Red,Medium,Good
4,Green,Small,Excellent
1,Blue,Medium,Excellent


By using these methods to create and convert categorical data, you can ensure that your data is stored efficiently and that any inherent order in your categories is preserved. This sets the foundation for more advanced categorical data manipulation and analysis techniques that we'll explore in the following sections.

## <a id='toc2_'></a>[Working with Categorical Data](#toc0_)

Once you have created categorical data, Pandas provides various methods to manipulate and analyze it. Let's explore some common operations.


### <a id='toc2_1_'></a>[Accessing Categories](#toc0_)


You can access the categories of a categorical column using the `.cat.categories` attribute:


In [20]:
# Create a sample DataFrame with categorical data
df = pd.DataFrame({
    'Color': pd.Categorical(['Red', 'Blue', 'Green', 'Red', 'Blue']),
    'Size': pd.Categorical(
        ['Small', 'Medium', 'Large', 'Medium', 'Small'],
        categories=['Small', 'Medium', 'Large'],
        ordered=True
    )
})
df

Unnamed: 0,Color,Size
0,Red,Small
1,Blue,Medium
2,Green,Large
3,Red,Medium
4,Blue,Small


In [21]:
# Access categories
df['Color'].cat.categories
df['Size'].cat.categories


Index(['Small', 'Medium', 'Large'], dtype='object')

In [22]:
# Check if the categorical data is ordered
df['Color'].cat.ordered
df['Size'].cat.ordered


True

In [23]:
# Get category codes (underlying integer representation)
df['Color'].cat.codes

0    2
1    0
2    1
3    2
4    0
dtype: int8

### <a id='toc2_2_'></a>[Adding and Removing Categories](#toc0_)


You can add new categories or remove existing ones:


In [24]:
# Add a new category
df['Color'] = df['Color'].cat.add_categories(['Yellow'])
df['Color'].cat.categories

Index(['Blue', 'Green', 'Red', 'Yellow'], dtype='object')

In [25]:
# Remove a category
df['Color'] = df['Color'].cat.remove_categories(['Green'])
df['Color'].cat.categories

Index(['Blue', 'Red', 'Yellow'], dtype='object')

In [26]:
# Remove unused categories
df['Color'] = df['Color'].cat.remove_unused_categories()
df['Color'].cat.categories

Index(['Blue', 'Red'], dtype='object')

In [27]:
# Set new categories (this will remove any categories not specified)
df['Size'] = df['Size'].cat.set_categories(['Tiny', 'Small', 'Medium', 'Large', 'Huge'])
df['Size'].cat.categories

Index(['Tiny', 'Small', 'Medium', 'Large', 'Huge'], dtype='object')

In [28]:
# Add multiple categories at once
df['Color'] = df['Color'].cat.add_categories(['Green', 'Purple', 'Orange'])
df['Color'].cat.categories

Index(['Blue', 'Red', 'Green', 'Purple', 'Orange'], dtype='object')

Note that adding or removing categories doesn't automatically update the data. It only changes the set of possible categories.


### <a id='toc2_3_'></a>[Renaming Categories](#toc0_)


You can rename categories using the `rename_categories()` method:


In [29]:
# Rename individual categories
df['Color'] = df['Color'].cat.rename_categories({'Red': 'Crimson', 'Blue': 'Navy'})
df['Color'].cat.categories

Index(['Navy', 'Crimson', 'Green', 'Purple', 'Orange'], dtype='object')

In [30]:
# Rename all categories at once with a list (must match the number of existing categories)
df['Size'] = df['Size'].cat.rename_categories(['XS', 'S', 'M', 'L', 'XL'])
df['Size'].cat.categories

Index(['XS', 'S', 'M', 'L', 'XL'], dtype='object')

In [31]:
# Rename categories using a function
df['Color'] = df['Color'].cat.rename_categories(lambda x: x.upper())
df['Color'].cat.categories

Index(['NAVY', 'CRIMSON', 'GREEN', 'PURPLE', 'ORANGE'], dtype='object')

You can also use a dictionary to rename specific categories while leaving others unchanged:


In [32]:
# Create a new categorical column
df['Rating'] = pd.Categorical(['Good', 'Fair', 'Excellent', 'Good', 'Fair'])

In [33]:
# Rename specific categories
df['Rating'] = df['Rating'].cat.rename_categories({'Good': 'Average', 'Excellent': 'Outstanding'})
df['Rating'].cat.categories

Index(['Outstanding', 'Fair', 'Average'], dtype='object')

These operations allow you to flexibly modify your categorical data as needed. Remember that some operations, like removing categories, can result in missing data if the removed categories were present in your data. Always check your data after performing these operations to ensure the results are as expected.


In [34]:
# Display the updated DataFrame
df

Unnamed: 0,Color,Size,Rating
0,CRIMSON,S,Average
1,NAVY,M,Fair
2,,L,Outstanding
3,CRIMSON,M,Average
4,NAVY,S,Fair


By mastering these techniques, you can efficiently manage and modify your categorical data, preparing it for further analysis or visualization tasks.

## <a id='toc3_'></a>[Sorting and Ordering Categorical Data](#toc0_)

Sorting and ordering categorical data is a common task in data analysis. Pandas provides several methods to sort categorical data, both for ordered and unordered categories.


### <a id='toc3_1_'></a>[Sorting with Ordered Categories](#toc0_)


When your categorical data has a specified order, sorting becomes straightforward:


In [35]:
# Create a DataFrame with ordered categorical data
df = pd.DataFrame({
    'Size': pd.Categorical(['Medium', 'Small', 'Large', 'Small', 'Medium'],
                           categories=['Small', 'Medium', 'Large'],
                           ordered=True),
    'Count': [5, 2, 8, 3, 4]
})
df

Unnamed: 0,Size,Count
0,Medium,5
1,Small,2
2,Large,8
3,Small,3
4,Medium,4


In [36]:
# Sort the DataFrame by the 'Size' column
df.sort_values('Size')

Unnamed: 0,Size,Count
1,Small,2
3,Small,3
0,Medium,5
4,Medium,4
2,Large,8


In [37]:
# Sort in descending order
df.sort_values('Size', ascending=False)

Unnamed: 0,Size,Count
2,Large,8
0,Medium,5
4,Medium,4
1,Small,2
3,Small,3


In this case, the sorting respects the order specified in the categories: Small < Medium < Large.


### <a id='toc3_2_'></a>[Sorting with Unordered Categories](#toc0_)


For unordered categories, the default sorting is alphabetical:


In [38]:
# Create a DataFrame with unordered categorical data
df = pd.DataFrame({
    'Color': pd.Categorical(['Red', 'Blue', 'Green', 'Blue', 'Red']),
    'Value': [10, 20, 15, 25, 5]
})

# Sort by the 'Color' column
df.sort_values('Color')

Unnamed: 0,Color,Value
1,Blue,20
3,Blue,25
2,Green,15
0,Red,10
4,Red,5


### <a id='toc3_3_'></a>[Customizing the Sort Order](#toc0_)


You can customize the sort order by specifying a category order:


In [39]:
# Specify a custom order for sorting
custom_order = ['Green', 'Blue', 'Red']
df['Color'] = df['Color'].cat.reorder_categories(custom_order, ordered=True)

# Now sort using the custom order
df.sort_values('Color')

Unnamed: 0,Color,Value
2,Green,15
1,Blue,20
3,Blue,25
0,Red,10
4,Red,5


### <a id='toc3_4_'></a>[Sorting with Multiple Columns](#toc0_)


You can sort by multiple columns, including a mix of categorical and non-categorical data:


In [40]:
# Sort by 'Color' (categorical) and then by 'Value' (numeric)
df.sort_values(['Color', 'Value'])

Unnamed: 0,Color,Value
2,Green,15
1,Blue,20
3,Blue,25
4,Red,5
0,Red,10


### <a id='toc3_5_'></a>[Sorting Index](#toc0_)


You can also sort the index if it's categorical:


In [41]:
# Create a DataFrame with a categorical index
df = pd.DataFrame({
    'Value': [10, 20, 30, 40, 50]
}, index=pd.CategoricalIndex(
        ['Medium', 'Small', 'Large', 'Tiny', 'Huge'],
        categories=['Tiny', 'Small', 'Medium', 'Large', 'Huge'],
        ordered=True
    )
)

# Sort by index
df.sort_index()

Unnamed: 0,Value
Tiny,40
Small,20
Medium,10
Large,30
Huge,50


### <a id='toc3_6_'></a>[NaN Handling](#toc0_)


By default, NaN values are placed at the end when sorting:

In [42]:
# Create a DataFrame with NaN values
df = pd.DataFrame({
    'Category': pd.Categorical(['A', 'B', np.nan, 'C', 'B', np.nan],
                               categories=['C', 'B', 'A'],
                               ordered=True)
})
df

Unnamed: 0,Category
0,A
1,B
2,
3,C
4,B
5,


In [43]:
# Sort the DataFrame
df.sort_values('Category')


Unnamed: 0,Category
3,C
1,B
4,B
0,A
2,
5,


In [44]:
# Place NaN values at the beginning
df.sort_values('Category', na_position='first')


Unnamed: 0,Category
2,
5,
3,C
1,B
4,B
0,A


Understanding how to sort and order categorical data is crucial for data analysis and presentation. It allows you to arrange your data in meaningful ways, whether you're preparing it for visualization, reporting, or further analysis.

## <a id='toc4_'></a>[Memory Efficiency of Categorical Data](#toc0_)

One of the key advantages of using categorical data in Pandas is its memory efficiency, especially when dealing with datasets that contain repeated values. Let's explore this concept in detail.


### <a id='toc4_1_'></a>[Understanding Memory Usage](#toc0_)


To demonstrate the memory efficiency of categorical data, we'll compare it with other data types:


In [45]:
# Create a large DataFrame with repeated string values
n = 1_000_000
df = pd.DataFrame({
    'ID': range(n),
    'Category': np.random.choice(['A', 'B', 'C', 'D', 'E'], n)
})
df

Unnamed: 0,ID,Category
0,0,C
1,1,D
2,2,D
3,3,E
4,4,D
...,...,...
999995,999995,C
999996,999996,A
999997,999997,B
999998,999998,E


In [46]:
# Check memory usage
df.memory_usage(deep=True)

Index            128
ID           8000000
Category    58000000
dtype: int64

In [47]:
# Convert 'Category' to categorical
df['Category_Cat'] = df['Category'].astype('category')

In [48]:
# Compare memory usage
df.memory_usage(deep=True)

Index                128
ID               8000000
Category        58000000
Category_Cat     1000462
dtype: int64

You'll notice a significant reduction in memory usage for the categorical column compared to the string column.


### <a id='toc4_2_'></a>[Factors Affecting Memory Efficiency](#toc0_)


1. **Number of unique categories**: The fewer unique categories you have relative to the total number of rows, the more memory you save.

2. **Length of category names**: Longer category names lead to greater memory savings when converted to categorical.

3. **Number of rows**: The memory savings become more pronounced as the number of rows increases.


Let's illustrate these factors:


In [49]:
# Function to create DataFrame and measure memory
def measure_memory(n_rows, n_categories, cat_length):
    categories = [''.join(np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'), cat_length)) for _ in range(n_categories)]
    df = pd.DataFrame({
        'String': np.random.choice(categories, n_rows),
        'Categorical': pd.Categorical(np.random.choice(categories, n_rows))
    })
    mem_usage = df.memory_usage(deep=True)
    print(f"String column: {mem_usage['String'] / 1e6:.2f} MB")
    print(f"Categorical column: {mem_usage['Categorical'] / 1e6:.2f} MB")
    print(f"Memory saved: {(1 - mem_usage['Categorical'] / mem_usage['String']) * 100:.2f}%")

In [50]:
# Test with different scenarios
print("Scenario 1: Many rows, few categories")
measure_memory(n_rows=1_000_000, n_categories=5, cat_length=3)

Scenario 1: Many rows, few categories
String column: 60.00 MB
Categorical column: 1.00 MB
Memory saved: 98.33%


In [51]:
print("\nScenario 2: Many rows, many categories")
measure_memory(n_rows=1_000_000, n_categories=1000, cat_length=3)


Scenario 2: Many rows, many categories
String column: 60.00 MB
Categorical column: 2.09 MB
Memory saved: 96.51%


In [52]:
print("\nScenario 3: Many rows, few categories, long category names")
measure_memory(n_rows=1_000_000, n_categories=5, cat_length=20)



Scenario 3: Many rows, few categories, long category names
String column: 77.00 MB
Categorical column: 1.00 MB
Memory saved: 98.70%


### <a id='toc4_3_'></a>[When to Use Categorical Data](#toc0_)


Categorical data is most beneficial when:

1. Your data has a limited number of unique values that are repeated many times.
2. You're working with large datasets where memory usage is a concern.
3. You need to perform operations that can benefit from the ordered nature of categories.


However, for columns with mostly unique values or very few rows, the memory savings might be negligible or even negative due to the overhead of the categorical data structure.


### <a id='toc4_4_'></a>[Impact on Performance](#toc0_)


Besides memory efficiency, categorical data can also improve performance for certain operations:


In [53]:
# Create a large DataFrame
n = 5_000_000
df = pd.DataFrame({
    'String': np.random.choice(['A', 'B', 'C', 'D', 'E'], n),
    'Categorical': pd.Categorical(np.random.choice(['A', 'B', 'C', 'D', 'E'], n))
})
df

Unnamed: 0,String,Categorical
0,A,B
1,E,C
2,E,E
3,B,C
4,D,E
...,...,...
4999995,A,B
4999996,D,A
4999997,E,A
4999998,D,E


In [54]:
%timeit df['String'].value_counts()

103 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [55]:
%timeit df['Categorical'].value_counts()

14.1 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


You'll often find that operations like `value_counts()`, `groupby()`, and sorting are faster with categorical data.


Understanding the memory efficiency and performance benefits of categorical data allows you to make informed decisions about when to use this data type in your Pandas DataFrames, potentially leading to significant improvements in both memory usage and computation speed for large datasets.

## <a id='toc5_'></a>[Encoding Categorical Data](#toc0_)

Encoding categorical data is a crucial step in preparing data for many machine learning algorithms that require numerical input. Pandas provides several methods to encode categorical data. We'll focus on two common encoding techniques: One-Hot Encoding and Ordinal Encoding.


### <a id='toc5_1_'></a>[One-Hot Encoding](#toc0_)


One-hot encoding creates binary columns for each category in a categorical variable. This is useful for nominal categorical data where there's no inherent order among categories.


In [56]:
# Create a sample DataFrame
df = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Green'],
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']
})

# Perform one-hot encoding
df_encoded = pd.get_dummies(df, columns=['Color', 'Size'])
df_encoded

Unnamed: 0,Color_Blue,Color_Green,Color_Red,Size_Large,Size_Medium,Size_Small
0,False,False,True,False,False,True
1,True,False,False,False,True,False
2,False,True,False,True,False,False
3,False,False,True,False,True,False
4,False,True,False,False,False,True


You can customize the prefix for the new columns:


In [57]:
# Custom prefix
df_encoded = pd.get_dummies(df, columns=['Color', 'Size'], prefix=['C', 'S'])
df_encoded

Unnamed: 0,C_Blue,C_Green,C_Red,S_Large,S_Medium,S_Small
0,False,False,True,False,False,True
1,True,False,False,False,True,False
2,False,True,False,True,False,False
3,False,False,True,False,True,False
4,False,True,False,False,False,True


When using one-hot encoding, you might encounter new categories in your test data that weren't present in your training data. To handle this:


In [58]:
# Create training and test data
df_train = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']
})

df_test = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Yellow', 'Green', 'Orange']
})

In [59]:
df_train


Unnamed: 0,Color
0,Red
1,Blue
2,Green
3,Red
4,Green


In [60]:
df_test

Unnamed: 0,Color
0,Red
1,Blue
2,Yellow
3,Green
4,Orange


In [61]:
# One-hot encode with all categories from both datasets
all_categories = pd.concat([df_train['Color'], df_test['Color']]).unique()
df_train_encoded = pd.get_dummies(df_train, columns=['Color'], prefix=['Color'])
df_test_encoded = pd.get_dummies(df_test, columns=['Color'], prefix=['Color'])

In [62]:
# Ensure both have the same columns
for category in all_categories:
    if f'Color_{category}' not in df_train_encoded.columns:
        df_train_encoded[f'Color_{category}'] = False
    if f'Color_{category}' not in df_test_encoded.columns:
        df_test_encoded[f'Color_{category}'] = False

print("Train data:")
df_train_encoded

Train data:


Unnamed: 0,Color_Blue,Color_Green,Color_Red,Color_Yellow,Color_Orange
0,False,False,True,False,False
1,True,False,False,False,False
2,False,True,False,False,False
3,False,False,True,False,False
4,False,True,False,False,False


In [63]:
print("\nTest data:")
df_test_encoded


Test data:


Unnamed: 0,Color_Blue,Color_Green,Color_Orange,Color_Red,Color_Yellow
0,False,False,False,True,False
1,True,False,False,False,False
2,False,False,False,False,True
3,False,True,False,False,False
4,False,False,True,False,False


### <a id='toc5_2_'></a>[Ordinal Encoding](#toc0_)


Ordinal encoding assigns an integer to each category. This is useful for ordinal categorical data where there's a clear ordering of categories.


In [64]:
# Create a DataFrame with ordinal data
df = pd.DataFrame({
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small'],
    'Quality': ['Low', 'Medium', 'High', 'Medium', 'Low']
})
df

Unnamed: 0,Size,Quality
0,Small,Low
1,Medium,Medium
2,Large,High
3,Medium,Medium
4,Small,Low


In [65]:
# Define the order of categories
size_order = ['Small', 'Medium', 'Large']
quality_order = ['Low', 'Medium', 'High']

In [66]:
# Perform ordinal encoding
df['Size_Encoded'] = pd.Categorical(df['Size'], categories=size_order, ordered=True).codes
df['Quality_Encoded'] = pd.Categorical(df['Quality'], categories=quality_order, ordered=True).codes


In [67]:
df

Unnamed: 0,Size,Quality,Size_Encoded,Quality_Encoded
0,Small,Low,0,0
1,Medium,Medium,1,1
2,Large,High,2,2
3,Medium,Medium,1,1
4,Small,Low,0,0


For more advanced ordinal encoding, you can use scikit-learn:


In [68]:
from sklearn.preprocessing import OrdinalEncoder

# Create an OrdinalEncoder
encoder = OrdinalEncoder(categories=[size_order, quality_order])
# Fit and transform the data
encoded_data = encoder.fit_transform(df[['Size', 'Quality']])

In [69]:
# Create a new DataFrame with encoded values
df_encoded = pd.DataFrame(encoded_data, columns=['Size_Encoded', 'Quality_Encoded'])
df_encoded

Unnamed: 0,Size_Encoded,Quality_Encoded
0,0.0,0.0
1,1.0,1.0
2,2.0,2.0
3,1.0,1.0
4,0.0,0.0


In [70]:
# Combine with original data
df_final = pd.concat([df, df_encoded], axis=1)
df_final

Unnamed: 0,Size,Quality,Size_Encoded,Quality_Encoded,Size_Encoded.1,Quality_Encoded.1
0,Small,Low,0,0,0.0,0.0
1,Medium,Medium,1,1,1.0,1.0
2,Large,High,2,2,2.0,2.0
3,Medium,Medium,1,1,1.0,1.0
4,Small,Low,0,0,0.0,0.0


### <a id='toc5_3_'></a>[Choosing Between One-Hot and Ordinal Encoding](#toc0_)


- Use **One-Hot Encoding** for nominal categorical variables (no inherent order).
- Use **Ordinal Encoding** for ordinal categorical variables (clear order exists).


Remember that the choice of encoding can significantly impact the performance of your machine learning models. Always consider the nature of your data and the requirements of your chosen algorithm when deciding on an encoding method.


By understanding these encoding techniques, you can effectively prepare your categorical data for various data analysis and machine learning tasks.