---
title: "Handling Categorical Data with Pandas"
author: "Mohammed Adil Siraju"
date: "2025-09-21"
categories: [pandas, dataframe, categorical-data, data-transformation]
description: "Comprehensive guide to handling categorical data in Pandas, including encoding techniques, grouping operations, and data reshaping methods like melt and pivot."
---

This notebook covers essential techniques for working with categorical data in Pandas, including:
- **Encoding Methods**: Converting categorical variables to numerical formats
- **Grouping Operations**: Analyzing category distributions and aggregations
- **Data Transformation**: Reshaping data with melt and pivot operations

Categorical data transformation is crucial for machine learning models that require numerical inputs.

## 1. Setting Up Sample Data

Let's start by creating a sample DataFrame with categorical data to work with.

In [1]:
import pandas as pd

data = {
    'Category': ['A','B','C','C','B','A']
}

df = pd.DataFrame(data)

In [2]:
df

Unnamed: 0,Category
0,A
1,B
2,C
3,C
4,B
5,A


## 2. Encoding Categorical Data

Machine learning algorithms typically require numerical inputs. Categorical encoding converts text categories into numbers. Here are the most common techniques:

### One-Hot Encoding

One-hot encoding creates binary columns for each category. It's ideal for nominal (unordered) categories.

**Pros**: No ordinal assumptions, works well with most algorithms
**Cons**: Can create many columns (curse of dimensionality)

In [3]:
pd.get_dummies(df['Category'])[['A','B']]

Unnamed: 0,A,B
0,True,False
1,False,True
2,False,False
3,False,False
4,False,True
5,True,False


### Label Encoding

Label encoding assigns integer values to categories. Use this when categories have a natural order (ordinal data).

**Pros**: Memory efficient, preserves single column
**Cons**: Implies ordinal relationship even when none exists

In [4]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

df['Category_LabenEncoded'] = label_encoder.fit_transform(df['Category'])

df

Unnamed: 0,Category,Category_LabenEncoded
0,A,0
1,B,1
2,C,2
3,C,2
4,B,1
5,A,0


In [5]:
import pandas as pd

data = {
    'Category': ['A','B','C','C','B','A']
}

df = pd.DataFrame(data)

df

Unnamed: 0,Category
0,A
1,B
2,C
3,C
4,B
5,A


## 3. Analyzing Categorical Data with Grouping

Grouping operations help you understand the distribution and patterns in your categorical data. This is essential for exploratory data analysis.

### Counting Category Frequencies

Use `groupby().size()` or `groupby().count()` to see how many times each category appears.

In [6]:
df.groupby('Category').size()

Category
A    2
B    2
C    2
dtype: int64

In [7]:
df.groupby('Category').agg({'Category':'count'})

Unnamed: 0_level_0,Category
Category,Unnamed: 1_level_1
A,2
B,2
C,2


## 4. Data Transformation: Reshaping with Melt and Pivot

Data reshaping is crucial for transforming your data between "wide" and "long" formats. This is particularly useful when working with categorical data across multiple variables.

### Wide to Long Format (melt)

`pd.melt()` unpivots a DataFrame from wide format to long format. This is useful for:
- Converting multiple categorical columns into a single column
- Preparing data for visualization libraries
- Making data more database-friendly

In [8]:
# Reshaping Data
data = {
    'Name': ['John', 'Emily', 'Kate'],
    'Math': [90, 85,88],
    'Science': [92, 80, 95]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Math,Science
0,John,90,92
1,Emily,85,80
2,Kate,88,95


In [9]:
df_melted = pd.melt(df, id_vars='Name', var_name='Subject', value_name='Score')
df_melted

Unnamed: 0,Name,Subject,Score
0,John,Math,90
1,Emily,Math,85
2,Kate,Math,88
3,John,Science,92
4,Emily,Science,80
5,Kate,Science,95


### Long to Wide Format (pivot)

`df.pivot()` does the opposite of melt - it converts long format back to wide format. This is useful for:
- Creating summary tables
- Preparing data for certain types of analysis
- Making data more human-readable

In [10]:
df_melted.pivot(index='Name', columns='Subject', values='Score')

Subject,Math,Science
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Emily,85,80
John,90,92
Kate,88,95


## Summary

In this notebook, you learned essential data transformation techniques for categorical data:

1. **Encoding**: Convert text categories to numbers
   - One-hot encoding for nominal data
   - Label encoding for ordinal data

2. **Grouping**: Analyze category distributions
   - Count frequencies with `groupby().size()`
   - Aggregate data by categories

3. **Reshaping**: Transform data structure
   - `melt()`: Wide to long format
   - `pivot()`: Long to wide format

These techniques form the foundation of data preprocessing for machine learning and analysis workflows. Choose the right method based on your data characteristics and modeling requirements!

**Next Steps**: Practice with real datasets and explore advanced encoding techniques like target encoding or frequency encoding.