# Summarizing Categorical Data Using pandas

This notebook demonstrates how to **summarize categorical data** using pandas, based closely on the instructor-led transcript.

Categorical data describes how observations are distributed across a **fixed set of categories**. Common examples include:
- Product review sentiment (positive / negative)
- Fruit types (apples / oranges)
- Vehicle attributes such as gears, cylinders, or carburetors

We will explore **three main ways to describe categorical variables**:
1. Counts (`value_counts`)
2. Grouping and statistical description (`groupby`, `describe`)
3. Cross-tabulation (`crosstab`)

We will also learn how to **convert variables to the categorical data type**.

In [None]:
import numpy as np
import pandas as pd

## Loading the dataset

We will use the classic **mtcars** dataset. Each row represents a car model, and columns describe different mechanical attributes.

We also:
- Rename the columns for clarity
- Set the car names as the DataFrame index

In [None]:
address = '/workspaces/python-for-data-science-and-machine-learning-essential-training-part-1-3006708/data/mtcars.csv'

cars = pd.read_csv(address)
cars.columns = ['car_names','mpg','cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb']
cars.index = cars.car_names

cars.head(15)

## Counting categories with `value_counts`

`value_counts()` is the most direct way to summarize categorical data.

It counts how many times each **unique value** appears in a Series.

Here, we analyze the **number of carburetors (`carb`)** per car.

In [None]:
carb = cars.carb
carb.value_counts()

### Interpreting the result

Each row shows:
- The category value (e.g., number of carburetors)
- The count of cars that fall into that category

This is a fundamental way to describe categorical distributions.

## Grouping categorical variables

Another powerful way to summarize categorical data is by **grouping**.

We first create a smaller subset of categorical-like variables:
- `cyl`  (cylinders)
- `vs`   (engine shape)
- `am`   (transmission type)
- `gear` (number of gears)
- `carb` (carburetors)

In [None]:
cars_cat = cars[['cyl', 'vs', 'am', 'gear', 'carb']]
cars_cat.head()

## Grouping by a categorical variable

We now group the dataset by the **number of gears**.

`groupby()` splits the dataset into subgroups, and `describe()` generates
a full statistical summary for each subgroup.

In [None]:
gears_group = cars_cat.groupby('gear')
gears_group.describe()

### Understanding the output

- Rows correspond to **unique gear values**
- Columns contain **statistical summaries** (count, mean, std, min, max, etc.)
- Each variable is described separately within each gear group

This produces a wide table but provides deep insight into grouped behavior.

## Transforming variables to categorical data type

In pandas, you can explicitly assign a variable the **categorical data type**.

This is useful for:
- Memory efficiency
- Clear semantic meaning
- Improved grouping and plotting behavior

Here, we convert the `gear` variable into a categorical Series.

In [None]:
cars['group'] = pd.Series(cars.gear, dtype='category')

### Verifying the data type

In [None]:
cars['group'].dtypes

### Distribution of the categorical variable

Once a variable is categorical, you can still summarize it using `value_counts()`.

In [None]:
cars['group'].value_counts()

## Describing categorical relationships with crosstabs

`pd.crosstab()` is used to summarize **relationships between two categorical variables**.

In this example:
- Rows represent transmission type (`am`)
- Columns represent number of gears (`gear`)

Each cell shows the count of cars matching that combination.

In [None]:
pd.crosstab(cars['am'], cars['gear'])