## [ Categorical Data ]
- is a pandas data type used for columns that have a fixed and limited number of possible values (categories)

#### Benefits of Using Categorical Data

1. **Less Memory Usage**  
   Stores categories as integers under the hood, not strings.

2. **Faster Performance**  
   Comparisons and groupings are quicker.

3. **Clear Data Meaning**  
   Helps distinguish **true categories** (like `gender`, `city`, `rating`) from free-form text.


#### Background & Motivation

Before categorical types:

- Strings were used to represent categories like `"male"`, `"female"`, `"yes"`, `"no"`, etc.
- But string operations are **slow** and **memory-heavy**.
- There was **no way to define ordering** (like `"low" < "medium" < "high"`).
- Missing values and inconsistent labels could cause bugs in analysis.

**So pandas introduced `Categorical` data type to solve this.**


####  Features of Categorical Data

| Feature                | Description                                           |
|------------------------|-------------------------------------------------------|
| **Categories**         | The possible values (like "small", "medium", "large") |
| **Order** (optional)   | Categories can be **ordered** (e.g., low < medium < high) |
| **Efficient Storage**  | Internally stored as integers + a mapping table       |
| **Group-by Friendly**  | Useful for `groupby()`, `pivot_table()`, etc.         |


In [1]:
import numpy as np 
import pandas as pd 

values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [2]:
print(pd.unique(values))
print(pd.Series(values).value_counts())

['apple' 'orange']
apple     6
orange    2
Name: count, dtype: int64


In [3]:
# Many data systems (for data warehousing, statistical computing, or other uses) have developed specialized approaches for representing data with repeated values for more efficient storage and computation. In data warehousing.
#  A best practice is to use so-called dimension tables containing the distinct values and storing the primary observations as integer keys referencing the dimension table


# Summary: 
# To save space and work faster: 
#     don't repeat the same text over and over
#     instead, store each unique value once in a dimension table
#     in your main table, just store an integer code that points to that value 
#     this technique is used in data warehousing, pandas categorical data, and many analytics tools

- Data Systems: these are tools or platforms used to store, manage, and analyze data
    - databases (MySQL, PostgreSQL)
    - data warehouses (amazon redshift, google bigquery)
    - statistical tools (R, pandas in python)

- Repeated Values: refers to data that appears multiple times in a dataset

- Efficient Storage and Computation: to avoid redundancy and save space and time, data systems try do:
    - store repeated values only once
    - use references or codes instead of full repeated strings
    - Why?
        - reduces memory usage
        - makes searching and comparing faster

- Data Warehousing: it is a system designed for analyzing large datasets(often from different sources) to support dicision-making
    - Characteristics: 
        - stores historical data
        - used for reporting, analytics, BI (business intelligence)

- Dimension Tables: these are tables in a data warehouse that store unique values for certain attributes (like city_names, product_names, etc)
    - Instead of storing full strings repeatedly, we store: 
        - a unique ID (like 1,2,3,...)
        - the corresponding string once 

In [4]:
values = pd.Series([0,1,0,0] * 2)
dim = pd.Series(['apple', 'orange']) # dim[0] -> apple , dim[1] -> orange

print(values)
print(dim)

dim.take(values)    # dim.take([0,1,0,0,0,1,0,0])

# dim.take(values) returns a new Series -- it's a copy not a view, and it is not stored anywhere unless we assign it

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64
0     apple
1    orange
dtype: object


0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object


#### **Key Points on Categorical (Dictionary-Encoded) Representation in pandas**

1. **Categorical Representation**:
   - Data is stored as **integer codes** referring to a **set of distinct values** (called *categories*).
   - This is also known as **dictionary-encoded representation**.

2. **Terminology**:
   - The **distinct values**: called *categories*, *dictionary*, or *levels*.
   - The **integers** referencing them: called *category codes* or simply *codes*.

3. **Performance**:
   - Categorical representation is **memory-efficient** and **faster** for certain operations.
   - Useful for **analytics** where values repeat often (e.g., gender, region, product types).

4. **Transformations on Categories**:
   - You can change the **categories** without modifying the **underlying codes**.
   - Common low-cost transformations include:
     - ✅ **Renaming categories**
     - ✅ **Appending new categories** (while keeping existing order/positions unchanged)


--- 

## [ Categorical Extension Type in pandas ]
- pandas provides a **special data type** called `Categorical` for handling data using **integer-based encoding**.
- It stores repeated values as **integer codes**, pointing to a list of **unique categories** (like a lookup table).
- **Why it's useful**:  
   - This method:
   - ✅ Saves **memory**
   - ✅ Offers **faster performance**
   - ✅ Is especially helpful for **string data with many repeated values**
- Great for data like *city names*, *product types*, *status labels* (like 'Yes', 'No'), etc.


In [5]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)

rng = np.random.default_rng(seed=12345)

df = pd.DataFrame({'fruit': fruits,
                     'basket_id': np.arange(N),
                     'count': rng.integers(3, 15, size=N),
                     'weight': rng.uniform(0, 4, size=N)},
                    columns=['basket_id', 'fruit', 'count', 'weight'])
df 

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,11,1.564438
1,1,orange,5,1.331256
2,2,apple,12,2.393235
3,3,apple,6,0.746937
4,4,apple,5,2.691024
5,5,orange,12,3.767211
6,6,apple,10,0.992983
7,7,apple,11,3.795525


In [6]:
# here df['fruit'] is an array of python strings objects. we can convert it to categorical by calling
fruit_cat = df['fruit'].astype('category')
fruit_cat # series object

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

In [7]:
# the values for fruit_cat are now an instance of pandas.Categorical, which you can access via the .array attribute

c = fruit_cat.array
type(c)

# s.array returns the actual data container behind the series

pandas.core.arrays.categorical.Categorical

In [8]:
# the categorical object has categories and codes attributes
print(c.categories)
print(c.codes)
# fruit_cat.categories: series object has no object categories

# these can be accessed more easily using the cat accessor

Index(['apple', 'orange'], dtype='object')
[0 1 0 0 0 1 0 0]


In [9]:
# trick to get a mapping between codes and categories is 
dict(enumerate(c.categories))

# what integer code corresponds to which category label

{0: 'apple', 1: 'orange'}

In [10]:
# we can convert a DataFrame column to categorical 
df['fruit'] = df['fruit'].astype('category')
df['fruit']

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

In [11]:
# we can also create pandas.Categorical directly from other types of Python sequences
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']

In [12]:
# if you have obtained categorical encoded data from another source, you can use the alternative from_codes constructor

categories = ['foo', 'bar', 'baz']
codes = [0,2,1,2,2,1,0,0,1,2]

res = pd.Categorical.from_codes(codes, categories) 
print(res)

# unless explicitly specified, categorical conversions assume no specific ordering of the categories. 
# so the categories array may be in a different order depending on the ordering of the input data 
# when using from_codes or any of the other constructors, you can indicate that the categorie have a meaninggul odering

ordered_cat = pd.Categorical.from_codes(codes, categories, ordered=True)
print(ordered_cat)

# no difference in print output, but underlying metadata knows it's ordered

# when to use ordered: 
    # the categories have a natural ranking 
    # you need to compare values or sort them in a meaningful order

['foo', 'baz', 'bar', 'baz', 'baz', 'bar', 'foo', 'foo', 'bar', 'baz']
Categories (3, object): ['foo', 'bar', 'baz']
['foo', 'baz', 'bar', 'baz', 'baz', 'bar', 'foo', 'foo', 'bar', 'baz']
Categories (3, object): ['foo' < 'bar' < 'baz']


categorical data need not be strings, a categorical array can consist of any immutable value types

## [ Computations with Categories ]

- pandas has a special data type called `Categorial`, which stores data more efficiently and can speed up certain operations (like `groupby`)
- when you use functions like `qcut()`, pandas automatically returns a `Categorical` type behind the scenes
- you can use categoricals just like strings or numbers, but under the hood, it uses integer codes for better performance.

In [17]:
rng = np.random.default_rng(seed=12345)
draws = rng.standard_normal(1000)
# print(draws)

# lets compute a quartile binning of this data and extract some statistics
bins = pd.qcut(draws, 4)
print(bins)
bins.value_counts()

[(-3.121, -0.675], (0.687, 3.211], (-3.121, -0.675], (-0.675, 0.0134], (-0.675, 0.0134], ..., (0.0134, 0.687], (0.0134, 0.687], (-0.675, 0.0134], (0.0134, 0.687], (-0.675, 0.0134]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.121, -0.675] < (-0.675, 0.0134] < (0.0134, 0.687] < (0.687, 3.211]]


(-3.121, -0.675]    250
(-0.675, 0.0134]    250
(0.0134, 0.687]     250
(0.687, 3.211]      250
Name: count, dtype: int64

In [19]:
# while useful the exact sample quartiles may be less useful for producing a report than quartile names. 
# We can achieve this with the labels argument to qcut

bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
print(bins)
bins.codes

['Q1', 'Q4', 'Q1', 'Q2', 'Q2', ..., 'Q3', 'Q3', 'Q2', 'Q3', 'Q2']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']


array([0, 3, 0, 1, 1, 0, 0, 2, 2, 0, 3, 3, 0, 3, 1, 1, 3, 0, 2, 3, 3, 1,
       3, 0, 1, 2, 0, 1, 3, 3, 3, 3, 0, 0, 0, 2, 3, 1, 0, 2, 2, 1, 3, 1,
       1, 0, 2, 3, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 0, 1, 2, 1, 2, 0, 0,
       3, 0, 2, 0, 0, 0, 1, 2, 0, 1, 3, 1, 0, 2, 1, 0, 0, 3, 1, 1, 0, 3,
       3, 3, 1, 2, 3, 3, 2, 1, 1, 0, 0, 2, 1, 0, 1, 3, 1, 0, 1, 3, 2, 1,
       3, 3, 1, 3, 2, 3, 3, 3, 2, 1, 0, 1, 3, 2, 2, 2, 2, 2, 0, 0, 1, 0,
       3, 1, 1, 0, 2, 1, 3, 3, 0, 2, 0, 1, 3, 3, 2, 1, 0, 0, 2, 3, 0, 2,
       2, 3, 0, 1, 2, 0, 0, 0, 1, 2, 1, 2, 0, 0, 3, 1, 1, 3, 3, 3, 2, 1,
       2, 2, 3, 2, 2, 1, 3, 0, 2, 3, 0, 3, 1, 1, 0, 3, 2, 1, 2, 2, 3, 2,
       0, 1, 0, 0, 3, 2, 1, 1, 3, 2, 3, 2, 1, 3, 0, 0, 1, 0, 1, 1, 3, 2,
       2, 0, 0, 3, 0, 3, 0, 2, 0, 2, 0, 1, 3, 1, 3, 1, 0, 2, 1, 3, 3, 3,
       2, 3, 3, 2, 0, 0, 2, 3, 3, 0, 2, 2, 0, 2, 0, 2, 3, 0, 2, 0, 0, 3,
       0, 1, 1, 1, 2, 2, 3, 3, 1, 3, 3, 0, 2, 3, 3, 2, 0, 1, 0, 1, 3, 2,
       3, 1, 0, 0, 2, 0, 1, 3, 0, 0, 1, 1, 1, 0, 2,

In [20]:
# the labeled bins categorical does not contain information about the bin edges in the data, so we can use groupby to extract some summary statistics
bins = pd.Series(bins, name='quartile')
results = (pd.Series(draws)
        .groupby(bins)
        .agg(['count', 'min', 'max'])
        .reset_index())

results

  .groupby(bins)


Unnamed: 0,quartile,count,min,max
0,Q1,250,-3.119609,-0.678494
1,Q2,250,-0.673305,0.008009
2,Q3,250,0.018753,0.686183
3,Q4,250,0.688282,3.211418


In [21]:
# the quartile column in the result retains the original categorical information, including ordering, from bins
results['quartile']

0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

### [ Better Performance with Categoricals ]

In [25]:
# consider some Series with 10Million elements and a small number of distinct categories:

N = 10_000_000
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))

# convert labels to categorical
categories = labels.astype('category')

# Note that labels uses significatly more memory than categories
print(labels.memory_usage(deep=True))
print(categories.memory_usage(deep=True))

# the conversion to category is not free, of course, but it is a one-time cost
%time _ = labels.astype('category')

520000132
10000512
CPU times: user 723 ms, sys: 29.6 ms, total: 753 ms
Wall time: 758 ms


In [28]:
# using categorical data in pandas (which is internally stored as integers referencing categories) can speed up operations like groupby() or value_counts() compared to working with raw strings or other object types.

# why is it faster?
    # strings take more memory and more time to compare 
    # categoricals store strings as integer codes and a category list
    # operations like `groupby` or `value_counts` can work on integer arrays instead of strings -- and integers are super fast to compare and sort

labels_time = %timeit labels.value_counts()
category_time = %timeit categories.value_counts()

labels_time
category_time

707 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
46.8 ms ± 368 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## [ Categorical Methods ]
- series containing categorical data have several special methods similar to the Series.str specialized string methods.
- this also provides convenient access to the categories and codes

In [29]:
s = pd.Series(['a', 'b', 'c', 'd'] * 2)
cat_s = s.astype('category')

cat_s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [31]:
# the special accessor attribute cat provides access to categorical methods
cat_s.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [32]:
cat_s.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

In [34]:
# suppose that we know the actual set of categories for this data extends beyond the four values observed in the data
# we can use the set_categories method to change them

actual_categories = ['a', 'b', 'c', 'd', 'e']
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

In [35]:
# while it appears that the data is unchanged, the new categories will be reflected in operations that use them.
# example 

print(cat_s.value_counts())
print(cat_s2.value_counts())

a    2
b    2
c    2
d    2
Name: count, dtype: int64
a    2
b    2
c    2
d    2
e    0
Name: count, dtype: int64


In [38]:
# in large datasets, categoricals are often used as a convenient tool for memory savings and better performance. 
# After you filter a large DataFrame or Series, many of the categories may not appear in the data.
# we can use the remove_unused_categories method to trim unobserved categories

cat_s3 = cat_s[cat_s.isin(['a', 'b'])]
print(cat_s3)

cat_s3.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']


0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): ['a', 'b']


| Method / Attribute      | Description |
|--------------------------|-------------|
| `Series.cat.categories` | Returns the categories (unique values). |
| `Series.cat.codes`      | Returns the integer codes for each value in the Series. |
| `Series.cat.ordered`    | Returns `True` if the categorical has an ordered nature. |
| `Series.cat.set_categories(new_categories)` | Changes the categories to a new list. |
| `Series.cat.add_categories(new_categories)` | Adds new categories without removing existing ones. |
| `Series.cat.remove_categories(removal)` | Removes specific categories. |
| `Series.cat.remove_unused_categories()` | Removes categories not actually used in the data. |
| `Series.cat.rename_categories(new_names)` | Renames existing categories. |
| `Series.cat.reorder_categories(new_order)` | Changes order of categories. |
| `Series.cat.as_ordered()` | Sets the ordered flag to `True`. |
| `Series.cat.as_unordered()` | Sets the ordered flag to `False`. |


## [ Creating dummy variables for modeling ]

In [41]:
# When you’re using statistics or machine learning tools, you’ll often transform categorical data into dummy variables, also known as one-hot encoding
# This involves creating a DataFrame with a column for each distinct category; these columns contain 1s for occurrences of a given category and 0 otherwise.

cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')

# the pandas.get_dummies function converts this one-dimensional categorical data into a DataFrame containing the dummy variable

pd.get_dummies(cat_s, dtype=int)

Unnamed: 0,a,b,c,d
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,0,1
4,1,0,0,0
5,0,1,0,0
6,0,0,1,0
7,0,0,0,1


NOTE:    
- Effective data preparation can significantly improve productivity by enabling you to spend more time analyzing data and less time getting it ready for analysis.
- We have to explore a number of tools for that.