## [ Categorical Data ]
- is a pandas data type used for columns that have a fixed and limited number of possible values (categories)

#### Benefits of Using Categorical Data

1. **Less Memory Usage**  
   Stores categories as integers under the hood, not strings.

2. **Faster Performance**  
   Comparisons and groupings are quicker.

3. **Clear Data Meaning**  
   Helps distinguish **true categories** (like `gender`, `city`, `rating`) from free-form text.


#### Background & Motivation

Before categorical types:

- Strings were used to represent categories like `"male"`, `"female"`, `"yes"`, `"no"`, etc.
- But string operations are **slow** and **memory-heavy**.
- There was **no way to define ordering** (like `"low" < "medium" < "high"`).
- Missing values and inconsistent labels could cause bugs in analysis.

**So pandas introduced `Categorical` data type to solve this.**


####  Features of Categorical Data

| Feature                | Description                                           |
|------------------------|-------------------------------------------------------|
| **Categories**         | The possible values (like "small", "medium", "large") |
| **Order** (optional)   | Categories can be **ordered** (e.g., low < medium < high) |
| **Efficient Storage**  | Internally stored as integers + a mapping table       |
| **Group-by Friendly**  | Useful for `groupby()`, `pivot_table()`, etc.         |


In [7]:
import numpy as np 
import pandas as pd 

values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [9]:
print(pd.unique(values))
print(pd.Series(values).value_counts())

['apple' 'orange']
apple     6
orange    2
Name: count, dtype: int64


In [None]:
# Many data systems (for data warehousing, statistical computing, or other uses) have developed specialized approaches for representing data with repeated values for more efficient storage and computation. In data warehousing.
#  A best practice is to use so-called dimension tables containing the distinct values and storing the primary observations as integer keys referencing the dimension table


# Summary: 
# To save space and work faster: 
#     don't repeat the same text over and over
#     instead, store each unique value once in a dimension table
#     in your main table, just store an integer code that points to that value 
#     this technique is used in data warehousing, pandas categorical data, and many analytics tools

- Data Systems: these are tools or platforms used to store, manage, and analyze data
    - databases (MySQL, PostgreSQL)
    - data warehouses (amazon redshift, google bigquery)
    - statistical tools (R, pandas in python)

- Repeated Values: refers to data that appears multiple times in a dataset

- Efficient Storage and Computation: to avoid redundancy and save space and time, data systems try do:
    - store repeated values only once
    - use references or codes instead of full repeated strings
    - Why?
        - reduces memory usage
        - makes searching and comparing faster

- Data Warehousing: it is a system designed for analyzing large datasets(often from different sources) to support dicision-making
    - Characteristics: 
        - stores historical data
        - used for reporting, analytics, BI (business intelligence)

- Dimension Tables: these are tables in a data warehouse that store unique values for certain attributes (like city_names, product_names, etc)
    - Instead of storing full strings repeatedly, we store: 
        - a unique ID (like 1,2,3,...)
        - the corresponding string once 

In [14]:
values = pd.Series([0,1,0,0] * 2)
dim = pd.Series(['apple', 'orange']) # dim[0] -> apple , dim[1] -> orange

print(values)
print(dim)

dim.take(values)    # dim.take([0,1,0,0,0,1,0,0])

# dim.take(values) returns a new Series -- it's a copy not a view, and it is not stored anywhere unless we assign it

0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64
0     apple
1    orange
dtype: object


0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object


#### **Key Points on Categorical (Dictionary-Encoded) Representation in pandas**

1. **Categorical Representation**:
   - Data is stored as **integer codes** referring to a **set of distinct values** (called *categories*).
   - This is also known as **dictionary-encoded representation**.

2. **Terminology**:
   - The **distinct values**: called *categories*, *dictionary*, or *levels*.
   - The **integers** referencing them: called *category codes* or simply *codes*.

3. **Performance**:
   - Categorical representation is **memory-efficient** and **faster** for certain operations.
   - Useful for **analytics** where values repeat often (e.g., gender, region, product types).

4. **Transformations on Categories**:
   - You can change the **categories** without modifying the **underlying codes**.
   - Common low-cost transformations include:
     - ✅ **Renaming categories**
     - ✅ **Appending new categories** (while keeping existing order/positions unchanged)


--- 

## [ Categorical Extension Type in pandas ]
- pandas provides a **special data type** called `Categorical` for handling data using **integer-based encoding**.
- It stores repeated values as **integer codes**, pointing to a list of **unique categories** (like a lookup table).
- **Why it's useful**:  
   - This method:
   - ✅ Saves **memory**
   - ✅ Offers **faster performance**
   - ✅ Is especially helpful for **string data with many repeated values**
- Great for data like *city names*, *product types*, *status labels* (like 'Yes', 'No'), etc.


In [16]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
N = len(fruits)

rng = np.random.default_rng(seed=12345)

df = pd.DataFrame({'fruit': fruits,
                     'basket_id': np.arange(N),
                     'count': rng.integers(3, 15, size=N),
                     'weight': rng.uniform(0, 4, size=N)},
                    columns=['basket_id', 'fruit', 'count', 'weight'])
df 

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,11,1.564438
1,1,orange,5,1.331256
2,2,apple,12,2.393235
3,3,apple,6,0.746937
4,4,apple,5,2.691024
5,5,orange,12,3.767211
6,6,apple,10,0.992983
7,7,apple,11,3.795525


In [17]:
# here df['fruit'] is an array of python strings objects. we can convert it to categorical by calling
fruit_cat = df['fruit'].astype('category')
fruit_cat # series object

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

In [21]:
# the values for fruit_cat are now an instance of pandas.Categorical, which you can access via the .array attribute

c = fruit_cat.array
type(c)

# s.array returns the actual data container behind the series

pandas.core.arrays.categorical.Categorical

In [24]:
# the categorical object has categories and codes attributes
print(c.categories)
print(c.codes)
# fruit_cat.categories: series object has no object categories

# these can be accessed more easily using the cat accessor

Index(['apple', 'orange'], dtype='object')
[0 1 0 0 0 1 0 0]


In [25]:
# trick to get a mapping between codes and categories is 
dict(enumerate(c.categories))

# what integer code corresponds to which category label

{0: 'apple', 1: 'orange'}

In [32]:
# we can convert a DataFrame column to categorical 
df['fruit'] = df['fruit'].astype('category')
df['fruit']

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): ['apple', 'orange']

In [35]:
# we can also create pandas.Categorical directly from other types of Python sequences
my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories

['foo', 'bar', 'baz', 'foo', 'bar']
Categories (3, object): ['bar', 'baz', 'foo']

In [38]:
# if you have obtained categorical encoded data from another source, you can use the alternative from_codes constructor

categories = ['foo', 'bar', 'baz']
codes = [0,2,1,2,2,1,0,0,1,2]

res = pd.Categorical.from_codes(codes, categories) 
print(res)

# unless explicitly specified, categorical conversions assume no specific ordering of the categories. 
# so the categories array may be in a different order depending on the ordering of the input data 
# when using from_codes or any of the other constructors, you can indicate that the categorie have a meaninggul odering

ordered_cat = pd.Categorical.from_codes(codes, categories, ordered=True)
print(ordered_cat)

# no difference in print output, but underlying metadata knows it's ordered

# when to use ordered: 
    # the categories have a natural ranking 
    # you need to compare values or sort them in a meaningful order

['foo', 'baz', 'bar', 'baz', 'baz', 'bar', 'foo', 'foo', 'bar', 'baz']
Categories (3, object): ['foo', 'bar', 'baz']
['foo', 'baz', 'bar', 'baz', 'baz', 'bar', 'foo', 'foo', 'bar', 'baz']
Categories (3, object): ['foo' < 'bar' < 'baz']


categorical data need not be strings, a categorical array can consist of any immutable value types

## [ Computations with Categories ]