#### Part 23: Working with Categorical Data in Pandas

In this notebook, we'll explore:
- Handling missing values (continued)
- Working with categorical data
- Accessing and manipulating categorical data

##### Setup
First, let's import the necessary libraries:

In [1]:
import pandas as pd
import numpy as np

##### 1. Handling Missing Values (Continued)

### 1.1 Handling Missing Values in Boolean Arrays

When working with boolean arrays that contain missing values, we need to be careful. Let's see what happens:

In [2]:
# Create a Series with random values
s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7])
print("Original Series:")
print(s)

# Create a boolean Series
bool_series = s > 0
print("\nBoolean Series:")
print(bool_series)
print("Data type:", bool_series.dtype)

# Reindex the boolean Series to introduce missing values
crit = bool_series.reindex(list(range(8)))
print("\nReindexed Boolean Series:")
print(crit)
print("Data type:", crit.dtype)

Original Series:
0   -0.835941
2   -1.475607
4   -1.522111
6    0.658046
7    0.194612
dtype: float64

Boolean Series:
0    False
2    False
4    False
6     True
7     True
dtype: bool
Data type: bool

Reindexed Boolean Series:
0    False
1      NaN
2    False
3      NaN
4    False
5      NaN
6     True
7     True
dtype: object
Data type: object


If we try to use this boolean Series with NAs for indexing, we'll get an error:

In [3]:
# Reindex the original Series and fill missing values with 0
reindexed = s.reindex(list(range(8))).fillna(0)
print("Reindexed Series with filled NAs:")
print(reindexed)

# Try to use the boolean Series with NAs for indexing
try:
    result = reindexed[crit]
    print(result)
except ValueError as e:
    print(f"Error: {e}")

Reindexed Series with filled NAs:
0   -0.835941
1    0.000000
2   -1.475607
3    0.000000
4   -1.522111
5    0.000000
6    0.658046
7    0.194612
dtype: float64
Error: Cannot mask with non-boolean array containing NA / NaN values


However, we can fill the NAs in the boolean Series before using it for indexing:

In [4]:
# Fill NAs with False
print("Using boolean Series with NAs filled with False:")
print(reindexed[crit.fillna(False)])

# Fill NAs with True
print("\nUsing boolean Series with NAs filled with True:")
print(reindexed[crit.fillna(True)])

Using boolean Series with NAs filled with False:
6    0.658046
7    0.194612
dtype: float64

Using boolean Series with NAs filled with True:
1    0.000000
3    0.000000
5    0.000000
6    0.658046
7    0.194612
dtype: float64


  print(reindexed[crit.fillna(False)])
  print(reindexed[crit.fillna(True)])


### 1.2 Nullable Integer Data Type

Pandas provides a nullable integer dtype, but you must explicitly request it when creating the series or column. Notice that we use a capital "I" in the dtype="Int64":

In [5]:
# Create a Series with nullable integer dtype
s = pd.Series([0, 1, np.nan, 3, 4], dtype="Int64")
s

0       0
1       1
2    <NA>
3       3
4       4
dtype: Int64

### 1.3 Experimental NA Scalar to Denote Missing Values

Starting from pandas 1.0, an experimental `pd.NA` value (singleton) is available to represent scalar missing values. It is used in the nullable integer, boolean, and dedicated string data types as the missing value indicator.

In [6]:
# Create a Series with nullable integer dtype
s = pd.Series([1, 2, None], dtype="Int64")
print(s)

# Check if the missing value is pd.NA
print(f"\ns[2] = {s[2]}")
print(f"s[2] is pd.NA: {s[2] is pd.NA}")

0       1
1       2
2    <NA>
dtype: Int64

s[2] = <NA>
s[2] is pd.NA: True


In general, missing values propagate in operations involving `pd.NA`. When one of the operands is unknown, the outcome of the operation is also unknown:

In [7]:
# Arithmetic operations with pd.NA
print(f"pd.NA + 1 = {pd.NA + 1}")
print(f"pd.NA * 2 = {pd.NA * 2}")
print(f"pd.NA ** 0 = {pd.NA ** 0}")

pd.NA + 1 = <NA>
pd.NA * 2 = <NA>
pd.NA ** 0 = 1


##### 2. Working with Categorical Data

Categorical data in pandas is a type that corresponds to categorical variables in statistics. It can be used to save memory and improve performance when you have a limited set of possible values.

### 2.1 Creating Categorical Data

In [8]:
# Create a categorical Series
raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
print("Categorical Series:")
print(raw_cat)

# Create a DataFrame with a categorical column
df = pd.DataFrame({"A": raw_cat,
                  "B": ["c", "d", "c", "d"],
                  "values": [1, 2, 3, 4]})
print("\nDataFrame with categorical column:")
print(df)
print("\nData types:")
print(df.dtypes)

Categorical Series:
['a', 'a', 'b', 'b']
Categories (3, object): ['a', 'b', 'c']

DataFrame with categorical column:
   A  B  values
0  a  c       1
1  a  d       2
2  b  c       3
3  b  d       4

Data types:
A         category
B           object
values       int64
dtype: object


### 2.2 Grouping with Categorical Data

Categorical data can be used for grouping operations:

In [9]:
# Create a DataFrame with a categorical column
cats = pd.Categorical(["a", "b", "a", "c"], categories=["a", "b", "c", "d"])
df = pd.DataFrame({"cats": cats, "values": [1, 2, 1, 4]})
print("DataFrame:")
print(df)

# Group by the categorical column
print("\nGroupby result:")
print(df.groupby("cats").mean())

DataFrame:
  cats  values
0    a       1
1    b       2
2    a       1
3    c       4

Groupby result:
      values
cats        
a        1.0
b        2.0
c        4.0
d        NaN


  print(df.groupby("cats").mean())


We can also use multiple groupby columns, including categorical ones:

In [10]:
# Create a DataFrame with a categorical column
cats2 = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
df2 = pd.DataFrame({"cats": cats2,
                   "B": ["c", "d", "c", "d"],
                   "values": [1, 2, 3, 4]})
print("DataFrame:")
print(df2)

# Group by multiple columns
print("\nGroupby result:")
print(df2.groupby(["cats", "B"]).mean())

DataFrame:
  cats  B  values
0    a  c       1
1    a  d       2
2    b  c       3
3    b  d       4

Groupby result:
        values
cats B        
a    c     1.0
     d     2.0
b    c     3.0
     d     4.0
c    c     NaN
     d     NaN


  print(df2.groupby(["cats", "B"]).mean())


### 2.3 Pivot Tables with Categorical Data

In [11]:
# Create a DataFrame with a categorical column
raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
df = pd.DataFrame({"A": raw_cat,
                  "B": ["c", "d", "c", "d"],
                  "values": [1, 2, 3, 4]})
print("DataFrame:")
print(df)

# Create a pivot table
print("\nPivot table:")
print(pd.pivot_table(df, values='values', index=['A', 'B']))

DataFrame:
   A  B  values
0  a  c       1
1  a  d       2
2  b  c       3
3  b  d       4

Pivot table:
     values
A B        
a c     1.0
  d     2.0
b c     3.0
  d     4.0


  print(pd.pivot_table(df, values='values', index=['A', 'B']))


### 2.4 Data Munging with Categorical Data

The optimized pandas data access methods `.loc`, `.iloc`, `.at`, and `.iat` work as normal with categorical data. The only difference is the return type (for getting) and that only values already in categories can be assigned.

In [12]:
# Create a DataFrame with a categorical column
idx = pd.Index(["h", "i", "j", "k", "l", "m", "n"])
cats = pd.Series(["a", "b", "b", "b", "c", "c", "c"], dtype="category", index=idx)
values = [1, 2, 2, 2, 3, 4, 5]
df = pd.DataFrame({"cats": cats, "values": values}, index=idx)
print("DataFrame:")
print(df)

# Slicing with .iloc
print("\nSlicing with .iloc:")
print(df.iloc[2:4, :])
print("\nData types after slicing:")
print(df.iloc[2:4, :].dtypes)

# Slicing with .loc
print("\nSlicing with .loc:")
print(df.loc["h":"j", "cats"])

# Filtering
print("\nFiltering:")
print(df[df["cats"] == "b"])

DataFrame:
  cats  values
h    a       1
i    b       2
j    b       2
k    b       2
l    c       3
m    c       4
n    c       5

Slicing with .iloc:
  cats  values
j    b       2
k    b       2

Data types after slicing:
cats      category
values       int64
dtype: object

Slicing with .loc:
h    a
i    b
j    b
Name: cats, dtype: category
Categories (3, object): ['a', 'b', 'c']

Filtering:
  cats  values
i    b       2
j    b       2
k    b       2


If you take one single row, the resulting Series is of dtype object:

In [14]:
# Get a single row
print("Single row:")
row_h = df.loc["h", :]
print(row_h)
print(f"Data type: {row_h.dtype}")

# Get a single value
print("\nSingle value:")
print(f"df.iat[0, 0] = {df.iat[0, 0]}")

# Change category names - using rename_categories instead of direct assignment
df["cats"] = df["cats"].cat.rename_categories(["x", "y", "z"])
print("\nAfter changing category names:")
print(df)
print(f"df.at['h', 'cats'] = {df.at['h', 'cats']}")

Single row:
cats      a
values    1
Name: h, dtype: object
Data type: object

Single value:
df.iat[0, 0] = a

After changing category names:
  cats  values
h    x       1
i    y       2
j    y       2
k    y       2
l    z       3
m    z       4
n    z       5
df.at['h', 'cats'] = x


To get a single value Series of type category, you need to pass in a list with a single value:

In [15]:
# Get a single value Series of type category
print("Single value Series of type category:")
print(df.loc[["h"], "cats"])

Single value Series of type category:
h    x
Name: cats, dtype: category
Categories (3, object): ['x', 'y', 'z']


### 2.5 String and Datetime Accessors

The accessors `.dt` and `.str` will work if the `s.cat.categories` are of an appropriate type:

In [16]:
# Create a string Series and convert to category
str_s = pd.Series(list('aabb'))
str_cat = str_s.astype('category')
print("Categorical string Series:")
print(str_cat)

# Use string accessor
print("\nUsing .str accessor:")
print(str_cat.str.upper())

Categorical string Series:
0    a
1    a
2    b
3    b
dtype: category
Categories (2, object): ['a', 'b']

Using .str accessor:
0    A
1    A
2    B
3    B
dtype: object


##### Summary

In this notebook, we've explored:

1. Advanced handling of missing values in pandas, including:
   - Working with missing values in boolean arrays
   - Using the nullable integer data type
   - Understanding the experimental `pd.NA` scalar

2. Working with categorical data in pandas, including:
   - Creating and manipulating categorical data
   - Grouping and pivoting with categorical data
   - Accessing categorical data with various methods
   - Using string and datetime accessors with categorical data

These techniques are essential for efficient data manipulation and analysis in pandas, especially when working with limited sets of possible values.