# Categorical Data

## Intro

[doc](http://pandas.pydata.org/pandas-docs/stable/categorical.html)

Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales.

In contrast to statistical categorical variables, categorical data might have an order (e.g. ‘strongly agree’ vs ‘agree’ or ‘first observation’ vs. ‘second observation’), but numerical operations (additions, divisions, …) are not possible.

All values of categorical data are either in categories or np.nan. Order is defined by the order of categories, not lexical order of the values. Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array.

The categorical data type is useful in the following cases:

* A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see [here](http://pandas.pydata.org/pandas-docs/stable/categorical.html#categorical-memory).
* The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see [here](http://pandas.pydata.org/pandas-docs/stable/categorical.html#categorical-sort).
* As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).


In [1]:
import pandas as pd
import numpy as np
print(pd.__version__)
print(np.__version__)

0.22.0
1.14.3


## Object Creation

### Series Creation

In [2]:
s = pd.Series(["a","b","c","a"], dtype="category")

In [3]:
s

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

In [4]:
df = pd.DataFrame((np.random.randn(25).reshape(5,5)))

In [5]:
df

Unnamed: 0,0,1,2,3,4
0,0.192806,-1.893924,-1.075192,0.501513,-1.196255
1,-1.16827,-0.698392,-0.913904,2.213021,1.147565
2,1.751186,0.420273,-1.74691,-0.208617,2.894478
3,-0.251419,-1.781252,0.029791,0.470689,0.48608
4,-0.540789,-0.000982,0.860098,0.732898,-1.465267


In [6]:
df[0].astype('category')

0    0.192806
1   -1.168270
2    1.751186
3   -0.251419
4   -0.540789
Name: 0, dtype: category
Categories (5, float64): [-1.168270, -0.540789, -0.251419, 0.192806, 1.751186]

### cut()

In [7]:
df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})

In [None]:
np.random.randint?

In [8]:
df

Unnamed: 0,value
0,53
1,80
2,15
3,45
4,72
5,76
6,98
7,85
8,4
9,49


In [9]:
df['group'] = pd.cut(df.value, range(-10, 105, 10))

In [10]:
df

Unnamed: 0,value,group
0,53,"(50, 60]"
1,80,"(70, 80]"
2,15,"(10, 20]"
3,45,"(40, 50]"
4,72,"(70, 80]"
5,76,"(70, 80]"
6,98,"(90, 100]"
7,85,"(80, 90]"
8,4,"(0, 10]"
9,49,"(40, 50]"


In [12]:
df.groupby('group').count()

Unnamed: 0_level_0,value
group,Unnamed: 1_level_1
"(-10, 0]",0
"(0, 10]",1
"(10, 20]",2
"(20, 30]",1
"(30, 40]",0
"(40, 50]",5
"(50, 60]",1
"(60, 70]",2
"(70, 80]",5
"(80, 90]",1


In [19]:
raw_cat = pd.Categorical(["a","b","c","a"], categories=["b","c","d"],
                         ordered=False)

In [20]:
s = pd.Series(raw_cat)

In [21]:
s

0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): [b, c, d]

In [22]:
df = pd.DataFrame({"A":["a","b","c","a"]})

In [23]:
df["B"] = raw_cat

In [24]:
df

Unnamed: 0,A,B
0,a,
1,b,b
2,c,c
3,a,


### DataFrame Creation

In [25]:
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')}, dtype="category")

In [26]:
df.dtypes

A    category
B    category
dtype: object

In [27]:
df

Unnamed: 0,A,B
0,a,b
1,b,c
2,c,c
3,a,d


In [28]:
df['B'] = df['B'].astype('str')

In [29]:
df.dtypes

A    category
B      object
dtype: object

## CategoricalDtype

In [30]:
from pandas.api.types import CategoricalDtype

In [31]:
CategoricalDtype(['a', 'b', 'c'])

CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)

## Description

In [38]:
cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])

In [39]:
df = pd.DataFrame({"cat":cat, "s":["a", "c", "c", np.nan]})

In [40]:
df

Unnamed: 0,cat,s
0,a,a
1,c,c
2,c,c
3,,


In [41]:
df.describe()

Unnamed: 0,cat,s
count,3,3
unique,2,2
top,c,c
freq,2,2


In [42]:
df["cat"].describe()

count     3
unique    2
top       c
freq      2
Name: cat, dtype: object

## Working with categories

In [43]:
s = pd.Series(["a","b","c","a"], dtype="category")

In [44]:
s.cat.categories

Index(['a', 'b', 'c'], dtype='object')

In [45]:
s.cat.ordered

False

Specifying order matters

In [46]:
s = pd.Series(pd.Categorical(["a","b","c","a"], categories=["c","b","a"], ordered=True))

In [47]:
s.cat.categories

Index(['c', 'b', 'a'], dtype='object')

In [48]:
s.cat.ordered

True