#Categorical Data and the Dirichlet Discrete Distribution

---

Let's consider some examples of data with categorical variables

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_context('talk')
sns.set_style('darkgrid')

First, the passenger list of the Titanic

In [None]:
titanic = sns.load_dataset("titanic")

In [None]:
titanic.head(n=10)

One of the categorical variables in this dataset is `embark_town`

Let's plot the number of passengers departing from each town

In [None]:
ax = titanic.groupby(['embark_town'])['age'].count().plot(kind='bar')
plt.xticks(rotation=0)
plt.xlabel('Departure Town')
plt.ylabel('Passengers')
plt.title('Number of Passengers by Town of Departure')

Let's look at another example: the [cars93 dataset](https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Cars93.html)

In [None]:
cars = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/MASS/Cars93.csv', index_col=0)

In [None]:
cars.head()

In [None]:
cars.ix[1]

This dataset has multiple categorical variables

Based on the description of the cars93 datatset, we'll consider `Manufacturer`, and `DriveTrain` to be categorical variables

Let's plot `Manufacturer` and `DriveTrain`

In [None]:
cars.groupby('Manufacturer')['Model'].count().plot(kind='bar')
plt.ylabel('Cars')
plt.title('Number of Cars by Manufacturer')

In [None]:
cars.groupby('DriveTrain')['Model'].count().plot(kind='bar')
plt.ylabel('Cars')
plt.title('Number of Cars by Drive Train')

If our categorical data has labels, we need to convert them to integer id's

In [None]:
def col_2_ids(df, col):
    ids = df[col].drop_duplicates().sort(inplace=False).reset_index(drop=True)
    ids.index.name = '%s_ids' % col
    ids = ids.reset_index()
    df = pd.merge(df, ids, how='left')
    del df[col]
    return df

In [None]:
cat_columns = ['Manufacturer', 'DriveTrain']

for c in cat_columns:
    print c
    cars = col_2_ids(cars, c)

In [None]:
cars[['%s_ids' % c for c in cat_columns]].head()

Just as we model binary data with the beta Bernoulli distribution, we can model categorical data with the Dirichlet discrete distribution

The beta Bernoulli distribution allows us to learn the underlying probability, $\theta$, of the binary random variable, $x$

$$P(x=1) =\theta$$
$$P(x=0) = 1-\theta$$

The Dirichlet discrete distribution extends the beta Bernoulli distribution to the case in which $x$ can assume more than two states

$$\forall i \in [0,1,...n] \hspace{2mm} P(x = i) = \theta_i$$
$$\sum_{i=0}^n \theta_i = 1$$

Again, the Dirichlet distribution takes advantage of the fact that the Dirichlet distribution and the discrete distribution are conjugate. Note that the discrete distriution is sometimes called the categorical distribution or the multinomial distribution.

To import the Dirichlet discrete distribution call

In [None]:
from microscopes.models import dd as dirichlet_discrete

Then given the specific model we'd want we'd import

`from microscopes.model_name.definition import model_definition`

**NOTE: You must specify the number of categories in your Dirichlet Discrete distribution**

For `5` categories, for examples you must specify the likelihood as:

In [None]:
dd5 = dirichlet_discrete(5)

You can then use the model definition as appropriate for your desired model: