## Fundamentals of categorical data
Categorical data refers to data that can only fall under one of a limited set of values. Often these will be non-numeric. For example, the 'wnd_dir' column of the pollution dataset is categorical and gets imported as an object (ie. string). This means we need to utilise some different tools to analyse this kind of data.

Consider the pollution dataset again:

In [1]:
import pandas as pd
pollution_data = pd.read_csv('LSTM-Multivariate_pollution.csv', index_col = 'date', parse_dates = True, dayfirst = True)

To find all the categories in a column, you can call the <code>unique</code> method. For example, to find all the unique wind directions you would type:

In [2]:
pollution_data['wnd_dir'].unique()

array(['SE', 'cv', 'NW', 'NE'], dtype=object)

To find the how many times each category is observed, you can call the <code>value_counts</code> method.

In [3]:
pollution_data['wnd_dir'].value_counts()

wnd_dir
SE    15290
NW    14130
cv     9384
NE     4996
Name: count, dtype: int64

***
## Ordering categories
You have seen on an earlier page that grouped aggregates allow you to return an aggregate for each group that appears in a column. For example, if we wanted to see the mean pollution level for each month of the year, we could use the code below.

In [4]:
pollution_data['month'] = pollution_data.index.month_name()
pollution_data.groupby('month')['pollution'].mean()

month
April         79.175000
August        71.760484
December      93.475000
February     125.327423
January      108.054654
July          92.481183
June          91.285833
March         93.447849
May           77.722581
November     102.854167
October      115.837903
September     78.889722
Name: pollution, dtype: float64

By default grouped aggregates are sorted by index in ascending order. Since the index contains month names represented as strings, they are presented in alphabetical order. In this case the month names are ordered categories - the chronological order is Jan, Feb, ..., Nov, Dec. If we want the results displayed in this order we can convert the month column to an ordered categorical data type.

In [5]:
# define category list in order
month_names = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 
               'August', 'September', 'October', 'November', 'December']

# convert column to an ordered categorical
pollution_data['month'] = pd.Categorical(pollution_data['month'], categories = month_names, ordered = True)

# now our grouped aggregates should be shown in the correct order
pollution_data.groupby('month')['pollution'].mean()

  pollution_data.groupby('month')['pollution'].mean()


month
January      108.054654
February     125.327423
March         93.447849
April         79.175000
May           77.722581
June          91.285833
July          92.481183
August        71.760484
September     78.889722
October      115.837903
November     102.854167
December      93.475000
Name: pollution, dtype: float64