### Categorical Data

Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take
on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social
class, blood types, country affiliations, observation time or ratings via Likert scales.

In contrast to statistical categorical variables, categorical data might have an order (e.g. ‘strongly agree’ vs ‘agree’ or
‘first observation’ vs. ‘second observation’), but numerical operations (additions, divisions, ...) are not possible.

All values of categorical data are either in categories or np.nan. Order is defined by the order of categories, not lexical
order of the values.

documentation: http://pandas.pydata.org/pandas-docs/stable/categorical.html

In [26]:
import pandas as pd
import numpy as np
file_name_string = './EmployeesWithGrades.xlsx'
employees_df = pd.read_excel(file_name_string, 'Sheet1', index_col=None, na_values=['NA'])

In [27]:
employees_df

Unnamed: 0,Department,Name,YearsOfService,Grade
0,Marketing,Able,4,a
1,Engineering,Baker,7,b
2,Accounting,Charlie,12,c
3,Marketing,Delta,1,d
4,Engineering,Echo,15,f
5,Accounting,Foxtrot,9,a
6,Marketing,Golf,3,b
7,Engineering,Hotel,1,c
8,Accounting,India,2,d
9,Marketing,Juliet,5,f


In [28]:
employees_df["Grade"] 

0      4
1      7
2     12
3      1
4     15
5      9
6      3
7      1
8      2
9      5
10     7
11    11
12     2
13     3
14     4
15     9
16     1
17     1
18     1
19     7
20     5
21    19
22     2
23     3
24     8
25    17
26     5
Name: YearsOfService, dtype: int64

##### Change data type
change data type for "Grade" column to category

documentation for astype(): http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html

In [29]:
employees_df["Grade"] = employees_df["Grade"].astype("category")
employees_df["Grade"] 

0      4
1      7
2     12
3      1
4     15
5      9
6      3
7      1
8      2
9      5
10     7
11    11
12     2
13     3
14     4
15     9
16     1
17     1
18     1
19     7
20     5
21    19
22     2
23     3
24     8
25    17
26     5
Name: YearsOfService, dtype: int32

##### Rename the categories
Rename the categories to more meaningful names (assigning to Series.cat.categories is inplace)

In [30]:
employees_df["Grade"].cat.categories = ["excellent", "good", "acceptable", "poor", "unacceptable"]

AttributeError: Can only use .cat accessor with a 'category' dtype

### Values in data frame have not changed

In [None]:
employees_df

tabulate Department, Name, and YearsOfService, by Grade

In [None]:
employees_df.groupby('Grade').count()