# 2.11 Categorical Data

TODO

- data dictionaries (meanings of encoded values)
- replacing encoded values with actuals and vice-versa
- categorical data type in pandas
- ordinal
- link to modelling - e.g. one-hot encoding?


In [44]:
import pandas as pd
import numpy as np

df = pd.read_csv("data/pcs_2017.csv", na_values=["UNKNOWN", "NOT APPLICABLE"])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 32 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   Program Category                  5000 non-null   object
 1   Region Served                     5000 non-null   object
 2   Age Group                         4999 non-null   object
 3   Sex                               4992 non-null   object
 4   Living Situation                  4764 non-null   object
 5   Household Composition             3886 non-null   object
 6   Preferred Language                4940 non-null   object
 7   Veteran Status                    4879 non-null   object
 8   Employment Status                 5000 non-null   object
 9   Number Of Hours Worked Each Week  789 non-null    object
 10  Education Status                  4546 non-null   object
 11  Special Education Services        871 non-null    object
 12  Mental Illness      

```{note}
This dataset encodes missing information with the string "UNKNOWN" and unapplicable questions with the string "NOT APPLICABLE". We've asked Pandas to treat these as NaN (null) values by passing them to the `na_values` argument.
```

```{note}
The original source data has over 175,000 patients and more than 60 columns. We're using a smaller subset of the data here for teaching purposes.
```

In [23]:
df.head()

Unnamed: 0,Program Category,Region Served,Age Group,Sex,Living Situation,Household Composition,Preferred Language,Veteran Status,Employment Status,Number Of Hours Worked Each Week,...,Smokes,Received Smoking Medication,Received Smoking Counseling,Serious Mental Illness,Principal Diagnosis Class,SSI Cash Assistance,SSDI Cash Assistance,Public Assistance Cash Program,Other Cash Benefits,Three Digit Residence Zip Code
0,INPATIENT,HUDSON RIVER REGION,ADULT,FEMALE,INSTITUTIONAL SETTING,,ENGLISH,NO,NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ...,,...,NO,NO,NO,YES,MENTAL ILLNESS,NO,NO,NO,YES,105
1,SUPPORT,WESTERN REGION,CHILD,MALE,PRIVATE RESIDENCE,COHABITATES WITH OTHERS,ENGLISH,NO,NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ...,,...,NO,NO,NO,YES,MENTAL ILLNESS,YES,NO,NO,NO,138
2,OUTPATIENT,WESTERN REGION,CHILD,FEMALE,PRIVATE RESIDENCE,COHABITATES WITH OTHERS,ENGLISH,NO,NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ...,,...,NO,NO,NO,YES,MENTAL ILLNESS,,,,,140
3,OUTPATIENT,NEW YORK CITY REGION,CHILD,FEMALE,PRIVATE RESIDENCE,COHABITATES WITH OTHERS,ENGLISH,NO,NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ...,,...,NO,NO,NO,NO,,NO,NO,NO,NO,113
4,OUTPATIENT,LONG ISLAND REGION,CHILD,FEMALE,PRIVATE RESIDENCE,COHABITATES WITH OTHERS,ENGLISH,NO,NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ...,,...,NO,NO,NO,YES,MENTAL ILLNESS,NO,NO,NO,NO,115


In [24]:
df.iloc[0]

Program Category                                                            INPATIENT
Region Served                                                     HUDSON RIVER REGION
Age Group                                                                       ADULT
Sex                                                                            FEMALE
Living Situation                                                INSTITUTIONAL SETTING
Household Composition                                                             NaN
Preferred Language                                                            ENGLISH
Veteran Status                                                                     NO
Employment Status                   NOT IN LABOR FORCE:UNEMPLOYED AND NOT LOOKING ...
Number Of Hours Worked Each Week                                                  NaN
Education Status                                         MIDDLE SCHOOL TO HIGH SCHOOL
Special Education Services                            

## Binary Data

In [28]:
df["Smokes"].value_counts()

NO     3387
YES    1387
Name: Smokes, dtype: int64

## Ordinal Data

In [25]:
df["Number Of Hours Worked Each Week"].value_counts()

35 HOURS OR MORE            309
15-34 HOURS                 251
01-14 HOURS                 119
UNKNOWN EMPLOYMENT HOURS    110
Name: Number Of Hours Worked Each Week, dtype: int64

```{note}
Note that the majority of people in this dataset are not currently employed (fewer than 700 out of 5000 people are working at least one hour per week). We would need to think carefully about the right way to encode this. If we know people are unemployed we could assign 0 hours, but what about for people with unknown status?
```

In [27]:
df["Education Status"].value_counts()

MIDDLE SCHOOL TO HIGH SCHOOL    2585
SOME COLLEGE                     737
COLLEGE OR GRADUATE DEGREE       699
PRE-K TO FIFTH GRADE             433
OTHER                             68
NO FORMAL EDUCATION               24
Name: Education Status, dtype: int64

## Data with Multiple Categories and No Natural Order

In [31]:
df["Preferred Language"].value_counts()

ENGLISH                     4459
SPANISH                      359
INDO-EUROPEAN                 55
ASIAN AND PACIFIC ISLAND      27
ALL OTHER LANGUAGES           26
AFRO-ASIATIC                  14
Name: Preferred Language, dtype: int64

In [34]:
pd.get_dummies(df["Preferred Language"])

Unnamed: 0,AFRO-ASIATIC,ALL OTHER LANGUAGES,ASIAN AND PACIFIC ISLAND,ENGLISH,INDO-EUROPEAN,SPANISH
0,0,0,0,1,0,0
1,0,0,0,1,0,0
2,0,0,0,1,0,0
3,0,0,0,1,0,0
4,0,0,0,1,0,0
...,...,...,...,...,...,...
4995,0,0,0,1,0,0
4996,0,0,0,1,0,0
4997,0,0,0,1,0,0
4998,0,0,0,1,0,0


In [42]:
prefer_english = (df["Preferred Language"] == "ENGLISH").astype(int)
prefer_english.value_counts()

1    4459
0     541
Name: Preferred Language, dtype: int64

```{note}
The use of `astype(int)` above converts the created boolean series of True/False values to a series of ones (True) and zeros (False), which may be more suitable for input to a model.
```

## Data Dictionaries

## Pandas Categorical Type