# STAT1100 Data Communication and Modelling

## Categorising Variables

In [1]:
import pandas as pd
print(pd.__version__)

2.2.1


### Heart data set

The `heart` data set is from the text: Wild and Seber (2000). *Chance Encounters: A First Course in Data Analysis and Inference.* Wiley, New York. Page 39. The data are a subset of data from a study of male heart attack patients admitted to a hospital in Auckland, N.Z. Every patient has the information in the table below measured and recorded.

Variable | Description
---|---
ID | A patient identifier instead of a name to protect a patient’s privacy.
EJEC | Ejection fraction, the percentage of blood in the left ventricle of the heart ejected in one beat.
SYSVOL | End-systolic volume, a measure of the size of the heart. This is calculated from a two dimensional silhouette picture of the left ventricle at its smallest part.
DIAVOL | End-diastolic volume. The same as SYSVOL, except that the largest silhouette is used.
OCCLU | Occlusion score, or, percentage of the myocardium of the left ventricle supplied by arteries that are totally blocked.
STEN | Stenosis score, or percentage supplied by the arteries that are significantly narrowed but not completely blocked.
TIME | Time in months from when patient was admitted until OUTCOME occurs.
OUTCOME | Coded variable with the following levels: 0 = alive at last follow up, 1 = sudden cardiac death, 2 = death within 30 days of heart attack, 3 = death from heart failure, 4 = death during or after coronary surgery, 5 = noncardiac death.
AGE | The age in years of the patient at admission.
SMOKE | Whether the patient continued to smoke: 1 = yes, 2 = no.
BETA | Whether the patient was taking drugs called beta blockers: 1 = yes, 2 = no.
CHOL | Blood cholesterol measured in millimoles per litre.
SURG | Whether the patient had surgery, including reason for surgery: 0 = no surgery, 1 = surgery as part of trial, 2 = surgery for symptoms within one year, 3 = surgery for symptoms within one to five years, 4 = surgery for symptoms after five years.

When reading the data set into a data frame we pass a dictionary of (variable, dtype) pairs to the `dtype` parameter to cast the categorical variables *OUTCOME*, *SMOKE*, *BETA*, and *SURG*, to the `category` dtype.

In [2]:
cat_vars = ["OUTCOME", "SMOKE", "BETA", "SURG"]
dtypes = dict((v, "category") for v in cat_vars)
dtypes

{'OUTCOME': 'category',
 'SMOKE': 'category',
 'BETA': 'category',
 'SURG': 'category'}

In [3]:
heart = pd.read_csv("data/heart.csv", dtype=dtypes)
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45 entries, 0 to 44
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   ID       45 non-null     int64   
 1   EJEC     45 non-null     int64   
 2   SYSVOL   45 non-null     int64   
 3   DIAVOL   45 non-null     int64   
 4   OCCLU    45 non-null     int64   
 5   STEN     45 non-null     int64   
 6   TIME     45 non-null     int64   
 7   OUTCOME  45 non-null     category
 8   AGE      45 non-null     int64   
 9   SMOKE    45 non-null     category
 10  BETA     45 non-null     category
 11  CHOL     32 non-null     float64 
 12  SURG     45 non-null     category
dtypes: category(4), float64(1), int64(8)
memory usage: 3.9 KB


In [4]:
for v in cat_vars:
    print(heart[v].cat.categories)

Index(['alive at last follow-up', 'death from heart failure',
       'death within 30 days of heart attack', 'non-cardiac death',
       'sudden cardiac death'],
      dtype='object')
Index(['no', 'yes'], dtype='object')
Index(['no', 'yes'], dtype='object')
Index(['no surgery', 'surgery as part of trial',
       'surgery for symptoms after 5 years',
       'surgery for symptoms within 1 to 5 years',
       'surgery for symptoms within 1 year'],
      dtype='object')


### Categorising a numerical variable

Consider the continuous variable *AGE* in years and suppose that we want to categorise it into age groups. The function [cut()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) allows us to bin values into discrete intervals. The following code creates a new *ordinal* categorical variable that indicates which decade the person's age falls into. Setting the parameter `right=False` returns intervals that are left-closed, right-open. Custom labels can be specified using the `labels` parameter. 

In [5]:
bins = [10*i for i in range(13)]
print(bins)

[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120]


In [6]:
heart["DECADE"] = pd.cut(heart["AGE"], bins, right=False)
print(heart["DECADE"].cat.categories)
heart[["AGE","DECADE"]].head()

IntervalIndex([   [0, 10),   [10, 20),   [20, 30),   [30, 40),   [40, 50),
                 [50, 60),   [60, 70),   [70, 80),   [80, 90),  [90, 100),
               [100, 110), [110, 120)],
              dtype='interval[int64, left]')


Unnamed: 0,AGE,DECADE
0,49,"[40, 50)"
1,54,"[50, 60)"
2,56,"[50, 60)"
3,42,"[40, 50)"
4,46,"[40, 50)"


### Collapsing categories

Consider the coded categorical variable *SURG*.

In [7]:
print(heart["SURG"].cat.categories)
heart["SURG"].cat.codes.value_counts()

Index(['no surgery', 'surgery as part of trial',
       'surgery for symptoms after 5 years',
       'surgery for symptoms within 1 to 5 years',
       'surgery for symptoms within 1 year'],
      dtype='object')


0    32
1     5
4     4
2     2
3     2
Name: count, dtype: int64

Suppose that we wish to create a new coded categorical variable *SURG_SIMPLE* with fewer categories according to the following table:

| Code | Description |
|---|---|
| 0 | no surgery |
| 1 | surgery as part of trial |
| 2 | surgery for symptoms |

For convenience we create a variable *SURG_CAT* that contains the *SURG* variable codes. We then create a function that collapses the categories of the *SURG* variable: given a *SURG* variable category code it returns the corresponding new category code.

In [8]:
heart["SURG_CAT"] = heart["SURG"].cat.codes

def surg_simple(x):
    return x if x in [0,1] else 2

# Test the function
for i in range(5):
    print(i, surg_simple(i))

0 0
1 1
2 2
3 2
4 2


Now we can use the [apply()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) method to invoke the function for each value of the variable *SURG*.

In [9]:
heart["SURG_SIMPLE"] = heart["SURG_CAT"].apply(surg_simple)
# Check counts are consistent with those for SURG
heart["SURG_SIMPLE"].value_counts()

SURG_SIMPLE
0    32
2     8
1     5
Name: count, dtype: int64

Finally we convert the variable *SURG_SIMPLE* to a coded categorical variable and check for errors.

In [10]:
heart["SURG_SIMPLE"] = heart["SURG_SIMPLE"].astype('category')
heart["SURG_SIMPLE"].cat.rename_categories(["no surgery", "surgery as part of trial", "surgery for symptoms"])
heart[["SURG","SURG_SIMPLE"]].head(2)

Unnamed: 0,SURG,SURG_SIMPLE
0,no surgery,0
1,surgery as part of trial,1


In [11]:
heart.loc[heart["SURG"].str.contains("symptoms"), ["SURG","SURG_SIMPLE"]]

Unnamed: 0,SURG,SURG_SIMPLE
5,surgery for symptoms within 1 year,2
13,surgery for symptoms within 1 year,2
15,surgery for symptoms within 1 year,2
23,surgery for symptoms after 5 years,2
31,surgery for symptoms within 1 to 5 years,2
37,surgery for symptoms after 5 years,2
40,surgery for symptoms within 1 to 5 years,2
44,surgery for symptoms within 1 year,2


### Creating "flag" variables

Consider the *OUTCOME* variable and suppose that we wish to filter the data set depending upon whether the patient was alive at the last follow up, or dead. For convenience we may want to create a "flag" variable *ALIVE* that, in this case, is either `True` or `False`. There are many ways in which this can be done. We could again use an appropriately written function with the [apply()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) method, and this might be necessary in more complicated settings, but here we will simply use one of the binary operator methods, namely [eq()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.eq.html).

In [12]:
heart["ALIVE"] = heart["OUTCOME"].eq("alive at last follow-up")
heart[["OUTCOME","ALIVE"]].tail()

Unnamed: 0,OUTCOME,ALIVE
40,alive at last follow-up,True
41,death from heart failure,False
42,sudden cardiac death,False
43,alive at last follow-up,True
44,sudden cardiac death,False
