# Creating a covariate dataset that encodes categorical sample variables

The sample data possesses covariates as shown in <a href='https://github.com/cognoma/cancer-data/blob/master/3.explore-mutations.ipynb'>this notebook</a>. These may provide a spurious signal that a classifier accommodates for, and could confound attempts to pick out the actual signal that we desire. This notebook will create a file with encoded information on these covariates. Classifiers being implemented by the machine learning group can use this as additional data to train on.

In [1]:
import os

import pandas as pd
import numpy as np

Let's peak at the sample data.

In [2]:
path = os.path.join('data', 'samples.tsv')
covariates_df = pd.read_table(path, index_col=0)
covariates_df['recurred'] = covariates_df.days_recurrence_free.notnull().astype(int)
covariates_df.head(4).transpose()

sample_id,TCGA-02-0047-01,TCGA-02-0055-01,TCGA-02-2483-01,TCGA-02-2485-01
patient_id,TCGA-02-0047,TCGA-02-0055,TCGA-02-2483,TCGA-02-2485
acronym,GBM,GBM,GBM,GBM
disease,glioblastoma multiforme,glioblastoma multiforme,glioblastoma multiforme,glioblastoma multiforme
age_diagnosed,78,62,43,53
gender,Male,Female,Male,Male
race,White,White,Asian,Black Or African American
ajcc_stage,,,,
clinical_stage,,,,
histological_type,Untreated primary (de novo) GBM,Untreated primary (de novo) GBM,Untreated primary (de novo) GBM,Untreated primary (de novo) GBM
histological_grade,,,,


Let's get a sense of how much missing data there is.

In [3]:
print('Total number of samples: {:,}'.format(len(covariates_df)))
print('Number of nulls in each column:')
covariates_df.isnull().sum(axis=0)

Total number of samples: 8,397
Number of nulls in each column:


patient_id                             0
acronym                                0
disease                                0
age_diagnosed                         42
gender                                 0
race                                 646
ajcc_stage                          2684
clinical_stage                      6519
histological_type                    110
histological_grade                  4744
initial_pathologic_dx_year           122
menopause_status                    6960
birth_days_to                         97
vital_status                           1
tumor_status                         547
last_contact_days_to                2129
death_days_to                       6119
cause_of_death                      8190
new_tumor_event_type                6782
new_tumor_event_site                7269
new_tumor_event_site_other          7980
days_recurrence_free                6185
treatment_outcome_first_course      3530
margin_status                       7456
residual_tumor  

In [4]:
# Specify which variables are categorical
categorical_variables = ['acronym', 'gender', 'dead', 'recurred']

# Number of categories per categorical variable
covariates_df[categorical_variables].apply(lambda x: x.nunique())

acronym     32
gender       2
dead         2
recurred     2
dtype: int64

In [5]:
continuous_variables = ['age_diagnosed', 'days_survived', 'days_recurrence_free', 'n_mutations']
covariates_df = covariates_df[categorical_variables + continuous_variables]
covariates_df.head(2)

Unnamed: 0_level_0,acronym,gender,dead,recurred,age_diagnosed,days_survived,days_recurrence_free,n_mutations
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
TCGA-02-0047-01,GBM,Male,1,1,78.0,448.0,57.0,64
TCGA-02-0055-01,GBM,Female,1,1,62.0,76.0,6.0,51


Before encoding, we're going to use the disease categories below. So let's store them for later use.

Inspecting the head of the covariates DataFrame above, we see that two columns, namely <code>dead</code> and <code>recurred</code>, need some attention. We're going to encode categorical variables using panda's get_dummies. Since these columns are indicated by a 1 or 0, this will become the column header when encoded, as below.

Let's rename the values in each of these so that they more accurately reflect the underlying variable.

In [6]:
renamer = {
    'dead_0': 'alive',
    'dead_1': 'dead',
    'recurred_0': 'has_not_recurred',
    'recurred_1': 'has_recurred',
    'gender_Female': 'female',
    'gender_Male': 'male',
}
covariates_df = pd.get_dummies(covariates_df, columns=categorical_variables).rename(columns=renamer)
covariates_df.head(2)

Unnamed: 0_level_0,age_diagnosed,days_survived,days_recurrence_free,n_mutations,acronym_ACC,acronym_BLCA,acronym_BRCA,acronym_CESC,acronym_CHOL,acronym_COAD,...,acronym_THYM,acronym_UCEC,acronym_UCS,acronym_UVM,female,male,alive,dead,has_not_recurred,has_recurred
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-02-0047-01,78.0,448.0,57.0,64,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
TCGA-02-0055-01,62.0,76.0,6.0,51,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0


Now the column name more accurately reflects the underlying variable. The categorical values are encoded as numeric data that can be input to the types of classifiers that we have been using.

Another useful covariate will be the logarithm plus one function of the number mutations that was calculated in the aforementioned notebook.

In [7]:
covariates_df['n_mutations_log1p'] = np.log1p(covariates_df.pop('n_mutations'))
covariates_df.head(2)

Unnamed: 0_level_0,age_diagnosed,days_survived,days_recurrence_free,acronym_ACC,acronym_BLCA,acronym_BRCA,acronym_CESC,acronym_CHOL,acronym_COAD,acronym_DLBC,...,acronym_UCEC,acronym_UCS,acronym_UVM,female,male,alive,dead,has_not_recurred,has_recurred,n_mutations_log1p
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TCGA-02-0047-01,78.0,448.0,57.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,4.174387
TCGA-02-0055-01,62.0,76.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,3.951244


Finally, let's save this to a <code>.tsv</code> file.

In [8]:
path = os.path.join('data', 'covariates.tsv')
covariates_df.to_csv(path, sep='\t', float_format='%.5g')