# Head and Mouth Cancer Datasets

This Jupyter Notebook builds a dataset relating to head and mouth cancers. This category was chosen because it is easy to define on the Proteomic Data Commons and the Imaging Data Commons, but will need to be manually aggregated on the Genomic Data Commons.

## Dataset definition

This dataset is defined as:
* On the Proteomic Data Commons: all cases whose [`primary_site` is set to `Head and Neck`](https://pdc.cancer.gov/pdc/browse/filters/primary_site:Head+and+Neck)
* On the Imaging Data Commons: all cases whose [Primary Site Location is `Head-Neck`](https://portal.imaging.datacommons.cancer.gov/explore/?filters_for_load=%5B%7B%22filters%22:%5B%7B%22id%22:%22128%22,%22values%22:%5B%22Head-Neck%22%5D%7D%5D%7D%5D)
    * This isn't currently included, since I don't think the IDC API is publicly accessible yet.
* On the Genomics Data Commons: [all cases](https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22base%20of%20tongue%22%2C%22floor%20of%20mouth%22%2C%22gum%22%2C%22hypopharynx%22%2C%22larynx%22%2C%22lip%22%2C%22nasal%20cavity%20and%20middle%20ear%22%2C%22nasopharynx%22%2C%22oropharynx%22%2C%22other%20and%20ill-defined%20sites%20in%20lip%2C%20oral%20cavity%20and%20pharynx%22%2C%22other%20and%20unspecified%20major%20salivary%20glands%22%2C%22other%20and%20unspecified%20parts%20of%20mouth%22%2C%22other%20and%20unspecified%20parts%20of%20tongue%22%2C%22palate%22%2C%22tonsil%22%5D%7D%7D%5D%7D) whose `primary_site` is set to one of:
    * [`Other and unspecified parts of major salivary glands`](https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22other%20and%20unspecified%20major%20salivary%20glands%22%5D%7D%7D%5D%7D)
    * [`Other and ill-defined sites in lip, oral cavity and pharynx`](https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22other%20and%20ill-defined%20sites%20in%20lip%2C%20oral%20cavity%20and%20pharynx%22%5D%7D%7D%5D%7D)
    * [`Oropharynx`](https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22oropharynx%22%5D%7D%7D%5D%7D)
    * [`Larynx`](https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22larynx%22%5D%7D%7D%5D%7D)
    * [`Other and unspecified parts of tongue`](https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22other%20and%20unspecified%20parts%20of%20tongue%22%5D%7D%7D%5D%7D)
    * [`Nasopharynx`](https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22nasopharynx%22%5D%7D%7D%5D%7D)
    * [`Floor of mouth`](https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22floor%20of%20mouth%22%5D%7D%7D%5D%7D)
    * [`Tonsil`](https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22tonsil%22%5D%7D%7D%5D%7D)
    * [`Other and unspecified parts of mouth`](https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22other%20and%20unspecified%20parts%20of%20mouth%22%5D%7D%7D%5D%7D)
    * [`Nasal cavity and middle ear`](https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22nasal%20cavity%20and%20middle%20ear%22%5D%7D%7D%5D%7D)
    * [`Hypopharynx`](https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22hypopharynx%22%5D%7D%7D%5D%7D)
    * [`Base of tongue`](https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22base%20of%20tongue%22%5D%7D%7D%5D%7D)
    * [`Gum`](https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22gum%22%5D%7D%7D%5D%7D)
    * [`Lip`](https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22lip%22%5D%7D%7D%5D%7D)
    * [`Palate`](https://portal.gdc.cancer.gov/exploration?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22palate%22%5D%7D%7D%5D%7D)

Note that we specifically don't include the brain, eye, trachea or esophagus -- please let us know if you think they should be included in this dataset.

## Setup

In [4]:
import sys

# Install pandas
#!{sys.executable} -m pip install pandas 



In [10]:
# Load required packages.
import requests
import json
import pandas as pd
import numpy as np

## Download PDC data

In [33]:
pdc_graphql_endpoint = "https://pdc.cancer.gov/graphql"

# Step 1. Get a list of all the case IDs relevant to us.
primary_site = 'Head and Neck'
primary_site_query = """{ uiCase(primary_site: "%s") { case_id } }""" % (primary_site)

response = requests.get(pdc_graphql_endpoint, params = {
    "query": primary_site_query
})
result = json.loads(response.content)

case_ids = list(map(lambda case: case['case_id'], result['data']['uiCase']))
unique_case_ids = numpy.unique(case_ids)
print(f"We have {len(case_ids)} case IDs, of which {len(unique_case_ids)} are unique.")

We have 257 case IDs, of which 148 are unique.


In [34]:
# Step 2. Get all the cases relevant to us.

# I got this query from https://pdc.cancer.gov/data-dictionary/publicapi-documentation/#!/Case/case
case_query = """{
    case (
        case_id: "%s"
        acceptDUA: true
    ) {
        case_id case_submitter_id project_submitter_id days_to_lost_to_followup disease_type
        index_date lost_to_followup primary_site
        externalReferences {
            external_reference_id
            reference_resource_shortname reference_resource_name reference_entity_location
        }
        demographics {
            demographic_id ethnicity gender demographic_submitter_id race cause_of_death days_to_birth
            days_to_death vital_status year_of_birth year_of_death
        }
        samples {
            sample_id sample_submitter_id sample_type sample_type_id gdc_sample_id gdc_project_id
            biospecimen_anatomic_site composition current_weight days_to_collection days_to_sample_procurement
            diagnosis_pathologically_confirmed freezing_method initial_weight intermediate_dimension is_ffpe
            longest_dimension method_of_sample_procurement oct_embedded pathology_report_uuid preservation_method
            sample_type_id shortest_dimension time_between_clamping_and_freezing time_between_excision_and_freezing
            tissue_type tumor_code tumor_code_id tumor_descriptor
            aliquots {
                aliquot_id aliquot_submitter_id analyte_type
                aliquot_run_metadata {
                    aliquot_run_metadata_id
                }
            }
        }
        diagnoses {
            diagnosis_id tissue_or_organ_of_origin age_at_diagnosis primary_diagnosis tumor_grade tumor_stage
            diagnosis_submitter_id classification_of_tumor days_to_last_follow_up days_to_last_known_disease_status
            days_to_recurrence last_known_disease_status morphology progression_or_recurrence
            site_of_resection_or_biopsy prior_malignancy ajcc_clinical_m ajcc_clinical_n ajcc_clinical_stage
            ajcc_clinical_t ajcc_pathologic_m ajcc_pathologic_n ajcc_pathologic_stage ajcc_pathologic_t
            ann_arbor_b_symptoms ann_arbor_clinical_stage ann_arbor_extranodal_involvement ann_arbor_pathologic_stage
            best_overall_response burkitt_lymphoma_clinical_variant circumferential_resection_margin
            colon_polyps_history days_to_best_overall_response days_to_diagnosis days_to_hiv_diagnosis
            days_to_new_event figo_stage hiv_positive hpv_positive_type hpv_status iss_stage laterality
            ldh_level_at_diagnosis ldh_normal_range_upper lymph_nodes_positive lymphatic_invasion_present
            method_of_diagnosis new_event_anatomic_site new_event_type overall_survival perineural_invasion_present
            prior_treatment progression_free_survival progression_free_survival_event residual_disease
            vascular_invasion_present year_of_diagnosis icd_10_code synchronous_malignancy
            tumor_largest_dimension_diameter
        }
    }
}"""

cases = []
for num, case_id in enumerate(unique_case_ids):
    response = requests.get(pdc_graphql_endpoint, params = {
        "query": case_query % (case_id)
    })
    result = json.loads(response.content)
    case = result['data']['case']
    if case and len(case) == 1:
        cases.append(case[0])
        print("Downloaded case %d of %d: %s" % (num, len(unique_case_ids), case_id))
    else:
        print("Could not download case %d of %d: %s" % (num, len(unique_case_ids), case_id))

len(cases)

Downloaded case 0 of 148: 0232701d-6d00-440c-af6c-5899fbbf4142
Downloaded case 1 of 148: 0e943de7-c277-48f2-8fa9-b2e836b03c2c
Downloaded case 2 of 148: 1104505a-9890-49ce-8d7d-7a8070261324
Downloaded case 3 of 148: 195cd133-0d53-402d-b31c-3d4fe0481858
Downloaded case 4 of 148: 1df726a4-8520-4474-8c00-d238a7384be1
Downloaded case 5 of 148: 2b4204ab-87af-4a79-913f-fabdcb2d02b4
Downloaded case 6 of 148: 3195cf36-998a-4f48-a7f1-a84f2b9fdcc5
Downloaded case 7 of 148: 37ea85c2-5e02-4d7c-846a-7cd40ef1912d
Downloaded case 8 of 148: 3f8aba34-9229-4701-bd5d-869b3705d4a6
Downloaded case 9 of 148: 4633e71f-5fbe-49b0-9721-9769182a08eb
Downloaded case 10 of 148: 482dd5ba-9c50-4ee3-a3f2-3bb0ad6487dc
Downloaded case 11 of 148: 4c848c55-2c7e-4d32-9a17-b2531a18a19a
Downloaded case 12 of 148: 58355fa4-e2ad-434b-a915-acbd09abbb77
Downloaded case 13 of 148: 59448f1d-6be8-466f-bb1a-1a3aef693d4f
Downloaded case 14 of 148: 5e99d311-24f2-4ee5-abfe-c9d40771e8e3
Downloaded case 15 of 148: 6e4c8397-715c-4dc5-8782

Downloaded case 128 of 148: df4f14e8-8f98-11ea-b1fd-0aad30af8a83
Downloaded case 129 of 148: df4f15be-8f98-11ea-b1fd-0aad30af8a83
Downloaded case 130 of 148: df4f1689-8f98-11ea-b1fd-0aad30af8a83
Downloaded case 131 of 148: df4f1830-8f98-11ea-b1fd-0aad30af8a83
Downloaded case 132 of 148: df4f1907-8f98-11ea-b1fd-0aad30af8a83
Downloaded case 133 of 148: df4f19d3-8f98-11ea-b1fd-0aad30af8a83
Downloaded case 134 of 148: df4f1b8c-8f98-11ea-b1fd-0aad30af8a83
Downloaded case 135 of 148: df4f1c60-8f98-11ea-b1fd-0aad30af8a83
Downloaded case 136 of 148: df4f1d2d-8f98-11ea-b1fd-0aad30af8a83
Downloaded case 137 of 148: df4f1ef8-8f98-11ea-b1fd-0aad30af8a83
Downloaded case 138 of 148: df4f1fff-8f98-11ea-b1fd-0aad30af8a83
Downloaded case 139 of 148: df4f20dd-8f98-11ea-b1fd-0aad30af8a83
Downloaded case 140 of 148: df4f22b3-8f98-11ea-b1fd-0aad30af8a83
Downloaded case 141 of 148: df4f238b-8f98-11ea-b1fd-0aad30af8a83
Downloaded case 142 of 148: df4f2567-8f98-11ea-b1fd-0aad30af8a83
Downloaded case 143 of 14

148

In [35]:
# Before we examine it, let's save this as a file.

with open('pdc-head-and-mouth.json', 'w') as file:
    json.dump(cases, file, indent=2, sort_keys=True)

In [36]:
#df_pdc_cases = pd.DataFrame(cases)
df_pdc_cases = pd.json_normalize(cases)
df_pdc_cases.shape, list(df_pdc_cases.columns)

((148, 12),
 ['case_id',
  'case_submitter_id',
  'project_submitter_id',
  'days_to_lost_to_followup',
  'disease_type',
  'index_date',
  'lost_to_followup',
  'primary_site',
  'externalReferences',
  'demographics',
  'samples',
  'diagnoses'])

In [37]:
df_pdc_cases.head(5)

Unnamed: 0,case_id,case_submitter_id,project_submitter_id,days_to_lost_to_followup,disease_type,index_date,lost_to_followup,primary_site,externalReferences,demographics,samples,diagnoses
0,0232701d-6d00-440c-af6c-5899fbbf4142,OSCC_13,Oral Squamous Cell Carcinoma - Chang Gung Univ...,,Oral Squamous Cell Carcinoma,,,Head and Neck,[],[{'demographic_id': 'c4f66cfa-2d1d-4d58-b7b6-8...,[{'sample_id': 'd58e2a88-8b0c-4cc4-bb1a-e7734a...,[{'diagnosis_id': '90972454-6af2-4704-bb16-c00...
1,0e943de7-c277-48f2-8fa9-b2e836b03c2c,OSCC_25,Oral Squamous Cell Carcinoma - Chang Gung Univ...,,Oral Squamous Cell Carcinoma,,,Head and Neck,[],[{'demographic_id': 'de8038df-cf33-46c2-a9eb-8...,[{'sample_id': 'b6ecae8d-08a9-44e3-ac70-51331b...,[{'diagnosis_id': 'ae7a3f6c-173c-44d3-8a3e-381...
2,1104505a-9890-49ce-8d7d-7a8070261324,OSCC_23,Oral Squamous Cell Carcinoma - Chang Gung Univ...,,Oral Squamous Cell Carcinoma,,,Head and Neck,[],[{'demographic_id': 'efef05c6-84a9-4353-95d0-2...,[{'sample_id': '4dc515ba-6525-4619-8d57-520cc2...,[{'diagnosis_id': '4c1cade2-18a6-4a2e-8264-949...
3,195cd133-0d53-402d-b31c-3d4fe0481858,OSCC_37,Oral Squamous Cell Carcinoma - Chang Gung Univ...,,Oral Squamous Cell Carcinoma,,,Head and Neck,[],[{'demographic_id': 'd978bb14-8056-4f56-ac5b-b...,[{'sample_id': '5836c294-187d-4a17-96da-e1ee30...,[{'diagnosis_id': 'e993cc6a-3774-4ef6-9f6a-77e...
4,1df726a4-8520-4474-8c00-d238a7384be1,OSCC_06,Oral Squamous Cell Carcinoma - Chang Gung Univ...,,Oral Squamous Cell Carcinoma,,,Head and Neck,[],[{'demographic_id': '5b24c6ed-ff68-4d1a-8aad-4...,[{'sample_id': '37d52145-52cc-40f9-8d6e-e13f3d...,[{'diagnosis_id': '60422ff9-30a2-41da-ac80-783...


In [38]:
df_pdc_cases.describe()

Unnamed: 0,case_id,case_submitter_id,project_submitter_id,days_to_lost_to_followup,disease_type,index_date,lost_to_followup,primary_site,externalReferences,demographics,samples,diagnoses
count,148,148,148,0.0,148,110,110,148,148,148,148,148
unique,148,148,2,0.0,2,1,3,1,111,148,148,148
top,0232701d-6d00-440c-af6c-5899fbbf4142,OSCC_13,CPTAC3-Discovery,,Head and Neck Squamous Cell Carcinoma,Diagnosis,No,Head and Neck,[],[{'demographic_id': 'c4f66cfa-2d1d-4d58-b7b6-8...,[{'sample_id': 'd58e2a88-8b0c-4cc4-bb1a-e7734a...,[{'diagnosis_id': '90972454-6af2-4704-bb16-c00...
freq,1,1,110,,110,110,72,148,38,1,1,1


## Download GDC data

In [21]:
# Search by cases.primary_site.
cases_endpt = "https://api.gdc.cancer.gov/cases"

field_groups = [
    "diagnoses",
    "samples",
    "demographic"
    ]
field_groups = ",".join(field_groups)

filters = {
    "op": "in",
    "content": {
        "field": "cases.primary_site",
        "value": [
            "baseoftongue",
            "floorofmouth",
            "gum",
            "hypopharynx",
            "larynx",
            "lip",
            "nasalcavityandmiddleear",
            "nasopharynx",
            "oropharynx",
            "otherandill-definedsitesinlip,oralcavityandpharynx",
            "otherandunspecifiedmajorsalivaryglands",
            "otherandunspecifiedpartsofmouth",
            "otherandunspecifiedpartsoftongue",
            "palate",
            "tonsil"
        ]
    }
}

params = {
    "filters": json.dumps(filters),
    "expand": field_groups,
    "format": "JSON",
    "size": 1000
}

response = requests.get(cases_endpt, params = params)
result = json.loads(response.content)

print(f"Warnings: {result['warnings']}")

gdc_entries = result['data']['hits']

with open('gdc-head-and-mouth.json', 'w') as file:
    json.dump(gdc_entries, file, indent=2, sort_keys=True)



In [28]:
#pd_gdc_entries = pd.DataFrame(gdc_entries)
df_gdc_entries = pd.json_normalize(gdc_entries)
df_gdc_entries.shape, list(df_gdc_entries.columns)

((560, 45),
 ['id',
  'lost_to_followup',
  'days_to_lost_to_followup',
  'disease_type',
  'submitter_id',
  'submitter_aliquot_ids',
  'aliquot_ids',
  'diagnoses',
  'diagnosis_ids',
  'created_datetime',
  'sample_ids',
  'samples',
  'submitter_sample_ids',
  'primary_site',
  'submitter_diagnosis_ids',
  'updated_datetime',
  'case_id',
  'index_date',
  'state',
  'demographic.cause_of_death',
  'demographic.race',
  'demographic.vital_status',
  'demographic.ethnicity',
  'demographic.gender',
  'demographic.age_at_index',
  'demographic.submitter_id',
  'demographic.days_to_birth',
  'demographic.created_datetime',
  'demographic.year_of_birth',
  'demographic.cause_of_death_source',
  'demographic.premature_at_birth',
  'demographic.weeks_gestation_at_birth',
  'demographic.demographic_id',
  'demographic.age_is_obfuscated',
  'demographic.updated_datetime',
  'demographic.occupation_duration_years',
  'demographic.days_to_death',
  'demographic.state',
  'demographic.year_of

In [29]:
df_gdc_entries.head(5)

Unnamed: 0,id,lost_to_followup,days_to_lost_to_followup,disease_type,submitter_id,submitter_aliquot_ids,aliquot_ids,diagnoses,diagnosis_ids,created_datetime,...,demographic.occupation_duration_years,demographic.days_to_death,demographic.state,demographic.year_of_death,slide_ids,submitter_slide_ids,analyte_ids,submitter_analyte_ids,portion_ids,submitter_portion_ids
0,a203ac35-914f-4f4d-816c-2af124257500,,,Squamous Cell Neoplasms,GENIE-DFCI-011620,[GENIE-DFCI-011620-10763_aliquot],[3d4995b8-5b04-46f2-8d37-7e0b9f9b1b1a],"[{'irs_stage': None, 'iss_stage': None, 'ajcc_...",[a4f6276a-b3cc-45f9-9fb8-30edd56ad4ea],2018-09-13T13:41:51.057497-05:00,...,,,released,,,,,,,
1,26d5f693-dfbc-44ec-a073-49a59a3f09a0,,,Squamous Cell Neoplasms,GENIE-DFCI-050738,[GENIE-DFCI-050738-234120_aliquot],[57d18da1-d1b9-40b0-8ee6-0f94fd9f7575],"[{'irs_stage': None, 'iss_stage': None, 'ajcc_...",[2e4427b7-e557-49c7-85ef-39a02c4a441c],2019-06-03T12:43:36.681258-05:00,...,,,released,,,,,,,
2,d7c7ecbd-7495-4d29-8bb6-78797f5a47eb,,,Squamous Cell Neoplasms,GENIE-DFCI-004072,[GENIE-DFCI-004072-413_aliquot],[95066691-03ea-422a-bb4c-ba09e9cbd7ab],"[{'irs_stage': None, 'iss_stage': None, 'ajcc_...",[85d63c6e-a6c1-47a5-92a9-216719073a4a],2018-09-13T13:44:12.915115-05:00,...,,,released,,,,,,,
3,33fa625e-852e-49ef-8134-6ea46edb5183,,,Squamous Cell Neoplasms,GENIE-GRCC-a8pxs0u6,[GENIE-GRCC-a8pxs0u6-sample-a_aliquot],[b17a8d8a-395e-4d42-bfb0-7e829e1d4a8b],"[{'irs_stage': None, 'iss_stage': None, 'ajcc_...",[d0678ad6-4a4a-4ec8-9e5f-352c1347efc2],2019-06-04T18:08:22.482657-05:00,...,,,released,,,,,,,
4,4d49b9f5-09a0-49de-84f7-0d3441c214f6,,,Squamous Cell Neoplasms,GENIE-GRCC-2b4655c3,[GENIE-GRCC-2b4655c3-sample-a_aliquot],[a0f16f51-94eb-4a6b-beb6-9159fa6acc4b],"[{'irs_stage': None, 'iss_stage': None, 'ajcc_...",[2707217b-5a4b-404b-97de-921526581d45],2018-10-02T17:53:10.070290-05:00,...,,,released,,,,,,,


In [30]:
df_gdc_entries.describe()

Unnamed: 0,lost_to_followup,days_to_lost_to_followup,demographic.cause_of_death,demographic.age_at_index,demographic.days_to_birth,demographic.year_of_birth,demographic.cause_of_death_source,demographic.premature_at_birth,demographic.weeks_gestation_at_birth,demographic.age_is_obfuscated,demographic.occupation_duration_years,demographic.days_to_death,demographic.year_of_death
count,0.0,0.0,0.0,499.0,499.0,201.0,0.0,0.0,0.0,0.0,0.0,77.0,54.0
mean,,,,12998.078156,-21889.791583,1948.029851,,,,,,824.909091,2005.314815
std,,,,11059.558005,3837.513498,11.702953,,,,,,964.107048,5.012528
min,,,,35.0,-32871.0,1916.0,,,,,,1.0,1996.0
25%,,,,62.5,-24471.0,1940.0,,,,,,330.0,2001.0
50%,,,,18627.0,-22182.0,1950.0,,,,,,480.0,2006.5
75%,,,,22645.0,-19470.0,1956.0,,,,,,941.0,2009.75
max,,,,30681.0,-7670.0,1973.0,,,,,,6417.0,2013.0
