# Introduction to Python, Week 2: Starting with data

## Before class
- share URL to hack.md with link to zipped dataset file to download

## Objectives
- review previous week's objectives
### Today:
- using packages
- tidy data and importing data to python
- electing data using labels (columns) and rows
- slicing subsets of rows and columns
- calculating summary statistics

## Using packages
- make sure everyone is working in project directory
- create new notebook for this week's material (name week2)
- introduction to packages
    - collection of functions
    - community contributed (anyone can write a package!)
- describe pandas
    - python data analysis library

In [1]:
# make packages available to use in this notebook
import os
import urllib.request
import pandas as pd # pd is alias, or shortcut, to specify we're using a function from it

## Importing data

In [2]:
# create data directory
os.mkdir("data")

In [3]:
# download dataset (url on HackMD page)
urllib.request.urlretrieve("https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.csv", "data/clinical.csv")
# can preview data file by opening in web browser, or opening CSV file in spreadsheet program
# *data from The Cancer Genome Atlas (from NIH), several cancer types in one file

('data/clinical.csv', <http.client.HTTPMessage at 0x11ea851f0>)

## Overview of tidy data principles
- columns: variables (demographic and health information)
- rows: observations (patients)
- one piece of info per cell

_*csv: comma separated values (other things besides commas can have separators too though)_


_backup option for downloading data, if above code doesn't work:_
- show where to download data
- emphasize unzipping directory and moving data to appropriate location

In [4]:
# import data as csv
pd.read_csv("data/clinical.csv")
# this only prints it to the screen!

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,C34.1,stage ia,24477.0,dead,8070/3,371.0,live,C34.1,-24477.0,C34.1,,10.958904,,male,1936.0,white,not hispanic or latino,2004.0,TCGA-18-3406,LUSC
1,C34.1,stage ib,26615.0,dead,8070/3,136.0,live,C34.1,-26615.0,C34.1,,2.191781,,male,1931.0,asian,not hispanic or latino,2003.0,TCGA-18-3407,LUSC
2,C34.3,stage ib,28171.0,dead,8070/3,2304.0,live,C34.3,-28171.0,C34.3,2099.0,1.643836,,female,1927.0,white,not hispanic or latino,,TCGA-18-3408,LUSC
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.095890,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6827,C55,not reported,32871.0,dead,8950/3,167.0,live,C55.9,-32871.0,C55.9,,,,female,1917.0,white,not reported,2007.0,TCGA-NF-A5CP,UCS
6828,C54.1,not reported,23323.0,dead,8950/3,442.0,live,C54.1,-23323.0,C54.1,,,,female,1948.0,white,not hispanic or latino,2012.0,TCGA-NG-A4VU,UCS
6829,C54.1,not reported,27326.0,dead,8950/3,949.0,live,C54.1,-27326.0,C54.1,,,,female,1932.0,white,not hispanic or latino,2008.0,TCGA-NG-A4VW,UCS
6830,C54.1,not reported,24781.0,alive,8950/3,,live,C54.1,-24781.0,C54.1,587.0,,,female,1945.0,white,not hispanic or latino,,TCGA-QM-A5NM,UCS


In [5]:
# assign data to object
clinical_df = pd.read_csv("data/clinical.csv")

In [6]:
# preview data import
clinical_df.head() # print top few rows, by default .head() does 5

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,C34.1,stage ia,24477.0,dead,8070/3,371.0,live,C34.1,-24477.0,C34.1,,10.958904,,male,1936.0,white,not hispanic or latino,2004.0,TCGA-18-3406,LUSC
1,C34.1,stage ib,26615.0,dead,8070/3,136.0,live,C34.1,-26615.0,C34.1,,2.191781,,male,1931.0,asian,not hispanic or latino,2003.0,TCGA-18-3407,LUSC
2,C34.3,stage ib,28171.0,dead,8070/3,2304.0,live,C34.3,-28171.0,C34.3,2099.0,1.643836,,female,1927.0,white,not hispanic or latino,,TCGA-18-3408,LUSC
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.09589,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC


In [7]:
## Challenge: What do you need to do to download and import the following files correctly:
# example1: https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.tsv
# pd.read_csv("/data/clinical.tsv", sep="\t")
# example2: https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.txt
# pd.read_csv("../data/clinical.txt", sep=" ")

In [8]:
# examine data import
type(clinical_df) # look at data type

pandas.core.frame.DataFrame

In [9]:
clinical_df.columns # view column names

Index(['primary_diagnosis', 'tumor_stage', 'age_at_diagnosis', 'vital_status',
       'morphology', 'days_to_death', 'state', 'tissue_or_organ_of_origin',
       'days_to_birth', 'site_of_resection_or_biopsy',
       'days_to_last_follow_up', 'cigarettes_per_day', 'years_smoked',
       'gender', 'year_of_birth', 'race', 'ethnicity', 'year_of_death',
       'bcr_patient_barcode', 'disease'],
      dtype='object')

In [10]:
clinical_df.dtypes # look at type of data in each column

primary_diagnosis               object
tumor_stage                     object
age_at_diagnosis               float64
vital_status                    object
morphology                      object
days_to_death                  float64
state                           object
tissue_or_organ_of_origin       object
days_to_birth                  float64
site_of_resection_or_biopsy     object
days_to_last_follow_up         float64
cigarettes_per_day             float64
years_smoked                   float64
gender                          object
year_of_birth                  float64
race                            object
ethnicity                       object
year_of_death                  float64
bcr_patient_barcode             object
disease                         object
dtype: object

Can enter following as markdown cell (* render as bullet points)
## **Data types:** pandas _vs_ native python
* object = string
* int64 = integer (64 bit)
* float64 = float
* datetime64 = N/A

## Selecting data using labels (columns) and row ranges

In [11]:
# select a "subset" of the data using the column name
clinical_df["tumor_stage"]

0           stage ia
1           stage ib
2           stage ib
3           stage ia
4          stage iib
            ...     
6827    not reported
6828    not reported
6829    not reported
6830    not reported
6831    not reported
Name: tumor_stage, Length: 6832, dtype: object

In [12]:
# show only the first few rows of output
clinical_df["tumor_stage"].head()

0     stage ia
1     stage ib
2     stage ib
3     stage ia
4    stage iib
Name: tumor_stage, dtype: object

In [13]:
# show data type for this row
clinical_df["tumor_stage"].dtype # single column, O stands for "object"

dtype('O')

In [14]:
# use the column name as an "attribute"; gives the same output
clinical_df.tumor_stage

0           stage ia
1           stage ib
2           stage ib
3           stage ia
4          stage iib
            ...     
6827    not reported
6828    not reported
6829    not reported
6830    not reported
6831    not reported
Name: tumor_stage, Length: 6832, dtype: object

In [15]:
# head still works here!
clinical_df.tumor_stage.head()

0     stage ia
1     stage ib
2     stage ib
3     stage ia
4    stage iib
Name: tumor_stage, dtype: object

In [16]:
# What happens if you ask for a column that doesn't exist?
# clinical_df["tumorstage"] # uncomment this line

In [17]:
# Select two columns at once
clinical_df[["tumor_stage", "vital_status"]]
# can't use .column_name because there are multiple columns!
# double brackets are part of normal python syntax;
# they reference parts of lists, which can represent more complex data structures

Unnamed: 0,tumor_stage,vital_status
0,stage ia,dead
1,stage ib,dead
2,stage ib,dead
3,stage ia,alive
4,stage iib,dead
...,...,...
6827,not reported,dead
6828,not reported,dead
6829,not reported,dead
6830,not reported,alive


**Challenge:** does the order of the columns you list matter?

In [18]:
# Select rows 0, 1, 2 (row 3 is not selected)
clinical_df[0:3]

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,C34.1,stage ia,24477.0,dead,8070/3,371.0,live,C34.1,-24477.0,C34.1,,10.958904,,male,1936.0,white,not hispanic or latino,2004.0,TCGA-18-3406,LUSC
1,C34.1,stage ib,26615.0,dead,8070/3,136.0,live,C34.1,-26615.0,C34.1,,2.191781,,male,1931.0,asian,not hispanic or latino,2003.0,TCGA-18-3407,LUSC
2,C34.3,stage ib,28171.0,dead,8070/3,2304.0,live,C34.3,-28171.0,C34.3,2099.0,1.643836,,female,1927.0,white,not hispanic or latino,,TCGA-18-3408,LUSC


In [19]:
# Select row 2 to the end
clinical_df[1:]

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
1,C34.1,stage ib,26615.0,dead,8070/3,136.0,live,C34.1,-26615.0,C34.1,,2.191781,,male,1931.0,asian,not hispanic or latino,2003.0,TCGA-18-3407,LUSC
2,C34.3,stage ib,28171.0,dead,8070/3,2304.0,live,C34.3,-28171.0,C34.3,2099.0,1.643836,,female,1927.0,white,not hispanic or latino,,TCGA-18-3408,LUSC
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.095890,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC
5,C34.1,stage iiia,23370.0,alive,8070/3,,live,C34.1,-23370.0,C34.1,3576.0,2.739726,,female,1942.0,not reported,not reported,,TCGA-18-3411,LUSC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6827,C55,not reported,32871.0,dead,8950/3,167.0,live,C55.9,-32871.0,C55.9,,,,female,1917.0,white,not reported,2007.0,TCGA-NF-A5CP,UCS
6828,C54.1,not reported,23323.0,dead,8950/3,442.0,live,C54.1,-23323.0,C54.1,,,,female,1948.0,white,not hispanic or latino,2012.0,TCGA-NG-A4VU,UCS
6829,C54.1,not reported,27326.0,dead,8950/3,949.0,live,C54.1,-27326.0,C54.1,,,,female,1932.0,white,not hispanic or latino,2008.0,TCGA-NG-A4VW,UCS
6830,C54.1,not reported,24781.0,alive,8950/3,,live,C54.1,-24781.0,C54.1,587.0,,,female,1945.0,white,not hispanic or latino,,TCGA-QM-A5NM,UCS


In [20]:
# Select the last element in the list
clinical_df[-1:] # what does this mean in the context of indexing?

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
6831,C54.1,not reported,20318.0,alive,8950/3,,live,C54.1,-20318.0,C54.1,0.0,,,female,1957.0,asian,not hispanic or latino,,TCGA-QN-A5NN,UCS


### BREAK

**Challenge:** how would you extract the last 10 rows of the dataset?

## Slicing subsets of rows and columns

In [21]:
# iloc is integer indexing [row slicing, column slicing]
# locate specific data element
clinical_df.iloc[2, 6]

'live'

In [22]:
# select range of data
clinical_df.iloc[0:3, 1:4]

Unnamed: 0,tumor_stage,age_at_diagnosis,vital_status
0,stage ia,24477.0,dead
1,stage ib,26615.0,dead
2,stage ib,28171.0,dead


In [23]:
# stop/end bound is NOT inclusive (e.g., up to but not including 3)
# can use empty stop boundary to indicate end of data
clinical_df.iloc[0:, 1:4]

Unnamed: 0,tumor_stage,age_at_diagnosis,vital_status
0,stage ia,24477.0,dead
1,stage ib,26615.0,dead
2,stage ib,28171.0,dead
3,stage ia,27154.0,alive
4,stage iib,29827.0,dead
...,...,...,...
6827,not reported,32871.0,dead
6828,not reported,23323.0,dead
6829,not reported,27326.0,dead
6830,not reported,24781.0,alive


In [24]:
# loc is for label indexing (integers interpreted as labels)
# start and stop bound are inclusive
clinical_df.loc[1:4]

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
1,C34.1,stage ib,26615.0,dead,8070/3,136.0,live,C34.1,-26615.0,C34.1,,2.191781,,male,1931.0,asian,not hispanic or latino,2003.0,TCGA-18-3407,LUSC
2,C34.3,stage ib,28171.0,dead,8070/3,2304.0,live,C34.3,-28171.0,C34.3,2099.0,1.643836,,female,1927.0,white,not hispanic or latino,,TCGA-18-3408,LUSC
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.09589,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC


In [25]:
# can use empty stop boundary to indicate end of data
clinical_df.loc[1: ]

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
1,C34.1,stage ib,26615.0,dead,8070/3,136.0,live,C34.1,-26615.0,C34.1,,2.191781,,male,1931.0,asian,not hispanic or latino,2003.0,TCGA-18-3407,LUSC
2,C34.3,stage ib,28171.0,dead,8070/3,2304.0,live,C34.3,-28171.0,C34.3,2099.0,1.643836,,female,1927.0,white,not hispanic or latino,,TCGA-18-3408,LUSC
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.095890,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC
5,C34.1,stage iiia,23370.0,alive,8070/3,,live,C34.1,-23370.0,C34.1,3576.0,2.739726,,female,1942.0,not reported,not reported,,TCGA-18-3411,LUSC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6827,C55,not reported,32871.0,dead,8950/3,167.0,live,C55.9,-32871.0,C55.9,,,,female,1917.0,white,not reported,2007.0,TCGA-NF-A5CP,UCS
6828,C54.1,not reported,23323.0,dead,8950/3,442.0,live,C54.1,-23323.0,C54.1,,,,female,1948.0,white,not hispanic or latino,2012.0,TCGA-NG-A4VU,UCS
6829,C54.1,not reported,27326.0,dead,8950/3,949.0,live,C54.1,-27326.0,C54.1,,,,female,1932.0,white,not hispanic or latino,2008.0,TCGA-NG-A4VW,UCS
6830,C54.1,not reported,24781.0,alive,8950/3,,live,C54.1,-24781.0,C54.1,587.0,,,female,1945.0,white,not hispanic or latino,,TCGA-QM-A5NM,UCS


In [26]:
# Select all columns for rows of index values specified
clinical_df.loc[[0, 10, 6831], ]

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,C34.1,stage ia,24477.0,dead,8070/3,371.0,live,C34.1,-24477.0,C34.1,,10.958904,,male,1936.0,white,not hispanic or latino,2004.0,TCGA-18-3406,LUSC
10,C34.9,stage iv,24019.0,dead,8070/3,1097.0,live,C34.9,-24019.0,C34.9,758.0,1.369863,26.0,male,1941.0,not reported,not reported,,TCGA-18-3417,LUSC
6831,C54.1,not reported,20318.0,alive,8950/3,,live,C54.1,-20318.0,C54.1,0.0,,,female,1957.0,asian,not hispanic or latino,,TCGA-QN-A5NN,UCS


In [27]:
# select first row for specified columns
clinical_df.loc[0, ["primary_diagnosis", "tumor_stage", "age_at_diagnosis"]]

primary_diagnosis       C34.1
tumor_stage          stage ia
age_at_diagnosis        24477
Name: 0, dtype: object

**Challenge:** why doesn't the following code work?

`clinical_df.loc[2, 6]`

**Challenge:** how would you extract the last 100 rows for only vital status and days to death?

In [28]:
clinical_df.loc[6732:, ["vital_status", "days_to_death"]]

Unnamed: 0,vital_status,days_to_death
6732,alive,
6733,alive,
6734,alive,
6735,alive,
6736,dead,2352.0
...,...,...
6827,dead,167.0
6828,dead,442.0
6829,dead,949.0
6830,alive,


In [29]:
clinical_df.iloc[-100:, [3,5]]

Unnamed: 0,vital_status,days_to_death
6732,alive,
6733,alive,
6734,alive,
6735,alive,
6736,dead,2352.0
...,...,...
6827,dead,167.0
6828,dead,442.0
6829,dead,949.0
6830,alive,


## Calculating summary statistics

In [30]:
# calculate basic stats for all records in single column
clinical_df.age_at_diagnosis.describe()

count     6718.000000
mean     22319.849658
std       5077.709000
min       3982.000000
25%      19191.250000
50%      22841.500000
75%      26001.500000
max      32872.000000
Name: age_at_diagnosis, dtype: float64

In [31]:
# each metric one at a time (only prints last if all executed in one cell!)
clinical_df.age_at_diagnosis.min()

3982.0

In [32]:
# convert columns
clinical_df.age_at_diagnosis/365

0       67.060274
1       72.917808
2       77.180822
3       74.394521
4       81.717808
          ...    
6827    90.057534
6828    63.898630
6829    74.865753
6830    67.893151
6831    55.665753
Name: age_at_diagnosis, Length: 6832, dtype: float64

In [33]:
# convert min to days
clinical_df.age_at_diagnosis.min()/365

10.90958904109589

In [34]:
## Challenge: What type of summary stats do you get for object data?
clinical_df.site_of_resection_or_biopsy.describe()

count      6793
unique       94
top       C50.9
freq       1088
Name: site_of_resection_or_biopsy, dtype: object

In [35]:
## Challenge: How would you extract only the standard deviation for days to death?
clinical_df.days_to_death.std()

1052.4798716517794

## Copying vs referencing objects

In [36]:
# Using the "=" operator references the previous object
ref_clinical_df = clinical_df
ref_clinical_df

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,C34.1,stage ia,24477.0,dead,8070/3,371.0,live,C34.1,-24477.0,C34.1,,10.958904,,male,1936.0,white,not hispanic or latino,2004.0,TCGA-18-3406,LUSC
1,C34.1,stage ib,26615.0,dead,8070/3,136.0,live,C34.1,-26615.0,C34.1,,2.191781,,male,1931.0,asian,not hispanic or latino,2003.0,TCGA-18-3407,LUSC
2,C34.3,stage ib,28171.0,dead,8070/3,2304.0,live,C34.3,-28171.0,C34.3,2099.0,1.643836,,female,1927.0,white,not hispanic or latino,,TCGA-18-3408,LUSC
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.095890,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6827,C55,not reported,32871.0,dead,8950/3,167.0,live,C55.9,-32871.0,C55.9,,,,female,1917.0,white,not reported,2007.0,TCGA-NF-A5CP,UCS
6828,C54.1,not reported,23323.0,dead,8950/3,442.0,live,C54.1,-23323.0,C54.1,,,,female,1948.0,white,not hispanic or latino,2012.0,TCGA-NG-A4VU,UCS
6829,C54.1,not reported,27326.0,dead,8950/3,949.0,live,C54.1,-27326.0,C54.1,,,,female,1932.0,white,not hispanic or latino,2008.0,TCGA-NG-A4VW,UCS
6830,C54.1,not reported,24781.0,alive,8950/3,,live,C54.1,-24781.0,C54.1,587.0,,,female,1945.0,white,not hispanic or latino,,TCGA-QM-A5NM,UCS


In [37]:
# Using the "copy() method": actually creates another object
true_copy_clinical_df = clinical_df.copy()
true_copy_clinical_df

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,C34.1,stage ia,24477.0,dead,8070/3,371.0,live,C34.1,-24477.0,C34.1,,10.958904,,male,1936.0,white,not hispanic or latino,2004.0,TCGA-18-3406,LUSC
1,C34.1,stage ib,26615.0,dead,8070/3,136.0,live,C34.1,-26615.0,C34.1,,2.191781,,male,1931.0,asian,not hispanic or latino,2003.0,TCGA-18-3407,LUSC
2,C34.3,stage ib,28171.0,dead,8070/3,2304.0,live,C34.3,-28171.0,C34.3,2099.0,1.643836,,female,1927.0,white,not hispanic or latino,,TCGA-18-3408,LUSC
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.095890,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6827,C55,not reported,32871.0,dead,8950/3,167.0,live,C55.9,-32871.0,C55.9,,,,female,1917.0,white,not reported,2007.0,TCGA-NF-A5CP,UCS
6828,C54.1,not reported,23323.0,dead,8950/3,442.0,live,C54.1,-23323.0,C54.1,,,,female,1948.0,white,not hispanic or latino,2012.0,TCGA-NG-A4VU,UCS
6829,C54.1,not reported,27326.0,dead,8950/3,949.0,live,C54.1,-27326.0,C54.1,,,,female,1932.0,white,not hispanic or latino,2008.0,TCGA-NG-A4VW,UCS
6830,C54.1,not reported,24781.0,alive,8950/3,,live,C54.1,-24781.0,C54.1,587.0,,,female,1945.0,white,not hispanic or latino,,TCGA-QM-A5NM,UCS


In [38]:
# Assign the value `0` to the first three rows of data in the DataFrame
ref_clinical_df[0:3] = 0
ref_clinical_df.head()
# note: you probably wouldn't want to actually *do* this to your data!

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,0,0,0.0,0,0,0.0,0,0,0.0,0,0.0,0.0,0.0,0,0.0,0,0,0.0,0,0
1,0,0,0.0,0,0,0.0,0,0,0.0,0,0.0,0.0,0.0,0,0.0,0,0,0.0,0,0
2,0,0,0.0,0,0,0.0,0,0,0.0,0,0.0,0.0,0.0,0,0.0,0,0,0.0,0,0
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.09589,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC


**Challenge:** How and why are the following three objects different?
_Hint: try applying `head()`_

In [39]:
clinical_df.head() # has been modified because ref_clinical_df referenced it

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,0,0,0.0,0,0,0.0,0,0,0.0,0,0.0,0.0,0.0,0,0.0,0,0,0.0,0,0
1,0,0,0.0,0,0,0.0,0,0,0.0,0,0.0,0.0,0.0,0,0.0,0,0,0.0,0,0
2,0,0,0.0,0,0,0.0,0,0,0.0,0,0.0,0.0,0.0,0,0.0,0,0,0.0,0,0
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.09589,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC


In [40]:
ref_clinical_df.head() # was actually altered

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,0,0,0.0,0,0,0.0,0,0,0.0,0,0.0,0.0,0.0,0,0.0,0,0,0.0,0,0
1,0,0,0.0,0,0,0.0,0,0,0.0,0,0.0,0.0,0.0,0,0.0,0,0,0.0,0,0
2,0,0,0.0,0,0,0.0,0,0,0.0,0,0.0,0.0,0.0,0,0.0,0,0,0.0,0,0
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.09589,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC


In [41]:
true_copy_clinical_df.head() # actual copy of original, unaltered
# reinforce that the order of operations matters!

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,C34.1,stage ia,24477.0,dead,8070/3,371.0,live,C34.1,-24477.0,C34.1,,10.958904,,male,1936.0,white,not hispanic or latino,2004.0,TCGA-18-3406,LUSC
1,C34.1,stage ib,26615.0,dead,8070/3,136.0,live,C34.1,-26615.0,C34.1,,2.191781,,male,1931.0,asian,not hispanic or latino,2003.0,TCGA-18-3407,LUSC
2,C34.3,stage ib,28171.0,dead,8070/3,2304.0,live,C34.3,-28171.0,C34.3,2099.0,1.643836,,female,1927.0,white,not hispanic or latino,,TCGA-18-3408,LUSC
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.09589,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC


## Wrapping up
- review objectives
- preview next week's objectives
- demo of spyder IDE, if time allows