# Introduction to Python
# Class 2: Starting with data

## Objectives

In the last class,
we learned the basics of Python syntax and Jupyter notebooks,
and examined common data types, data structures, and programming structures for Python.

By the end of this lesson, you should be able to:

- load packages and spreadsheet-style data using Python
- extract columns, rows, and portions thereof from datasets
- calculate summary statistics
- understand the difference between referencing and copying a variable

## Using packages

Open your Jupyter notebook file browser,
navigate to your project directory,
and create a new Python notebook called `class2`
with an appropriate title in the first cell in Markdown formatting.

We'll first need to load additional packages,
(collections of related functions)
so the functions we'll need are available for use:

In [1]:
# make packages available to use in this notebook
import os
import urllib.request
import pandas as pd 

The packages we're using today include:

- [`os`](https://docs.python.org/3/library/os.html): to create a `data` directory
- [`urllib`](https://docs.python.org/3/library/urllib.html): for downloading files
- [`pandas`](https://pandas.pydata.org): for data manipulation and analysis

For the last package,
`pd` is being defined as an alias, or shortcut, 
to specify we're using a function from that package.
For the rest of this lesson, we'll preface the function in which it's been loaded.

## Importing data

Before we can download our data,
we should create a new directory to contain it:

In [2]:
# create data directory
os.mkdir("data")

Then we can use a function from the `urllib` package to download the data file:

In [3]:
# download dataset
urllib.request.urlretrieve("https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.csv", "data/clinical.csv")

('data/clinical.csv', <http.client.HTTPMessage at 0x7fb247181e90>)

The first argument (string inside quotation marks)
represents the URL from which the data is being downloaded.
The second argument ("data/clinical.csv") indicates where the data will be saved.

> If above code doesn't work, you can download the data 
as a [zip file](https://www.dropbox.com/s/k639bkse64r0bfz/data.zip),
manually unzip it, and move the resulting folder to your project's data directory.

Notice that the URL above ends in "clinical.csv", 
which is also the name we used to save the file on our computers.
If you click on the URL and view it in a web browser, the format isn’t particularly easy for us to understand. 
The data we’ve downloaded are in csv format, which stands for “comma separated values.” This means the data are organized into rows and columns, with columns separated by commas.

These data are arranged in a tidy format, meaning each row represents an observation, and each column represents a variable (piece of data for each observation). Moreover, only one piece of data is entered in each cell.

These data are clinical cancer data from the National Cancer Institute’s Genomic Data Commons, specifically from The Cancer Genome Atlas, or TCGA.
Each row represents a patient, and each column represents information about demographics (race, age at diagnosis, etc) and disease (e.g., cancer type).
The data were downloaded and aggregated using a script included in the 
[Introduction to R course](https://github.com/fredhutchio/R_intro).

We can import these data and assign them to a variable:

In [4]:
# assign data to variable
clinical_df = pd.read_csv("data/clinical.csv")

The command executed successfully, 
but we still need to ensure the data have been imported correctly.

> You can view the entire dataset in Jupyter notebooks by executing the name of the object (e.g., `clinical_df`).
You'll see a text box similar to that shown below,
except you'll be able to scroll down and across the entire data set.
For these lessons, 
we will show a small part of these data to keep materials concise.

There are a few ways we can inspect the data.
First, we can preview the data:

In [5]:
# preview first few rows of the data
clinical_df.head()

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,C34.1,stage ia,24477.0,dead,8070/3,371.0,live,C34.1,-24477.0,C34.1,,10.958904,,male,1936.0,white,not hispanic or latino,2004.0,TCGA-18-3406,LUSC
1,C34.1,stage ib,26615.0,dead,8070/3,136.0,live,C34.1,-26615.0,C34.1,,2.191781,,male,1931.0,asian,not hispanic or latino,2003.0,TCGA-18-3407,LUSC
2,C34.3,stage ib,28171.0,dead,8070/3,2304.0,live,C34.3,-28171.0,C34.3,2099.0,1.643836,,female,1927.0,white,not hispanic or latino,,TCGA-18-3408,LUSC
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.09589,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC


The `head` function by default shows the the column headers,
along with first five rows of data.
You can specify a different number of rows by placing that number inside the parentheses, 
demonstrated below using `tail`, 
which shows the last few rows:

In [6]:
# print last eight rows of data to screen
clinical_df.tail(8) # print top n rows

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
6824,C55,not reported,21901.0,alive,8980/3,,live,C55.9,-21901.0,C55.9,298.0,,,female,1952.0,white,not hispanic or latino,,TCGA-NF-A4WU,UCS
6825,C55,not reported,22407.0,alive,8950/3,,live,C55.9,-22407.0,C55.9,452.0,,,female,1950.0,not reported,not reported,,TCGA-NF-A4WX,UCS
6826,C55,not reported,21908.0,alive,8950/3,,live,C55.9,-21908.0,C55.9,81.0,,,female,1953.0,white,not hispanic or latino,,TCGA-NF-A4X2,UCS
6827,C55,not reported,32871.0,dead,8950/3,167.0,live,C55.9,-32871.0,C55.9,,,,female,1917.0,white,not reported,2007.0,TCGA-NF-A5CP,UCS
6828,C54.1,not reported,23323.0,dead,8950/3,442.0,live,C54.1,-23323.0,C54.1,,,,female,1948.0,white,not hispanic or latino,2012.0,TCGA-NG-A4VU,UCS
6829,C54.1,not reported,27326.0,dead,8950/3,949.0,live,C54.1,-27326.0,C54.1,,,,female,1932.0,white,not hispanic or latino,2008.0,TCGA-NG-A4VW,UCS
6830,C54.1,not reported,24781.0,alive,8950/3,,live,C54.1,-24781.0,C54.1,587.0,,,female,1945.0,white,not hispanic or latino,,TCGA-QM-A5NM,UCS
6831,C54.1,not reported,20318.0,alive,8950/3,,live,C54.1,-20318.0,C54.1,0.0,,,female,1957.0,asian,not hispanic or latino,,TCGA-QN-A5NN,UCS


> #### Challenge-import
Download, import, and inspect the following data files. 
The URL for each sample dataset is included along with a name to assign to the variable. 
(Hint: you can use the same function as above, but may need to update the `sep =` argument)
> - example1: https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.tsv
> - example2: https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.txt

Importing data can be tricky and frustrating. 
However, if you can’t get your data into Python, 
you can’t do anything to analyze or visualize it. 
It’s worth understanding how to do it effectively to save you time and energy later.

Now that we have data imported and available, 
we can print a summary of all column names, number of entries, data types, and non-null values:

In [7]:
# print summary
clinical_df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6832 entries, 0 to 6831
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   primary_diagnosis            6793 non-null   object 
 1   tumor_stage                  6793 non-null   object 
 2   age_at_diagnosis             6718 non-null   float64
 3   vital_status                 6793 non-null   object 
 4   morphology                   6793 non-null   object 
 5   days_to_death                2187 non-null   float64
 6   state                        6793 non-null   object 
 7   tissue_or_organ_of_origin    6793 non-null   object 
 8   days_to_birth                6718 non-null   float64
 9   site_of_resection_or_biopsy  6793 non-null   object 
 10  days_to_last_follow_up       5714 non-null   float64
 11  cigarettes_per_day           1171 non-null   float64
 12  years_smoked                 448 non-null    float64
 13  gender            

The output above highlight another of the key features of `pandas`:
it interprets data in ways that make it easier to analyze.

The description at the top of this output,
`pandas.core.frame.DataFrame`,
describes the data structure as a data frame,
which is how `pandas` interprets spreadsheet style data.
Directly below that line,
we see a note that there are 6832 observations (rows, or our case, patients or cases),
as well as 20 columns.
A summary of the data type for each column is below.

In the last lesson,
we discussed data types built into Python.
`pandas` features the following data types specific to its package,
which were implemented in our data:

- `object` data in `pandas` represents string (character) data in native Python
- `float64` is still float data (the `64` references 64 bit hardware)
- `int64` in `pandas` isn't represented in our data, but refers to integer data 
- `datetime64` from `pandas` also isn't shown here, but refers to a specific format to make working with date and time data easier. 

> To create a list in a markdown cell, 
use an asterisk (`*`) or dash (`-`) followed by a space;
these will be rendered as bullet points when you execute the cell.

## Accessing columns and rows

A common task in data analysis is to extract particular columns or rows,.
often referred to as subsetting.
This section explores a few different ways to access these parts of our spreadsheet.

First, we can subset a single column using its name (column header):

In [8]:
# show only the first few rows of one column
clinical_df["tumor_stage"].head()

0     stage ia
1     stage ib
2     stage ib
3     stage ia
4    stage iib
Name: tumor_stage, dtype: object

The square brackets above are a common subsetting syntax in Python.
The quotation marks around the column name are necessary for Python to interpret it as a column,
rather than a variable name.
We've added `.head()` to the end so we only preview the first few rows, rather than the entire data frame.

> In the output above, it looks like there are two columns output:
one of numbers and another of the tumor stage.
The numbers aren't actually a column;
they represent the row labels
(which in this dataset also happen to be the index values).

Similarly, we can assess the data type of a specific row:

In [9]:
# show data type for a column
clinical_df["tumor_stage"].dtype 

dtype('O')

The output, `O`, 
indicates these data are object (character) type.

One of the shortcuts afforded by `pandas` is the ability to treat the column names as attributes,
which means you can access them using the `.` syntax:

In [10]:
# access columns by name using dot syntax
clinical_df.tumor_stage.head()

0     stage ia
1     stage ib
2     stage ib
3     stage ia
4    stage iib
Name: tumor_stage, dtype: object

Here, we've also used `.head()` to minimize the amount of data printed to the screen.
If you were assigning data to a new variable name, 
you would likely be using the whole column instead.

If you need to extract multiple columns, 
you'll need to adjust the syntax slightly:

In [11]:
# Select two columns at once
clinical_df[["tumor_stage", "vital_status"]].head()

Unnamed: 0,tumor_stage,vital_status
0,stage ia,dead
1,stage ib,dead
2,stage ib,dead
3,stage ia,alive
4,stage iib,dead


In this case, we can't use the dot syntax to access columns.
However, double square brackets are a common part of Python syntax.
They reference parts of lists (a more complex data structure).
In general, the dot syntax means you are accessing a part of the thing (generally a variable)
that comes before the dot. 
In the case 

> #### Challenge-typo
What happens if you misspell the name of a column?

> #### Challenge-order
Does the order of the columns you list matter?

We can also extract rows from a data frame:

In [12]:
# access three rows 
clinical_df[0:3]

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,C34.1,stage ia,24477.0,dead,8070/3,371.0,live,C34.1,-24477.0,C34.1,,10.958904,,male,1936.0,white,not hispanic or latino,2004.0,TCGA-18-3406,LUSC
1,C34.1,stage ib,26615.0,dead,8070/3,136.0,live,C34.1,-26615.0,C34.1,,2.191781,,male,1931.0,asian,not hispanic or latino,2003.0,TCGA-18-3407,LUSC
2,C34.3,stage ib,28171.0,dead,8070/3,2304.0,live,C34.3,-28171.0,C34.3,2099.0,1.643836,,female,1927.0,white,not hispanic or latino,,TCGA-18-3408,LUSC


In the output above, 
we see three rows (index positions 0, 1, 2).
This type of subsetting is noninclusive of the endpoint,
meaning the row at index 3 is not selected from a range of `0:3`.

We can also select a range including the end of the data frame by leaving the field after the colon empty:

In [13]:
# access the second row to the end of the data frame
clinical_df[6829:].tail()

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
6829,C54.1,not reported,27326.0,dead,8950/3,949.0,live,C54.1,-27326.0,C54.1,,,,female,1932.0,white,not hispanic or latino,2008.0,TCGA-NG-A4VW,UCS
6830,C54.1,not reported,24781.0,alive,8950/3,,live,C54.1,-24781.0,C54.1,587.0,,,female,1945.0,white,not hispanic or latino,,TCGA-QM-A5NM,UCS
6831,C54.1,not reported,20318.0,alive,8950/3,,live,C54.1,-20318.0,C54.1,0.0,,,female,1957.0,asian,not hispanic or latino,,TCGA-QN-A5NN,UCS


Again, we've used `tail` to show only the end of the data frame.

We can perform a similar operation to `tail` using an index value that extracts only the last row:

In [14]:
# access the last row in the data frame
clinical_df[-1:] 

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
6831,C54.1,not reported,20318.0,alive,8950/3,,live,C54.1,-20318.0,C54.1,0.0,,,female,1957.0,asian,not hispanic or latino,,TCGA-QN-A5NN,UCS


> #### Challenge-last
How would you extract the last 10 rows of the dataset?

## Slicing subsets of rows and columns

Now that we have a basic understanding of accessing whole rows and columns, 
we are ready to discuss slicing
(extracting portions of rows and columns).

There are multiple ways to slice a data frame. 
We'll begin by exploring `iloc`, 
which uses integer indexing.
This means we'll reference rows and columns by their index position:

In [15]:
# access one data element from a single cell
clinical_df.iloc[2, 1]

'stage ib'

We can check one of our previews of the data above to see that this does represent the data in that cell.

As with subsetting described in the previous section,
we can also extract ranges of cells:

In [16]:
# select range of data
clinical_df.iloc[0:3, 1:4]

Unnamed: 0,tumor_stage,age_at_diagnosis,vital_status
0,stage ia,24477.0,dead
1,stage ib,26615.0,dead
2,stage ib,28171.0,dead


As described earlier with subsetting using ranges of index values,
we can see in the output above that the beginning and end bounds of the ranges are noninclusive.

We can also include an empty start or stop bound to indicate the beginning or end of the data frame, respectively:

In [17]:
# empty stop boundary to indicate end of data
clinical_df.iloc[:2, 18:]

Unnamed: 0,bcr_patient_barcode,disease
0,TCGA-18-3406,LUSC
1,TCGA-18-3407,LUSC


Now we'll move on and explore the second method for extracting slices,
using `loc`, which stands for label indexing.
The tricky part with our data is that the row labels are actually also the index values.
This means that when we extract a range of rows,
we can still reference those values:

In [18]:
# slicing using loc
clinical_df.loc[1:4]

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
1,C34.1,stage ib,26615.0,dead,8070/3,136.0,live,C34.1,-26615.0,C34.1,,2.191781,,male,1931.0,asian,not hispanic or latino,2003.0,TCGA-18-3407,LUSC
2,C34.3,stage ib,28171.0,dead,8070/3,2304.0,live,C34.3,-28171.0,C34.3,2099.0,1.643836,,female,1927.0,white,not hispanic or latino,,TCGA-18-3408,LUSC
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.09589,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC


Here you can note one of the major differences between `iloc` and `loc`:
the latter has inclusive start and stop bound.

We can still use empty bounds:

In [19]:
# empty stop boundary to indicate end of data
clinical_df.loc[6830: ]

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
6830,C54.1,not reported,24781.0,alive,8950/3,,live,C54.1,-24781.0,C54.1,587.0,,,female,1945.0,white,not hispanic or latino,,TCGA-QM-A5NM,UCS
6831,C54.1,not reported,20318.0,alive,8950/3,,live,C54.1,-20318.0,C54.1,0.0,,,female,1957.0,asian,not hispanic or latino,,TCGA-QN-A5NN,UCS


We can also select all columns for a specific set of rows by adding the row labels as a list:

In [20]:
# Select all columns for rows of index values specified
clinical_df.loc[[0, 10, 6831], ]

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,C34.1,stage ia,24477.0,dead,8070/3,371.0,live,C34.1,-24477.0,C34.1,,10.958904,,male,1936.0,white,not hispanic or latino,2004.0,TCGA-18-3406,LUSC
10,C34.9,stage iv,24019.0,dead,8070/3,1097.0,live,C34.9,-24019.0,C34.9,758.0,1.369863,26.0,male,1941.0,not reported,not reported,,TCGA-18-3417,LUSC
6831,C54.1,not reported,20318.0,alive,8950/3,,live,C54.1,-20318.0,C54.1,0.0,,,female,1957.0,asian,not hispanic or latino,,TCGA-QN-A5NN,UCS


Finally, we can use the column labels for extraction:

In [21]:
# select first row for specified columns
clinical_df.loc[0, ["primary_diagnosis", "tumor_stage", "age_at_diagnosis"]]

primary_diagnosis       C34.1
tumor_stage          stage ia
age_at_diagnosis        24477
Name: 0, dtype: object

> #### Challenge-location
Why doesn't the following code work? 
>
> `clinical_df.loc[2, 6]`

> #### Challenge-100
How would you extract the last 100 rows for only vital status and days to death?

So far, we've been printing the output from our subsetting and slicing to the screen 
(often using `.head()`).
Remember that if you'd like to use these data for another purpose,
it's possible you may want to assign these data to a new variable to further manipulate
(but see the section below comparing referencing and copying!).

## Calculating summary statistics

Once you've extracted your data of interest, 
you will likely want to be able to assess basic statistical features of the data.

Data frames allow you to assess these features:

In [22]:
# calculate basic stats a single column
clinical_df.age_at_diagnosis.describe()

count     6718.000000
mean     22319.849658
std       5077.709000
min       3982.000000
25%      19191.250000
50%      22841.500000
75%      26001.500000
max      32872.000000
Name: age_at_diagnosis, dtype: float64

In this case, 
we've assessed a collection of summary statistics for the column "age at diagnosis" 
using the `.describe()` function.

You can access the statistics listed above individually as well:

In [23]:
# calculate only the minimum for age at diagnosis
clinical_df.age_at_diagnosis.min()

3982.0

We can also use column names to perform mathematical operations,
such as unit conversion:

In [24]:
# convert age column from days to years
clinical_df.age_at_diagnosis.head()/365

0    67.060274
1    72.917808
2    77.180822
3    74.394521
4    81.717808
Name: age_at_diagnosis, dtype: float64

We can also perform a conversion on a summary statistic:

In [25]:
# convert minimum age at diagnosis to years
clinical_df.age_at_diagnosis.min()/365

10.90958904109589

> #### Challenge-object
What type of summary statistics do you get for object data?
(Hint: try this on one of our columns of object data, like `tumor_stage`)

> #### Challenge-deviation
How would you extract only the standard deviation for days to death?

In [26]:
clinical_df.tumor_stage.describe()

count             6793
unique              18
top       not reported
freq              2753
Name: tumor_stage, dtype: object

## Copying vs referencing

In this final section, 
we'll take a look at the difference between copying and referencing objects (variables).

It's often desirable to create a new variable that you can then use for data filtering.
It is possible to reference another variable using the `=` assignment operator:

In [27]:
# reference another object
ref_clinical_df = clinical_df
ref_clinical_df.head()

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,C34.1,stage ia,24477.0,dead,8070/3,371.0,live,C34.1,-24477.0,C34.1,,10.958904,,male,1936.0,white,not hispanic or latino,2004.0,TCGA-18-3406,LUSC
1,C34.1,stage ib,26615.0,dead,8070/3,136.0,live,C34.1,-26615.0,C34.1,,2.191781,,male,1931.0,asian,not hispanic or latino,2003.0,TCGA-18-3407,LUSC
2,C34.3,stage ib,28171.0,dead,8070/3,2304.0,live,C34.3,-28171.0,C34.3,2099.0,1.643836,,female,1927.0,white,not hispanic or latino,,TCGA-18-3408,LUSC
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.09589,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC


If you inspect both objects,
you'll see they're identical.

Next, we'll use the `copy` method to create a new object:

In [28]:
# create another object using copy
true_copy_clinical_df = clinical_df.copy()
true_copy_clinical_df.head()

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,C34.1,stage ia,24477.0,dead,8070/3,371.0,live,C34.1,-24477.0,C34.1,,10.958904,,male,1936.0,white,not hispanic or latino,2004.0,TCGA-18-3406,LUSC
1,C34.1,stage ib,26615.0,dead,8070/3,136.0,live,C34.1,-26615.0,C34.1,,2.191781,,male,1931.0,asian,not hispanic or latino,2003.0,TCGA-18-3407,LUSC
2,C34.3,stage ib,28171.0,dead,8070/3,2304.0,live,C34.3,-28171.0,C34.3,2099.0,1.643836,,female,1927.0,white,not hispanic or latino,,TCGA-18-3408,LUSC
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.09589,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC


Now we have a few objects to compare.

We'll assess how these objects change by making a clear, obvious change in the referenced data frame.
This isn't something you'd necessarily want to do in your own data,
but is a way for us to quickly see what happens to the other objects:

In [29]:
# assign the value `0` to the first three rows of data
ref_clinical_df[0:3] = 0
ref_clinical_df.head()

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,0,0,0.0,0,0,0.0,0,0,0.0,0,0.0,0.0,0.0,0,0.0,0,0,0.0,0,0
1,0,0,0.0,0,0,0.0,0,0,0.0,0,0.0,0.0,0.0,0,0.0,0,0,0.0,0,0
2,0,0,0.0,0,0,0.0,0,0,0.0,0,0.0,0.0,0.0,0,0.0,0,0,0.0,0,0
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.09589,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC


> #### Challenge-ref
What has happened to each of our three objects?
> - `true_copy_clinical_df`, created from the original data frame using `copy`
> - `ref_clinical_df`, for which the first three rows have been changed to 0
> - `clinical_df`, which is the original data frame
(Hint: you can examine the data directly,
or use code to assess whether summary statistics,
like minimum age at diagnosis,
differ among these data frames)

Remember, referencing objects does not protect the original object from modification,
because the `=` creates another name for the same object.
Using the `copy` method allows you to retain the original object in its unaltered state.

## Wrapping up

Today, we imported spreadsheet-style data into Python,
learned to inspect and subset the resulting data frame,
as well as calculate summary statistics and copy/reference objects (variables).

Next time, we'll work on additional types of data manipulation,
including extracting data that meet particular criteria
and dealing with missing data.




In [30]:
clinical_df = pd.read_csv("data/clinical.csv")

## Extra exercises

Answers to all challenge exercises are available [here](https://fredhutchio.github.io/python_intro/solutions/).

Please remember that the last section of the project above modified `clinical_df`;
you should execute the following line of code to import the original data again before attempting the following exercises:

> #### Challenge-first-five
Print to the screen the first five rows of `clinical_df` with only the following columns: `primary_diagnosis`, `tumor_stage`, `age_at_diagnosis`, `vital_status`, `gender`, `disease`.
(Hint: use `loc` or `iloc`)

> #### Challenge-summary
Obtain summary statistics for the `clinical_df` data grouped by both `disease` and `vital_status`.

> #### Challenge-last
Identify two ways to access the last item of the following list: `test_data = [50, 40, 30]`

> #### Challenge-compare
We examined differences between the copied and referenced data frames by manually inspecting the data itself.
Calculate the difference between the minimum age at diagnosis for:
> - `clinical_df` and `true_copy_clinical_df`
> - `clinical_df` and `ref_clinical_df`