# Introduction to Python, Class 2: Starting with data

## Objectives

In the last class,
we learned the basics of Python syntax and Jupyter notebooks,
and examined common data types, data structures, and programming structures for Python.

By the end of this lesson, you should be able to:

- load packages and spreadsheet-style data using Python
- extract columns, rows, and portions thereof from datasets
- calculate summary statistics
- understand the difference between referencing and copying a variable

## Using packages

Open your Jupyter notebook file browser,
navigate to your project directory,
and create a new Python notebook called `class2`
with an appropriate title in the first cell in Markdown formatting.

We'll first need to load additional packages,
(collections of related functions)
so the functions we'll need are available for use:

In [7]:
# make packages available to use in this notebook
import os
import urllib.request
import pandas as pd 

The packages we're using today include:

- [`os`](https://docs.python.org/3/library/os.html): to create a `data` directory
- [`urllib`](https://docs.python.org/3/library/urllib.html): for downloading files
- [`pandas`](https://pandas.pydata.org): for data manipulation and analysis

For the last package,
`pd` is being defined as an alias, or shortcut, 
to specify we're using a function from that package.
For the rest of this lesson, we'll preface the function in which it's been loaded.

## Importing data

Before we can download our data,
we should create a new directory to contain it:

In [8]:
# create data directory
os.mkdir("data")

Then we can use a function from the `urllib` package to download the data file:

In [9]:
# download dataset
urllib.request.urlretrieve("https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.csv", "data/clinical.csv")

('data/clinical.csv', <http.client.HTTPMessage at 0x1075130b8>)

The first argument (string inside quotation marks)
represents the URL from which the data is being downloaded.
The second argument ("data/clinical.csv") indicates where the data will be saved.

> If above code doesn't work, you can download the data 
as a [zip file](https://www.dropbox.com/s/k639bkse64r0bfz/data.zip),
manually unzip it, and move the resulting folder to your project's data directory.

Notice that the URL above ends in "clinical.csv", 
which is also the name we used to save the file on our computers.
If you click on the URL and view it in a web browser, the format isn’t particularly easy for us to understand. 
The data we’ve downloaded are in csv format, which stands for “comma separated values.” This means the data are organized into rows and columns, with columns separated by commas.

These data are arranged in a tidy format, meaning each row represents an observation, and each column represents a variable (piece of data for each observation). Moreover, only one piece of data is entered in each cell.

These data are clinical cancer data from the National Cancer Institute’s Genomic Data Commons, specifically from The Cancer Genome Atlas, or TCGA.
Each row represents a patient, and each column represents information about demographics (race, age at diagnosis, etc) and disease (e.g., cancer type).
The data were downloaded and aggregated using a script included in the 
[Introduction to R course](https://github.com/fredhutchio/R_intro).

We can import these data and assign them to a variable:

In [10]:
# assign data to variable
clinical_df = pd.read_csv("data/clinical.csv")

The command executed successfully, 
but we still need to ensure the data have been imported correctly.

There are a few ways we can inspect the data.
First, we can preview the data:

In [15]:
# preview first few rows of the data
clinical_df.head()

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,C34.1,stage ia,24477.0,dead,8070/3,371.0,live,C34.1,-24477.0,C34.1,,10.958904,,male,1936.0,white,not hispanic or latino,2004.0,TCGA-18-3406,LUSC
1,C34.1,stage ib,26615.0,dead,8070/3,136.0,live,C34.1,-26615.0,C34.1,,2.191781,,male,1931.0,asian,not hispanic or latino,2003.0,TCGA-18-3407,LUSC
2,C34.3,stage ib,28171.0,dead,8070/3,2304.0,live,C34.3,-28171.0,C34.3,2099.0,1.643836,,female,1927.0,white,not hispanic or latino,,TCGA-18-3408,LUSC
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.09589,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC


The `head` function by default shows the the column headers,
along with first five rows of data.
You can specify a different number of rows by placing that number inside the parentheses, 
demonstrated below using `tail`, 
which shows the last few rows:

In [16]:
# print last eight rows of data to screen
clinical_df.head(8) # print top n rows

Unnamed: 0,primary_diagnosis,tumor_stage,age_at_diagnosis,vital_status,morphology,days_to_death,state,tissue_or_organ_of_origin,days_to_birth,site_of_resection_or_biopsy,days_to_last_follow_up,cigarettes_per_day,years_smoked,gender,year_of_birth,race,ethnicity,year_of_death,bcr_patient_barcode,disease
0,C34.1,stage ia,24477.0,dead,8070/3,371.0,live,C34.1,-24477.0,C34.1,,10.958904,,male,1936.0,white,not hispanic or latino,2004.0,TCGA-18-3406,LUSC
1,C34.1,stage ib,26615.0,dead,8070/3,136.0,live,C34.1,-26615.0,C34.1,,2.191781,,male,1931.0,asian,not hispanic or latino,2003.0,TCGA-18-3407,LUSC
2,C34.3,stage ib,28171.0,dead,8070/3,2304.0,live,C34.3,-28171.0,C34.3,2099.0,1.643836,,female,1927.0,white,not hispanic or latino,,TCGA-18-3408,LUSC
3,C34.1,stage ia,27154.0,alive,8083/3,,live,C34.1,-27154.0,C34.1,3747.0,1.09589,,male,1930.0,white,not hispanic or latino,,TCGA-18-3409,LUSC
4,C34.3,stage iib,29827.0,dead,8070/3,146.0,live,C34.3,-29827.0,C34.3,,,,male,1923.0,not reported,not reported,2004.0,TCGA-18-3410,LUSC
5,C34.1,stage iiia,23370.0,alive,8070/3,,live,C34.1,-23370.0,C34.1,3576.0,2.739726,,female,1942.0,not reported,not reported,,TCGA-18-3411,LUSC
6,C34.3,stage ib,19025.0,dead,8070/3,345.0,live,C34.3,-19025.0,C34.3,,1.369863,,male,1953.0,white,not hispanic or latino,2005.0,TCGA-18-3412,LUSC
7,C34.3,stage iv,26938.0,dead,8070/3,716.0,live,C34.3,-26938.0,C34.3,,1.369863,,male,1932.0,asian,not hispanic or latino,2006.0,TCGA-18-3414,LUSC


**Challenge:** Download, import, and inspect the following data files. 
The URL for each sample dataset is included along with a name to assign to the variable. 
(Hint: you can use the same function as above, but may need to update the `sep =` argument)

- URL: https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.tsv, object name: example1
- URL: https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.txt, object name: example2

Importing data can be tricky and frustrating. 
However, if you can’t get your data into Python, 
you can’t do anything to analyze or visualize it. 
It’s worth understanding how to do it effectively to save you time and energy later.

Now that we have data imported and available, 
we can print a summary of all column names, number of entries, data types, and non-null values:

In [17]:
# print summary
clinical_df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6832 entries, 0 to 6831
Data columns (total 20 columns):
primary_diagnosis              6793 non-null object
tumor_stage                    6793 non-null object
age_at_diagnosis               6718 non-null float64
vital_status                   6793 non-null object
morphology                     6793 non-null object
days_to_death                  2187 non-null float64
state                          6793 non-null object
tissue_or_organ_of_origin      6793 non-null object
days_to_birth                  6718 non-null float64
site_of_resection_or_biopsy    6793 non-null object
days_to_last_follow_up         5714 non-null float64
cigarettes_per_day             1171 non-null float64
years_smoked                   448 non-null float64
gender                         6793 non-null object
year_of_birth                  6662 non-null float64
race                           6793 non-null object
ethnicity                      6793 non-null object
yea

The output above highlight another of the key features of `pandas`:
it interprets data in ways that make it easier to analyze.

The description at the top of this output,
`pandas.core.frame.DataFrame`,
describes the data structure as a data frame,
which is how `pandas` interprets spreadsheet style data.
Directly below that line,
we see a note that there are 6832 observations (rows, or our case, patients or cases),
as well as 20 columns.
A summary of the data type for each column is below.

In the last lesson,
we discussed data types built into Python.
`pandas` features the following data types specific to its package,
which were implemented in our data:

- `object` data in `pandas` represents string (character) data in native Python
- `float64` is still float data (the `64` references 64 bit hardware)
- `int64` in `pandas` isn't represented in our data, but refers to integer data 
- `datetime64` from `pandas` also isn't shown here, but refers to a specific format to make working with date and time data easier. 

> To create a list in a markdown cell, 
use an asterisk (`*`) or dash (`-`) followed by a space;
these will be rendered as bullet points when you execute the cell.

## Accessing columns and rows

A common task in data analysis is to extract particular columns or rows,.
often referred to as subsetting
This section explores a few different ways to access these parts of our spreadsheet.

First, we can extract 

In [None]:
# select a "subset" of the data using the column name
clinical_df["tumor_stage"]

In [None]:
# show only the first few rows of output
clinical_df["tumor_stage"].head()

In [None]:
# show data type for this row
clinical_df["tumor_stage"].dtype # single column, O stands for "object"

In [None]:
# use the column name as an "attribute"; gives the same output
clinical_df.tumor_stage

In [None]:
# head still works here!
clinical_df.tumor_stage.head()

In [None]:
# What happens if you ask for a column that doesn't exist?
# clinical_df["tumorstage"] # uncomment this line

In [None]:
# Select two columns at once
clinical_df[["tumor_stage", "vital_status"]]
# can't use .column_name because there are multiple columns!
# double brackets are part of normal python syntax;
# they reference parts of lists, which can represent more complex data structures

**Challenge:** does the order of the columns you list matter?

In [None]:
# Select rows 0, 1, 2 (row 3 is not selected)
clinical_df[0:3]

In [None]:
# Select row 2 to the end
clinical_df[1:]

In [None]:
# Select the last element in the list
clinical_df[-1:] # what does this mean in the context of indexing?

**Challenge:** how would you extract the last 10 rows of the dataset?

## Slicing subsets of rows and columns

In [None]:
# iloc is integer indexing [row slicing, column slicing]
# locate specific data element
clinical_df.iloc[2, 6]

In [None]:
# select range of data
clinical_df.iloc[0:3, 1:4]

In [None]:
# stop/end bound is NOT inclusive (e.g., up to but not including 3)
# can use empty stop boundary to indicate end of data
clinical_df.iloc[0:, 1:4]

In [None]:
# loc is for label indexing (integers interpreted as labels)
# start and stop bound are inclusive
clinical_df.loc[1:4]

In [None]:
# can use empty stop boundary to indicate end of data
clinical_df.loc[1: ]

In [None]:
# Select all columns for rows of index values specified
clinical_df.loc[[0, 10, 6831], ]

In [None]:
# select first row for specified columns
clinical_df.loc[0, ["primary_diagnosis", "tumor_stage", "age_at_diagnosis"]]

**Challenge:** why doesn't the following code work?

`clinical_df.loc[2, 6]`

**Challenge:** how would you extract the last 100 rows for only vital status and days to death?

In [None]:
clinical_df.loc[6732:, ["vital_status", "days_to_death"]]

In [None]:
clinical_df.iloc[-100:, [3,5]]

In [None]:
clinical_df.info()
# Say you have a dataframe with a lot of columns and you want to grab alot of them
# for your analysis but not all. You can use numpy's R_ to make it easer

import numpy as np # imports numpy and aliases it as "np"
# Now say you want to get all the rows and columns but 'bcr_patient_barcode'

# clinical_df.iloc[0:, 0:18, 19:20] # this WONT work
clinical_df.iloc[0:, np.r_[0:18, 19:20] ] # but this does

# We are using numpy's R- which translates slice objects to concatenate along the first axis
np.r_[0:18, 19:20] # this takes the slices objects and makes an array

# You can then pass that array to iloc like so:
clinical_df.iloc[0:, np.r_[0:18, 19:20] ].head()
# This is just an easy way to quickly wrangle large dataframes by columns if need be

# You can also employ this when reading in files as dataframes using the `usecols` parameter like so
#pd.read_csv("data/clinical.txt", sep=" ", usecols=np.r_[0:18, 19:20])

## Calculating summary statistics

In [None]:
# calculate basic stats for all records in single column
clinical_df.age_at_diagnosis.describe()

In [None]:
# each metric one at a time (only prints last if all executed in one cell!)
clinical_df.age_at_diagnosis.min()

In [None]:
# convert columns
clinical_df.age_at_diagnosis/365

In [None]:
# convert min to days
clinical_df.age_at_diagnosis.min()/365

In [None]:
## Challenge: What type of summary stats do you get for object data?
clinical_df.site_of_resection_or_biopsy.describe()

In [None]:
## Challenge: How would you extract only the standard deviation for days to death?
clinical_df.days_to_death.std()

## Copying vs referencing objects

In [None]:
# Using the "=" operator references the previous object
ref_clinical_df = clinical_df
ref_clinical_df

In [None]:
# Using the "copy() method": actually creates another object
true_copy_clinical_df = clinical_df.copy()
true_copy_clinical_df

In [None]:
# Assign the value `0` to the first three rows of data in the DataFrame
ref_clinical_df[0:3] = 0
ref_clinical_df.head()
# note: you probably wouldn't want to actually *do* this to your data!

**Challenge:** How and why are the following three objects different?
_Hint: try applying `head()`_

In [None]:
clinical_df.head() # has been modified because ref_clinical_df referenced it

In [None]:
ref_clinical_df.head() # was actually altered

In [None]:
true_copy_clinical_df.head() # actual copy of original, unaltered
# reinforce that the order of operations matters!

## Wrapping up
- review objectives
- preview next week's objectives
- demo of spyder IDE, if time allows