# Introduction to Python, Class 2: Starting with data

## Objectives

In the last class,
we learned the basics of Python syntax and Jupyter notebooks,
and examined common data types, data structures, and programming structures for Python.

By the end of this lesson, you should be able to:

- load packages and spreadsheet-style data using Python
- extract columns, rows, and portions thereof from datasets
- calculate summary statistics
- understand the difference between referencing and copying a variable

## Using packages

Open your Jupyter notebook file browser,
navigate to your project directory,
and create a new Python notebook called `class2`
with an appropriate title in the first cell in Markdown formatting.

We'll first need to load additional packages,
(collections of related functions)
so the functions we'll need are available for use:

In [1]:
# make packages available to use in this notebook
import os
import urllib.request
import pandas as pd 

The packages we're using today include:

- [`os`](https://docs.python.org/3/library/os.html): to create a `data` directory
- [`urllib`](https://docs.python.org/3/library/urllib.html): for downloading files
- [`pandas`](https://pandas.pydata.org): for data manipulation and analysis

For the last package,
`pd` is being defined as an alias, or shortcut, 
to specify we're using a function from that package.
For the rest of this lesson, we'll preface the function in which it's been loaded.

## Importing data

Before we can download our data,
we should create a new directory to contain it:

In [2]:
# create data directory
os.mkdir("data")

Then we can use a function from the `urllib` package to download the data file:

In [None]:
# download dataset
urllib.request.urlretrieve("https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.csv", "data/clinical.csv")

The first argument (string inside quotation marks)
represents the URL from which the data is being downloaded.
The second argument ("data/clinical.csv") indicates where the data will be saved.

> If above code doesn't work, you can download the data 
as a [zip file](https://www.dropbox.com/s/k639bkse64r0bfz/data.zip),
manually unzip it, and move the resulting folder to your project's data directory.

Notice that the URL above ends in "clinical.csv", 
which is also the name we used to save the file on our computers.
If you click on the URL and view it in a web browser, the format isn’t particularly easy for us to understand. 
The data we’ve downloaded are in csv format, which stands for “comma separated values.” This means the data are organized into rows and columns, with columns separated by commas.

These data are arranged in a tidy format, meaning each row represents an observation, and each column represents a variable (piece of data for each observation). Moreover, only one piece of data is entered in each cell.

These data are clinical cancer data from the National Cancer Institute’s Genomic Data Commons, specifically from The Cancer Genome Atlas, or TCGA.
Each row represents a patient, and each column represents information about demographics (race, age at diagnosis, etc) and disease (e.g., cancer type).
The data were downloaded and aggregated using a script included in the 
[Introduction to R course](https://github.com/fredhutchio/R_intro).

We can import these data and assign them to a variable:

In [None]:
# assign data to variable
clinical_df = pd.read_csv("data/clinical.csv")

The command executed successfully, 
but we still need to ensure the data have been imported correctly.

There are a few ways we can inspect the data.
First, we can preview the data:

In [None]:
# preview first few rows of the data
clinical_df.head()

The `head` function by default shows the the column headers,
along with first five rows of data.
You can specify a different number of rows by placing that number inside the parentheses, 
demonstrated below using `tail`, 
which shows the last few rows:

In [None]:
# print last eight rows of data to screen
clinical_df.head(8) # print top n rows

**Challenge:** Download, import, and inspect the following data files. 
The URL for each sample dataset is included along with a name to assign to the variable. 
(Hint: you can use the same function as above, but may need to update the `sep =` argument)

- URL: https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.tsv, object name: example1
- URL: https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.txt, object name: example2

Importing data can be tricky and frustrating. 
However, if you can’t get your data into Python, 
you can’t do anything to analyze or visualize it. 
It’s worth understanding how to do it effectively to save you time and energy later.

Now that we have data imported and available, 
we can print a summary of all column names, number of entries, data types, and non-null values:

In [None]:
# print summary
clinical_df.info() 

The output above highlight another of the key features of `pandas`:
it interprets data in ways that make it easier to analyze.

The description at the top of this output,
`pandas.core.frame.DataFrame`,
describes the data structure as a data frame,
which is how `pandas` interprets spreadsheet style data.
Directly below that line,
we see a note that there are 6832 observations (rows, or our case, patients or cases),
as well as 20 columns.
A summary of the data type for each column is below.

In the last lesson,
we discussed data types built into Python.
`pandas` features the following data types specific to its package,
which were implemented in our data:

- `object` data in `pandas` represents string (character) data in native Python
- `float64` is still float data (the `64` references 64 bit hardware)
- `int64` in `pandas` isn't represented in our data, but refers to integer data 
- `datetime64` from `pandas` also isn't shown here, but refers to a specific format to make working with date and time data easier. 

> To create a list in a markdown cell, 
use an asterisk (`*`) or dash (`-`) followed by a space;
these will be rendered as bullet points when you execute the cell.

## Accessing columns and rows

A common task in data analysis is to extract particular columns or rows,.
often referred to as subsetting.
This section explores a few different ways to access these parts of our spreadsheet.

First, we can subset a single column using its name (column header):

In [None]:
# show only the first few rows of one column
clinical_df["tumor_stage"].head()

The square brackets above are a common subsetting syntax in Python.
The quotation marks around the column name are necessary for Python to interpret it as a column,
rather than a variable name.
We've added `.head()` to the end so we only preview the first few rows, rather than the entire data frame.

Similarly, we can assess the data type of a specific row:

In [None]:
# show data type for a column
clinical_df["tumor_stage"].dtype 

The output, `O`, 
indicates these data are object (character) type.

One of the shortcuts afforded by `pandas` is the ability to treat the column names as attributes,
which means you can access them using the `.` syntax:

In [None]:
# access columns by name using dot syntax
clinical_df.tumor_stage.head()

Here, we've also used `.head()` to minimize the amount of data printed to the screen.
If you were assigning data to a new variable name, 
you would likely be using the whole column instead.

If you need to extract multiple columns, 
you'll need to adjust the syntax slightly:

In [None]:
# Select two columns at once
clinical_df[["tumor_stage", "vital_status"]].head()

In this case, we can't use the dot syntax to access columns.
However, double square brackets are a common part of Python syntax.
They reference parts of lists (a more complex data structure).
In general, the dot syntax means you are accessing a part of the thing (generally a variable)
that comes before the dot. 
In the case 

**Challenge:** does the order of the columns you list matter?

We can also extract rows from a data frame:

In [None]:
# access three rows 
clinical_df[0:3]

In the output above, 
we see three rows (index positions 0, 1, 2).
This type of subsetting is noninclusive of the endpoint,
meaning the row at index 3 is not selected from a range of `0:3`.

We can also select a range including the end of the data frame by leaving the field after the colon empty:

In [None]:
# access the second row to the end of the data frame
clinical_df[1:].tail()

Again, we've used `tail` to show only the end of the data frame.

We can perform a similar operation to `tail` using and index value that extracts only the last row:

In [None]:
# access the last row in the data frame
clinical_df[-1:] 

**Challenge:** how would you extract the last 10 rows of the dataset?

## Slicing subsets of rows and columns

Now that we have a basic understanding of accessing whole rows and columns, 
we are ready to discuss slicing
(extracting portions of rows and columns).

There are multiple ways to slice a data frame. 
We'll begin by exploring `iloc`, 
which uses integer indexing.
This means we'll reference rows and columns by their index position:

In [None]:
# access one data element from a single cell
clinical_df.iloc[2, 1]

We can check one of our previews of the data above to see that this does represent the data in that cell.

As with subsetting described in the previous section,
we can also extract ranges of cells:

In [None]:
# select range of data
clinical_df.iloc[0:3, 1:4]

As described earlier with subsetting using ranges of index values,
we can see in the output above that the beginning and end bounds of the ranges are noninclusive.

We can also include an empty start or stop bound to indicate the beginning or end of the data frame, respectively:

In [None]:
# empty stop boundary to indicate end of data
clinical_df.iloc[:2, 18:]

Now we'll move on and explore the second method for extracting slices,
using `loc`, which stands for label indexing.
The tricky part with our data is that the row labels are actually also the index values.
This means that when we extract a range of rows,
we can still reference those values:

In [None]:
# slicing using loc
clinical_df.loc[1:4]

Here you can note one of the major differences between `iloc` and `loc`:
the latter has inclusive start and stop bound.

We can still use empty bounds:

In [None]:
# empty stop boundary to indicate end of data
clinical_df.loc[6830: ]

We can also select all columns for a specific set of rows by adding the row labels as a list:

In [None]:
# Select all columns for rows of index values specified
clinical_df.loc[[0, 10, 6831], ]

Finally, we can use the column labels for extraction:

In [None]:
# select first row for specified columns
clinical_df.loc[0, ["primary_diagnosis", "tumor_stage", "age_at_diagnosis"]]

**Challenge:** why doesn't the following code work?

`clinical_df.loc[2, 6]`

**Challenge:** how would you extract the last 100 rows for only vital status and days to death?

So far, we've been printing the output from our subsetting and slicing to the screen 
(often using `.head()`).
Remember that if you'd like to use these data for another purpose,
it's possible you may want to assign these data to a new variable to further manipulate
(but see the section below comparing referencing and copying!).

## Calculating summary statistics

Once you've extracted your data of interest, 
you will likely want to be able to assess basic statistical features of the data.

Data frames allow you to assess these features:

In [None]:
# calculate basic stats a single column
clinical_df.age_at_diagnosis.describe()

In this case, 
we've assessed a collection of summary statistics for the column "age at diagnosis" 
using the `.describe()` function.

You can access the statistics listed above individually as well:

In [None]:
# calculate only the minimum for age at diagnosis
clinical_df.age_at_diagnosis.min()

We can also our ability to access columns to perform mathematical operations,
such as unit conversion:

In [None]:
# convert age column from days to years
clinical_df.age_at_diagnosis.head()/365

We can also perform a conversion on a summary statistic:

In [None]:
# convert minimum age at diagnosis to years
clinical_df.age_at_diagnosis.min()/365

**Challenge:** What type of summary statistics do you get for object data?

**Challenge:** How would you extract only the standard deviation for days to death?

## Copying vs referencing

In this final section, 
we'll take a look at the difference between copying and referencing objects (variables).

It's often desirable to create a new variable that you can then use for data filtering.
It is possible to reference another variable using the `=` assignment operator:

In [None]:
# reference another object
ref_clinical_df = clinical_df
ref_clinical_df.head()

If you inspect both objects,
you'll see they're identical.

Next, we'll use the `copy` method to create a new object:

In [None]:
# create another object using copy
true_copy_clinical_df = clinical_df.copy()
true_copy_clinical_df.head()

Now we have a few objects to compare.

We'll assess how these objects change by making a clear, obvious change in the referenced data frame.
This isn't something you'd necessarily want to do in your own data,
but is a way for us to quickly see what happens to the other objects:

In [None]:
# assign the value `0` to the first three rows of data
ref_clinical_df[0:3] = 0
ref_clinical_df.head()

**Challenge:** What has happened to each of our three objects?

- `true_copy_clinical_df`, created from the original data frame using `copy`
- `ref_clinical_df`, for which the first three rows have been changed to 0
- `clinical_df`, which is the original data frame

Remember, referencing objects does not protect the original object from modification,
because the `=` creates another name for the same object.
Using the `copy` method allows you to retain the original object in its unaltered state.

## Wrapping up

Today, we imported spreadsheet-style data into Python,
learned to inspect and subset the resulting data frame,
as well as calculate summary statistics and copy/reference objects (variables).

Next time, we'll work on additional types of data manipulation,
including extracting data that meet particular criteria
and dealing with missing data.