# Pandas SciProg Workshop

## For reference:

https://github.com/jakevdp/PythonDataScienceHandbook - Data Science textbook all in iPython notebooks; much of the text borrowed from here

Extremely useful, and completely free. Great for beginners.


## Pandas

Pandas is a newer package built on top of NumPy, a powerful N-dimensional array object, and provides an efficient implementation of a ``DataFrame``.
``DataFrame``s are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.
As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

NumPy's ``ndarray`` data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks.
While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us.
Pandas, and in particular its ``Series`` and ``DataFrame`` objects, builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.

In [1]:
import pandas as pd # Pandas!
import numpy as np # NumPy!

## The Pandas ``DataFrame``

The fundamental structure in Pandas is the ``DataFrame``. The ``DataFrame`` can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.

In [2]:
data1 = pd.DataFrame({'a': [1,2,3],
                      'b': [2,3,4],
                      'c': [4,5,6]})# Fill in here
data1

Unnamed: 0,a,b,c
0,1,2,4
1,2,3,5
2,3,4,6


In [3]:
data2 = pd.DataFrame([{'a': 1, 'b': 2, 'c': 4},
                      {'a': 2, 'b': 3, 'c': 5},
                      {'a': 3, 'b': 5, 'c': 6}])
data2

Unnamed: 0,a,b,c
0,1,2,4
1,2,3,5
2,3,5,6


## The ``Series`` object
Each column of the ``DataFrame`` corresponds to a ``Series`` object.

The ``Series`` object is basically interchangeable with a one-dimensional NumPy array.
The essential difference is the presence of the index: while the Numpy Array has an *implicitly defined* integer index used to access the values, the Pandas ``Series`` has an *explicitly defined* index associated with the values.

In [4]:
data1['a']

0    1
1    2
2    3
Name: a, dtype: int64

In [5]:
data1.a

0    1
1    2
2    3
Name: a, dtype: int64

In [6]:
data1['a'].values

array([1, 2, 3])

In [7]:
data1['a'].index = ['First', 'Second', 'Third']

In [8]:
data1['a']

First     1
Second    2
Third     3
Name: a, dtype: int64

In practice, ``DataFrames`` are usually created from tabulated data. In this case, let's read some normalized transcriptomic data from TCGA. Each file is compressed using gzip and is tab delimited.

In [9]:
cancer_data = pd.read_csv('TCGA_sample.tsv.gz', compression='gzip', sep='\t', index_col=0)
cancer_data

Unnamed: 0,PTEN,BRCA1,BRCA2,TP53,cancer_type
TCGA-S9-A7J2-01,10.2998,7.401,4.7902,10.4938,Brain Lower Grade Glioma
TCGA-G3-A3CH-11,11.3833,7.6912,4.7393,9.0602,Liver hepatocellular carcinoma
TCGA-EK-A2RE-01,11.082,9.6598,7.1897,10.1316,Cervical squamous cell carcinoma and endocervi...
TCGA-44-6778-01,10.6484,8.7113,7.4553,10.4169,Lung adenocarcinoma
TCGA-VM-A8C8-01,10.3577,7.131,3.8055,9.6591,Brain Lower Grade Glioma
TCGA-AB-2863-03,11.6441,11.334,11.0614,10.9526,Acute Myeloid Leukemia
TCGA-C8-A1HL-01,11.5515,8.2087,7.3946,10.8971,Breast invasive carcinoma
TCGA-EE-A17X-06,9.4834,9.4479,7.0443,11.1922,Skin Cutaneous Melanoma
TCGA-YB-A89D-11,11.3773,6.6231,5.2721,9.8334,Pancreatic adenocarcinoma
TCGA-05-4420-01,10.1536,8.0003,7.8686,10.1576,Lung adenocarcinoma


## Indexing and Selecting Data

### Indexers: loc, iloc, and ix

``loc``: Selection By Label

``iloc``: Selection By Position (more pythonic)

``ix``: Label-based, but falls back to position-based if label doesn't exist. Good for mixed data.

In [10]:
# Examples
# Note: loc is indexed at 1, whereas iloc is indexed at 0
cancer_data.loc['TCGA-S9-A7J2-01':'TCGA-EE-A17X-06', ['PTEN', 'BRCA1']]

Unnamed: 0,PTEN,BRCA1
TCGA-S9-A7J2-01,10.2998,7.401
TCGA-G3-A3CH-11,11.3833,7.6912
TCGA-EK-A2RE-01,11.082,9.6598
TCGA-44-6778-01,10.6484,8.7113
TCGA-VM-A8C8-01,10.3577,7.131
TCGA-AB-2863-03,11.6441,11.334
TCGA-C8-A1HL-01,11.5515,8.2087
TCGA-EE-A17X-06,9.4834,9.4479


In [11]:
cancer_data.iloc[1:10:5, 0:2]

Unnamed: 0,PTEN,BRCA1
TCGA-G3-A3CH-11,11.3833,7.6912
TCGA-C8-A1HL-01,11.5515,8.2087


In [12]:
cancer_data.ix[3:5, 'PTEN': 'BRCA2']

Unnamed: 0,PTEN,BRCA1,BRCA2
TCGA-44-6778-01,10.6484,8.7113,7.4553
TCGA-VM-A8C8-01,10.3577,7.131,3.8055


Similar to ``NumPy`` arrays, vector calculations can be done on ``Series`` objects. Try finding the sum of gene expression values for the ``BRCA1`` and ``BRCA2`` genes for each patient.

In [13]:
cancer_data['BRCA1'] + cancer_data['BRCA2']

TCGA-S9-A7J2-01    12.1912
TCGA-G3-A3CH-11    12.4305
TCGA-EK-A2RE-01    16.8495
TCGA-44-6778-01    16.1666
TCGA-VM-A8C8-01    10.9365
TCGA-AB-2863-03    22.3954
TCGA-C8-A1HL-01    15.6033
TCGA-EE-A17X-06    16.4922
TCGA-YB-A89D-11    11.8952
TCGA-05-4420-01    15.8689
dtype: float64

## Your turn

1) Try to find out whether the first half of patients have greater PTEN expression or the last half (Hint: np.sum()).

2) Despite it not making sense to do so, try adding expression values of the BRCA genes for every other patient.

3) Other suggestions?

In [21]:
# 1
first_half = cancer_data.ix[0:5, 'PTEN']
second_half = cancer_data.ix[6:10, 'PTEN']
np.sum(first_half) > np.sum(second_half)

True

In [15]:
# 2
cancer_data.loc[::2, 'BRCA1'] + cancer_data.loc[::2, 'BRCA2']

TCGA-S9-A7J2-01    12.1912
TCGA-EK-A2RE-01    16.8495
TCGA-VM-A8C8-01    10.9365
TCGA-C8-A1HL-01    15.6033
TCGA-YB-A89D-11    11.8952
dtype: float64

## Masking

You can use an array of ``True`` or ``False`` values to mask values that you do not want to see. For example, what if we're only interested in patients whose ``BRCA1`` expression exceeds 8.

In [23]:
cancer_data['BRCA1'] > 8

TCGA-S9-A7J2-01    False
TCGA-G3-A3CH-11    False
TCGA-EK-A2RE-01     True
TCGA-44-6778-01     True
TCGA-VM-A8C8-01    False
TCGA-AB-2863-03     True
TCGA-C8-A1HL-01     True
TCGA-EE-A17X-06     True
TCGA-YB-A89D-11    False
TCGA-05-4420-01     True
Name: BRCA1, dtype: bool

In [24]:
cancer_data[cancer_data['BRCA1'] > 8]

Unnamed: 0,PTEN,BRCA1,BRCA2,TP53,cancer_type
TCGA-EK-A2RE-01,11.082,9.6598,7.1897,10.1316,Cervical squamous cell carcinoma and endocervi...
TCGA-44-6778-01,10.6484,8.7113,7.4553,10.4169,Lung adenocarcinoma
TCGA-AB-2863-03,11.6441,11.334,11.0614,10.9526,Acute Myeloid Leukemia
TCGA-C8-A1HL-01,11.5515,8.2087,7.3946,10.8971,Breast invasive carcinoma
TCGA-EE-A17X-06,9.4834,9.4479,7.0443,11.1922,Skin Cutaneous Melanoma
TCGA-05-4420-01,10.1536,8.0003,7.8686,10.1576,Lung adenocarcinoma


### All NumPy functions can be used on Pandas Series, including ufuncs (vectorized functions)
https://docs.scipy.org/doc/numpy-1.13.0/reference/ufuncs.html

## Using DataFrame.apply()
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

Let's find the mean expression for each gene.

For the programmers: Do not use for-loops

http://stupidpythonideas.blogspot.ca/2015/09/going-faster-with-numpy.html

In [16]:
numerical_cancer = cancer_data.loc[:, 'PTEN': 'TP53']

In [17]:
numerical_cancer.apply(np.mean, axis=0)

PTEN     10.79811
BRCA1     8.42083
BRCA2     6.66210
TP53     10.27945
dtype: float64

What about the sum of gene expression across each patient?

In [18]:
numerical_cancer.apply(np.sum, axis=1)

TCGA-S9-A7J2-01    32.9848
TCGA-G3-A3CH-11    32.8740
TCGA-EK-A2RE-01    38.0631
TCGA-44-6778-01    37.2319
TCGA-VM-A8C8-01    30.9533
TCGA-AB-2863-03    44.9921
TCGA-C8-A1HL-01    38.0519
TCGA-EE-A17X-06    37.1678
TCGA-YB-A89D-11    33.1059
TCGA-05-4420-01    36.1801
dtype: float64

We can also define any arbitrary function using Python `lambda` functions

In [20]:
# Lambda example:
numerical_cancer.apply(lambda col: np.sum(col) + np.mean(col), axis=0)

PTEN     118.77921
BRCA1     92.62913
BRCA2     73.28310
TP53     113.07395
dtype: float64

## Challenge task

Using the ``Pandas`` and ``NumPy`` documentation, try to attempt this challenge. In this hypothetical scenario, we want to find patients whose expression of ``BRCA1`` and ``BRCA2`` are greater than the mean expression for those genes. Then, find the standard deviation of ``PTEN`` expression for those patients.