# Pandas SciProg Workshop

## For reference:

https://github.com/jakevdp/PythonDataScienceHandbook - Data Science textbook all in iPython notebooks; much of the text borrowed from here

Extremely useful, and completely free. Great for beginners.


## Pandas

Pandas is a newer package built on top of NumPy, a powerful N-dimensional array object, and provides an efficient implementation of a ``DataFrame``.
``DataFrame``s are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.
As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

NumPy's ``ndarray`` data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks.
While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us.
Pandas, and in particular its ``Series`` and ``DataFrame`` objects, builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.

In [None]:
import pandas as pd # Pandas!
import numpy as np # NumPy!

## The Pandas ``DataFrame``

The fundamental structure in Pandas is the ``DataFrame``. The ``DataFrame`` can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.

In [None]:
data1 = pd.DataFrame({'a': [1,2,3],
                      'b': [2,3,4],
                      'c': })# Fill in here
# What happens if there are more than 3 elements in c?
data1

In [None]:
data2 = pd.DataFrame([{'a': 1, 'b': 2, 'c': 4},
                      {'a': 2, 'b': 3, 'c': 5},
                      {}]) # Fill in here
# What happens if there are more than 3 elements in the last row above?
data2

## The ``Series`` object
Each column of the ``DataFrame`` corresponds to a ``Series`` object.

The ``Series`` object is basically interchangeable with a one-dimensional NumPy array.
The essential difference is the presence of the index: while the Numpy Array has an *implicitly defined* integer index used to access the values, the Pandas ``Series`` has an *explicitly defined* index associated with the values.

In [None]:
data1['a']

In [None]:
data1.a

In [None]:
data1['a'].values

In [None]:
data1['a'].index = ['First', 'Second', 'Third']

In [None]:
data1['a']

In practice, ``DataFrames`` are usually created from tabulated data. In this case, let's read some normalized transcriptomic data from TCGA. Each file is compressed using gzip and is tab delimited.

In [None]:
cancer_data = pd.read_csv('TCGA_sample.tsv.gz', compression='gzip', sep='\t', index_col=0)
cancer_data

## Indexing and Selecting Data

### Indexers: loc, iloc, and ix

``loc``: Selection By Label

``iloc``: Selection By Position (more pythonic)

``ix``: Label-based, but falls back to position-based if label doesn't exist. Good for mixed data.

In [None]:
# Examples
# Note: loc is indexed at 1, whereas iloc is indexed at 0
cancer_data.loc['TCGA-S9-A7J2-01':'TCGA-EE-A17X-06', ['PTEN', 'BRCA1']]

In [None]:
cancer_data.iloc[1:10:5, 0:2]

In [None]:
cancer_data.ix[3:5, 'PTEN': 'BRCA2']

Similar to ``NumPy`` arrays, vector calculations can be done on ``Series`` objects. Try finding the sum of gene expression values for the ``BRCA1`` and ``BRCA2`` genes for each patient.

In [None]:
cancer_data['BRCA1'] + cancer_data['BRCA2']

## Your turn

1) Try to find out whether the first half of patients have greater PTEN expression or the last half (Hint: np.sum()).

2) Despite it not making sense to do so, try adding expression values of the BRCA genes for every other patient.

3) Other suggestions?

In [None]:
# 1

In [None]:
# 2

## Masking

You can use an array of ``True`` or ``False`` values to mask values that you do not want to see. For example, what if we're only interested in patients whose ``BRCA1`` expression exceeds 8.

In [None]:
cancer_data['BRCA1'] > 8

In [None]:
cancer_data[cancer_data['BRCA1'] > 8]

### All NumPy functions can be used on Pandas Series, including ufuncs (vectorized functions)
https://docs.scipy.org/doc/numpy-1.13.0/reference/ufuncs.html

## Using DataFrame.apply()
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

Let's find the mean expression for each gene.

For the programmers: Do not use for-loops

http://stupidpythonideas.blogspot.ca/2015/09/going-faster-with-numpy.html

In [None]:
numerical_cancer = cancer_data.loc[:, 'PTEN': 'TP53']

In [None]:
numerical_cancer.apply(np.mean, axis=0)

What about the sum of gene expression across each patient?

In [None]:
numerical_cancer.apply(np.sum, axis=1)

We can also define any arbitrary function using Python `lambda` functions

In [None]:
# Lambda example:
numerical_cancer.apply(lambda col: np.sum(col) + np.mean(col), axis=0)

## Challenge task

Using the ``Pandas`` and ``NumPy`` documentation, try to attempt this challenge. In this hypothetical scenario, we want to find patients whose expression of ``BRCA1`` and ``BRCA2`` are greater than the mean expression for those genes. Then, find the standard deviation of ``PTEN`` expression for those patients.