# Introduction to Pandas : Part 1
-------
This tutorial is heavily based on [Pandas in 10 min](https://pandas.pydata.org/pandas-docs/stable/10min.html). The original material waas modified by adding TnSeq data as examples. 


## Get datasets to play with

In [None]:
%%bash
wget https://nekrut.github.io/BMMB554/tnseq_untreated.txt.gz
wget https://nekrut.github.io/BMMB554/ta_gc.txt

In [None]:
data_file = 'tnseq_untreated.txt.gz'

The first dataset lists coordinates of `TA` sites and counts of reads for TnSeq constructs 'blunt', 'cap', 'dual', 'erm', 'pen', and 'tuf':

In [None]:
!gunzip -c {data_file} | head

In [None]:
# Just two choices for beginning of of gene field
!gunzip -c {data_file} | cut -f 8 | cut -f 1 -d '=' | sort | uniq -c

In [None]:
# Process tnseq_untreated.txt.gz to correctly parse gene names

import os
f = open('data.txt','w')

with os.popen('gunzip -c {}'.format(data_file)) as stream:
  for line in stream:
    if line.split( '\t' )[7].startswith( '.' ):
      f.write( '{}\t{}\n'.format( '\t'.join( line.split( '\t' )[:7] ) , 'intergenic'  ) )
    elif line.split( '\t' )[7].startswith( 'ID' ):
      f.write( '{}\t{}\n'.format( '\t'.join( line.split( '\t' )[:7] ) , line.split( '\t' )[7].split(';')[0][3:] ) )
f.close()

In [None]:
!wc -l data.txt

In [None]:
!head data.txt

In [None]:
!gunzip -c {data_file} | wc -l

In [None]:
import pandas as pd

tnseq = pd.read_table('data.txt', header=None, names=['pos','blunt','cap','dual','erm','pen','tuf','gene'])

In [None]:
# Let's create a small subset of this dataset
df = tnseq[tnseq['blunt'] > 200]

In [None]:
df.head()

In [None]:
df.index

In [None]:
df.describe()

In [None]:
df = df.sort_values(by=['pos'])

In [None]:
df = df.set_index('pos')

In [None]:
df.head()

## Selection
-------
**Note**: While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, `.at`, `.iat`, `.loc` and `.iloc`.

See the indexing documentation:
 - [Indexing and Selecting Data](https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing)
 - [MultiIndex / Advanced Indexing](https://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced)

### Getting

Selecting via `[]`, which slices the rows.

In [None]:
df[0:3]

Selecting a single column, a `series` can be done in two ways:

In [None]:
df.gene.head()

or

In [None]:
df['gene'].head()

### Selection by label
See more in [Selection by Label](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-label).

For getting a cross section using a label:

In [None]:
df.loc[2404930]

In [None]:
df.loc[2404930,['erm','pen']]

In [None]:
df.loc[2404930:2404937,['erm','pen']]

### Selection by position
See more in [Selection by Position](https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer)

Select via the position of the passed integers:

In [None]:
df.iloc[3]

By integer slices, acting similar to numpy/python:

In [None]:
df.iloc[3:5,0:2]

By lists of integer position locations, similar to the numpy/python style:

In [None]:
df.iloc[[1,2,4],[0,2]]

For slicing rows explicitly:

In [None]:
df.iloc[100:,1:5]

For getting a value explicitly:

In [None]:
df.iloc[1,1]

For getting fast access to a scalar (equivalent to the prior method):

In [None]:
df.iat[1,1]

### Selecting based on condition (boolean indexing)

Using a single column’s values to select data:

In [None]:
df[df.gene != 'intergenic'].head()

Selecting values from a DataFrame where a boolean condition is met:

In [None]:
df[df > 0].head()

Using the [`isin()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isin.html#pandas.Series.isin) method for filtering:

In [None]:
df.gene.unique()

In [None]:
df[df['gene'].isin(['gene2465','gene206'])]

### Setting

Setting a new column automatically aligns the data by the indexes:

In [None]:
gc = pd.read_table('ta_gc.txt', header=None, names=['pos','gc'])

In [None]:
gc = gc.set_index('pos')

In [None]:
gc.head()

In [None]:
df['gc'] = gc

In [None]:
df.head()

Setting values by label:

In [None]:
df.at[2,'erm'] = 0

In [None]:
df.loc[2]

In [None]:
df = df.sort_index()

In [None]:
df.head()

## Missing data
-------
pandas primarily uses the value `np.nan` to represent missing data. It is by default not included in computations. See the [Missing Data section](https://pandas.pydata.org/pandas-docs/stable/missing_data.html#missing-data).

To drop any rows that have missing data:

In [None]:
df.dropna(how='any').head()

Filling missing data

In [None]:
df.fillna(value='0').head()

In [None]:
df.isna().head()

## Operations
-------
See the [Basic section on Binary Ops](https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-binop).

### Stats

Operations in general exclude missing data.

Performing a descriptive statistic:

In [None]:
df.mean()

Same operation on the other axis:

In [None]:
df.mean(1).head()

### Apply
Apply functions to the data:

In [None]:
df.head()

In [None]:
import numpy as np
df.loc[:,'blunt':'tuf'].apply(np.cumsum).head()

### Hostogramming
See more at [Histogramming and Discretization](https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-discretization):

In [None]:
df['gene'].value_counts()

In [None]:
df.loc[:,'blunt':'tuf'].hist(bins=100, sharex=True, sharey=True)

### String Methods
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses [regular expressions](https://docs.python.org/3/library/re.html) by default (and in some cases always uses them). See more at [Vectorized String Methods](https://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods):

In [None]:
df['gene'].str.upper().head()

In the next lecture we will learn how to process data in a number of interesting ways