# Introduction to pandas

* A Python package for working with multi-dimensional, structured data (e.g. Excel spreadsheets, relational databases)

* Built on top of NumPy so it's fast...but with more convenient data structures

* The main data structure, called a DataFrame, is similar to the data.frame in R

Conventionally, pandas is imported using the alias **`pd`** because programmers are lazy

You'll often see the commonly used data structures imported separately for even less typing (i.e. avoiding pd.DataFrame)

In [1]:
import pandas as pd
from pandas import Series, DataFrame

Before we get started, let's make sure we are in the right directory to access the files we'll use

In [2]:
!ls
!pwd

cd4_change.csv	long_data.csv	Mock clinical data2.csv  test_slides.ipynb
cd4_data.csv	long_data.csv~	pandas-intro.ipynb	 Untitled.ipynb
cd4_data.tsv	medals.csv	python-part-03.ipynb
/home/swhite/git/cfar-data-workshop-2015/day1-sect3_pandas-intro


# Data Structures


## Series

...similar to a Python list or a single column of a spreadsheet

### Creating a Series

Let's create a new Series from a simple Python list

In [3]:
some_data = [815, 364, 2117]
some_data

[815, 364, 2117]

In [4]:
baseline = Series(some_data)
baseline

0     815
1     364
2    2117
dtype: int64

...a much nicer output vs Python's list

### Custom Index

But, we can make this even better

pandas allows us to specify custom indices

Let's re-create the Series with something more meaningful:

In [5]:
baseline = Series(some_data, index=['John', 'Jane', 'Joe'])
baseline

John     815
Jane     364
Joe     2117
dtype: int64

Note the length of the data and the indices given must be equal:

In [6]:
Series(some_data, index=['John', 'Jane', 'Joe', 'Scott'])

ValueError: Wrong number of items passed 3, placement implies 4

### Adding and Removing Items in a Series

To add a new value:

In [7]:
baseline['Jason'] = 42
baseline

John      815
Jane      364
Joe      2117
Jason      42
dtype: int64

And to remove it:

In [8]:
baseline.drop('Jason')

John     815
Jane     364
Joe     2117
dtype: int64

But, our value wasn't really removed!

In [9]:
baseline

John      815
Jane      364
Joe      2117
Jason      42
dtype: int64

Most pandas functions that modify data return a copy by default

We ***could*** assign the copy back to the original variable...

Luckily, many pandas functions have an option to modify data in place

In [10]:
baseline.drop('Jason', inplace=True)
baseline

John     815
Jane     364
Joe     2117
dtype: int64

### The Series Index

To determine the indices use the **`index`** attribute:

In [11]:
baseline.index

Index([u'John', u'Jane', u'Joe'], dtype='object')

We can also give a more meaningful label name to the index

In [12]:
baseline.index.name = 'patients'
baseline

patients
John     815
Jane     364
Joe     2117
dtype: int64

And to the Series itself

In [13]:
baseline.name = 'CD4 baseline'
baseline

patients
John     815
Jane     364
Joe     2117
Name: CD4 baseline, dtype: int64

### Selecting Values from a Series

We can use our indices to reference the values:

In [14]:
baseline['John']

815

Regular indexing by position also works:

In [15]:
baseline[0]

815

Slicing works as well:

In [16]:
baseline[1:3]

patients
Jane     364
Joe     2117
Name: CD4 baseline, dtype: int64

Retrieving non-successive rows by position or index name:

In [17]:
baseline

patients
John     815
Jane     364
Joe     2117
Name: CD4 baseline, dtype: int64

In [18]:
baseline[[0,2]]

patients
John     815
Joe     2117
Name: CD4 baseline, dtype: int64

In [19]:
baseline[['John', 'Joe']]

patients
John     815
Joe     2117
Name: CD4 baseline, dtype: int64

### Data Alignment

Let's create a Series with some followup data

We'll use a different order for the patient names

In [20]:
followup_data = [448, 1959, 792]
followup = Series(
    followup_data,
    index=['Jane', 'Joe', 'John'], 
    name='CD4 followup')
followup.index.name = 'patients'
followup 

patients
Jane     448
Joe     1959
John     792
Name: CD4 followup, dtype: int64

Note, we specified the Series name when creating the Series

Now let's compute the differences over time

In [21]:
diff = followup - baseline
diff

patients
Jane     84
Joe    -158
John    -23
dtype: int64

pandas uses the indices to ***align*** data in different series

But, what if the 2 Series have non-matching indices?

In [22]:
baseline['Jill'] = 836
baseline

patients
John     815
Jane     364
Joe     2117
Jill     836
Name: CD4 baseline, dtype: int64

In [23]:
diff = followup - baseline
diff

patients
Jane     84
Jill    NaN
Joe    -158
John    -23
dtype: float64

The new Series is a **union** of the indices

pandas uses the value **`NaN`** (not a number) for the missing data

### Filtering

Like with NumPy, we can use boolean arrays for filtering

In [24]:
diff > 0

patients
Jane     True
Jill    False
Joe     False
John    False
dtype: bool

**NaN** is evaluated as False

Likewise for a "less than" comparison:

In [25]:
diff < 0

patients
Jane    False
Jill    False
Joe      True
John     True
dtype: bool

Use the boolean array to get the values for the filter

In [26]:
diff[diff > 0]

patients
Jane    84
dtype: float64

Only the values corresponding to **`True`** are returned

We can easily filter the missing data using the function **`isnull`**

In [27]:
diff[diff.isnull()]

patients
Jill   NaN
dtype: float64

And get the inverse using **`notnull`**

In [28]:
diff[diff.notnull()]

patients
Jane     84
Joe    -158
John    -23
dtype: float64

We can also fill in the missing values with a value using **`fillna`**.

In [29]:
diff.fillna(0)

patients
Jane     84
Jill      0
Joe    -158
John    -23
dtype: float64

**`fillna`** doesn't modify the original Series, but does take an ***`inplace`*** argument

There's also **`dropna`** to remove missing values

In [30]:
diff.dropna(inplace=True)
diff

patients
Jane     84
Joe    -158
John    -23
dtype: float64

# Data Structures

## DataFrame

A DataFrame is similar to an Excel spreadsheet, containing both columns and rows

You can think of a DataFrame as a container for multiple Series with a common index

Let's create a DataFrame by concatenating both the baseline and followup Series across the columns (**`axis=1`**):

In [116]:
cd4_frame = pd.concat([baseline, followup], axis=1)
cd4_frame

Unnamed: 0,CD4 baseline,CD4 followup
Jane,364,448.0
Jill,836,
Joe,2117,1959.0
John,815,792.0


iPython notebook renders the DataFrame as an HTML table

### Axis labelling is tricky

```
+------+-------+-------+
|      | col_A | col_B |
+------+-------+-------+
| Jane |  364  |  448  | -- axis=1 -->
+------+-------+-------+
           |
           | axis=0
           ↓
```

**`axis=1`** as across the columns (along the row)

**`axis=0`** as across the rows (along the column)

The **`shape`** *attribute* returns the number of rows and columns:

In [117]:
cd4_frame.shape

(4, 2)

The **`describe`** *method* gives a variety of summary data:

In [118]:
cd4_frame.describe()

Unnamed: 0,CD4 baseline,CD4 followup
count,4.0,3.0
mean,1033.0,1066.333333
std,754.751615,791.974958
min,364.0,448.0
25%,702.25,620.0
50%,825.5,792.0
75%,1156.25,1375.5
max,2117.0,1959.0


### Naming things

Let's rename the columns for easier typing & to remove the spaces:

In [122]:
cd4_frame.rename(
    columns={
        'CD4 baseline': 'baseline', 
        'CD4 followup': 'followup'
    },
    inplace=True)
cd4_frame

Unnamed: 0_level_0,baseline,followup
patients,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,364,448.0
Jill,836,
Joe,2117,1959.0
John,815,792.0


And, just like with a Series, we can name the DataFrame's index.

In [123]:
cd4_frame.index.name = 'patients'
cd4_frame

Unnamed: 0_level_0,baseline,followup
patients,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,364,448.0
Jill,836,
Joe,2117,1959.0
John,815,792.0


### Making Selections

A single column can be extracted in a couple ways

First, by dictionary-like indexing:

In [36]:
cd4_frame['baseline']

patients
Jane     364
Jill     836
Joe     2117
John     815
Name: baseline, dtype: int64

Notice a DataFrame index refers to a column, whereas a Series index referred to a row

We'll see how to select an entire DataFrame row in a bit

A more convenient way to extract a column is by attribute:

In [37]:
cd4_frame.baseline

patients
Jane     364
Jill     836
Joe     2117
John     815
Name: baseline, dtype: int64

Column names containing a space are not available as an attribute, you must use dictionary indexing

...another good reason to rename unwieldy column names

A column extracted from a DataFrame is a pandas Series object

In [38]:
type(cd4_frame.baseline)

pandas.core.series.Series

Any of the Series methods can be used on the column

In [39]:
cd4_frame.followup.isnull()

patients
Jane    False
Jill     True
Joe     False
John    False
Name: followup, dtype: bool

Knowing this, we can extract a single "cell" from a DataFrame:

In [40]:
cd4_frame.baseline['Joe']

2117

This works with dictionary indexing too:

In [41]:
cd4_frame['baseline']['Joe']

2117

If all the names are space-free, we can conveniently use all attributes:

In [42]:
cd4_frame.baseline.Joe

2117

We can select multiple columns and specify their order using a ***list*** of column names:

In [43]:
cd4_frame[['followup','baseline']]

Unnamed: 0_level_0,followup,baseline
patients,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,448.0,364
Jill,,836
Joe,1959.0,2117
John,792.0,815


But, be careful when manipulating data extracted from a DataFrame:

In [124]:
col = cd4_frame.baseline
col['Jane'] = 42
cd4_frame

A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0_level_0,baseline,followup
patients,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,42,448.0
Jill,836,
Joe,2117,1959.0
John,815,792.0


The Series extracted from our DataFrame is a **view** and not a copy of the data. If you really want a separate copy make sure to use **`copy`**:

In [45]:
col = cd4_frame.baseline.copy()
col['Jane'] = 5000
cd4_frame

Unnamed: 0_level_0,baseline,followup
patients,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,42,448.0
Jill,836,
Joe,2117,1959.0
John,815,792.0


Let's restore our original baseline column:

In [46]:
cd4_frame['baseline'] = baseline
cd4_frame

Unnamed: 0_level_0,baseline,followup
patients,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,364,448.0
Jill,836,
Joe,2117,1959.0
John,815,792.0


### Retrieving Rows

To retrieve an entire DataFrame row, use the **`ix`** attribute: 

In [47]:
cd4_frame.ix['Joe']

baseline    2117
followup    1959
Name: Joe, dtype: float64

This also returns a Series object:

In [48]:
type(cd4_frame.ix['Joe'])

pandas.core.series.Series

And gives us even more options for accessing a single value:

In [49]:
cd4_frame.ix['Joe'].baseline

2117.0

In [50]:
cd4_frame.ix['Joe']['baseline']

2117.0

But, **`ix`** does *not* have attributes for the row names, so this won't work:

In [51]:
cd4_frame.ix.Joe

AttributeError: '_IXIndexer' object has no attribute 'Joe'

To get the first 2 rows, we can slice using **`ix`**:

In [52]:
cd4_frame.ix[:2]

Unnamed: 0_level_0,baseline,followup
patients,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,364,448.0
Jill,836,


Getting the 2nd and 4th rows:

In [53]:
cd4_frame.ix[[1,3]]

Unnamed: 0_level_0,baseline,followup
patients,Unnamed: 1_level_1,Unnamed: 2_level_1
Jill,836,
John,815,792.0


Select multiple rows of a single column:

In [54]:
cd4_frame.ix[[1,3], 'followup']

patients
Jill    NaN
John    792
Name: followup, dtype: float64

And finally, selecting multiple rows and multiple columns:

In [55]:
cd4_frame.ix[[1,3], ['followup', 'baseline']]

Unnamed: 0_level_0,followup,baseline
patients,Unnamed: 1_level_1,Unnamed: 2_level_1
Jill,,836
John,792.0,815


### Creating and Deleting Columns

Let's create a new column:

In [56]:
cd4_frame

Unnamed: 0_level_0,baseline,followup
patients,Unnamed: 1_level_1,Unnamed: 2_level_1
Jane,364,448.0
Jill,836,
Joe,2117,1959.0
John,815,792.0


In [57]:
cd4_frame['sex'] = ['F', 'F', 'M', 'M']
cd4_frame

Unnamed: 0_level_0,baseline,followup,sex
patients,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jane,364,448.0,F
Jill,836,,F
Joe,2117,1959.0,M
John,815,792.0,M


And a new column with the percent change in CD4:

In [58]:
diff = cd4_frame.followup - cd4_frame.baseline
cd4_frame['percent_change'] = diff / cd4_frame.baseline * 100
cd4_frame

Unnamed: 0_level_0,baseline,followup,sex,percent_change
patients,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jane,364,448.0,F,23.076923
Jill,836,,F,
Joe,2117,1959.0,M,-7.463392
John,815,792.0,M,-2.822086


Note you cannot create a new column using a new attribute, i.e. `cd4_frame.some_column = ...`

To remove a column we can use **`drop`**

In [59]:
cd4_frame.drop('percent_change', axis=1)

Unnamed: 0_level_0,baseline,followup,sex
patients,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jane,364,448.0,F
Jill,836,,F
Joe,2117,1959.0,M
John,815,792.0,M


**`drop`** returns a copy and doesn't modify in place by default

It can also remove rows using ***`axis=0`***

### Filtering DataFrames

Filter the whole frame:

In [60]:
cd4_frame > 800

Unnamed: 0_level_0,baseline,followup,sex,percent_change
patients,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jane,False,False,True,False
Jill,True,False,True,False
Joe,True,True,True,False
John,True,False,True,False


**`isnull`** also works on the whole DataFrame:

In [61]:
cd4_frame.isnull()

Unnamed: 0_level_0,baseline,followup,sex,percent_change
patients,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jane,False,False,False,False
Jill,False,True,False,True
Joe,False,False,False,False
John,False,False,False,False


Or just a column:

In [62]:
cd4_frame[cd4_frame['baseline'] > 400]

Unnamed: 0_level_0,baseline,followup,sex,percent_change
patients,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jill,836,,F,
Joe,2117,1959.0,M,-7.463392
John,815,792.0,M,-2.822086


Filtering a text value:

In [63]:
cd4_frame[cd4_frame['sex'] == 'F']

Unnamed: 0_level_0,baseline,followup,sex,percent_change
patients,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jane,364,448.0,F,23.076923
Jill,836,,F,


Filtering multiple columns by combining boolean arrays:

In [64]:
(cd4_frame['sex'] == 'F') & (cd4_frame['baseline'] > 800)

patients
Jane    False
Jill     True
Joe     False
John    False
dtype: bool

### Sorting

Sorting a single column:

In [65]:
cd4_frame.sort('percent_change')

Unnamed: 0_level_0,baseline,followup,sex,percent_change
patients,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Joe,2117,1959.0,M,-7.463392
John,815,792.0,M,-2.822086
Jane,364,448.0,F,23.076923
Jill,836,,F,


And multiple columns:

In [66]:
cd4_frame.sort(columns=['sex', 'baseline'])

Unnamed: 0_level_0,baseline,followup,sex,percent_change
patients,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jane,364,448.0,F,23.076923
Jill,836,,F,
John,815,792.0,M,-2.822086
Joe,2117,1959.0,M,-7.463392


Sorting by the index:

In [69]:
cd4_frame.sort_index()

Unnamed: 0_level_0,baseline,followup,sex,percent_change
patients,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jane,364,448.0,F,23.076923
Jill,836,,F,
Joe,2117,1959.0,M,-7.463392
John,815,792.0,M,-2.822086


Sorting the column order by column name:

In [78]:
cd4_frame.sort_index(axis=1, ascending=False)

Unnamed: 0_level_0,sex,percent_change,followup,baseline
patients,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jane,F,23.076923,448.0,364
Jill,F,,,836
Joe,M,-7.463392,1959.0,2117
John,M,-2.822086,792.0,815


# Exporting and Importing CSV Data

Saving our DataFrame to a CSV is easy:

In [90]:
cd4_frame.to_csv("cd4_data.csv")
!ls

cd4_change.csv	long_data.csv	Mock clinical data2.csv  test_slides.ipynb
cd4_data.csv	long_data.csv~	pandas-intro.ipynb	 Untitled.ipynb
cd4_data.tsv	medals.csv	python-part-03.ipynb


In [91]:
!cat cd4_data.csv

patients,baseline,followup,sex,percent_change
Jane,364,448.0,F,23.0769230769
Jill,836,,F,
Joe,2117,1959.0,M,-7.46339159188
John,815,792.0,M,-2.82208588957


In [92]:
cd4_frame.dtypes

baseline            int64
followup          float64
sex                object
percent_change    float64
dtype: object

Importing our data back into pandas:

In [93]:
cd4_import = pd.read_csv("cd4_data.csv")
cd4_import

Unnamed: 0,patients,baseline,followup,sex,percent_change
0,Jane,364,448.0,F,23.076923
1,Jill,836,,F,
2,Joe,2117,1959.0,M,-7.463392
3,John,815,792.0,M,-2.822086


But, there's something different. The original data we exported was indexed by `patients`.

To set the index to an existing column we can use **`set_index`**

In [94]:
cd4_import = cd4_import.set_index('patients')
cd4_import

Unnamed: 0_level_0,baseline,followup,sex,percent_change
patients,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jane,364,448.0,F,23.076923
Jill,836,,F,
Joe,2117,1959.0,M,-7.463392
John,815,792.0,M,-2.822086


Or, we could have specified the index column when importing:

In [95]:
pd.read_csv("cd4_data.csv", index_col='patients')

Unnamed: 0_level_0,baseline,followup,sex,percent_change
patients,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jane,364,448.0,F,23.076923
Jill,836,,F,
Joe,2117,1959.0,M,-7.463392
John,815,792.0,M,-2.822086


If the text file is not comma delimited, you can specify the separator using ***`sep`***

**`to_csv`** also uses the ***`sep`*** argument. Let's save a tab-delimited version of our data:

In [127]:
cd4_frame.to_csv("cd4_data.tsv", sep="\t")
!cat cd4_data.tsv

patients	baseline	followup
Jane	42	448.0
Jill	836	
Joe	2117	1959.0
John	815	792.0


Notice the tab delimiter is set using the regular expression **`\t`**

A full list of options for **`read_csv`** is available in the docs

In [126]:
pd.read_csv?

# Exercise

Use pandas to import the longitudinal data set in **`long_data.csv`**

1. How many records are in the CSV?
1. Rename any column names containing spaces.
1. Is there a good choice for an index column?
1. Are there any missing data values?
1. What is the lowest FI-Bkgd value? the highest? the mean?
1. Filter for visit 9 records with FI-Bkgd more than 10,000.
1. On what date did visit code 19 occur for participant 'SAL2'?
1. Make a new DataFrame by filtering on 'SAL2' matching the 'Blank' analyte
1. Use tab completion on your DataFrame to find a function we didn't cover. Print the help for this function using "?".

Q1. How many records are in the CSV?

Read in the CSV, use shape to get the number of records

In [106]:
long_data = pd.read_csv("long_data.csv")
long_data.shape

(1761, 10)

Q2. Rename any column names containing spaces.

In [128]:
long_data.columns

Index([u'dilution', u'analyte', u'fi-bkgd', u'fi-bkgd-neg', u'cv',
       u'participant_id', u'visit_code', u'visit_date', u'sample_type',
       u'buffer'],
      dtype='object')

In [107]:
long_data.rename(
    columns={
        'Participant ID': 'participant_id', 
        'Visit Code': 'visit_code',
        'Visit Date': 'visit_date',
        'Sample Type': 'sample_type'
    },
    inplace=True
)

In [108]:
long_data.columns

Index([u'Dilution', u'Analyte', u'FI-Bkgd', u'FI-Bkgd-Neg', u'CV',
       u'participant_id', u'visit_code', u'visit_date', u'sample_type',
       u'Buffer'],
      dtype='object')

We'll also make all the column names lowercase

In [109]:
long_data.columns = long_data.columns.str.lower()
long_data.head()

Unnamed: 0,dilution,analyte,fi-bkgd,fi-bkgd-neg,cv,participant_id,visit_code,visit_date,sample_type,buffer
0,50,p24 (19),474.8,454.8,0.0372,URN2,0,10/14/1899,PLA,PBS
1,50,gp41 (44),470.8,452.8,0.1387,URN2,0,10/14/1899,PLA,PBS
2,50,Con 6 gp120/B (72),52.5,44.5,0.1183,URN2,0,10/14/1899,PLA,PBS
3,50,B.con.env03 140 CF (65),55.5,46.5,0.1709,URN2,0,10/14/1899,PLA,PBS
4,50,Blank (53),29.0,,0.0527,URN2,0,10/14/1899,PLA,PBS


Q3. Is there a good choice for an index column?

Not really. There's no single column containing unique values. 

We'll se in the next session how to create an index using multiple columns.

Q4. Are there any missing data values?

There are several ways to determine if a data set contains missing values

We could look column by column

In [130]:
long_data[long_data.participant_id.isnull()]

Unnamed: 0,dilution,analyte,fi-bkgd,fi-bkgd-neg,cv,participant_id,visit_code,visit_date,sample_type,buffer


A useful trick is to sum the boolean values returned from **`isnull`**:

In [110]:
long_data.isnull().sum()

dilution            0
analyte             0
fi-bkgd             0
fi-bkgd-neg       353
cv                  0
participant_id      0
visit_code          0
visit_date          0
sample_type         0
buffer              0
dtype: int64

This tells us which columns contain null values and how many

But, if multiple columns contained missing values we could find all of rows using **`any`**:

In [141]:
long_data[long_data.isnull().any(axis=1)]

Unnamed: 0,dilution,analyte,fi-bkgd,fi-bkgd-neg,cv,participant_id,visit_code,visit_date,sample_type,buffer
4,50,Blank (53),29.0,,0.0527,URN2,0,10/14/1899,PLA,PBS
8,50,MulVgp70_His6 (49),205.4,,0.0861,URN2,0,10/14/1899,PLA,PBS
14,50,Blank (53),4.8,,0.1674,URN2,0,10/14/1899,PLA,CIT
18,50,MulVgp70_His6 (49),20.0,,0.1170,URN2,0,10/14/1899,PLA,CIT
24,50,Blank (53),54.0,,0.0228,URN2,9,01/04/1901,PLA,PBS
28,50,MulVgp70_His6 (49),89.5,,0.1078,URN2,9,01/04/1901,PLA,PBS
34,50,Blank (53),8.5,,0.0788,URN2,9,01/04/1901,PLA,CIT
38,50,MulVgp70_His6 (49),42.4,,0.3950,URN2,9,01/04/1901,PLA,CIT
44,50,Blank (53),77.0,,0.0166,URN2,8,12/30/1900,PLA,PBS
48,50,MulVgp70_His6 (49),222.5,,0.0578,URN2,8,12/30/1900,PLA,PBS


In [111]:
long_data.dtypes

dilution            int64
analyte            object
fi-bkgd           float64
fi-bkgd-neg       float64
cv                float64
participant_id     object
visit_code          int64
visit_date         object
sample_type        object
buffer             object
dtype: object

In [112]:
long_data.sort('fi-bkgd', ascending=False).head()

Unnamed: 0,dilution,analyte,fi-bkgd,fi-bkgd-neg,cv,participant_id,visit_code,visit_date,sample_type,buffer
748,50,p66 (RT) (42),42992.5,42026.0,0.0093,SAL2,21,09/12/1902,PLA,PBS
531,50,p24 (19),42972.8,42940.0,0.0027,SAL2,9,09/13/1900,PLA,CIT
1241,50,p24 (19),42962.0,42922.0,0.0062,PL2,21,02/09/1902,PLA,PBS
741,50,p24 (19),42954.5,40978.0,0.012,SAL2,21,09/12/1902,PLA,PBS
217,50,p66 (RT) (42),42942.0,42900.7,0.008,URN2,22,03/08/1903,PLA,CIT


In [113]:
long_data['fi-bkgd'].describe()

count     1761.000000
mean     12241.496479
std      17371.064423
min        -52.500000
25%         48.500000
50%        405.000000
75%      27909.400000
max      42992.500000
Name: fi-bkgd, dtype: float64

In [None]:
long_data[(long_data['fi-bkgd'] > 10000) & (long_data['visit_code'] == 9)]

In [None]:
long_data[(long_data['participant_id'] == 'SAL2') & (long_data['visit_code'] == 19)]

In [None]:
sal2_blanks = long_data[(long_data['participant_id'] == 'SAL2') & (long_data['analyte'] == 'Blank (53)')].copy()
sal2_blanks.head()

# Basic QC Techniques using Summary Data

- Unique values
- Value Counts
- Duplicates

Finding the unique values can help discover if any were missing or perhaps to help build a relational DB:

In [None]:
analytes = long_data.analyte.unique()
len(analytes)

In [None]:
Series(analytes)

Looking at the number of occurrences can also help find missing or duplicated data:

In [None]:
long_data.analyte.value_counts()

In [None]:
long_data.participant_id.value_counts()

pandas has a convenient way of finding duplicated data:

In [None]:
long_data[long_data.duplicated()]

In [None]:
long_data[(long_data['fi-bkgd'] == 5.8) & (long_data.analyte == "Blank (53)") & (long_data.participant_id == 'PL1')]

In [None]:
long_data = long_data.drop(464)
long_data[long_data.duplicated()]

**`duplicated`** can also take a list of columns:

In [None]:
long_data[long_data.duplicated(['fi-bkgd', 'cv', 'analyte', 'buffer'])]

There are many other DataFrame functions to get summary statistics. **`describe`** includes several:

In [None]:
long_data.describe()

# Hierarchical Indexing

Our longitudinal data set doesn't have a single column with unique values to use for an index. pandas allows us to create a hierarchical index using multiple columns.

**Note: It is good practice when using hierarchical indexing to sort the indices. On older versions of pandas multi-indexing may not work properly and even in the newest version indicing may be significantly slower for non-sorted DataFrames**

We know the same analyte shouldn't be present more than once per participant per visit per buffer, so we can use those four fields to create an index and then sort:

In [None]:
long_data_h = long_data.set_index(['participant_id', 'visit_code', 'buffer', 'analyte'])
long_data_h = long_data_h.sort_index()
long_data_h

To test if our index is unique:

In [None]:
long_data_h.index.is_unique

In [None]:
long_data_h.index

Now we can filter a little easier:

In [None]:
long_data_h.ix['PL1', 0]

In [None]:
long_data_h.ix['PL1', 0, 'PBS']['cv']

We can easily swap index levels as well:

In [None]:
long_data_h.swaplevel('analyte', 'participant_id').sort_index()

# Regular Expressions (regex)

## What are Regular Expressions & what can we do with them?

  * Funny name: In the 50s, mathematician Stephen Kleene found that regular language is constructed by patterns, called regular expressions
  * Regular expressions are a collection of patterns we can use to process nearly any text
  * Contructed using a combination of metacharacters: characters with a special meaning used to concisely define patterns

Understanding regex is valuable as they can be used in many tools besides Python, such as good text editors and Unix commands. Using a text editor that supports regex can solve many data munging problems without having to write any code at all.
  
Before we begin using regular expressions in Python let's have an overview using the online regex tool:

https://www.regex101.com/#python

## Global Modifier g

In the "TEST STRING" text box type

```
grey gray
```

Now, in the "REGULAR EXPRESSION" input field type the regular expression:

```
gr
```

Only the first 2 letters of the 1st word are highlighted. To find all occurences we need to perform a global search. To do this, we need to use a regex modifier. Type the letter "g" in the 2nd input field.

## Capture Groups ( )

Note the helpful explanation and match information on the right hand side. There are no "capture groups" extracted, even though we found a match. To create a capture group use parentheses:

```
(gr)
```

You can have as many capture groups as you want, and even capture strings inside a capture group.

## Capture either or using |

Using a pipe within the capture group we can specify matching on multiple phrases:

(grey|gray)

## Single character wildcard .

To capture either spelling variation we can use the single character wildcard ".":

```
(gr.y)
```

The single character wildcard matches any character except a new line.

## Character Classes [ ]

The wildcard will also match misspellings. Edit our TEST STRING to:

```
grey gray grzy
```

We can fix this using a character class to match only "e" or "a". Character classes are created using square brackets, 

```
(gr[ea]y)
```

The square brackets match a single character matching any character included in the list (very similar to the list syntax in Python)

## Zero or one quantifier ?

Let's try another word with spelling variations. Add a new line in the TEST STRING:

```
color colour
```

Our "or" approach won't work here, but we can use the "zero or one" quantifier "?":

```
colou?r
```

## Any word character \w

Sometimes we may not know all the combinations of letters. In this case we can use the word character \w.

Add another line to our sample text:

```
red green blue yellow
```

And we'll find all instances where any 2 letters are followed by the letter 'e' using the word character:

```
\w\we
```

Note that \w matches letters (both upper & lowercase), numbers, and the underscore character. If we really want just

## Any word boundary \b

Finding word boundaries manually can be tricky, you have to match spaces, tabs, new lines, periods, commas, etc. Luckily there's the word boundary \b

Let's find all the instances where the 3rd letter is 'e':

```
\b(\w\we)
```

## Zero or more quantifier *

The asterisk matches zero or more occurences of a character.

Our previous example found the instances where the 3rd letter was 'e' but what if we want to know what words they were. We'll use the zero or more to find the remaining part of the word:

```
\b(\w\we\w*)
```

## One or more quantifier +

To get all the words we could try the zero or more pattern:


```
(\w*)
``` 

Notice we get all the words but our matches also contain empty strings. These matches are the "zero" length strings between each  word.

To make sure at least one letter is present, we can use the one or more quantifier instead:

```
(\w+)
```

## Anything except character class ^

We know our misspelled word contains no vowels, let's try to isolate that word. The character class can be negated to match anything but the characters listed using the caret:

```
([^aeiou]+)
```

We did isolate everything but the vowel characters, but that also included spaces. We can at the space metacharacter to our list of exceptions:

```
([^aeiou\s]+)
```

A little better, but we're getting partial words too. We can add word boundaries to prevent those:

```
\b([^aeiou\s]+)\b
```

## Matching digits \d

Add the following text to the test string

```
123.456
42
1000000
```

The metacharacter **`\d`** matches only the numeric characters 0 through 9. We'll try it with the one or more quantifier:

```
(\d+)
```

## Character literal \

The previous regex doesn't match decimal values and we already seen that the period is a single character wildcard. To find an actual period character we need to "escape" the regex language to fine a literal period. This is done using a backslash:

```
(\.)
```

A decimal number can have digits before and after the decimal point:

```
(\d+\.\d+)
```

But this doesn't match the integers. We can make the decimal point and trailing digits optional:

```
(\d+\.?\d*)
```

## Specifying consecutive matches { }

We can use curly braces { } to specify a specific number of matches. This can also be useful for making shorter, more readable regex patterns. Say we want to match 4 consecutive digits:

```
(\d\d\d\d)
```

Versus:

```
(\d{4})
```

We can specify a lower and upper limit as well:

```
(\d{3, 6})
```

And leaving off the maximum gives us just a lower limit:

```
(\d{3,})
```

## Matching the end of a string: $

Use the following test string:

```
abc John Doe
abc def Jane Doe
```

And the following regex:

```
(\w+)\s(\w+)
```

We know the last 2 words are the names but there are differing numbers of preceding words. We can use the $ to specify our regex should match at the end of a string:

```
(\w+)\s(\w+)$
```

Note the end of the first line is not matched. To match multiple lines using $, we need to use the **m modifier**.

## Matching the beginning of a string: $

Use the following test string:

```
John Doe abc
Jane Doe abc def
```

Similarly we can use the caret **^** to specify the beginning of a line:

```
^(\w+)\s(\w+)
```

Again, we need to use the **m modifier**.

## Using capture groups for substitution

Keep the above test string and regex and expand the substitution area. 

We can reference our capture groups in order numerically:

```
\2,\1
```
We can "throw away" the extra info using .*:

```
^(\w+)\s(\w+).*
```


## Exercise

Copy and paste the regex_exercise.txt contents to the test string.

```
New York 11-17-2009 1223.0
New York 06-24-2010 1122.7
Chicago 07-24-2009 2819.0
Chicago 08-25-2010 2971.6
New York 01-05-2011 1410.0
Chicago 09-04-2010 4671.6
Chicago 02-25-2012 1099.0
New York 01-01-2013 950.9
New York 07-23-2012 2000.0
Chicago 08-22-2013 3500.4
Chicago 01-02-2014 4510.1
```

Using regex substitution, convert this data to a comma-delimited data set with the following columns:

```
Location, Year, Month, Day, Value
```

## Exercise Solution

First we'll isolate the location, we could try multiple approaches but we see there are only 2 values so that's easy enough to capture using either or:

```
(New York|Chicago)
```

Great, now for the space delimiter which we want outside our match:

```
(New York|Chicago)\s
```

Now to start on the month excluding the dash:

```
(New York|Chicago)\s(\d{2})-
```

And the same for the day:

```
(New York|Chicago)\s(\d{2})-(\d{2})-
```

And the 4 digit year:

```
(New York|Chicago)\s(\d{2})-(\d{2})-(\d{4})
```

Another space delimiter and the value with optional decimal:

```
(New York|Chicago)\s(\d{2})-(\d{2})-(\d{4})
```

Notice the caret, it is another anchor character denoting the beginning of the line. For the online tool, we need to add the multiline global option, "m", so that it knows to allow the caret to match the beginning of every line, not just the first one.

We used the word character with the one or more quantifier, surrounded in parenthesis. Finally, we used a space character to end the wild card search, and we use the one or more quantifier in case the delimiter is more than one space long.

Next, let's tackle the value of evil-ness. It looks like a regular float, which means there's a decimal character. But, the "." character is already used as a wild card. Anyone know what we can do here? Yep, we can use the backslash to escape the special character's meaning and match a literal ".":

```
\s+(\d+\.?\d*)$
```

We've also handled the case where the value may not have a decimal, making it optional. And in that case the 2nd "\d" covering the fractional part would be absent so we use the zero or more quantifier. Finally, we've used the end of line anchor as another data validation technique.

Looks like we have all of our parts, let's put it all together and get all the values we need from the data record:

```
(New York|Chicago)\s(\d{2})-(\d{2})-(\d{4})\s(\d+\.?\d*)
```


# Using regex in pandas

Now let's see how we can use regular expressions in pandas. The analyte values in the longitudinal data set actually contains an extra piece of information. At the end of each analyte string is a bead number within parentheses:

In [None]:
analytes = long_data.analyte
analytes.head()

We can use the extract function on the **`str`** attribute. It can take a regular expression as an argument. Note we escape the parentheses:

In [None]:
analytes.str.extract('\s*\((\d+)\)$')

Since our new Series is still indexed like the original DataFrame, we can simply add the bead number as a new column:

In [None]:
long_data['bead_number'] = analytes.str.extract('\s*\((\d+)\)')
long_data.head()

The bead number is still in the analyte column. We can use replace to substitute in an empty string:

In [None]:
long_data['analyte'] = analytes.str.replace('\s*\((\d+)\)', '')
long_data.head()

Another QC check for consistent date formatting. Let's look at the visit date format:

In [None]:
visit_dates = long_data.visit_date
visit_dates.head()

Looks like the first few dates are month/day/year. Let's check if all the values use that format:

In [None]:
good_dates = visit_dates.str.contains("\d{2}\/\d{2}\/\d{4}")
visit_dates[good_dates == False]

Looks like we have some inconsistent values. We can use replace to re-order our capture groups. Note we have to escape our backslash characters in the replacement string

In [None]:
fixed_dates = visit_dates.str.replace("(\d{4})\/(\d{2})\/(\d{2})", "\\2/\\3/\\1")
fixed_dates.str.contains("\d{4}\/").sum()

Looks like that fixed them. One more check to make sure all the dates start with the year:

In [None]:
fixed_dates[fixed_dates.str.contains("\d{4}\/")]

And to save our fixed dates back to the DataFrame and do a final check:

In [None]:
long_data['visit_date'] = fixed_dates
long_data.visit_date.str.contains("\d{2}\/\d{2}\/\d{4}").sum()

In [None]:
long_data.visit_date[good_dates == False]

In [None]:
long_data.to_csv("long_data_cleaned.csv")