# Introduction to pandas

* A Python package for working with multi -dimensional, structured data (e.g. Excel spreadsheets, relational databases)

* Built on top of NumPy so it's fast...but with more convenient data structures

* The main data structure, called a DataFrame, is similar to the data.frame in R

Create a new iPython notebook and rename it **`pandas-intro`**

Conventionally, pandas is imported using the alias **`pd`** because programmers are lazy

You'll often see the commonly used data structures imported separately for even less typing (i.e. avoiding pd.DataFrame)

# Data Structures


## Series

...similar to a Python list or a single column of a spreadsheet

### Creating a Series

Let's create a new Series from a simple Python list with the values: 

**`815, 364, 2117`**

...a much nicer output vs Python's list

### Custom Index

But, we can make this even better

pandas allows us to specify custom indices

Let's re-create the Series with some made up patient names

**`John, Jane, Joe`**

Note the length of the data and the indices given must be equal:

### Adding and Removing Items in a Series

To add a new value:

And to remove it:

But, our value wasn't really removed!

Most pandas functions that modify data return a copy by default

We ***could*** assign the copy back to the original variable...

Luckily, many pandas functions have an option to modify data in place

### The Series Index

To determine the indices use the **`index`** attribute:

We can also give a more meaningful label name to the index

And to the Series itself

### Selecting Values from a Series

We can use our indices to reference the values:

Regular indexing by position also works:

Slicing works as well:

Retrieving non-successive rows by position or index name:

# Exercise

### Create a New Series

Create a new Series called **`followup`** using the following data:

| Index | Value |
|-------|-------|
| Jane  | 448   |
| Joe   | 1959  |
| John  | 792   |

Rename the index to **`patients`**

Rename the Series **`CD4 followup`**

First, make a list of the values:

Create the Series (we can name it at the same time):

Give the index a name:

### Data Alignment

Let's compute the differences over time

pandas uses the indices to ***align*** data in different series, even though the indices were in a different order

But, what if the 2 Series have non-matching indices?

The new Series is a **union** of the indices

pandas uses the value **`NaN`** (not a number) for the missing data

### Filtering

Like with NumPy, we can use boolean arrays for filtering

**NaN** is evaluated as False

Likewise for a "less than" comparison:

Use the boolean array to get the values for the filter

Only the values corresponding to **`True`** are returned

Filter the missing data using the function **`isnull`**

And get the inverse using **`notnull`**

We can also fill in the missing values with a value using **`fillna`**.

**`fillna`** doesn't modify the original Series, but does take an ***`inplace`*** argument

There's also **`dropna`** to remove missing values

# Data Structures

## DataFrame

A DataFrame is similar to an Excel spreadsheet, containing both columns and rows

You can think of a DataFrame as a container for multiple Series with a common index

Let's create a DataFrame by concatenating both the baseline and followup Series across the columns (**`axis=1`**):

iPython notebook renders the DataFrame as an HTML table

### Axis labelling is tricky

```

+------+-------+-------+
|      | col_A | col_B |
+------+-------+-------+
| Jane |  364  |  448  | -- axis=1 -->
+------+-------+-------+
           |
           | axis=0
           ↓
```

**`axis=1`**

    across the columns (along the row)

----

**`axis=0`**

    across the rows (along the column)

The **`shape`** *attribute* returns the number of rows and columns:

The **`describe`** *method* gives a variety of summary data:

### Naming things

Let's rename the columns for easier typing & to remove the spaces:

And, just like with a Series, we can name the DataFrame's index.

### Making Selections

A single column can be extracted in a couple ways

First, by dictionary-like indexing:

Note: DataFrame selections return columns and take column names, whereas selections in a Series take row labels

We'll see how to select DataFrame rows in a bit

A more convenient way to extract a column is by attribute:

Column names containing a space are not available as an attribute, you must use dictionary indexing

...another good reason to rename unwieldy column names

A column extracted from a DataFrame is a pandas Series object

Any of the Series methods can be used on the column

Knowing this, we can extract a single "cell" from a DataFrame:

This works with dictionary indexing too:

If all the names are space-free, use the attributes:

We can select multiple columns *and* specify their order using a ***list*** of column names:

But, be careful when manipulating data extracted from a DataFrame:

The Series extracted from our DataFrame is a **view** and not a copy of the data. If you really want a separate copy make sure to use **`copy`**:

Let's restore our original baseline column:

### Retrieving Rows

To retrieve an entire DataFrame row, use the **`ix`** attribute: 

This also returns a Series object:

And gives us even more options for accessing a single value:

But, **`ix`** does *not* have attributes for the row names, so this won't work:

To get the first 2 rows, we can slice using **`ix`**:

Getting the 2nd and 4th rows:

Selecting multiple rows of a single column:

And selecting multiple rows and multiple columns:

### Creating and Deleting Columns

Let's create a new column:

# Exercise

### Create a DataFrame Column

Create a new column called **`percent_change`** in the `cd4_frame` containing the percent change in CD4


Add a new column with the percent change in CD4:

Note you cannot define a new column using an attribute, e.g. **`cd4_frame.some_column = ...`**

### Removing Columns

To remove a column we can use **`drop`**

Since **`drop`** is a modification, so it returns a copy

It can also remove rows using ***`axis=0`***

### Filtering DataFrames

Filter the whole frame:

**`isnull`** also works on the whole DataFrame:

Or we can filter just a column:

Filtering a text value:

Filtering multiple columns by combining boolean arrays:

### Sorting

Sorting a single column:

And multiple columns:

Sorting by the index:

Sorting the column order by column name:

# Exporting and Importing CSV Data

Saving our DataFrame to a CSV is easy:

Importing our data back into pandas:

But, there's something different. The original data we exported was indexed by `patients`.

To set the index to an existing column we can use **`set_index`**

Or, we could have specified the index column when importing:

If the text file is not comma delimited, you can specify the separator using ***`sep`***

**`to_csv`** also uses the ***`sep`*** argument. Let's save a tab-delimited version of our data:

Notice the tab delimiter is set using the regular expression **`\t`**

A full list of options for **`read_csv`** is available in the docs

# Exercise

Use pandas to import the longitudinal data set in **`long_data.csv`**

1. How many records are in the CSV?
1. Rename any column names containing spaces.
1. Is there a good choice for an index column?
1. Are there any missing data values?
1. What is the lowest FI-Bkgd value? the highest? the mean?
1. Filter for visit 9 records with FI-Bkgd more than 10,000.
1. Make a new DataFrame by filtering on 'SAL2' matching the 'Blank' analyte
1. Use tab completion on your DataFrame to find a function we didn't cover. Print the help for this function using "?".

Q1. How many records are in the CSV?

Read in the CSV, use shape to get the number of records

Q2. Rename any column names containing spaces.

We'll also make all the column names lowercase

Q3. Is there a good choice for an index column?

Not really. There's no single column containing unique values. 

We'll see in the next session how to create an index using multiple columns.

Q4. Are there any missing data values?

There are several ways to determine if a data set contains missing values

We could look column by column

A useful trick is to sum the boolean values returned from **`isnull`**:

This tells us which columns contain null values and how many

But, if multiple columns contained missing values we could find all of rows using **`any`**:

Q5. What is the lowest FI-Bkgd value? the highest? the mean?

We could try sorting

But, **`describe`** gives all three:

Or we could get them independently

Q6. Filter for visit 9 records with FI-Bkgd more than 10,000

A simple combination of filters

Q7. Make a new DataFrame by filtering on 'SAL2' matching the 'Blank' analyte

Q8. Anyone find a new, useful DataFrame method?

# Basic QC Techniques using Summary Data

- Unique values
- Value Counts
- Duplicates

### Unique values

Finding the unique values can help discover if any were missing or perhaps to help build a relational DB:

### Value Counts

Looking at the number of occurrences can also help find missing or duplicated data:

### Duplicates

pandas has a convenient way of finding duplicated data:

**`duplicated`** can also take a list of columns:

# Hierarchical Indexing

Our data doesn't have a single column with unique values

pandas allows us to create a hierarchical index using multiple columns

**Note: It's a good idea when using hierarchical indexing to sort the indices**

On older versions of pandas multi-indexing may not work properly for non-sorted DataFrames, and in the newest version indicing may be significantly slower

The same analyte shouldn't occur more than once per participant per visit per buffer

We'll use those four fields to create an index and then sort the DataFrame:

To test if our index is unique:

Now we can filter a little easier:

We can easily swap index levels as well:

# Regular Expressions (regex)

## What are Regular Expressions & what can we do with them?

  * Funny name: In the 50s, mathematician Stephen Kleene found that regular language is constructed by patterns, called regular expressions

  * Regular expressions are a collection of patterns we can use to process nearly any text


  * Contructed using a combination of metacharacters: characters with a special meaning used to concisely define patterns

Understanding regex is valuable as they can be used in many tools besides Python, such as good text editors and Unix commands. Using a text editor that supports regex can solve many data munging problems without having to write any code at all.

Before we begin using regular expressions in Python let's have an overview using the online regex tool:

https://www.regex101.com/#python

## Global Modifier g

In the "TEST STRING" text box type

```
grey gray
```

Now, in the "REGULAR EXPRESSION" input field type the regular expression:

```
gr
```

Only the first 2 letters of the 1st word are highlighted. To find all occurences we need to perform a global search. To do this, we need to use a regex modifier. Type the letter "g" in the 2nd input field.

## Capture Groups ( )

Note the helpful explanation and match information on the right hand side. There are no "capture groups" extracted, even though we found a match. To create a capture group use parentheses:

```
(gr)
```

You can have as many capture groups as you want, and even capture strings inside a capture group.

## Capture either or using |

Using a pipe within the capture group we can specify matching on multiple phrases:

(grey|gray)

## Single character wildcard .

To capture either spelling variation we can use the single character wildcard ".":

```
(gr.y)
```

The single character wildcard matches any character except a new line.

## Character Classes [ ]

The wildcard will also match misspellings. Edit our TEST STRING to:

```
grey gray grzy
```

We can fix this using a character class to match only "e" or "a". Character classes are created using square brackets, 

```
(gr[ea]y)
```

The square brackets match a single character matching any character included in the list (very similar to the list syntax in Python)

## Zero or one quantifier ?

Let's try another word with spelling variations. Add a new line in the TEST STRING:

```
color colour
```

Our "or" approach won't work here, but we can use the "zero or one" quantifier "?":

```
colou?r
```

## Any word character \w

Sometimes we may not know all the combinations of letters. In this case we can use the word character \w.

Add another line to our sample text:

```
red green blue yellow
```

And we'll find all instances where any 2 letters are followed by the letter 'e' using the word character:

```
\w\we
```

Note that \w matches letters (both upper & lowercase), numbers, and the underscore character. If we really want just

## Any word boundary \b

Finding word boundaries manually can be tricky, you have to match spaces, tabs, new lines, periods, commas, etc. Luckily there's the word boundary \b

Let's find all the instances where the 3rd letter is 'e':

```
\b(\w\we)
```

## Zero or more quantifier *

The asterisk matches zero or more occurences of a character.

Our previous example found the instances where the 3rd letter was 'e' but what if we want to know what words they were. We'll use the zero or more to find the remaining part of the word:

```
\b(\w\we\w*)
```

## One or more quantifier +

To get all the words we could try the zero or more pattern:


```
(\w*)
``` 

Notice we get all the words but our matches also contain empty strings. These matches are the "zero" length strings between each  word.

To make sure at least one letter is present, we can use the one or more quantifier instead:

```
(\w+)
```

## Anything except character class ^

We know our misspelled word contains no vowels, let's try to isolate that word. The character class can be negated to match anything but the characters listed using the caret:

```
([^aeiou]+)
```

We did isolate everything but the vowel characters, but that also included spaces. We can at the space metacharacter to our list of exceptions:

```
([^aeiou\s]+)
```

A little better, but we're getting partial words too. We can add word boundaries to prevent those:

```
\b([^aeiou\s]+)\b
```

## Matching digits \d

Add the following text to the test string

```
123.456
42
1000000
```

The metacharacter **`\d`** matches only the numeric characters 0 through 9. We'll try it with the one or more quantifier:

```
(\d+)
```

## Character literal \

The previous regex doesn't match decimal values and we already seen that the period is a single character wildcard. To find an actual period character we need to "escape" the regex language to fine a literal period. This is done using a backslash:

```
(\.)
```

A decimal number can have digits before and after the decimal point:

```
(\d+\.\d+)
```

But this doesn't match the integers. We can make the decimal point and trailing digits optional:

```
(\d+\.?\d*)
```

## Specifying consecutive matches { }

We can use curly braces { } to specify a specific number of matches. This can also be useful for making shorter, more readable regex patterns. Say we want to match 4 consecutive digits:

```
(\d\d\d\d)
```

Versus:

```
(\d{4})
```

We can specify a lower and upper limit as well:

```
(\d{3, 6})
```

And leaving off the maximum gives us just a lower limit:

```
(\d{3,})
```

## Matching the end of a string: $

Use the following test string:

```
abc John Doe
abc def Jane Doe
```

And the following regex:

```
(\w+)\s(\w+)
```

We know the last 2 words are the names but there are differing numbers of preceding words. We can use the $ to specify our regex should match at the end of a string:

```
(\w+)\s(\w+)$
```

Note the end of the first line is not matched. To match multiple lines using $, we need to use the **m modifier**.

## Matching the beginning of a string: $

Use the following test string:

```
John Doe abc
Jane Doe abc def
```

Similarly we can use the caret **^** to specify the beginning of a line:

```
^(\w+)\s(\w+)
```

Again, we need to use the **m modifier**.

## Using capture groups for substitution

Keep the above test string and regex and expand the substitution area. 

We can reference our capture groups in order numerically:

```
\2,\1
```
We can "throw away" the extra info using .*:

```
^(\w+)\s(\w+).*
```


## Exercise

Copy and paste the regex_exercise.txt contents to the test string.

```
New York 11-17-2009 1223.0
New York 06-24-2010 1122.7
Chicago 07-24-2009 2819.0
Chicago 08-25-2010 2971.6
New York 01-05-2011 1410.0
Chicago 09-04-2010 4671.6
Chicago 02-25-2012 1099.0
New York 01-01-2013 950.9
New York 07-23-2012 2000.0
Chicago 08-22-2013 3500.4
Chicago 01-02-2014 4510.1
```

Using regex substitution, convert this data to a comma-delimited data set with the following columns:

```
Location, Year, Month, Day, Value
```

## Exercise Solution

First we'll isolate the location, we could try multiple approaches but we see there are only 2 values so that's easy enough to capture using either or:

```
(New York|Chicago)
```

Great, now for the space delimiter which we want outside our match:

```
(New York|Chicago)\s
```

Now to start on the month excluding the dash:

```
(New York|Chicago)\s(\d{2})-
```

And the same for the day:

```
(New York|Chicago)\s(\d{2})-(\d{2})-
```

And the 4 digit year:

```
(New York|Chicago)\s(\d{2})-(\d{2})-(\d{4})
```

Another space delimiter and the value with optional decimal:

```
(New York|Chicago)\s(\d{2})-(\d{2})-(\d{4})
```

Notice the caret, it is another anchor character denoting the beginning of the line. For the online tool, we need to add the multiline global option, "m", so that it knows to allow the caret to match the beginning of every line, not just the first one.

We used the word character with the one or more quantifier, surrounded in parenthesis. Finally, we used a space character to end the wild card search, and we use the one or more quantifier in case the delimiter is more than one space long.

Next, let's tackle the value of evil-ness. It looks like a regular float, which means there's a decimal character. But, the "." character is already used as a wild card. Anyone know what we can do here? Yep, we can use the backslash to escape the special character's meaning and match a literal ".":

```
\s+(\d+\.?\d*)$
```

We've also handled the case where the value may not have a decimal, making it optional. And in that case the 2nd "\d" covering the fractional part would be absent so we use the zero or more quantifier. Finally, we've used the end of line anchor as another data validation technique.

Looks like we have all of our parts, let's put it all together and get all the values we need from the data record:

```
(New York|Chicago)\s(\d{2})-(\d{2})-(\d{4})\s(\d+\.?\d*)
```


# Using regex in pandas

Now let's see how we can use regular expressions in pandas. The analyte values in the longitudinal data set actually contains an extra piece of information. At the end of each analyte string is a bead number within parentheses:

We can use the extract function on the **`str`** attribute. It can take a regular expression as an argument. Note we escape the parentheses:

Since our new Series is still indexed like the original DataFrame, we can simply add the bead number as a new column:

The bead number is still in the analyte column. We can use replace to substitute in an empty string:

Another QC check for consistent date formatting. Let's look at the visit date format:

Looks like the first few dates are month/day/year. Let's check if all the values use that format:

Looks like we have some inconsistent values. We can use replace to re-order our capture groups. Note we have to escape our backslash characters in the replacement string

Looks like that fixed them. One more check to make sure all the dates start with the year:

And to save our fixed dates back to the DataFrame and do a final check:

Finally, we'll export the cleaned data to CSV: