What is Pandas?
---

From https://pandas.pydata.org/pandas-docs/stable:

> pandas is a Python package providing fast, flexible, and expressive data structures designed to
> make working with “relational” or “labeled” data both easy and intuitive. It aims to be the
> fundamental high-level building block for doing practical, real world data analysis in Python.
> Additionally, it has the broader goal of becoming the most powerful and flexible open source data
> analysis / manipulation tool available in any language. It is already well on its way toward this
> goal.

I tend to use Pandas to work with tabular data, similar to columns and rows in an Excel spreadsheet. However, what do you do if your data no longer fits in an Excel sheet? Use Pandas.

Pandas is built on top of NumPy, a lower level numerical computing library with a fast multidimensional array object. I decided to forgo an introduction to NumPy because learning Pandas will provide you many of the fundamentals of NumPy. Let's start by looking at the fundamental data structures in Pandas.

Importing Pandas
---

First, you need to `import pandas`. By convention, it is imported with the shorter library name `pd`. That is done with the following syntax:

```python
import <library> as <short name>
```

#### Tasks:

1. Import pandas using the short name convention.

Basic Data Structures
---

Similar to the basic Python data structures (e.g. `list, dictionary, set`), Pandas is built on top of three fundamental data structures:

1. `Series`: For one-dimensional data, similar to a `list` or NumPy array
2. `DataFrame`: For two-dimensional data, similar to a `dictionary` or 2d NumPy Array
3. `Index`: Similar to a `Series`, but for naming, selecting, and transforming data within a `Series` or `DataFrame`

### Series

You can create pandas `Series` in a few ways:

- From a named Python list:
```python
a = ['a', 'b', 'c']
series = pd.Series(a)
```
- Or, from a temporary Python list:
```python
series = pd.Series([4, 5, 6])
```
- Or, using specific index (similar to `dict`, keys are index, values are list):
```python
series = pd.Series([4, 5, 6], index=['a', 'b', 'c'])
```
- Or, directly from a dictionary (exactly the same as above):
```python
series = pd.Series({'a': 4, 'b': 5, 'c': 6})
```

### DataFrame

This is the data structure that makes Pandas so powerful. A `DataFrame` is essentially built from many `Series` objects. A `Series` is very similar to a `dict`, the `index` are keys each of which have a value. In a `DataFrame`, the `keys` map to `Series` objects which share a common `index`. An example:

```python
rock_bands = ['Pink Floyd', 'Rush', 'Yes']
year_formed = [1965, 1968, 1968]
location_formed = ['London, England', 'Ontario, Canada', 'London, England']
df = pd.DataFrame({'year_formed': year_formed, 'location_formed': location_formed}, index=rock_bands)
```

Additionally, `DataFrame`'s can be constructed from files! In one of the previous tasks, you were asked to read a bar separated values (bsv) file, parse it manually, and rewrite it to comma separated values (csv). This was a ridiculous task and could have been completed with 2 lines of pandas! Reminder, these files don't contain headers!

```python
df = pd.read_csv('gpu.bsv', sep='|', header=None)
df.to_csv('gpu.csv', sep=',', header=None, index=None)
```

### Tasks

1. Use `pd.read_csv` to read in the csv file: `example.csv`. It does not contain a header. Call the variable `df`. By default `pd.read_csv` sets the seperator (`sep`) to a comma. Add `names=['col1', 'col2', 'col3']` to the list of arguments.

Viewing DataFrames
---

Jupyter has built in support for viewing `DataFrame` objects in a nice format. Run the following cell:

In [None]:
df

You should have seen 4 rows and 4 columns. The upper most row are the `keys` and the most left column are the `index`. Think of each column as 3 `Series` objects. The following code snippet won't run, but it is meant to help you think about the `DataFrame` above:

```python
df['col1'] = pd.Series([1, 4, 7])
df['col2'] = pd.Series([2, 5, 8])
df['col3'] = pd.Series([3, 6, 9])
```

If you only want to view a subset of your DataFrame, you can use `df.head()`. By default it will print 5 rows at the top of your DataFrame. You can change the default w/ `df.head(n=<number>)`

Tasks
---

1. Try printing the `head` of `df`, does it look different from the previous code cell? Why?
2. Print only the first row of `df` using `head`

### Access and Types

You can access a DataFrame in 2 ways:

1. `df['<key>']` where `<key>` in the above `df` could be `col1` or `col2` or `col3`
2. Or, `df.<key>`

You can access the types of columns with `df.dtypes`

### Tasks

1. Access `col2` of `df` using both the `dict` style and attribute style.
2. Why are there 2 columns printed?
3. What is the type of `df.col2`?
4. What are the `dtypes` of `df`?

Slicing and Indexing
---

There are many ways to slice and dice DataFrames. Let's start with the least flexible option, selecting multiple columns. Let's make a new DataFrame in the following cell.

In [None]:
example = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
example.head()

To slice columns `a` and `c` we'll use a similar syntax to the dictionary access, shown before, but instead we will ask for a list of columns instead of a single one, e.g. 

In [None]:
example[['a', 'c']]

One can also slice rows using a `list`-like syntax. Note you are __required__ to specify a slice (something containing '`:`'). For example,

In [None]:
# Zeroth row only
example[0:1]

In [None]:
# First row to end
example[1:]

In [None]:
# Every other row
example[::2]

In [None]:
# This will fail with `KeyError`, because you are requesting key based access
example[0]

More Complicated Access Patterns
---

You can narrow down rows and columns using `loc`, some examples:

In [None]:
# Only row 1, columns 'a' and 'c'
example.loc[1:1, ['a', 'c']]

In [None]:
# All rows, columns 'a' to 'b'
example.loc[:, 'a':'b']

In [None]:
# Single row, single column
example.loc[0, 'a']

### Tasks

Using `loc` and the `example` DataFrame,

1. Can you get every other row, columns `b` to `c`?
2. Can you get the last row, all columns?

### Note

`loc` is all about index/key access, what if the indices are characters? Run the following cell and then complete the tasks

In [None]:
example2 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}, index=['A', 'B', 'C'])
example2.head()

### Tasks

Use `loc` and DataFrame `example2`, to

1. Get rows `B` to `C` and columns `a` to `b`.
2. What happens if you try to access the index numerically?

### Continuation

To access `example2` w/ numerical values, we need `iloc`.

### Tasks

1. Using `iloc` and `example2`, get rows `B` to `C` and columns `a` to `b`.

### Notes

You can also use the `list` style access I showed before, e.g.

In [None]:
example2.iloc[[1, 2], [0, 1]]

Yet Another Option
---

Another way of accessing is by providing a list of indices based on a condition. For example, return rows of `example2` where column `a` is greater than or equal to 2.

In [None]:
example2[example2.a >= 2]

How about columns where row `B` is greater than or equal to 5?

In [None]:
example2.loc['B'][example2.loc['B'] >= 5]

How about compound requirements? Rows where column `a` is greater than or equal to 2 and column `b` is less than 6.

In [None]:
example2[(example2.a >= 2) & (example2.b < 6)]

### Notes

Indexing and slicing can be very complicated in Pandas. I think we have dwelled on it long enough.

Doing Stuff with DataFrames
---

Run the following cell:

In [None]:
states = pd.read_csv('states.bsv', sep='|')
states.head()

Let's explore some statistics on this DataFrame, run the following cell:

In [None]:
states.describe()

We can see there are 50 states in the DataFrame from `count`. The mean population is 6.5 million people in an area of 76 thousand square miles.

### Tasks

To get the minimum of a column, you could to `DataFrame.<key>.min()` (`max()` is also an option). Using this information,

1. Find the state with the smallest population
2. Find the state with the largest area
3. Find the state with the smallest population
4. Find the state with the largest area

Adding New Data
---

What if you were really interested in the population density, that is population divided by the area?

DataFrame's support _vectorized_ operations. Try the following cell:

In [None]:
states.Population / states.Area

Tasks
---

1. What is the type of `states.Population / states.Area`? Is that suprising?
2. DataFrame's are mutable, therefore you can always add a new key with Series. Store `states.Population / states.Area` in `states` with the key `Density`.
3. Which states are the most/least dense?

Viewing Data
---

Pandas plugs into `matplotlib` very nicely. I am going to iteratively build a plot which is easy to read. First, run the following cell.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
states.plot()

Okay, this is something, but not very helpful. What would we like:

- Plots for each column separate
- X axis should be labeled with the state

In [None]:
ax = states.plot(subplots=True, xticks=states.index)
dummy = ax[0].set_xticklabels(states.State.str.strip())

Notes
---

1. `subplots=True`: separates the 2 plots from one another
2. `xticks=states.index`: sets all of the points on the x-axis
3. `ax = ...`: is a list containing both plots
4. `ax[0].set_xticklables` changes the numeric index to the State name, should only be necessary for the 0th plot
5. `states.State.str.strip()` is an artifact from the way I made the bsv file, we will discuss string manipulation next
6. `dummy = ...`, I use this to supress the output from `set_xticklabels`


Neat, but I can't read the labels...

In [None]:
ax = states.plot(subplots=True, xticks=states.index, figsize=(20, 10))
dummy = ax[0].set_xticklabels(states.State.str.strip())

Line plots are a little awkward because each point is independent. We should switch to a bar chart

In [None]:
ax = states.plot(subplots=True, xticks=states.index, figsize=(20, 10), kind='bar')
dummy = ax[0].set_xticklabels(states.State.str.strip())

Notes
---

- `kind=bar` will get us the bar chart

Tasks
---

1. You can also call `plot` on `Series`. Call a `Series` style plot on `Density` with appropriate x labels and readable size.

Modifying DataFrames
---

So, we have been calling `states.State.str.strip()` to generate reasonable x labels. What is going on there? Try the following cells:

In [None]:
state_string = states.State.loc[0]
state_string, len(states.State.loc[0])

In [None]:
len('Alabama')

Note, there is a ton of empty space at the end of the state's name for some reason. Let's pretend I didn't do this on purpose and I would like to fix and write a new bsv file. The `str` object has a ton of useful functions for string manipulation. In this case, we want to `strip` off the whitespace, e.g.:

In [None]:
state_string = state_string.strip()
state_string, len(state_string)

Now, I need to do this on the entire DataFrame. How can I do that?

As usual, there are a few options:

- Apply a lambda function to the Series

In [None]:
states.State.apply(lambda s: s.strip()).head()

- __Or__, access the `.str` representation of the Series

In [None]:
states.State.str.strip().head()

### Notes

I use `.head()` above to reduce the amount of output. Both of the above options return a `Series` of type (`object`) and the latter is preferred for string operations but only work for string manipulations! The `apply` + `lambda` syntax is more general. We will explore it more shortly.

The 2 previous cells didn't modify the data in the dataframe, it returned a Series and we have to do the assignment ourselves

### Tasks

1. Call the `str` style `strip` function on the `State Series` and store it back into the DataFrame `states`.  

Now, you should be able to make the plot with:

In [None]:
ax = states.Density.plot(xticks=states.index, figsize=(20, 10), kind='bar')
dummy = ax.set_xticklabels(states.State)

Apply + Lambda
---


I want to briefly show you a decent idiom for doing more complicated work on a Series.

This is a contrived example, but it shows the utility of `apply` + `lambda`.

What if we wanted wanted to figure out if all letters A-Z are in the names of the states? First, we could create a `set` of characters in each state's name: 

In [None]:
def set_of_chars(s):
    return set(list(s.lower()))

series_of_sets = states.State.apply(lambda s: set_of_chars(s))
series_of_sets.head()

I am going to do something a little fancy here. Functional programming is a powerful tool for something like this. I want to take the `list` of `set`s and convert it to one set. First, to combine two sets: 

In [None]:
a = {1, 2, 3}
b = {2, 3}
a.union(b)

We are going to "reduce" the `list` of `set`s by taking the union of each entry, something like the following:

1. `temporary_set = <zeroth entry>.union(<first entry>)`
2. `temporary_set = temporary_set.union(<second entry>)`
3. `temporary_set = temporary_set.union(<third entry>)`
4. Repeat this until there are no more entries and return `temporary_set`

We use an anonymous function which takes 2 variables and takes the union between them. We apply that anonymous function to the `series_of_sets` with a call to `reduce`

In [None]:
from functools import reduce
characters_used_in_states_names = reduce(lambda x, y: x.union(y), series_of_sets)
characters_used_in_states_names

Only one issue! `' '` isn't a character of the alphabet, remove it!

In [None]:
characters_used_in_states_names.remove(' ')

In [None]:
len(characters_used_in_states_names)

One character is missing, which one?

In [None]:
from string import ascii_lowercase

alphabet_set = set(list(ascii_lowercase))
alphabet_set.difference(characters_used_in_states_names)

Notes
---

- `ascii_lowercase` is a string with all characters of the alphabet, i.e. `abcdefghijklmnopqrstuvwxyz`

The concepts of reductions and anonymous functions are defined and applied often in the world of functional programming. Learning functional concepts can greatly reduce the amount of code you right in your research.

Writing Files
---

csv files are a pretty standard way to share files, we can write a csv to a file with:

```python
<DataFrame>.to_csv(<filename>, index=None)
```

I tend to ignore the index because by default pandas will create a numeric index for you.

### Tasks

1. Write your `states` DataFrame to a file called "intro.csv".

Wrapping Up
---

We went over a lot of useful functionality in Pandas to get you started, but honestly there is so much more to cover. I learn something new about Pandas every week. My suggestion is to just default use Pandas when you do everything and learn on the fly.

### Documentation Resources

- Main documentation page: https://pandas.pydata.org/pandas-docs/stable/
- Working with text data: https://pandas.pydata.org/pandas-docs/stable/text.html
- Indexing and selecting data: https://pandas.pydata.org/pandas-docs/stable/indexing.html
- Computations, statistics, aggregation (reduction with numpy functions): https://pandas.pydata.org/pandas-docs/stable/computation.html
- Visualizations: https://pandas.pydata.org/pandas-docs/stable/visualization.html

### Quick Survey

In [None]:
from IPython.display import IFrame
IFrame("", width=760, height=500)