# Python's Data Science Ecosystem

In addition to Python's built-in modules like the ``math`` module we explored above, there are also many often-used third-party modules that are core tools for doing data science with Python.
Some of the most important ones are:

#### [``numpy``](http://numpy.org/): Numerical Python

Numpy is short for "Numerical Python", and contains tools for efficient manipulation of arrays of data.
If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar.

#### [``scipy``](http://scipy.org/): Scientific Python

Scipy is short for "Scientific Python", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more.
We will not look closely at Scipy today, but we will use its functionality later in the course.

#### [``pandas``](http://pandas.pydata.org/): Labeled Data Manipulation in Python

Pandas is short for "Panel Data", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a *Data Frame*.
If you've used the [R](http://rstats.org) statistical language (and in particular the so-called "Hadley Stack"), much of the functionality in Pandas should feel very familiar.

#### [``matplotlib``](http://matplotlib.org): Visualization in Python

Matplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly).

### Loading Data with Pandas

In [None]:
import numpy
numpy.__path__

In [None]:
import pandas

Because we'll use it so much, we often import under a shortened name using the ``import ... as ...`` pattern:

In [None]:
import pandas as pd

Now we can use the ``read_csv`` command to read the comma-separated-value data:

In [None]:
data = pd.read_csv('/fh/fast/_ADM/SciComp/data/training/2015_trip_data.csv')

* *Note 1: The file path above should work within the FH network. If you are not at the FH, then you can download this data at* <https://s3.amazonaws.com/pronto-data/open_data_year_one.zip>
* *Note 2: strings in Python can be defined either with double quotes or single quotes*

### Viewing Pandas Dataframes

The ``head()`` and ``tail()`` methods show us the first and last rows of the data

In [None]:
data.head()

In [None]:
data.tail()

The ``shape`` attribute shows us the number of elements:

In [None]:
data.shape

The ``columns`` attribute gives us the column names

In [None]:
data.columns

The ``index`` attribute gives us the index names

In [None]:
data.index

The ``dtypes`` attribute gives the data types of each column:

In [None]:
data.dtypes

## 4. Manipulating data with ``pandas``

Here we'll cover some key features of manipulating data with pandas

Access columns by name using square-bracket indexing:

In [None]:
data["usertype"]

Mathematical operations on columns happen *element-wise*:

In [None]:
data['tripduration'] / 60

Columns can be created (or overwritten) with the assignment operator.
Let's create a *tripminutes* column with the number of minutes for each trip

In [None]:
data['tripminutes'] = data['tripduration'] / 60

In [None]:
data.head()

### Working with Times

One trick to know when working with columns of times is that Pandas ``DateTimeIndex`` provides a nice interface for working with columns of times:

In [None]:
times = pd.DatetimeIndex(data['starttime'])

With it, we can extract, the hour of the day, the day of the week, the month, and a wide range of other views of the time:

In [None]:
times

In [None]:
times.dayofweek

In [None]:
times.month

*Note: math functionality can be applied to columns using the NumPy package: for example:*

In [None]:
import numpy as np
np.exp(data['tripminutes'])

### Simple Grouping of Data

The real power of Pandas comes in its tools for grouping and aggregating data. Here we'll look at *value counts* and the basics of *group-by* operations.

#### Value Counts

Pandas includes an array of useful functionality for manipulating and analyzing tabular data.
We'll take a look at two of these here.

The ``pandas.value_counts`` returns statistics on the unique values within each column.

We can use it, for example, to break down rides by gender:

In [None]:
pd.value_counts(data['gender'])

Or to break down rides by age:

In [None]:
pd.value_counts(data['birthyear']).sort_index()

What else might we break down rides by?

In [None]:
pd.value_counts(times.dayofweek)

*We can sort by the index rather than the counts if we wish:*

In [None]:
pd.value_counts(times.dayofweek, sort=False)

In [None]:
pd.value_counts(times.month)

In [None]:
pd.value_counts(times.month, sort=False)

### Group-by Operation

One of the killer features of the Pandas dataframe is the ability to do group-by operations.
You can visualize the group-by like this (image borrowed from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do))

In [None]:
from IPython.display import Image
Image('split_apply_combine.png')

So, for example, we can use this to find the average length of a ride as a function of time of day:

In [None]:
data.groupby(times.hour)['tripminutes'].mean()

The simplest version of a groupby looks like this, and you can use almost any aggregation function you wish (mean, median, sum, minimum, maximum, standard deviation, count, etc.)

```
<data object>.groupby(<grouping values>).<aggregate>()
```

You can even group by multiple values: for example we can look at the trip duration by time of day and by gender:

In [None]:
grouped = data.groupby([times.hour, 'gender'])['tripminutes'].mean()
grouped

The ``unstack()`` operation can help make sense of this type of multiply-grouped data. What this technically does is split a multiple-valued index into an index plus columns:

In [None]:
grouped.unstack()

## 5. Visualizing data with ``pandas``

Of course, looking at tables of data is not very intuitive.
Fortunately Pandas has many useful plotting functions built-in, all of which make use of the ``matplotlib`` library to generate plots.

Whenever you do plotting in the IPython notebook, you will want to first run this *magic command* which configures the notebook to work well with plots:

In [None]:
%matplotlib inline

Now we can simply call the ``plot()`` method of any series or dataframe to get a reasonable view of the data:

In [None]:
data.groupby([times.hour, 'usertype'])['tripminutes'].mean().unstack().plot()

### Adjusting the Plot Style

The default formatting is not very nice; I often make use of the [Seaborn](http://stanford.edu/~mwaskom/software/seaborn/) library for better plotting defaults.

This is already installed on the Fred Hutch Jupyterhub, but if you are running your own Jupyter server you will need to install it separately.

In [None]:
import seaborn
seaborn.set()

And now re-run the plot from above:

In [None]:
data.groupby([times.hour, 'usertype'])['tripminutes'].mean().unstack().plot()

### Other plot types

Pandas supports a range of other plotting types; you can find these by using the <TAB> autocomplete on the ``plot`` method:

In [None]:
data.plot.hist

For example, we can create a histogram of trip durations:

In [None]:
data['tripminutes'].plot.hist(bins=100)

If you'd like to adjust the x and y limits of the plot, you can use the ``set_xlim()`` and ``set_ylim()`` method of the resulting object:

In [None]:
plot = data['tripminutes'].plot.hist(bins=500)
plot.set_xlim(0, 50)


## Breakout: Exploring the Data

1. Make a plot of the total number of rides as a function of month of the year (You'll need to extract the month, use a ``groupby``, and find the appropriate aggregation to count the number in each group).

2. Split this plot by gender. Do you see any seasonal ridership patterns by gender?

3. Split this plot by user type. Do you see any seasonal ridership patterns by usertype?

4. Repeat the above three steps, counting the number of rides by time of day rather than by month.

5. Are there any other interesting insights you can discover in the data using these tools?