# Getting started with `pandas`

`pandas` is the python library for dealing with data that is organised as a "spreadsheet".

It is incredibly powerful, and allows for very terse code.

### Pros
* It’s terse, so you can develop quickly

### Cons
* It’s terse, so it tends to be more write-only code than not

![Power tools ahead](http://49.media.tumblr.com/tumblr_lvck9qVHou1qbvl2io1_400.gif)

In [None]:
%matplotlib inline

In [None]:
# pandas is a library for handling spreadsheets
import pandas

In [None]:
# preferred style (less typing)
import numpy as np
import pandas as pd

In [None]:
# fix random number seed for reproducibility
np.random.seed(42)

In [None]:
# There are many ways to construct a DataFrame
# We pass a dictionary of {column name: column values}
df = pd.DataFrame({'A': [1, 2, 3, 4],
                   'B': [True, True, False, True],
                   'C': np.random.randn(4)},
                  # also this weird index thing
                  index=['a', 'b', 'c', 'd'])
df

# Selecting rows and columns

You can select columns via their name.

In [None]:
df['A']

In [None]:
# select multiple columns
columns = ['A', 'C']
df[columns]

In [None]:
# select a row
df.loc['a']

In [None]:
# or multiple rows
df.loc[['a', 'c']]

In [None]:
# or a range of rows
# nb. for python experts: slicing is inclusive!
df.loc['a':'c']

In [None]:
# can also select by position instead of name
df.iloc[0:2]

In [None]:
# also works
df[0:2]

In [None]:
df

In [None]:
# select a specific cell
df.loc['a', 'B']
# row first, then column
#df.loc['B', 'a']

In [None]:
# range of rows, two specific columns
df.loc['a':'c', ['A', 'C']]

# What is a 1D data frame? A `Series`

In [None]:
df.A

In [None]:
df['A']

In [None]:
df.loc[:, 'A']

# Operating on your data

In [None]:
# fill a dataframe with some random numbers
# three rows, four columns
df = pd.DataFrame(np.random.uniform(0, 10, size=(3, 4)))
df

In [None]:
# add or subtract
df + 1

In [None]:
df - 1

# Wait, what??

Why does it say 2.117??

In [None]:
# to the power of two
df**2

In [None]:
np.sqrt(df**2)

# Combining two (or more) data frames

If your columns are labelled `pandas` will be smart about it.

In [None]:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [2, 4, 7], 'B': [2, 4, 9]})

df1

In [None]:
df2

In [None]:
df1 + df2

In [None]:
df1 - df2

In [None]:
df1**df2 #!!!

# Unsorted rows? No problem

Because `pandas` uses the index, swapping around rows isn't a problem

In [None]:
df2 = pd.DataFrame({'A': [2, 4, 7], 'B': [2, 4, 9]}, index=[2, 0, 1])
df2

In [None]:
df1 + df2

In [None]:
# If it can not match up entries by index, it tells you
df3 = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
df3

In [None]:
df2 + df3
# NaN means Not A Number

# Switch to cycling data from Zurich!

----

## Cycling data from Montreal

This is cycling data from Montreal.

In [None]:
broken_df = pd.read_csv('bikes.csv', encoding='latin1')
broken_df[:3]

Luckily `pandas` is an expert at reading broken CSV files. In the following we:

* set the separator to be `;`
* setup parsing of dates
* tell it that the dates are dd/mm/yyyy
* ask it to label each row with the date

In [None]:
df = pd.read_csv('bikes.csv', sep=';', encoding='latin1', parse_dates=['Date'], dayfirst=True, index_col='Date')
df[:3]

In [None]:
df['Berri 1'].plot()

In [None]:
df.plot(figsize=(15, 10))

# Focus on Berri

Berri is a street in Montreal, I've never been there :(. Let's see if we can create a plot showing
the average number of cyclists on this road for each day of the week.

In [None]:
berri_bikes = df[['Berri 1']]

In [None]:
# access the day part of the date of each row
berri_bikes.index.day

In [None]:
# the index also knows the how many'th day of the week each day is
berri_bikes.index.weekday

In [None]:
# create a new column to explicitly store the day of the week
berri_bikes.loc[:, 'weekday'] = berri_bikes.index.weekday

In [None]:
berri_bikes[:4]

In [None]:
# Let's start with summing riders for each weekday
# groupby() groups rows by the value in the column you name
# by itself it doesn't do much
grouped = berri_bikes.groupby('weekday')
grouped

In [None]:
# you now need to specify how it should combine all the values
grouped.aggregate(sum)

In [None]:
average = grouped.aggregate(np.mean)
average

In [None]:
# can you remember which number corresponds to which day? I can't
average.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
average.plot(kind='bar', title="average riders per weekday")