Pandas is a Python package providing data-analysis tools.

http://pandas.pydata.org/

Official Documentation

http://pandas.pydata.org/pandas-docs/stable/

Installation options:
    
    pip install pandas
    
It is also included in standard scientific Python distributions 

https://store.continuum.io/cshop/anaconda/

https://www.enthought.com/products/epd/

In [3]:
# import numpy and pandas packages
import numpy as np
import pandas as pd

Most of the utility of Pandas comes from the Dataframe object.

A Dataframe is a 2D table with named columns. It is similar to the DataFrame structure in R, but it also has 'pythonic' functionality built in.

In [4]:
# Empty Dataframe Object
df = pd.DataFrame()
type(df)

pandas.core.frame.DataFrame

Pandas provides easy ways to load in a dataset from common data sources.

Here I load a csv file for the popular iris 'practice' dataset, from the UCI Machine Learning Repository.
http://archive.ics.uci.edu/ml/datasets/Iris

**Dont worry: no machine learning will be taught today ;)

In [5]:
# load data from csv to DataFrame
df = pd.DataFrame.from_csv('iris.csv')

# display the first part of the dataset
df.head()

Unnamed: 0_level_0,sepal_width,petal_length,petal_width,class_name
sepal_length,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa


When we want to quickly see what the data looks like

    DataFrame.head()
    
and    

    DataFrame.tail() 
    
Display the data first and last parts of the dataframe.

We could can also just let python display the full DataFrame, but this begins to cluttet the screen.

In [24]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class_name,new_field
0,5.1,3.5,1.4,0.2,Iris-setosa,0
1,4.9,3.0,1.4,0.2,Iris-setosa,0
2,4.7,3.2,1.3,0.2,Iris-setosa,0
3,4.6,3.1,1.5,0.2,Iris-setosa,0
4,5.0,3.6,1.4,0.2,Iris-setosa,0
5,5.4,3.9,1.7,0.4,Iris-setosa,0
6,4.6,3.4,1.4,0.3,Iris-setosa,0
7,5.0,3.4,1.5,0.2,Iris-setosa,0
8,4.4,2.9,1.4,0.2,Iris-setosa,0
9,4.9,3.1,1.5,0.1,Iris-setosa,0


We can also very quickly get some summary statistics of the dataframe.

In [6]:
df.describe()

Unnamed: 0,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0
mean,3.054,3.758667,1.198667
std,0.433594,1.76442,0.763161
min,2.0,1.0,0.1
25%,2.8,1.6,0.3
50%,3.0,4.35,1.3
75%,3.3,5.1,1.8
max,4.4,6.9,2.5


Notice that this syntax treats all these functions as ```class``` functions of the ```DataFrame``` object. This should be familiar to those of you with a Python background.

By default Pandas made treats the first column differently 

...this is the index column, which is always included in the dataframe.

Since the the first row isn't particularly special we may instead want to set this to a default index.

In [7]:
# load data from csv to DataFrame
df = pd.DataFrame.from_csv('iris.csv',index_col=False)

df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class_name
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


We can index the data in a variety of ways. 

In [8]:
# select a particular column by key. (like a dict object)
width = df['sepal_width']
width.head()

0    3.5
1    3.0
2    3.2
3    3.1
4    3.6
Name: sepal_width, dtype: float64

The selected column is a pandas ```Series``` object.

In [9]:
type(width)

pandas.core.series.Series

In [10]:
# or in a more pythonic way
length = df.sepal_width
length.tail()

145    3.0
146    2.5
147    3.0
148    3.4
149    3.0
Name: sepal_width, dtype: float64

In [23]:
# or a subset of the columns using a list
widths = df[['sepal_width','petal_width','class_name']]
widths.head()

Unnamed: 0,sepal_width,petal_width,class_name
0,3.5,0.2,Iris-setosa
1,3.0,0.2,Iris-setosa
2,3.2,0.2,Iris-setosa
3,3.1,0.2,Iris-setosa
4,3.6,0.2,Iris-setosa


In [11]:
# the index column is special
df.index

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')

In [12]:
# selecting rows
df[10:15]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class_name
10,5.4,3.7,1.5,0.2,Iris-setosa
11,4.8,3.4,1.6,0.2,Iris-setosa
12,4.8,3.0,1.4,0.1,Iris-setosa
13,4.3,3.0,1.1,0.1,Iris-setosa
14,5.8,4.0,1.2,0.2,Iris-setosa


If you have worked with numpy arrays before, then row indexing should look familiar.

In fact the pandas ```DataFrame``` is built upon numpy arrays, so a lot functionality carries over.

In [13]:
# index single value like a numpy array
df.iloc[10,2]

1.5

In [14]:
# Series behave like 1D numpy arrays
df.class_name[:5]

0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: class_name, dtype: object

Each pandas column (or ```Series```) has entries of a particular type of data. These are denoted by numpy ```dtype```s.

In our example we can see the ```dtype```s, along with other useful information using.

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
class_name      150 non-null object
dtypes: float64(4), object(1)
memory usage: 7.0+ KB


```df.class_name``` has a different dtype than the othe columns since it holds non-numeric data.

In [16]:
# return only dtypes
df.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
class_name       object
dtype: object

Note that we never explicitly set these. Pandas will try to do this automatically when you import data.

Adding to a Dataframe

We can create a new column for our dataframe

In [35]:
df['new_field'] = 0.
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class_name,new_field
0,5.1,3.5,1.4,0.2,Iris-setosa,0
1,4.9,3.0,1.4,0.2,Iris-setosa,0
2,4.7,3.2,1.3,0.2,Iris-setosa,0
3,4.6,3.1,1.5,0.2,Iris-setosa,0
4,5.0,3.6,1.4,0.2,Iris-setosa,0


In [32]:
# This is not a valid way of adding a column
df.new_field2 = 0.
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class_name,new_field
0,5.1,3.5,1.4,0.2,Iris-setosa,0
1,4.9,3.0,1.4,0.2,Iris-setosa,0
2,4.7,3.2,1.3,0.2,Iris-setosa,0
3,4.6,3.1,1.5,0.2,Iris-setosa,0
4,5.0,3.6,1.4,0.2,Iris-setosa,0


We can delete a column using ```drop()```

In [37]:
df.drop('new_field',1)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class_name,new_field
0,5.1,3.5,1.4,0.2,Iris-setosa,0
1,4.9,3.0,1.4,0.2,Iris-setosa,0
2,4.7,3.2,1.3,0.2,Iris-setosa,0
3,4.6,3.1,1.5,0.2,Iris-setosa,0
4,5.0,3.6,1.4,0.2,Iris-setosa,0


Why didnt this work?

Modifying functions return a copy of the modified DataFrame by default.

So we need to do

In [None]:
df = df.drop('new_field',1)
df.head()

or use the ```inplace``` option

In [38]:
df.drop('new_field',1,inplace=True)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class_name
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


It is straightforward to build new columns using mathematical operations on existing columns.

In [39]:
df['sepal_area'] = df.sepal_length * df.sepal_width
df['petal_area'] = df.petal_length * df.petal_width
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class_name,sepal_area,petal_area
0,5.1,3.5,1.4,0.2,Iris-setosa,17.85,0.28
1,4.9,3.0,1.4,0.2,Iris-setosa,14.7,0.28
2,4.7,3.2,1.3,0.2,Iris-setosa,15.04,0.26
3,4.6,3.1,1.5,0.2,Iris-setosa,14.26,0.3
4,5.0,3.6,1.4,0.2,Iris-setosa,18.0,0.28


More complicated functions can be applied to a column using ```apply()```.

For example to convert a text field to lowercase.

In [42]:
df.class_name = df.class_name.apply(lambda x: str.lower(x))
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class_name,sepal_area,petal_area
0,5.1,3.5,1.4,0.2,iris-setosa,17.85,0.28
1,4.9,3.0,1.4,0.2,iris-setosa,14.7,0.28
2,4.7,3.2,1.3,0.2,iris-setosa,15.04,0.26
3,4.6,3.1,1.5,0.2,iris-setosa,14.26,0.3
4,5.0,3.6,1.4,0.2,iris-setosa,18.0,0.28


```lambda``` is a convenient way to write a short 'anonymous' function, but this also works with normal function definitions.

In [44]:
def caps(x):
    
    return str.upper(x)

df.class_name = df.class_name.apply(caps)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class_name,sepal_area,petal_area
0,5.1,3.5,1.4,0.2,IRIS-SETOSA,17.85,0.28
1,4.9,3.0,1.4,0.2,IRIS-SETOSA,14.7,0.28
2,4.7,3.2,1.3,0.2,IRIS-SETOSA,15.04,0.26
3,4.6,3.1,1.5,0.2,IRIS-SETOSA,14.26,0.3
4,5.0,3.6,1.4,0.2,IRIS-SETOSA,18.0,0.28


There are a variety of useful things you can do with 'classification' columns.

In [48]:
# Return all unique values in class_name
df.class_name.unique()

array(['IRIS-SETOSA', 'IRIS-VERSICOLOR', 'IRIS-VIRGINICA'], dtype=object)

In [52]:
# selecting a part of the name
df_virginica = df[ df.class_name == 'IRIS-VIRGINICA' ]
df_virginica.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class_name,sepal_area,petal_area
100,6.3,3.3,6.0,2.5,IRIS-VIRGINICA,20.79,15.0
101,5.8,2.7,5.1,1.9,IRIS-VIRGINICA,15.66,9.69
102,7.1,3.0,5.9,2.1,IRIS-VIRGINICA,21.3,12.39
103,6.3,2.9,5.6,1.8,IRIS-VIRGINICA,18.27,10.08
104,6.5,3.0,5.8,2.2,IRIS-VIRGINICA,19.5,12.76


In [None]:
This can similarly be used to filter for data 