Pandas is a Python library or package. Which means that it is a coherent body of Python code that you can use in your Python code for a particular purpose.

The purpose of Pandas is to make working with "relational" or "labelled" data both easy and intuitive.

More precisely, and the following is taken from the Pandas documentation pages at GitHub, Pandas is well suited for many different kinds of data:

  - SQL tables or Excel spreadsheets (tabular data with heterogeneously-typed columns)
  - Time series data (ordered and unordered, not necessarily fixed-frequency)
  - Arbitrary matrix data (homo- or heterogeneously typed with row and column labels)
  - Other observational or statistical data sets (labelled or unlabelled)
  
Pandas is a good example of the "batteries included" principle of Python: Python comes with libraries, especially in the field of scientific computing, that make life a lot easier for it's users:

  - High-level code, often optimized for speed, to tackle common problems
  - Integration with other libraries
  
Pandas is built on top of another Python library, called Numpy. So some knowledge of that library will help us, users of Pandas, and in order to get a grasp of Numpy, some general knowledge of Python is necessary. The Russian doll principle, in short.

![Russian dolls](images/russian_dolls.jpeg)

Pandas is designed to do many things well, but here we will concentrate on some practical aspects of working with the library.

In order to be able to use a library one has to "load" that library, that is make the functions defined in the library  available to our programs.

There are two ways of enabling the use of the code contained in the libraries.

In [1]:
import pandas

But more often we see that import statement in another form:

In [2]:
import pandas as pd

The reason for this is that each time we want to use functions from the Pandas library in our code, we have to use the following syntax: LibraryName.FunctionName and pd.read_csv() is shorter (less typing) than it's equivalent pandas.read_csv.

To get a grip on Pandas, we start with a small ecological dataset from the US with survey data about animals caught on certain plots.

We already imported the Pandas library in one of the notebook cells above with the statement "import pandas as pd".

Now we have to get to the file. For this we need another library called os, that allows us to use commands that work on our operating system (os).

In [3]:
import os
os.getcwd()

'/Users/peter/Documents/bootcamps'

So, now we know where we are, the file we are going to work with is one directory below our "current working directory" or "cwd".

In [4]:
pd.read_csv("data/surveys.csv")

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,
5,6,7,16,1977,1,PF,M,14.0,
6,7,7,16,1977,2,PE,F,,
7,8,7,16,1977,1,DM,M,37.0,
8,9,7,16,1977,1,DM,F,34.0,
9,10,7,16,1977,6,PF,F,20.0,


Looks ok. Notice the missing values in some of the columns (NaN). Also notice that some columns, like plot and species reference ID's which makes this matrix a bit terse, although it is a sound way of doing things: Referencing entities stored in another table or matrix by way of ann ID.

In order to be able to work with the dataset surveys.csv, we need to keep the contents in memory. The way to do this, is to assign it to a new object by assigning the result to a variable. For this we use the syntactical element "=".

In [5]:
surveys_df = pd.read_csv("data/surveys.csv")

What precisely is this surveys_df thing? Let's ask Python:

In [6]:
type(surveys_df)

pandas.core.frame.DataFrame

Cool, let's see what is contained in the dataframe:

In [7]:
surveys_df

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,
5,6,7,16,1977,1,PF,M,14.0,
6,7,7,16,1977,2,PE,F,,
7,8,7,16,1977,1,DM,M,37.0,
8,9,7,16,1977,1,DM,F,34.0,
9,10,7,16,1977,6,PF,F,20.0,


To generate a smaller preview use the head() and tail() methods:

In [8]:
surveys_df.head()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,


The Python slicing syntax also works on dataframes:

In [9]:
surveys_df[2:4]

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,


There are various ways to peek into the object to see precisely what we are dealing with.

We can use the predefined and general method "info" on our dataframe object:

In [10]:
surveys_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35549 entries, 0 to 35548
Data columns (total 9 columns):
record_id          35549 non-null int64
month              35549 non-null int64
day                35549 non-null int64
year               35549 non-null int64
plot_id            35549 non-null int64
species_id         34786 non-null object
sex                33038 non-null object
hindfoot_length    31438 non-null float64
weight             32283 non-null float64
dtypes: float64(2), int64(5), object(2)
memory usage: 2.4+ MB


This gives us a lot of information:

- the type of the object: <class 'pandas.core.frame.DataFrame'>
- The number of entries (rows) in the dataframe
- The number of data columns: 9
- The number of missing values per column, if any: sex = 35549 - 33038 => 2511 (7%)
- The types (dtypes) of data contained in the matrix: int64, object, float64

Info() method gives an overview of the dataframe, but one can also directly ask for the types of the data:

In [11]:
surveys_df.dtypes

record_id            int64
month                int64
day                  int64
year                 int64
plot_id              int64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object

Dataframes also have a describe method, that provides us with some basic statistics about the datasets numeric columns:

In [12]:
surveys_df.describe()

Unnamed: 0,record_id,month,day,year,plot_id,hindfoot_length,weight
count,35549.0,35549.0,35549.0,35549.0,35549.0,31438.0,32283.0
mean,17775.0,6.474022,16.105966,1990.475231,11.397001,29.287932,42.672428
std,10262.256696,3.396583,8.256691,7.493355,6.799406,9.564759,36.631259
min,1.0,1.0,1.0,1977.0,1.0,2.0,4.0
25%,8888.0,4.0,9.0,1984.0,5.0,21.0,20.0
50%,17775.0,6.0,16.0,1990.0,11.0,32.0,37.0
75%,26662.0,9.0,23.0,1997.0,17.0,36.0,48.0
max,35549.0,12.0,31.0,2002.0,24.0,70.0,280.0


Another way of thinking of dataframes is as a group of series that share an index, the column headers. We can easily select a column, using the following syntax:

In [13]:
surveys_df['weight'].head()

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
Name: weight, dtype: float64

Using column selection is an easy way to select some columns and place them in a new dataframe:

In [14]:
smaller_surveys_df = surveys_df[['record_id', 'plot_id', 'weight']]
smaller_surveys_df.head()

Unnamed: 0,record_id,plot_id,weight
0,1,2,
1,2,3,
2,3,2,
3,4,7,
4,5,3,


Of course, we can select rows from our dataframe too:

In [15]:
print(surveys_df[surveys_df.weight > 100].head(5))

     record_id  month  day  year  plot_id species_id sex  hindfoot_length  \
356        357     11   12  1977        9         DS   F             50.0   
361        362     11   12  1977        1         DS   F             51.0   
366        367     11   12  1977       20         DS   M             51.0   
376        377     11   12  1977        9         DS   F             48.0   
380        381     11   13  1977       17         DS   F             48.0   

     weight  
356   117.0  
361   121.0  
366   115.0  
376   120.0  
380   118.0  


The data we have been working with sofar, surveys.csv, contains two columns that reference data available in other tables (csv files). The columns "plot_id" and "species_id" are made up of id's (identifiers) that identify rows in two additional files: plots.csv and species.csv.

Let's load the two files into Pandas as dataframes and have a look at them:

In [16]:
species_df = pd.read_csv("data/species.csv")
species_df.head()

Unnamed: 0,species_id,genus,species,taxa
0,AB,Amphispiza,bilineata,Bird
1,AH,Ammospermophilus,harrisi,Rodent
2,AS,Ammodramus,savannarum,Bird
3,BA,Baiomys,taylori,Rodent
4,CB,Campylorhynchus,brunneicapillus,Bird


Here we see, just as in our initial dataframe, a column with the species_id and some other columns that contain data about the species.

Combining two datasets on the basis of a shared ID is often referred to as a join. In Pandas, however, the method is called "merge":

In [17]:
pd.merge(surveys_df, species_df, on="species_id", how="inner").head(5)

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa
0,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent
1,2,7,16,1977,3,NL,M,33.0,,Neotoma,albigula,Rodent
2,22,7,17,1977,15,NL,F,31.0,,Neotoma,albigula,Rodent
3,38,7,17,1977,17,NL,M,33.0,,Neotoma,albigula,Rodent
4,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent


![Neotoma albigula](images/albigula_02.jpeg)
![Neotoma albigula](images/albigula_01.jpeg)

Let's take a step back for a moment and have a closer look at inner and outer joins, because these are important concepts working with relational datasets. Later on in this course, in the SQL module, we will work with these concepts again, so now is a good time getting a first impression.

We will make two small dataframes just for testing the two joins: inner and outer.

In [18]:
left_df = pd.DataFrame({'key': range(5),
                       'left_values': ['a', 'b', 'c', 'd', 'e']})
right_df = pd.DataFrame({'key': range(2,7),
                        'right_values': ['f', 'g', 'h', 'i', 'j']})
print(left_df)
print('\n')
print(right_df)

   key left_values
0    0           a
1    1           b
2    2           c
3    3           d
4    4           e


   key right_values
0    2            f
1    3            g
2    4            h
3    5            i
4    6            j


So, there we are: Two small datasets with 5 rows and 2 columns each. In the key-column we have 3 overlapping keys: 2, 3, and 4.

First the inner join on keys.

In [19]:
pd.merge(left_df, right_df, on='key', how='inner')

Unnamed: 0,key,left_values,right_values
0,2,c,f
1,3,d,g
2,4,e,h


We loose data from both files, because some keys do not match up (are not in both files).

Now for the outer join (on the left DF):

In [20]:
pd.merge(left_df, right_df, on='key', how='left')

Unnamed: 0,key,left_values,right_values
0,0,a,
1,1,b,
2,2,c,f
3,3,d,g
4,4,e,h


In [21]:
pd.merge(left_df, right_df, on='key', how='right')

Unnamed: 0,key,left_values,right_values
0,2,c,f
1,3,d,g
2,4,e,h
3,5,,i
4,6,,j


In [22]:
pd.merge(left_df, right_df, on='key', how='outer')

Unnamed: 0,key,left_values,right_values
0,0,a,
1,1,b,
2,2,c,f
3,3,d,g
4,4,e,h
5,5,,i
6,6,,j


One last method of the Pandas library: Concat. We use concat to combine different datasets.

In [23]:
pd.concat([left_df, right_df])

Unnamed: 0,key,left_values,right_values
0,0,a,
1,1,b,
2,2,c,
3,3,d,
4,4,e,
0,2,,f
1,3,,g
2,4,,h
3,5,,i
4,6,,j


Or, when we want to concat them next to each other, defining an axis:

In [24]:
pd.concat([left_df, right_df], axis=1)

Unnamed: 0,key,left_values,key.1,right_values
0,0,a,2,f
1,1,b,3,g
2,2,c,4,h
3,3,d,5,i
4,4,e,6,j


#### Exercises:

So far, so good. We miss one important step: Present results of the data we processed and visualize them.

Let's return to our ecological dataset. Before we dive in, we refresh our memories a little bit. Try out the following methods below and see what they return:

1. surveys_df.columns
2. surveys_df.head() and .head(10)
3. surveys_df.tail()
4. surveys_df.shape

Most of the methods we have tried sofar are meant to get a feel for the data. Next we are going to perform some simple summary statistics to learn more on the data.

First let us have a look at the column names:

In [25]:
surveys_df.columns.values

array(['record_id', 'month', 'day', 'year', 'plot_id', 'species_id', 'sex',
       'hindfoot_length', 'weight'], dtype=object)

We are interested in the different species contained in the data:

In [26]:
pd.unique(surveys_df.species_id)

array(['NL', 'DM', 'PF', 'PE', 'DS', 'PP', 'SH', 'OT', 'DO', 'OX', 'SS',
       'OL', 'RM', nan, 'SA', 'PM', 'AH', 'DX', 'AB', 'CB', 'CM', 'CQ',
       'RF', 'PC', 'PG', 'PH', 'PU', 'CV', 'UR', 'UP', 'ZL', 'UL', 'CS',
       'SC', 'BA', 'SF', 'RO', 'AS', 'SO', 'PI', 'ST', 'CU', 'SU', 'RX',
       'PB', 'PL', 'PX', 'CT', 'US'], dtype=object)

#### Exercises:

Think of a way to count the number of unique species.

Do the same for the plots. How many unique plots are there in the data?

In [27]:
# Exercise
species_count = len(pd.unique(surveys_df.species_id)) - 1
print(species_count)

48


In [28]:
# Exercise
plots_count = len(pd.unique(surveys_df.plot_id))
print(plots_count)

24


Sometimes we want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, suppose we are interested in the average weight of all animals:

In [29]:
surveys_df['weight'].describe()

count    32283.000000
mean        42.672428
std         36.631259
min          4.000000
25%         20.000000
50%         37.000000
75%         48.000000
max        280.000000
Name: weight, dtype: float64

Each of these outcomes of the describe() method can be used as a specific metric on the selected column ('weight') of the dataframe:

In [30]:
surveys_df['weight'].std()

36.63125947458399

Pandas documentation is great, so if we want to look up what we actually do calling this method:

[dataframe.std docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.std.html)

Another way to quickly calculate summary statistics is by using the groupby method:

In [31]:
# Group data by sex
sorted = surveys_df.groupby('sex')
# Quick look at what we have
sorted.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,day,hindfoot_length,month,plot_id,record_id,weight,year
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
F,count,15690.0,14894.0,15690.0,15690.0,15690.0,15303.0,15690.0
F,mean,16.007138,28.83678,6.583047,11.440854,18036.412046,42.170555,1990.644997
F,std,8.271144,9.463789,3.36735,6.870684,10423.089,36.847958,7.598725
F,min,1.0,7.0,1.0,1.0,3.0,4.0,1977.0
F,25%,9.0,21.0,4.0,5.0,8917.5,20.0,1984.0
F,50%,16.0,27.0,7.0,12.0,18075.5,34.0,1990.0
F,75%,23.0,36.0,10.0,17.0,27250.0,46.0,1997.0
F,max,31.0,64.0,12.0,24.0,35547.0,274.0,2002.0
M,count,17348.0,16476.0,17348.0,17348.0,17348.0,16879.0,17348.0
M,mean,16.184286,29.709578,6.392668,11.098282,17754.835601,42.995379,1990.480401


And to have a look at the mean values for all (numerical) columns grouped by sex:

In [32]:
sorted.mean()

Unnamed: 0_level_0,record_id,month,day,year,plot_id,hindfoot_length,weight
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
F,18036.412046,6.583047,16.007138,1990.644997,11.440854,28.83678,42.170555
M,17754.835601,6.392668,16.184286,1990.480401,11.098282,29.709578,42.995379


#### Exercises:

- How many recorded individuals are female (F) and how many male (M)
- What happens when you group by two columns using the following syntax and then grab the mean values?
  - sorted2 = surveys_df.groupby(['plot_id', 'sex'])
  - sorted2.mean()
- Summarize weight values for each plot in your data.
  - Use the following syntax: byPlot['weight'].describe()

In [33]:
# Exercise
surveys_df['sex'].describe()

count     33038
unique        2
top           M
freq      17348
Name: sex, dtype: object

In [34]:
# Exercise
surveys_df['sex'].value_counts()

M    17348
F    15690
Name: sex, dtype: int64

In [35]:
# Exercise checks
print(surveys_df['sex'].value_counts()['F'])
print(surveys_df['sex'].value_counts()['M'])
print(17348 + 15690)

15690
17348
33038


In [36]:
# Exercise
sorted2 = surveys_df.groupby(['plot_id', 'sex'])
sorted2.mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,record_id,month,day,year,hindfoot_length,weight
plot_id,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,F,18390.384434,6.597877,15.338443,1990.933962,31.733911,46.311138
1,M,17197.740639,6.121461,15.905936,1990.091324,34.30277,55.95056
2,F,17714.753608,6.426804,16.28866,1990.449485,30.16122,52.561845
2,M,18085.458042,6.340035,15.440559,1990.756119,30.35376,51.391382
3,F,19888.783875,6.604703,16.161254,1992.013438,23.774044,31.215349
3,M,20226.767857,6.271429,16.45,1992.275,23.833744,34.163241
4,F,17489.205275,6.442661,15.74656,1990.235092,33.249102,46.818824
4,M,18493.841748,6.430097,16.507767,1991.000971,34.097959,48.888119
5,F,12280.793169,6.142315,15.72296,1986.485769,28.921844,40.974806
5,M,12798.426621,6.194539,15.703072,1986.817406,29.694794,40.708551


Sofar, we have seen a lot of numbers, so let's zoom in on the animals we are dealing with. In the surveys dataframe we have been working with, we are dealing with ID's for species. But remember that we did a join on species ID against the species dataframe: pd.merge(surveys_df, species_df, on="species_id", how="inner")

In [37]:
surveys2_df = pd.merge(surveys_df, species_df, on="species_id", how="inner")
surveys2_df.tail(10)

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa
34776,24669,10,13,1996,3,PX,,,,Chaetodipus,sp.,Rodent
34777,24991,2,8,1997,23,PX,F,19.0,20.0,Chaetodipus,sp.,Rodent
34778,25137,2,9,1997,13,PX,M,20.0,18.0,Chaetodipus,sp.,Rodent
34779,28806,11,21,1998,7,PX,,,,Chaetodipus,sp.,Rodent
34780,30986,7,1,2000,7,PX,,,,Chaetodipus,sp.,Rodent
34781,28988,12,23,1998,6,CT,,,,Cnemidophorus,tigris,Reptile
34782,35512,12,31,2002,11,US,,,,Sparrow,sp.,Bird
34783,35513,12,31,2002,11,US,,,,Sparrow,sp.,Bird
34784,35528,12,31,2002,13,US,,,,Sparrow,sp.,Bird
34785,35544,12,31,2002,15,US,,,,Sparrow,sp.,Bird


Let's create a list of unique species in our data with a "groupby" and a "count":

In [38]:
species_counts = surveys2_df.groupby('species_id')['record_id'].count()
species_counts

species_id
AB      303
AH      437
AS        2
BA       46
CB       50
CM       13
CQ       16
CS        1
CT        1
CU        1
CV        1
DM    10596
DO     3027
DS     2504
DX       40
NL     1252
OL     1006
OT     2249
OX       12
PB     2891
PC       39
PE     1299
PF     1597
PG        8
PH       32
PI        9
PL       36
PM      899
PP     3123
PU        5
PX        6
RF       75
RM     2609
RO        8
RX        2
SA       75
SC        1
SF       43
SH      147
SO       43
SS      248
ST        1
SU        5
UL        4
UP        8
UR       10
US        4
ZL        2
Name: record_id, dtype: int64

A lot of DM's?

In [39]:
surveys2_df[surveys2_df.species_id == 'DM'].head(5)

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa
1252,3,7,16,1977,2,DM,F,37.0,,Dipodomys,merriami,Rodent
1253,4,7,16,1977,7,DM,M,36.0,,Dipodomys,merriami,Rodent
1254,5,7,16,1977,3,DM,M,35.0,,Dipodomys,merriami,Rodent
1255,8,7,16,1977,1,DM,M,37.0,,Dipodomys,merriami,Rodent
1256,9,7,16,1977,1,DM,F,34.0,,Dipodomys,merriami,Rodent


And here we have the little rodent (Merriam's kangaroo rat):
![Merriam's kangaroo rat](images/merriami.jpg)