Pandas is a Python library or package. Which means that it is a coherent body of Python code that you can use in your Python code for a particular purpose.

The purpose of Pandas is to make working with "relational" or "labelled" data both easy and intuitive.

More precisely, and the following is taken from the Pandas documentation pages at GitHub, Pandas is well suited for many different kinds of data:

  - SQL tables or Excel spreadsheets (tabular data with heterogeneously-typed columns)
  - Time series data (ordered and unordered, not necessarily fixed-frequency)
  - Arbitrary matrix data (homo- or heterogeneously typed with row and column labels)
  - Other observational or statistical data sets (labelled or unlabelled)
  
Pandas is a good example of the "batteries included" principle of Python: Python comes with libraries, especially in the field of scientific computing, that make life a lot easier for it's users:

  - High-level code, often optimized for speed, to tackle common problems
  - Integration with other libraries
  
Pandas is built on top of another Python library, called Numpy. So some knowledge of that library will help us, users of Pandas, and in order to get a grasp of Numpy, some general knowledge of Python is necessary. The Russian doll principle, in short.

![Russian dolls](images/russian_dolls.jpeg)

Pandas is designed to do many things well, but here we will concentrate on some practical aspects of working with the library.

In order to be able to use a library one has to "load" that library, that is make the functions defined in the library  available to our programs.

There are two ways of enabling the use of the code contained in the libraries.

In [1]:
import pandas

But more often we see that import statement in another form:

In [2]:
import pandas as pd

The reason for this is that each time we want to use functions from the Pandas library in our code, we have to use the following syntax: LibraryName.FunctionName and pd.read_csv() is shorter (less typing) than it's equivalent pandas.read_csv.

To get a grip on Pandas, we start with a small ecological dataset from the US with survey data about animals caught on certain plots.

We already imported the Pandas library in one of the notebook cells above with the statement "import pandas as pd".

Now we have to get to the file. For this we need another library called os, that allows us to use commands that work on our operating system (os).

In [3]:
import os
os.getcwd()

'/Users/peter/Documents/bootcamps'

So, now we know where we are, the file we are going to work with is one directory below our "current working directory" or "cwd".

In [4]:
pd.read_csv("data/surveys.csv")

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,
5,6,7,16,1977,1,PF,M,14.0,
6,7,7,16,1977,2,PE,F,,
7,8,7,16,1977,1,DM,M,37.0,
8,9,7,16,1977,1,DM,F,34.0,
9,10,7,16,1977,6,PF,F,20.0,


Looks ok. Notice the missing values in some of the columns (NaN). Also notice that some columns, like plot and species reference ID's which makes this matrix a bit terse, although it is a sound way of doing things: Referencing entities stored in another table or matrix by way of ann ID.

In order to be able to work with the dataset surveys.csv, we need to keep the contents in memory. The way to do this, is to assign it to a new object by assigning the result to a variable. For this we use the syntactical element "=".

In [5]:
surveys_df = pd.read_csv("data/surveys.csv")

What precisely is this surveys_df thing? Let's ask Python:

In [6]:
type(surveys_df)

pandas.core.frame.DataFrame

Cool, let's see what is contained in the dataframe:

In [7]:
surveys_df

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,
5,6,7,16,1977,1,PF,M,14.0,
6,7,7,16,1977,2,PE,F,,
7,8,7,16,1977,1,DM,M,37.0,
8,9,7,16,1977,1,DM,F,34.0,
9,10,7,16,1977,6,PF,F,20.0,


To generate a smaller preview use the head() and tail() methods:

In [8]:
surveys_df.head()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,


The Python slicing syntax also works on dataframes:

In [9]:
surveys_df[2:4]

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,


There are various ways to peek into the object to see precisely what we are dealing with.

We can use the predefined and general method "info" on our dataframe object:

In [10]:
surveys_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35549 entries, 0 to 35548
Data columns (total 9 columns):
record_id          35549 non-null int64
month              35549 non-null int64
day                35549 non-null int64
year               35549 non-null int64
plot_id            35549 non-null int64
species_id         34786 non-null object
sex                33038 non-null object
hindfoot_length    31438 non-null float64
weight             32283 non-null float64
dtypes: float64(2), int64(5), object(2)
memory usage: 2.4+ MB


This gives us a lot of information:

- the type of the object: <class 'pandas.core.frame.DataFrame'>
- The number of entries (rows) in the dataframe
- The number of data columns: 9
- The number of missing values per column, if any: sex = 35549 - 33038 => 2511 (7%)
- The types (dtypes) of data contained in the matrix: int64, object, float64

Info() method gives an overview of the dataframe, but one can also directly ask for the types of the data:

In [11]:
surveys_df.dtypes

record_id            int64
month                int64
day                  int64
year                 int64
plot_id              int64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object

Dataframes also have a describe method, that provides us with some basic statistics about the datasets numeric columns:

In [12]:
surveys_df.describe()



Unnamed: 0,record_id,month,day,year,plot_id,hindfoot_length,weight
count,35549.0,35549.0,35549.0,35549.0,35549.0,31438.0,32283.0
mean,17775.0,6.474022,16.105966,1990.475231,11.397001,29.287932,42.672428
std,10262.256696,3.396583,8.256691,7.493355,6.799406,9.564759,36.631259
min,1.0,1.0,1.0,1977.0,1.0,2.0,4.0
25%,8888.0,4.0,9.0,1984.0,5.0,,
50%,17775.0,6.0,16.0,1990.0,11.0,,
75%,26662.0,9.0,23.0,1997.0,17.0,,
max,35549.0,12.0,31.0,2002.0,24.0,70.0,280.0


Another way of thinking of dataframes is as a group of series that share an index, the column headers. We can easily select a column, using the following syntax:

In [13]:
surveys_df['weight'].head()

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
Name: weight, dtype: float64

Using column selection is an easy way to select some columns and place them in a new dataframe:

In [14]:
smaller_surveys_df = surveys_df[['record_id', 'plot_id', 'weight']]
smaller_surveys_df.head()

Unnamed: 0,record_id,plot_id,weight
0,1,2,
1,2,3,
2,3,2,
3,4,7,
4,5,3,


Of course, we can select rows from our dataframe too:

In [15]:
print(surveys_df[surveys_df.weight > 100].head(5))

     record_id  month  day  year  plot_id species_id sex  hindfoot_length  \
356        357     11   12  1977        9         DS   F             50.0   
361        362     11   12  1977        1         DS   F             51.0   
366        367     11   12  1977       20         DS   M             51.0   
376        377     11   12  1977        9         DS   F             48.0   
380        381     11   13  1977       17         DS   F             48.0   

     weight  
356   117.0  
361   121.0  
366   115.0  
376   120.0  
380   118.0  


The data we have been working with sofar, surveys.csv, contains two columns that reference data available in other tables (csv files). The columns "plot_id" and "species_id" are made up of id's (identifiers) that identify rows in two additional files: plots.csv and species.csv.

Let's load the two files into Pandas as dataframes and have a look at them:

In [20]:
species_df = pd.read_csv("data/species.csv")
species_df.head()

Unnamed: 0,species_id,genus,species,taxa
0,AB,Amphispiza,bilineata,Bird
1,AH,Ammospermophilus,harrisi,Rodent
2,AS,Ammodramus,savannarum,Bird
3,BA,Baiomys,taylori,Rodent
4,CB,Campylorhynchus,brunneicapillus,Bird


Here we see, just as in our initial dataframe, a column with the species_id and some other columns that contain data about the species.

Combining two datasets on the basis of a shared ID is often referred to as a join. In Pandas, however, the method is called "merge":

In [22]:
pd.merge(surveys_df, species_df, on="species_id", how="inner").head(5)

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa
0,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent
1,2,7,16,1977,3,NL,M,33.0,,Neotoma,albigula,Rodent
2,22,7,17,1977,15,NL,F,31.0,,Neotoma,albigula,Rodent
3,38,7,17,1977,17,NL,M,33.0,,Neotoma,albigula,Rodent
4,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent
