# Short Introduction to Programming in Python

---
teaching: 0
exercises: 0
questions:
    - "What is Python?"
    - "Why should I learn Python?"
objectives:
    - "Describe the advantages of using programming vs. completing repetitive tasks by hand."
    - "Define the following data types in Python: strings, integers, and floats."
    - "Perform mathematical operations in Python using basic operators."
    - "Define the following as it relates to Python: lists, tuples, and dictionaries."
---

In [1]:
2 + 2

4

## Introduction to Python built-in data types

### Strings, integers and floats

One of the most basic things we can do in Python is assign values to variables:

In [2]:
text = "Data Carpentry"  # An example of a string
number = 42  # An example of an integer
pi_value = 3.1415  # An example of a float

Here we've assigned data to the variables `text`, `number` and `pi_value`,
using the assignment operator `=`. To review the value of a variable, we
can type the name of the variable into the interpreter and press <kbd>Return</kbd>:

In [3]:
text

'Data Carpentry'

Everything in Python has a type. To get the type of something, we can pass it
to the built-in function `type`:

In [4]:
type(text)

str

In [5]:
type(number)

int

In [6]:
type(6.02)

float

The variable `text` is of type `str`, short for "string". Strings hold
sequences of characters, which can be letters, numbers, punctuation
or more exotic forms of text (even emoji!).

We can also see the value of something using another built-in function, `print`:

In [7]:
print(text)

Data Carpentry


In [8]:
print(11)

11


This may seem redundant, but in fact it's the only way to display output in a script:

```python
# A Python script file
# Comments in Python start with #
# The next line assigns the string "Data Carpentry" to the variable "text".
text = "Data Carpentry"
# The next line does nothing!
text
# The next line uses the print function to print out the value we assigned to "text"
print(text)
```

*Running the script*
```python
$ Python example.py
Data Carpentry
```

Notice that "Data Carpentry" is printed only once.

**Tip**: `print` and `type` are built-in functions in Python. Later in this
lesson, we will introduce methods and user-defined functions. The Python
documentation is excellent for reference on the differences between them.

### Operators

We can perform mathematical calculations in Python using the basic operators
 `+, -, /, *, %`:

In [9]:
2 + 2  # Addition

4

In [10]:
6 * 7  # Multiplication

42

In [11]:
2 ** 16  # Power

65536

In [12]:
13 % 5  # Modulo

3

We can also use comparison and logic operators:
`<, >, ==, !=, <=, >=` and statements of identity such as
`and, or, not`. The data type returned by this is
called a _boolean_.

In [13]:
3 > 4

False

In [14]:
True and True

True

In [15]:
True or False

True

## Sequential types: Lists and Tuples

### Lists

**Lists** are a common data structure to hold an ordered sequence of
elements. Each element can be accessed by an index.  Note that Python
indexes start with 0 instead of 1:

In [16]:
numbers = [1, 2, 3]
numbers[0]

1

A `for` loop can be used to access the elements in a list or other Python data
structure one at a time:

In [17]:
for num in numbers:
    print(num)

1
2
3


**Indentation** is very important in Python. Note that the second line in the
example above is indented. Just like three chevrons `>>>` indicate an
interactive prompt in Python, the three dots `...` are Python's prompt for
multiple lines. This is Python's way of marking a block of code. [Note: you
do not type `>>>` or `...`.]

To add elements to the end of a list, we can use the `append` method. Methods
are a way to interact with an object (a list, for example). We can invoke a
method using the dot `.` followed by the method name and a list of arguments
in parentheses. Let's look at an example using `append`:

In [18]:
numbers.append(4)
print(numbers)

[1, 2, 3, 4]


To find out what methods are available for an
object, we can use the built-in `help` command:

In [19]:
?numbers

[0;31mType:[0m        list
[0;31mString form:[0m [1, 2, 3, 4]
[0;31mLength:[0m      4
[0;31mDocstring:[0m  
list() -> new empty list
list(iterable) -> new list initialized from iterable's items


### Tuples

A tuple is similar to a list in that it's an ordered sequence of elements.
However, tuples can not be changed once created (they are "immutable"). Tuples
are created by placing comma-separated values inside parentheses `()`.

In [20]:
# Tuples use parentheses
a_tuple = (1, 2, 3)
another_tuple = ('blue', 'green', 'red')
# Note: lists use square brackets
a_list = [1, 2, 3]

> ## Challenge - Tuples
> 1. What happens when you type `a_tuple[2] = 5` vs `a_list[1] = 5` ?
> 2. Type `type(a_tuple)` into Python - what is the object type?
>

## Dictionaries

A **dictionary** is a container that holds pairs of objects - keys and values.

In [21]:
translation = {'one': 1, 'two': 2}
translation['one']

1

Dictionaries work a lot like lists - except that you index them with *keys*.
You can think about a key as a name for or a unique identifier for a set of values
in the dictionary. Keys can only have particular types - they have to be
"hashable". Strings and numeric types are acceptable, but lists aren't.

In [22]:
rev = {1: 'one', 2: 'two'}
rev[1]

'one'

In [23]:
bad = {[1, 2, 3]: 3}

TypeError: unhashable type: 'list'

In Python, a "Traceback" is an multi-line error block printed out for the
user.

To add an item to the dictionary we assign a value to a new key:

In [24]:
rev = {1: 'one', 2: 'two'}
rev[3] = 'three'
rev

{1: 'one', 2: 'two', 3: 'three'}

Using `for` loops with dictionaries is a little more complicated. We can do
this in two ways:

In [25]:
for key, value in rev.items():
    print(key, '->', value)

1 -> one
2 -> two
3 -> three


In [26]:
for key in rev: # or `rev.keys()`
    print(key, '->', rev[key])

1 -> one
2 -> two
3 -> three


> ## Challenge - Can you do reassignment in a dictionary?
>
> 1. First check what `rev` is right now (remember `rev` is the name of our dictionary).
>
>    Type:
> ```python
> >>> rev
> ```
>
> 2. Try to reassign the second value (in the *key value pair*) so that it no longer reads "two" but instead reads "apple-sauce".
>
> 3. Now display `rev` again to see if it has changed.
>

It is important to note that dictionaries are "unordered" and do not remember
the sequence of their items (i.e. the order in which key:value pairs were
added to the dictionary). Because of this, the order in which items are
returned from loops over dictionaries might appear random and can even change
with time.

## Functions

Defining a section of code as a function in Python is done using the `def`
keyword. For example a function that takes two arguments and returns their sum
can be defined as:

In [27]:
def add_function(a, b):
    result = a + b
    return result

z = add_function(20, 22)
print(z)

42


# Manipulating and analyzing data with pandas

## Lesson preamble

### Learning objectives

- Describe what a data frame is.
- Load external data from a .csv file into a data frame with pandas.
- Summarize the contents of a data frame with pandas.
- Learn to use data frame methods `loc()`, `head()`, `info()`, `describe()`, `shape()`, `columns()`, and `index()`.

### Lesson outline

- Data set background (10 min)
- What are data frames (15 min)
- Data wrangling with pandas (40 min)

---

## Dataset background

Today, we will be working with real data from a longitudinal study of the
species abundance in the Chihuahuan desert ecosystem near Portal, Arizona, USA.
This study includes observations of plants, ants, and rodents from 1977 - 2002,
and has been used in over 100 publications. More information is available in
[the abstract of this paper from 2009](
http://onlinelibrary.wiley.com/doi/10.1890/08-1222.1/full). There are several
datasets available related to this study, and we will be working with datasets
that have been preprocessed by the [Data
Carpentry](https://www.datacarpentry.org) to facilitate teaching. These are made
available online as *The Portal Project Teaching Database*, both at the [Data
Carpentry website](http://www.datacarpentry.org/ecology-workshop/data/), and on
[Figshare](https://figshare.com/articles/Portal_Project_Teaching_Database/1314459/6).
Figshare is a great place to publish data, code, figures, and more openly to
make them available for other researchers and to communicate findings that are
not part of a longer paper.

### Presentation of the survey data

We are studying the species and weight of animals caught in plots in our study
area. The dataset is stored as a comma separated value (CSV) file. Each row
holds information for a single animal, and the columns represent:

| Column           | Description                        |
|------------------|------------------------------------|
| record_id        | unique id for the observation      |
| month            | month of observation               |
| day              | day of observation                 |
| year             | year of observation                |
| plot_id          | ID of a particular plot            |
| species_id       | 2-letter code                      |
| sex              | sex of animal ("M", "F")           |
| hindfoot_length  | length of the hindfoot in mm       |
| weight           | weight of the animal in grams      |
| genus            | genus of animal                    |
| species          | species of animal                  |
| taxa             | e.g. rodent, reptile, bird, rabbit |
| plot_type        | type of plot                       |

To read the data into Python, we are going to use a function called `read_csv`. This
function is contained in an Python-package called
[`pandas`](https://pandas.pydata.org/). As mentioned previously, Python-packages are a bit like browser
extensions, they are not essential, but can provide nifty functionality. We will
go through Python-packages in general and which ones are good for data analyses in
detail later in this lecture. Now, let's import `pandas`.

In [28]:
# TODO double check that python packages have already been introduced before this
# TODO can I left align the entire table?

In [29]:
import pandas as pd

`pandas` can read CSV-files saved on the computer or directly from an URL.

In [30]:
surveys = pd.read_csv('https://ndownloader.figshare.com/files/2292169')

To view the result, you can simply type `surveys` in a cell and run it, just like when viewing the content of any variable in Python.

This is how a data frame is displayed in the JupyterLab Notebook. Although the data frame itself just consists of the values, the Notebook knows that this is a data frame and displays it in a nice tabular format (by adding HTML decorators), and adds some cosmetic conveniences such as the bold font type for the column and row names, the alternating grey and white zebra stripes for the rows and highlights the row the mouse pointer moves over.

## What are data frames?

A data frame is the representation of data in a tabular format, similar to how data is often arranged in spreadsheets. The data is rectangular, meaning that all rows have the same amount of columns and all columns have the same amount of rows. Data frames are the *de facto* data structure for most tabular data, and what we use for statistics and plotting. A data frame can be created by hand, but most commonly they are generated by the function `read_csv()`. In other words, when importing spreadsheets from your hard drive (or the web).

As can be seen above, the default is to display the first and last 30 rows and truncate everything in between, as indicated by the ellipsis (`...`). Although it is truncated, this output is still quite space consuming. To glance at how the data frame looks, it is sufficient to display only the top (the first 5 lines) using the `head()` method.

In [31]:
surveys.head()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
0,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control
1,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control
2,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
3,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
4,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control


Methods are very similar to functions, the main difference is that they belong to an object (above, the method `head()` belongs to the data frame `surveys`). Methods operate on the object they belong to, that's why we can call the method with an empty parenthesis without any arguments. Compare this with the function `type()` that was introduced previously.

In [32]:
type(surveys)

pandas.core.frame.DataFrame

Here, `surveys` are explicitly passed as an argument to `type()`. An immediately tangible advantage with methods is that they simplify tab completion. Just type the name of the dataframe, a period, and then hit tab to see all the relevant methods for that data frame instead of fumbling around with all the available functions in Python and figuring out which ones operate on data frames and which do not. Methods also facilitates readability when we chain many operations together, which will be shown in detail later.

The columns in a data frame can contain data of different types, e.g. (e.g., integers, floats, objects (including strings, lists, dictionaries, etc)). General information about the data frame (including the column data types) can be obtained with the `info()` method.

In [33]:
surveys.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34786 entries, 0 to 34785
Data columns (total 13 columns):
record_id          34786 non-null int64
month              34786 non-null int64
day                34786 non-null int64
year               34786 non-null int64
plot_id            34786 non-null int64
species_id         34786 non-null object
sex                33038 non-null object
hindfoot_length    31438 non-null float64
weight             32283 non-null float64
genus              34786 non-null object
species            34786 non-null object
taxa               34786 non-null object
plot_type          34786 non-null object
dtypes: float64(2), int64(5), object(6)
memory usage: 3.5+ MB


The information includes the total number of rows and columns, the number of non-null observations, the column data types, and the memory (RAM) usage. The number of non-null observation is not the same for all columns, which means that some columns contain null (or NA) values representing that there is missing information, we will get into this in detail later

After reading in a data frame, `head()` and `info()` are two of the most useful methods to get an idea of the structure of this data frame. There are many additional methods that can facilitate the understanding of what a data frame contains:

In [34]:
# TODO have a "tips" banner? for example for head adn info above. or have the below in a box?
# TODO make challenges into boxes?

- Size:
    - `surveys.shape` - a tuple with the number of rows in the first element
      and the number of columns as the second element
    - `surveys.shape[0]` - the number of rows
    - `surveys.shape[1]`- the number of columns

- Content:
    - `surveys.head()` - shows the first 5 rows
    - `surveys.tail()` - shows the last 5 rows

- Names:
    - `surveys.columns` - returns the names of the columns (also called variable names) 
      objects)
    - `surveys.index` - returns the names of the rows (referred to as the index in pandas)

- Summary:
    - `surveys.info()` - column names and data types, number of observations, memory consumptions
      length, and content of  each column
    - `surveys.describe()` - summary statistics for each column

<!-- # TODO should this section be moved to advanced topics? (nicest would be with a colored box) -->

All methods end with a parenthesis. Those words that do not have a trailing parenthesis are called attributes and hold a value that has been computed earlier, think of them as variables that belong to the object. When an an attribute is accessed, it will just return its value, like a variable would. When a method is called it will first perform a computation and then return the resulting value. For example, every time pandas creates a data frame, the number of rows and columns is computed and stored in the `shape` attribute, since it is very common to access this information and it would be a waste of time to compute it every time it is needed.

#### Challenge

Based on the output of `surveys.info()`, can you answer the following questions?

* What is the class of the object `surveys`?
* How many rows and how many columns are in this object?
* Why is there not the same number of rows (observations) for each column?

```{r}
## Answers
##
## * class: data frame
## * how many rows: 34786,  how many columns: 13
## * some values are NA, for example an animal escaped before it was measured, we will go more into this later
```

However, it is good practice to keep a copy of the data stored locally on your computer in case you want to do offline analyses, 
the online version of the file changes, or the file is taken down. You can either
download the data manually or just save the `surveys` dataframe with `to_csv()`.

In [35]:
# index=False because we don't want the index in the csv file.
surveys.to_csv('surveys.csv', index=False)

We only have to do this save step the first time we download the data. From now on, we can simply load the data by specifying the local path instead of the URL.

In [36]:
surveys = pd.read_csv('surveys.csv')
surveys.head()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
0,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control
1,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control
2,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
3,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
4,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control


### Indexing and subsetting data frames

The survey data frame has rows and columns (it has 2 dimensions). To extract specific data from it (also referred to as "subsetting"), columns can be called by name.

In [37]:
surveys['species_id'].head() # Using `head` just to limit the ouput.

0    NL
1    NL
2    NL
3    NL
4    NL
Name: species_id, dtype: object

The JupyterLab Notebook (technically, the underlying IPython interpreter) knows about the columns in your data frame, so you can take advantage of tab autocompletion to get the correct column name. 

Another syntax that is often used to specify column names is `.<column_name>`.

In [38]:
surveys.species_id.head()

0    NL
1    NL
2    NL
3    NL
4    NL
Name: species_id, dtype: object

Using brackets is clearer and also alows for passing multiple columns as a list, so this tutorial will stick to that.

In [39]:
surveys[['species_id', 'record_id']].head()

Unnamed: 0,species_id,record_id
0,NL,1
1,NL,72
2,NL,224
3,NL,266
4,NL,349


The output is displayed a bit differently this time. The reason is that in the last cell where the returned data frame only had one column ("species") pandas technically returned a `Series`, not a `Dataframe`. This can be confirmed by using `type` as previously.

In [40]:
type(surveys.species_id.head())

pandas.core.series.Series

In [41]:
type(surveys[['species_id', 'record_id']].head())

pandas.core.frame.DataFrame

So, every individual column is actually a `Series` and together they constitue a `Dataframe`. This introductory tutorial will not make any further distinction between a `Series` and a `Dataframe`, and many of the analysis techniques used here will apply to both series and data frames. If you prefer how the data frame is displayed, you can convert the `Series` to a `Dataframe` with `to_frame`.

In [42]:
type(surveys.species_id.head().to_frame())

pandas.core.frame.DataFrame

In [43]:
surveys.species_id.head().to_frame()

Unnamed: 0,species_id
0,NL
1,NL
2,NL
3,NL
4,NL


To select specific rows instead of columns, we can use the `loc[]` (location) syntax. This will select the row where the index name (the row name) equals '4'. Indices are unique, so specifying one name to `loc[]` will always return one row.

In [44]:
surveys.loc[4]

record_id               349
month                    11
day                      12
year                   1977
plot_id                   2
species_id               NL
sex                     NaN
hindfoot_length         NaN
weight                  NaN
genus               Neotoma
species            albigula
taxa                 Rodent
plot_type           Control
Name: 4, dtype: object

Square brackets are used instead of parentheses to stay consistent with the indexing with square brackets for Python lists and Numpy arrays. The data frame index does not have to consist of consecutive integers, and `loc[]` can be used to reference a named row via a string. If it is desired to reference rows by their index *position* rather than their index *name*, `iloc[]`.

`loc[]` can also select a range of rows with the same slice syntax introduced for lists earlier.

In [45]:
surveys.loc[2:4] # As a convenience row slicing can also be done in brackets without loc.

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
2,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
3,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control
4,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control


And a combination of columns and rows.

In [46]:
surveys.loc[2:4, 'record_id']

2    224
3    266
4    349
Name: record_id, dtype: int64

In [47]:
surveys.loc[[2, 4, 7], ['species', 'record_id']]

Unnamed: 0,species,record_id
2,albigula,224
4,albigula,349
7,albigula,506


#### Challenge

1. Create a `data.frame` (`surveys_200`) containing only the observations from
   row 200 of the `surveys` dataset.

2. Notice how `nrow()` gave you the number of rows in a `data.frame`?

     * Use that number to pull out just that last row in the data frame.
     * Compare that with what you see as the last row using `tail()` to make
       sure it's meeting expectations.
     * Pull out that last row using `nrow()` instead of the row number.
     * Create a new data frame object (`surveys_last`) from that last row.

3. Use `nrow()` to extract the row that is in the middle of the data
   frame. Store the content of this row in an object named `surveys_middle`.

4. Combine `nrow()` with the `-` notation above to reproduce the behavior of
   `head(surveys)` keeping just the first through 6th rows of the surveys
   dataset.

```{r}
## Answers
surveys_200 <- surveys[200, ]
surveys_last <- surveys[nrow(surveys), ]
surveys_middle <- surveys[nrow(surveys)/2, ]
surveys_head <- surveys[-c(7:nrow(surveys)),]
```

The `describe()` method was mentioned above as a way of retrieving summary statistics of a data frame. Together with `info()` and `head()` this is often a good place to start exploratory data analysis as it gives a nice overview of the numeric valuables the data set.

In [48]:
surveys.describe()

Unnamed: 0,record_id,month,day,year,plot_id,hindfoot_length,weight
count,34786.0,34786.0,34786.0,34786.0,34786.0,31438.0,32283.0
mean,17804.204421,6.473725,16.095987,1990.495832,11.343098,29.287932,42.672428
std,10229.682311,3.398384,8.249405,7.468714,6.794049,9.564759,36.631259
min,1.0,1.0,1.0,1977.0,1.0,2.0,4.0
25%,8964.25,4.0,9.0,1984.0,5.0,21.0,20.0
50%,17761.5,6.0,16.0,1990.0,11.0,32.0,37.0
75%,26654.75,10.0,23.0,1997.0,17.0,36.0,48.0
max,35548.0,12.0,31.0,2002.0,24.0,70.0,280.0


A common next step would be to plot the data to explore relationships between different variables, but before getting into plotting, it is beneficial to elaborate on the data frame object and several of its common operations.

An often desired outcome is to select a subset of rows matching a criteria, e.g. which observations have a weight under 5 grams. To do this, the "less than" comparison operator that was introduced previously can be used.

In [49]:
surveys['weight'] < 5

0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
34756    False
34757    False
34758    False
34759    False
34760    False
34761    False
34762    False
34763    False
34764    False
34765    False
34766    False
34767    False
34768    False
34769    False
34770    False
34771    False
34772    False
34773    False
34774    False
34775    False
34776    False
34777    False
34778    False
34779    False
34780    False
34781    False
34782    False
34783    False
34784    False
34785    False
Name: weight, Length: 34786, dtype: bool

The result is a boolean array of 3476 observations, the same length as the data frame. This array actually has one value for every row in the data frame indicating whether it is `True` or `False` that this row has a value below 5 for the weight variable. This boolean array can be used together with the `loc[]` parameter to select only those observations from the data frame!

In [50]:
surveys.loc[surveys['weight'] < 5]

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type
2428,4052,4,5,1981,3,PF,F,15.0,4.0,Perognathus,flavus,Rodent,Long-term Krat Exclosure
2453,7084,11,22,1982,3,PF,F,16.0,4.0,Perognathus,flavus,Rodent,Long-term Krat Exclosure
4253,28126,6,28,1998,15,PF,M,,4.0,Perognathus,flavus,Rodent,Long-term Krat Exclosure
4665,9909,1,20,1985,15,RM,F,15.0,4.0,Reithrodontomys,megalotis,Rodent,Long-term Krat Exclosure
6860,9853,1,19,1985,17,RM,M,16.0,4.0,Reithrodontomys,megalotis,Rodent,Control
21224,4290,4,6,1981,4,PF,,,4.0,Perognathus,flavus,Rodent,Control
21674,29906,10,10,1999,4,PP,M,21.0,4.0,Chaetodipus,penicillatus,Rodent,Control
24191,8736,12,8,1983,19,RM,M,17.0,4.0,Reithrodontomys,megalotis,Rodent,Long-term Krat Exclosure
24200,9799,1,19,1985,19,RM,M,16.0,4.0,Reithrodontomys,megalotis,Rodent,Long-term Krat Exclosure
25529,9794,1,19,1985,24,RM,M,16.0,4.0,Reithrodontomys,megalotis,Rodent,Rodent Exclosure


As before, this can be combined with selection of a particular set of columns.

In [51]:
surveys.loc[surveys['weight'] < 5, ['weight', 'species']]

Unnamed: 0,weight,species
2428,4.0,flavus
2453,4.0,flavus
4253,4.0,flavus
4665,4.0,megalotis
6860,4.0,megalotis
21224,4.0,flavus
21674,4.0,penicillatus
24191,4.0,megalotis
24200,4.0,megalotis
25529,4.0,megalotis


To prevent the output from running of the screen, `head()` can be used just like before.

In [52]:
surveys.loc[surveys['weight'] < 5, ['weight', 'species']].head()

Unnamed: 0,weight,species
2428,4.0,flavus
2453,4.0,flavus
4253,4.0,flavus
4665,4.0,megalotis
6860,4.0,megalotis


A new object could be created from this smaller version of the data, by assigning it to a new variable name.

In [53]:
surveys_sml = surveys.loc[surveys['weight'] < 5, ['weight', 'species']]
surveys_sml.head()

Unnamed: 0,weight,species
2428,4.0,flavus
2453,4.0,flavus
4253,4.0,flavus
4665,4.0,megalotis
6860,4.0,megalotis


A single expression can also be used to filter for several criteria, either
matching *all* criteria (`&`) or *any* criteria (`|`):

In [54]:
# AND = &
surveys.loc[(surveys['taxa'] == 'Rodent') & (surveys['sex'] == 'F'), ['taxa', 'sex']].head()

Unnamed: 0,taxa,sex
20,Rodent,F
21,Rodent,F
22,Rodent,F
23,Rodent,F
24,Rodent,F


To increase readability, these statements can be put on multiple rows. Anything that is within a parameter or bracket in Python can be continued on the next row.

In [55]:
surveys.loc[(surveys['taxa'] == 'Rodent') &
            (surveys['sex'] == 'F'),
            ['taxa', 'sex']].head()

Unnamed: 0,taxa,sex
20,Rodent,F
21,Rodent,F
22,Rodent,F
23,Rodent,F
24,Rodent,F


With the `|` operator, rows matching either of the supplied criteria are returned.

In [56]:
# OR = |
surveys.loc[(surveys['species'] == 'clarki') |
            (surveys['species'] == 'leucophrys'),
            'species']

10603    leucophrys
24480        clarki
34045    leucophrys
Name: species, dtype: object

#### Challenge

Using pipes, subset the `survey` data to include individuals collected before
1995 and retain only the columns `year`, `sex`, and `weight`.

```{r}
## Answer
surveys %>%
    filter(year < 1995) %>%
    select(year, sex, weight)
```

### Creating new columns

A frequent operation when working with data, is to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. To create a new column of the weight in kg instead of in grams:

In [57]:
surveys['weight_kg'] = surveys['weight'] / 1000
surveys.head(10)

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type,weight_kg
0,1,7,16,1977,2,NL,M,32.0,,Neotoma,albigula,Rodent,Control,
1,72,8,19,1977,2,NL,M,31.0,,Neotoma,albigula,Rodent,Control,
2,224,9,13,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
3,266,10,16,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
4,349,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
5,363,11,12,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
6,435,12,10,1977,2,NL,,,,Neotoma,albigula,Rodent,Control,
7,506,1,8,1978,2,NL,,,,Neotoma,albigula,Rodent,Control,
8,588,2,18,1978,2,NL,M,,218.0,Neotoma,albigula,Rodent,Control,0.218
9,661,3,11,1978,2,NL,,,,Neotoma,albigula,Rodent,Control,


The first few rows of the output are full of `NA`s. To remove those, use the `dropna()` method of the data frame.

In [58]:
surveys.dropna().head(10)

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight,genus,species,taxa,plot_type,weight_kg
11,845,5,6,1978,2,NL,M,32.0,204.0,Neotoma,albigula,Rodent,Control,0.204
13,1164,8,5,1978,2,NL,M,34.0,199.0,Neotoma,albigula,Rodent,Control,0.199
14,1261,9,4,1978,2,NL,M,32.0,197.0,Neotoma,albigula,Rodent,Control,0.197
17,1756,4,29,1979,2,NL,M,33.0,166.0,Neotoma,albigula,Rodent,Control,0.166
18,1818,5,30,1979,2,NL,M,32.0,184.0,Neotoma,albigula,Rodent,Control,0.184
19,1882,7,4,1979,2,NL,M,32.0,206.0,Neotoma,albigula,Rodent,Control,0.206
20,2133,10,25,1979,2,NL,F,33.0,274.0,Neotoma,albigula,Rodent,Control,0.274
21,2184,11,17,1979,2,NL,F,30.0,186.0,Neotoma,albigula,Rodent,Control,0.186
22,2406,1,16,1980,2,NL,F,33.0,184.0,Neotoma,albigula,Rodent,Control,0.184
24,3000,5,18,1980,2,NL,F,31.0,87.0,Neotoma,albigula,Rodent,Control,0.087


By default, `.dropna()` removes all rows that has an NA value in any of the columns. There are parameters that controls how the rows are dropped and which columns should be searched for NAs.

#### Challenge

Create a new data frame from the `surveys` data that meets the following
criteria: contains only the `species_id` column and a new column called
`hindfoot_half` containing values that are half the `hindfoot_length` values.
In this `hindfoot_half` column, there are no `NA`s and all values are less
than 30.

**Hint**: think about how the commands should be ordered to produce this data frame!

```{r}
## Answer
surveys_hindfoot_half <- surveys %>%
    filter(!is.na(hindfoot_length)) %>%
    mutate(hindfoot_half = hindfoot_length / 2) %>%
    filter(hindfoot_half < 30) %>%
    select(species_id, hindfoot_half)
```

# =====================================================
# ======================== END ========================
# =====================================================

## R packages for data analyses
# TODO will this be introduced in the previous lesson?
There are certainly many tools built-in to base R which can be used to
understand data, but we are going to use a package called `dplyr` which makes
exploratory data analysis (EDA) particularly intuitive and effective.

First, let's explain the concept of an R-package. What we have used so far is
all part of base R (except `read_csv`), together with many more functions. Every
package included in base R will be installed on any computer where R is
installed, since they are considered critical for using R, e.g. `c()`, `mean()`,
`+`, `-`, etc. However, since R is an open language, it is easy to develop your
own R-package that provides new functionality and submit it to the official
repository for R-packages called CRAN (Comprehensive R Archive Network). CRAN
has thousands of packages, and all these cannot be installed by default, because
then base R installation would be huge and most people would only be using a
fraction of everything installed on their machine. It would be like if you
downloaded the Firefox or Chrome browser and you would get all extensions and
addons installed by default, or as if your phone came with every app ever made
for it already installed when you bought it, quite impractical.

To install a package in R, we use the function `install.packages()`. In this
case, the package `dplyr` is part of a bigger collections of packages called
[`tidyverse`](https://www.tidyverse.org/) (just like Microsoft Word is part of
Microsoft Office), which also contains the `readr` package we installed in the
beginning and many more packages that makes exploratory data analyses more
intuitive and effective. 

```{r, eval=FALSE}
install.packages('tidyverse')
```

Now all the `dplyr` functions are available to us by prefacing them with
`dplyr::`:

```{r}
dplyr::glimpse(surveys) # `glimpse` is similar to `str`
```

We will be using this package a lot, and it would be a little annoying to have
to type `dplyr::` every time, so we will load it into our current environment.
This needs to be done once for every new R session and makes all functions
accessible without their package prefix, which is very convenient, as long as
you are aware of which function you are using and don't load a function with the
same name from two different packages.

```{r}
# We could also do `library(dplyr)`, but we need the rest of the
# tidyverse packages later, so we might as well import the entire collection.
library('tidyverse')
glimpse(surveys)
```

## Data wrangling with dplyr

Wrangling here is used in the sense of maneuvering, managing, controlling, and
turning your data upside down and inside out to look at it from different angles
in order to understand it. The package **`dplyr`** provides easy tools for the
most common data manipulation tasks. It is built to work directly with data
frames, with many common tasks optimized by being written in a compiled language
(C), this means that many operations run much faster than similar tools
in R. An additional feature is the ability to work directly with data stored in
an external database, such as SQL-databases. The ability to work with databases
is great because you are able to work with much bigger datasets (100s of GB)
than your computer could normally handle. We will not talk in detail about this
in class, but there are great resources online to learn more (e.g. [this lecture
from Data
Carpentry](http://www.datacarpentry.org/R-ecology-lesson/05-r-and-databases.html)).


### Selecting columns and filtering rows

We're going to learn some of the most common **`dplyr`** functions: `select()`,
`filter()`, `mutate()`, `group_by()`, and `summarize()`. To select columns of a
data frame, use `select()`. The first argument to this function is the data
frame (`surveys`), and the subsequent arguments are the columns to keep.

```{r}
select(surveys, plot_id, species_id, weight, year)
```

To choose rows based on a specific criteria, use `filter()`:

```{r}
filter(surveys, year == 1995)
```


### Chaining functions together using pipes

But what if you wanted to select and filter at the same time? There are three
ways to do this: use intermediate steps, nested functions, or pipes. With
intermediate steps, you essentially create a temporary data frame and use that
as input to the next function. This can clutter up your workspace with lots of
objects:

```{r}
temp_df <- select(surveys, plot_id, species_id, weight, year)
filter(temp_df, year == 1995)
```

You can also nest functions (i.e. one function inside of another).
This is handy, but can be difficult to read if too many functions are nested as
things are evaluated from the inside out.

```{r}
filter(select(surveys, plot_id, species_id, weight, year), year == 1995)
```

The last option, pipes, are a fairly recent addition to R. Pipes let you take
the output of one function and send it directly to the next, which is useful
when you need to do many things to the same dataset.  Pipes in R look like `%>%`
and are made available via the `magrittr` package that also is included in the
`tidyverse`. 

```{r}
surveys %>% 
    select(., plot_id, species_id, weight, year) %>% 
    filter(., year == 1995)
```

The `.` refers to the object that is passed from the previous line. In this
example, the data frame `surveys` is passed to the `.` in the `select()`
statement. Then, the modified data frame which is the result of the `select()`
operation, is passed to the `.` in the filter() statement. Put more simply:
whatever was the result from the line above the current line, will be used in
the current line.

Since it gets a bit tedious to write out the all dots, **`dplyr`** allows for
them to be omitted. The chunk below gives the same output as the one above:

```{r}
surveys %>% 
    select(plot_id, species_id, weight, year) %>% 
    filter(year == 1995)
```

Another example:

```{r}
surveys %>%
  filter(weight < 5) %>%
  select(species_id, sex, weight)
```