## Recap: Load the data

In [None]:
import pandas as pd
surveys_df = pd.read_csv('../course_materials/data/surveys.csv')
surveys_df.describe

In addition to learning about characteristics of our dataset as a whole, we may be interested in analyzing parts (subsets) of our data.
For exampe we want to know how heavy our samples are:

In [None]:
surveys_df['weight'].describe()

We can also extract one specific metric if we wish:

In [None]:
surveys_df['weight'].min()
surveys_df['weight'].max()
surveys_df['weight'].mean()
surveys_df['weight'].std()
surveys_df['weight'].count()

## Selecting data using column names

In the [morning session](introduction_to_python_3.ipynb) we saw how to get specific values from dictionaries using keys. We can do the same with DataFrames, in fact we have already accessed the values in a column by the column name. In this section we will discover how to select values, slices of data and subsets of a DataFrame.
There are two ways of selecting columns, we have already used the first:

In [None]:
surveys_df['species_id']

In [None]:
surveys_df.species_id

How can we now create a DataFrame that only consists of the two columns *plot_id* and *species_id*?

In [None]:
surveys_df[['plot_id', 'species_id']]

Why the double *[[..]]*? What is the difference between `surveys_df['plot_id']` and `surveys_df[['plot_id']]`? Let us have a closer look:

In [None]:
print(type(surveys_df['plot_id']))
print(type(surveys_df[['plot_id']]))

The DataFrame is organised as a dictionary with the column names as keys and row numbers as keys for the values stored in a row. `surveys_df['plot_id']` will give us the value behind the key *plot_id*, in our case the series of numbers. When we ask for the values behind *plot_id* **and** *species_id* we need to give the DataFrame a *list* of column names like we did with `surveys_df[['plot_id', 'species_id']]`.
When we pass a list of column names to a DataFrame, Pandas will execute for us the following code so that we do not have to worry about that any longer:

In [None]:
col1 = surveys_df['plot_id']
col2 = surveys_df['species_id']
aggregatedData = pd.DataFrame(dict(col1 = col1, col2 = col2))
aggregatedData

## Slicing subsets of rows
Slicing using the `[]` operator selects a set of rows and/or columns from a DataFrame. To slice out a set of rows, you use the following syntax: `data[start:stop]`. When slicing in pandas the start bound is included in the output. The stop bound is not included. The slicing stops _before_ the stop bound.
So if you want to select rows 0, 1 and 2 your code would look like this:

In [None]:
surveys_df[0:3]

We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing. The respective functions for that are called `loc` (label-based indexing) and `iloc` (integer-based indexing).

Let's have a look at `iloc` first. where we use the index of a row and/or column to select it. In the example below we select the first three entries and the columns month, day and year (the second, third and fourth column, remember indexing starts at 0 on Python). The first range of numbers selects the rows, the second the columns:

In [None]:
# iloc[row slicing, column slicing]
surveys_df.iloc[0:3, 1:4]

We can achieve the same with the function `loc`, only instead of column indices, we use the column labels this time. So, we need to know the names of the columns:

In [None]:
surveys_df.loc[0:3, ['month', 'day', 'year']]

And there is a third way: In a forst step we select the columns by their names `surveys_df[['month', 'day', 'year']]`. From the resulting DataFrame we then, in a second step, select the first three rows `[0:3]`. Putting the two steps together, the code looks like this:

In [None]:
surveys_df[['month', 'day', 'year']][0:3]

### Interactive Part
Let us further explore the `loc` and  `iloc` functions as they are more powerful. Have a look at the examples below and predict their outcome before hitting enter.

In [None]:
# Select all columns for rows of index values 0 and 10
surveys_df.loc[[0, 10], :]

# What does this do?
surveys_df.loc[0, ['species_id', 'plot_id', 'weight']]

# What happens when you type the code below?
surveys_df.loc[[0, 10, 35549], :]

We can also extract single values from our DataFrame:

In [None]:
# data.iloc[row, column]
surveys_df.iloc[2, 6]

### Summary: Selecting slices, rows and columns
In the first two methods we extract the column specifying its name. The third method is essentially identical to the first one as the 6th (index 5) element of the Series ```surveys_df.columns``` is *species_id*. The fourth method uses the method ```iloc``` to select *all* the rows of the 6th column. 

In [None]:
# By name
# --------------------------------------
# Method1
plot_id_1 = surveys_df['species_id']

# Method2
plot_id_2 = surveys_df.species_id
# --------------------------------------

# By location
# --------------------------------------
# Method3
plot_id_3 = surveys_df[surveys_df.columns[5]]

# Method4
plot_id_4 = surveys_df.iloc[:,5]
# --------------------------------------

<div class="alert alert-block alert-success">
<b>Exercise 3 to 5</b>
    
Now go to the Jupyter Dashboard in your internet browser and continue with the afternoon exercises 3 to 5.

### Subsetting Data according to user-defined criteria

We can extract subsets of our DataFrame following the general syntax ```data_frame[<condition_on_data>]``` <condition_on_data> is a conditional statement on the DataFrame content itself. You may think at the conditional statement as a question or query you ask to your DataFrame. Here there are some examples:

In [None]:
# What are the data collected in the year 2002?
surveys_df[surveys_df.year == 2002]

In [None]:
# What are the data NOT collected in the year 2002?
surveys_df[surveys_df.year != 2002]

In [None]:
# What are the data NOT collected in the year 2002? (different syntax)
surveys_df[~(surveys_df.year == 2002)]

Our filtering conditions may be very specific, they can target different columns in the DataFrame, and they can be combined using the logical operator "&" which means **and**:

In [None]:
# What are the data collected between 2000 and 2002 on female species?
surveys_df[(surveys_df.year >= 2000) & (surveys_df.year <= 2002) & (surveys_df.sex == 'F')]

We have also an operator for **or**.
Below we filter for rows with collected data on female species in the year 2000 or 2002.
"Give me all data where sex is Female and data is collected in 2000 or 2002".

The method ```isin()``` allows to specify a range of "permitted" values for a certain column. Here it follows another example:

In [None]:
surveys_df[(surveys_df.year == 2000) & (surveys_df.sex == 'F') & (surveys_df.month.isin([1,3,4]))]

## DataFrame Cleaning

A simple exploration of our DataFrame showed us that there are columns full of invalid values (NaN). One of the most important preliminary operations of data analysis is cleaning your data set, i.e. "getting rid" of non-numerical or non-character values. we want to make sure that our data only contains meaningful values. 

Now that we mastered selecting, slicing, and subsetting, we can easily clean our DataFrame with few lines of code. Let us have a look at the function *isnull*. It is a Pandas function which we imported at the beginning with `import pandas as pd`. Now we can call the functin like this:

In [None]:
pd.isnull(4)

In [None]:
pd.isnull([1, 2, 3, '', dict(), None])

We can pass single values or array-like values to the function. The function will then check for us whether each value is `NaN` (Not a Number) or `None` and return a boolean array.
Note, that values like the empty string (a strin without any characters in it) or an empty dictionary etc will not count as `null` value, they do have a type, they only do not contain any values but they are something. 
`null` values in python are only `NaN` and `None`. When you read in tabular data into a DataFrame empty cells will be shown as `NaN`. `None` stands for the type *NoneType*, which we will not dive into further in this workshop.

With all that kowledge we can now detect `null` values in the column *weight* and do something about it. Let us have a look how many `null` values we can find:

In [None]:
print("Length of the full dataframe: ", len(surveys_df))
pd.isnull(surveys_df.weight) # boolean array indicating where null values are found
surveys_df[pd.isnull(surveys_df.weight)] # all lines that have a null value in the column weight
len(surveys_df[pd.isnull(surveys_df.weight)]) # length

As you can see, in our whole dataset 3266 weight values are not usable. We need to do something with those values.

Another thing that would not make sense are negative weights. Let's check whether the remaining 32283 values in the *weight* column are positive:

In [None]:
len(surveys_df[surveys_df.weight > 0])

As we see, we have 32283 non-negative *weighht* values. The remaining 3266 values in the *weight* column are not set, so they are `null`. How can we impute the values? Let us have a look at the average weight:

In [None]:
surveys_df.weight.mean()

A smooth run, without errors or warnings. As we said several times, Pandas is a library designed for data analysis and when performing data analysis it is very common to deal with not numeric values. In particular, the ```.mean()``` method has an argument called *skipna* that when set `True` (default value, so we do not need to specify it) excludes NaN values. This means that, in this case, Pandas simply ignores whatever it is not numeric and it performs computations only on numeric values.

If we are not happy with Pandas default behaviour, we can manually decide which value to assign to cells that contain `null` values. One possible choice is setting them to zero. To do that, we just need to apply the method ```.fillna(<value>)```, where `<value>` is the number we want to substitute to the `null` value with (in our case, 0).

In [None]:
cleaned_weight1 = surveys_df.weight.fillna(0)
cleaned_weight_ave1 = cleaned_weight1.mean()
print(cleaned_weight_ave1)

You see that when filling the `null` values with 0, the average weight decreases. This is because the mean is now computed on data with many more zeros compared to the previous one.
Conscious of this problem, we may now choose a more appropriate value to "fill" our `null` values. How about we use the "clean" mean of our first computation?

In [None]:
cleaned_weight2 = surveys_df.weight.fillna(surveys_df.weight.mean())
cleaned_weight_ave2 = cleaned_weight2.mean()
print(cleaned_weight_ave2)

This time we obtain exactly the same result of our first computation, this is because we substituted the `null` values with a mean computed excluding the `null` values.

<div class="alert alert-block alert-success">
<b>Exercise 6 and 7</b>

Now go to the Jupyter Dashboard in your internet browser and continue with the afternoon exercises 6 and 7.

## Grouping

We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average weight of all individuals per site.

As we have seen above we can calculate basic statistics for all records in a single column using the syntax below:

In [None]:
surveys_df['weight'].describe()

If we want to summarize by one or more variables, for example sex, we can use Pandas’ `.groupby()` method. Once we’ve created a groupby DataFrame, we can quickly calculate summary statistics by a group of our choice.

In [None]:
grouped_data = surveys_df.groupby('sex')
grouped_data.describe()

The output is a bit overwhelming. Let's just have a look at one statistical value, the mean, to understand what is happening here:

In [None]:
grouped_data.mean()

We see that the data is divided into two groups, one group where the value in the column *sex* equals "F" and another group where the value in the column *sex* equals "M". The statistics is then calculated for all samples in that specific group for each of the columns in the dataframe. Note that samples annotated with sex equals NaN and column values with NaN are left out.

In [None]:
grouped_data = surveys_df.groupby(...)
grouped_data[...].mean()

## Structure of a groupby object
We can investigate which rows are assigned to which group as follows:

In [None]:
print(type(grouped_data.groups)) # dictionary
print("Plot ids: ", grouped_data.groups.keys()) # keys are the unique values of the column we grouped by
print("Rows belonging to plot id ", 1, ": ", grouped_data.groups[1]) # values are row indexes 

## Grouping by multiple columns
Now let's have a look at a more complex grouping example. We want an overview statistics of the weight of all females and males by plot id. So in fact we want to group by *sex* and by *plot_id* at the same time.

This will give us exactly 48 groups for our survey data:
- female, plot id = 1
- female, plot id = 2
- ...
- female, plot id = 24
- male, plot id = 1
- ...
- male, plot id = 24

Why 48 groups? We have 24 unique values for *plot_id*. Per plot we have two groups of samples, female and male. Hence, the grouping returns 48 groups.

In [None]:
grouped_data = surveys_df.groupby(['sex', 'plot_id'])
grouped_data["weight"].describe()

## Counting and plotting
Another very useful outcome of grouping is the possibility of performing selective counting. For example, let's see how to count the number of records per species. We just need to remember that each species has a unique ID and that records are identified by another ID stored in the column record ID. We will first group our data according to the species ID and then, for each group, we will count the number of records. Several consecutive operations that, once again, Pandas allows us to execute in a single line.

In [None]:
species_counts = surveys_df.groupby('species_id')['record_id'].count()
print(type(grouped_species_counts))
species_counts

We can also plot the information for better overview. We will learn more about plotting after the next chapter.

In [None]:
species_counts.plot(kind='bar')

## Summary grouping
Grouping is one of the most common operation in data analysis. Data often consists of different measurements on the same samples. In many cases we are not only interested in one particular measurement but in the cross product of measurements. In the picture below we labeled samples with green lines, blue dots and red lines. We are now interested how these three different groups relate to each other given the all other measurements in the dataframe. Pandas' groupby function gives us the means to compare these three groups with several built-in statistical methods.

![Grouping sketch](images/grouping.jpeg)

<div class="alert alert-block alert-success">
<b>Exercise 8 to 10</b>
    
Now go to the Jupyter Dashboard in your internet browser and continue with the afternoon exercises 8 to 10.

After you finished the exercises please come back to this document and continue with the [following chapter](data-science-with-pandas-3.ipynb).