## Recap: Load the data

In [None]:
import pandas as pd
surveys_df = pd.read_csv('../surveys.csv')
surveys_df.describe

We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average weight of all individuals per site.

For exampe we want to know how heavy our samples are:

In [None]:
surveys_df['plot_id'].describe()

We can also extract one specific metric if we wish:

In [None]:
surveys_df['weight'].min()
surveys_df['weight'].max()
surveys_df['weight'].mean()
surveys_df['weight'].std()
surveys_df['weight'].count()

Now assume we want to get those statistics for the two groups "male" and "female".
If we want to summarize data by one or more variables, we can use Pandas’ .groupby method. Once we’ve created a groupby DataFrame, we can quickly calculate summary statistics by a group of our choice.

## Grouping

In [None]:
grouped_data = surveys_df.groupby('sex')
grouped_data.describe()

The output is a bit overwhelming. Let's just have a look at one statistical value, the mean, to understand what is happening here:

In [None]:
grouped_data.mean()

We see that the data is divided into two groups, one group where the value in the column *sex* equals "F" and another group where the value in the column *sex* equals "M". The statistics is then calculated for all samples in that specific group for each of the columns in the dataframe. Note that samples annotated with sex equals NaN and column values with NaN are left out.

## Exercise
Let's see in which plots animals get more food. Calculate the average weight per plot!

In [None]:
grouped_data = surveys_df.groupby(...)
grouped_data[...].mean()

## Structure of a groupby object
We can investigate which rows are assigned to which group as follows:

In [None]:
print(type(grouped_data.groups)) # dictionary
print("Plot ids: ", grouped_data.groups.keys()) # keys are the unique values of the column we grouped by
print("Rows belonging to plot id ", 1, ": ", grouped_data.groups[1]) # values are row indexes 

## Grouping by multiple columns
Now let's have a look at a more complex grouping example. We want an overview statistics of the weight of all females and males by plot id. So in fact we want to group by *sex* and by *plot_id* at the same time.

This will give us exactly 48 groups for our survey data:
- female, plot id = 1
- female, plot id = 2
- ...
- female, plot id = 24
- male, plot id = 1
- ...
- male, plot id = 24

Why 48 groups? We have 24 unique values for *plot_id*. Per plot we have two groups of samples, female and male. Hence, the grouping returns 48 groups.

In [None]:
grouped_data = surveys_df.groupby(['sex', 'plot_id'])
grouped_data["weight"].describe()

## Exercise
Investigate the group keys and row indexes for this more complex grouping example. 
Why are there more than 48 groups?
What happened to the third group and why dos it not turn up in our statistics?

## Counting and plotting
Another very useful outcome of grouping is the possibility of performing selective counting. For example, let's see how to count the number of records per species. We just need to remember that each species has a unique ID and that records are identified by another ID stored in the column record ID. We will first group our data according to the species ID and then, for each group, we will count the number of records. Several consecutive operations that, once again, Pandas allows us to execute in a single line.

In [None]:
species_counts = surveys_df.groupby('species_id')['record_id'].count()
print(type(grouped_species_counts))
species_counts

We can also plot the information for better overview. We will learn more about plotting after the next chapter.

In [None]:
species_counts.plot(kind='bar')

## Summary grouping
Grouping is one of the most common operation in data analysis. Data often consists of different measurements on the same sample. And in many cases we are not only interested in one particular measurement but in the cross product of measurements. In the picture below we labeled samples with green lines, blue dots and red lines. We are now interested how these three different groups relate to each other given the all other measurements in the dataframe.

If we group a Pandas dataframe we in fact receive an object that contains three 

![Grouping sketch](pictures/grouping.jpeg)

## Indexing, Slicing, and Subsetting DataFrames (I did not review this part CS)

We saw that using grouping we can conveniently subset our DataFrame according to different measurement characteristics. However, sometimes it is necessary to "surgically" extract small portions of DataFrame such us single rows and columns of data satisfying very specific filtering criteria. In this paragraph we will see how Pandas allows to perform all these operations with a quite intuitive syntax.

### Selecting 

Let's look again at our original DataFrame columns using a loop. This time we will add some extra conditional statements to highlight the column name corresponding to a specific index.

In [None]:
sel_index = 5
print('Index) Column name')
for i,col in enumerate(surveys_df.columns):
    if i == sel_index:
        print('{}) {} <==='.format(i,col))
    else:
        print('{}) {}'.format(i,col))

We already saw in one of the previous paragraph how to extract a specific DataFrame column, but we did not go too much into details. The next block of code shows how to retrieve the same column (specied_id corresponding to index 5) from our DataFrame:

In [None]:
#By name
# --------------------------------------
#Method1
plot_id_1 = surveys_df['species_id']

#Method2
plot_id_2 = surveys_df.species_id
# --------------------------------------

#By location
# --------------------------------------
#Method3
plot_id_3 = surveys_df[surveys_df.columns[5]]

#Method4
plot_id_4 = surveys_df.iloc[:,5]
# --------------------------------------

In the first two methods we extract the column specifying its name. The third method is substantially identical to the first one as the 6th (index 5) element of the Series ```surveys_df.columns``` is species_id. The fourth method uses the method ```iloc``` to select *all* the rows of the 6th column. 

### Selecting data by type

The attributes ```dtypes``` contains information about the data types contained in each column.

In [None]:
surveys_df.dtypes

This information is not only important for our data analysis, but it also allows us to eventually select subset of data according to its type.

In [None]:
surveys_df_float_sel = surveys_df.select_dtypes(include = ['float64'])
print(type(surveys_df_float_sel))
surveys_df_float_sel.dtypes

In the previous block of code we used a method, ```.select_dtypes()```, to select the DataFrame columns storing only float64 values (double-precision floating point numbers).

### Selecting by string in name

Another very convenient option to select data is specifying a string that must be contained in the column names. DataFrame column names are indicative (or at least, they should be) of the characteristics relative to measurements. All the columns containing a unique identifier, for example, may contain the suffix "id" while all the measurements relative to a specific body part (another example) will most probably contain that body part name as well. In this context, the Pandas method ```.filter(like=<str>)``` will allow to extract only those columns containing a certain string in their names. 
For example, let's extract all columns containing some sort of ID:

In [None]:
print(surveys_df.columns)

In [None]:
surveys_df_str_sel = surveys_df.filter(like='_id')
print(type(surveys_df_str_sel))

In [None]:
surveys_df_str_sel.head()

### Slicing

DataFrame slicing allows you to extract a portion of a DataFrame based on conditions or indices and create a new DataFrame containing only the subset of data that you are interested in. In Pandas slicing can be perfomed using the methods ```loc``` and '''iloc''' for slicing via names and indices, respectively. To remember the difference between the two, just notice that the "i" in ```iloc``` stands for "index".
Let's start slicing our initial DataFrame into a 3x4 sub-DataFrame:

In [None]:
surveys_df.iloc[0:3,0:4]

<div class="alert alert-block alert-warning">
<b>WARNING:</b> In Python integer indexing starts with 0 and, when slicing using a continous range of indices, data corresponding to the last index is NOT included.
</div>

We can obtain the same result using ```loc```, but we need to specify a list with the first 4 column names.

In [None]:
surveys_df.loc[[0,1,2],['record_id','month','day','year']]

<div class="alert alert-block alert-success">
<b>TRY IT YOURSELF:</b> Can you tell what happens when you execute the following commands?
</div>

- ```surveys_df[0:1]```;
- ```surveys_df[:4]```;
- ```surveys_df[:-1]```.



<div class="alert alert-block alert-success">
<b>TRY IT YOURSELF:</b> What happens when you call the following commands? How are the two commands different?
</div>

- ```surveys_df.iloc[0:4, 1:4]```;
- ```surveys_df.loc[0:4, 1:4]```.


### Subsetting Data according to user-defined criteria

We can extract subsets of our DataFrame following the general syntax ```data_frame[<condition_on_data>]```. <condition_on_data> is a conditional statement on the DataFrame content itself. You may think at the conditional statement as a question or query you ask to your DataFrame. Here there are some examples:

In [None]:
# What are the data collected in 2002?
surveys_df[surveys_df.year == 2002]

In [None]:
# What are the data NOT collected in 2002?
surveys_df[surveys_df.year != 2002]

In [None]:
# What are the data NOT collected in 2002? (different syntax)
surveys_df[~(surveys_df.year == 2002)]

Our filtering conditions may be very specific, they can target different columns in the DataFrame, and they can be combined using the logical operator "&":

In [None]:
# What are the data collected between 2000 and 2002 on female species?
surveys_df[(surveys_df.year >= 2000) & (surveys_df.year <= 2002) & (surveys_df.sex == 'F')]

The method ```isin()``` allows to specify a range of "permitted" values for a certain column. Here it follows another example:

In [None]:
surveys_df[(surveys_df.year == 2000) & (surveys_df.sex == 'F') & (surveys_df.month.isin([1,3,4]))]

<div class="alert alert-block alert-success">
<b>TRY IT YOURSELF:</b> 
    <ol>
    <li> Create a new DataFrame that only contains observations with sex values that are not female or male. Print the number of rows in this new DataFrame. Verify the result by comparing the number of rows in the new DataFrame with the number of rows in the surveys DataFrame where sex is null.</li>
    <li>Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0.</li>
    </ol>
</div>

## DataFrame Cleaning

A simple exploration of our DataFrame showed us that there are columns full of invalid values (NaN). One of the most important preliminary operations of data analysis is cleaning your data set, i.e. "getting rid" of non numerical values. Now that we mastered selecting, slicing, and subsetting, we can easily clean our DataFrame with few lines of code.

In [None]:
# Are there any invalid values in the weight column?
n_tot = len(surveys_df)
n_null_weight = len(surveys_df[pd.isnull(surveys_df.weight)])
n_pos_weight  = len(surveys_df[surveys_df.weight > 0])

print('Total number of rows:',n_tot)
print('Number of null weight rows:',n_null_weight)
print('Number of positive weight rows:',n_pos_weight)

As you can see, out of 35549 weight measurements, 3266 are not usable. The remaining 32283 values are positive, so usable, values. What happens if we compute the mean weight ignoring the fact that there are not numeric values?

In [None]:
ave_weight = surveys_df.weight.mean()
print(ave_weight)

A smooth run, without errors or warnings. As we said several times, Pandas is a library designed for data analysis and when performing data analysis it is very common to deal with not numeric values. In particular, the ```.mean()``` method has an argument called *skipna* that when set TRUE (default value, so we do not need to specify it) excludes NA/null values. This means that, in this case, Pandas simply ignores whatever it is not numeric and it performs computations only on numeric values.

If we are not happy with Pandas default behaviour, we can manually decide which value to assigni to NA/null values. One possible choice is setting them to zero. To do that, we just need to apply the method ```.fillna(<value>)```, where <value> is the number we want to substitute to the NA/null value (in our case, 0).

In [None]:
cleaned_weight1 = surveys_df.weight.fillna(0)
cleaned_weight_ave1 = cleaned_weight1.mean()
print(cleaned_weight_ave1)

You probably noticed that compared to our previous mean computation, the result it's pretty different. This is because the mean is now computed on a sample with many more zeros compared to the previous one and, as a result, the value of the computed mean is smaller.
Conscious of this problem, we may now choose a more appropriate value to "fill" our NA/null values. How about we use the "clean" mean of our first computation?

In [None]:
cleaned_weight2 = surveys_df.weight.fillna(surveys_df.weight.mean())
cleaned_weight_ave2 = cleaned_weight2.mean()
print(cleaned_weight_ave2)

This time we obtain exactly the same result of our first computation, this is because we substituted the NA/null values with a mean computed excluding the NA/null values.

<div class="alert alert-block alert-success">
<b>TRY IT YOURSELF:</b> Compute the average weight of data after having cleaned the weight and the sex column.
</div>