# Afternoon Exercises: Working with data {.unnumbered}

In [None]:
import pandas as pd
surveys_df = pd.read_csv('../../course_materials/data/surveys.csv') # in your notebook the path should be 'data/surveys.csv'

### Exercise 0

Type the following commands and check the outputs. Can you tell what each command does? What is the difference between commands with and without parenthesis?

```python
surveys_df.shape # Answer: the dimensions of the dataframe
surveys_df.columns # Answer: the column names of the dataframe
surveys_df.index # Answer: the index (row labels) of the dataframe
surveys_df.dtypes # Answer: the data types of each column
surveys_df.head(<try_various_integers_here>) # Answer: the first n rows of the dataframe
surveys_df.tail(<try_various_integers_here>) # Answer: the last n rows of the dataframe
```

### Exercise 1
Perform some basic statistics on the weight column. For practical reasons, it can be useful to first create a variable `weight` that contains the just the weight column. It will make the code look a bit cleaner. Can you tell what each method listed below does? Look at our explorative plot, do the statistics make sense?

```python
weight =s urveys_df['weight'] # Answer: creates a new variable that contains the weight column
weight.min() # Answer: the minimum value of the weight column
weight.max() # Answer: the maximum value of the weight column
weight.mean() # Answer: the mean value of the weight column
weight.std() # Answer: the standard deviation of the weight column
weight.count() # Answer: the number of non-NaN values in the weight column
```

### Exercise 2
- Swap the order of column names in `surveys_df[['plot_id', 'species_id']]`
- Repeat one of the column names like `surveys_df[['plot_id', 'plot_id', 'species_id']]`.
What do the results look like and why?  

> Answer: the column names are repeated and the data is displayed twice. Column names do not have to be unique.

- Which error occurs in `surveys_df['plot_id', 'species_id']` and why?  

> Answer: KeyError: ('plot_id', 'species_id'). The column names are not in a list. We need double square brackets to select multiple columns.

- Which error occurs in `surveys_df['speciess']`?  

> Answer: KeyError: 'speciess'. The column name does not exist. Typo.

In [None]:
print(surveys_df[['species_id', 'plot_id']])

In [None]:
surveys_df[['plot_id', 'plot_id', 'species_id']]

In [None]:
surveys_df['plot_id', 'species_id'] 

In [None]:
surveys_df['speciess']

### Exercise 3
What happens when you call:

- `surveys_df[0:1]` Answer: shows the first row of the dataframe
- `surveys_df[:4]` Answer: shows the first 4 rows of the dataframe from index 0 to index 3
- `surveys_df[:-1]` Answer: shows all rows of the dataframe except the last row

In [None]:
surveys_df[0:1]
surveys_df[:4] 
surveys_df[:-1] 

### Exercise 4
- Create a new DataFrame that only contains observations from the original with sex values that are not female or male. Print the number of rows in this new DataFrame. Verify the result by comparing the number of rows in the new DataFrame with the number of rows in the surveys DataFrame where sex is NaN (hint: there is a function `isnull`).

In [None]:
df = surveys_df[(surveys_df['sex'] != 'M') & (surveys_df['sex'] != 'F')]
print("Number of rows not female or male:", len(df))
print("Number of rows NaN:", len(surveys_df['sex'].isnull()))
print("Unique values in column 'sex':", df['sex'].unique())

### Exercise 5: Putting it all together 
1. Clean the column *sex* (leave out samples of which we do not know whether they are male or female) and save the result as a new dataframe `clean_df`.
2. Fill undefined *weight* values with the mean of all valid weights in `surveys_df`.
3. Calculate the average weight of that new DataFrame `clean_df`

In [None]:
# Step 1
# sex is 'F' or 'M'. The `|` means or.
clean_df = surveys_df[(surveys_df['sex']=='F') | (surveys_df['sex']=='M')]
# Alternative solution: select columns where 'not' sex is null. The `~` means not.
clean_df = surveys_df[~(surveys_df['sex'].isnull())]

# Step 2
clean_df.weight.fillna(surveys_df['weight'].mean())

# Step 3
print("Average weight of surveys_df:", surveys_df['weight'].mean())
print("Average weight of clean_df:", clean_df['weight'].mean())

### Exercise 6
Let's see in which plots animals get more food. Calculate the average weight per plot! Complete the code below.

In [None]:
grouped_data = surveys_df.groupby("plot_id")
grouped_data['weight'].mean()

### Exercise 7
See below a more complex grouping example. Investigate the group keys and row indexes for this more complex grouping example. 
Why are there more than 48 groups? Answer: nan values are not ignored when grouping.
Calculate the average weight per group.
What happened to the third group and why does it not turn up in our statistics? Answer: the third group contains only nan values and is therefore not included in the statistics.

In [None]:
grouped_data = surveys_df.groupby(['sex', 'plot_id'])
print(len(grouped_data.groups))
grouped_data.groups.keys()

In [None]:
grouped_data['weight'].mean()

### Exercise 8
Would it make sense to group our data frame by the column *weight*? Why or why not?

In [None]:
# In real life nearly every sample has a unique value. So nearly every sample would 
# be placed in an own group.
# In our training data you can see that there are quite some values for weight. So
# usually it is not a good idea to categorise (group) data on such values.
print("Number of rows:", len(surveys_df))
print(len(surveys_df['weight'].unique())) #includes nan
print(len(surveys_df.groupby(['weight']).groups)) #does not include nan

### Exercise 9
In the given example of vertical concatenation, you concatenated two DataFrames with the same columns. What would happen if the two DataFrames to concatenate have different column number and names?

  1. Create a new DataFrame using the last 10 rows of the species DataFrame (`species_df`);
  2. Concatenate vertically `surveys_df_sub_first10` and your just created DataFrame;
  3. Print the concatenated DataFrame info on the screen. How may rows does it have? What happened to the columns? Explain why you get this result.

In [None]:
species_df = pd.read_csv("../../course_materials/data/species.csv")
species_df_sub_last10 = species_df.tail(10)

surveys_df_sub_first10 = surveys_df.head(10)
vert_concat = pd.concat([surveys_df_sub_first10, species_df_sub_last10], axis=0)

vert_concat

We get a total of 20 rows and 12 columns. The original dataframes together had a total of 13 columns. As they both have a column `species_id`, this one is collapsed. All other columns are padded with `NaN` values.
We expect 20 rows, as we are putting two DataFrames of 10 rows one after the other. The padding of the columns happens because these two DataFrames do not have the same column names. To keep all the information that was in the original DataFrames, the padding of columns that occur in only one of the two is necessary.

### Exercise 10
  1. Looking at the `inner_join` example, can you explain how much of each of the two DataFrames is missing from the result?

Now consider the other types of joins, for each one, can you predict the number of rows and the contents of the resulting DataFrame, based on the diagrams in the picture?

  2. For the outer join;
  3. For the left join;
  4. For the right join.

1. From the left DataFrame, three rows are not included in the `inner_join` DataFrame. This is because they have a value in their `species_id` column that is not present in the right DataFrame. From the right DataFrame, the information of 18 rows is missing from the result. This is because their `species_id` column has a value that does not occur in the left DataFrame. Note that the information from the two rows that are represented in the result is duplicated a number of times, as their `species_id` value occurs multiple times in the left DataFrame.
2. The result has a total of 28 rows. You may notice that the first seven of those rows are the same as the result of the inner join, followed by the three rows from the left DataFrame that are not represented in the inner join, and finally, the 18 rows from the right DataFrame that are not represented in the inner join. This makes for a total of 7 + 3 + 18 = 28 rows. The outer join preserves *all* the information from both the left and right DataFrames.

In [None]:
# 2.
left_df = surveys_df.head(10)
right_df = species_df.head(20)
outer_join = pd.merge(left_df, right_df, left_on='species_id', right_on='species_id', 
                      how='outer')
outer_join

3. Ten rows. The resulting DataFrame closely resembles the original left DataFrame, but with information from the right DataFrame added to it, where applicable.

In [None]:
# 3.
left_join = pd.merge(left_df, right_df, left_on='species_id', right_on='species_id', 
                     how='left')
left_join

4. 25 rows. The resulting DataFrame closely resembles the original right DataFrame, but with information from the left DataFrame added to it, where applicable. Note that rows from the right DataFrame that have multiple matching rows in the left DataFrame are duplicated.

In [None]:
# 4.
right_join = pd.merge(left_df, right_df, left_on='species_id', right_on='species_id', 
                      how='right')
right_join

### Exercise 11

Time to play with plots! Create a multiplot following these instructions:
- Using the matplotlib.pyplot function `subplots()`, create a single figure (10x10 inches) with four subplots organized in two rows and two columns; 
- In the top row plot hindfoot_length VS weight for female and male in two different plots with two different colors;
- In the bottom row, plot the same data of the top row, but using data collected before (left plot) and after (right plot) 1990;
- Give to each plot an appropriate descriptive title and customize the plot labels.
<br>

Feel free to use the DataFrame `plot` method or `plt.scatter` function to plot data points, but be awave that, in any case, the first thing to do is creating _Figure_ and _Axes_.<br>
EXTRA: The four plots have same x and y axes spanning the same range. Can you remove the space between the four plots? Try it!

In [None]:
from matplotlib import pyplot as plt

fig, axes = plt.subplots(2,2,figsize=(10,10)) # prepare a matplotlib figure

# Top left plot, male data
surveys_df[surveys_df['sex']=='M'].plot("hindfoot_length", "weight", kind="scatter", ax=axes[0][0], color='blue')
axes[0][0].set_title('Male data')
axes[0][0].grid()

# Top right plot, female data
surveys_df[surveys_df['sex']=='F'].plot("hindfoot_length", "weight", kind="scatter", ax=axes[0][1], color='red')
axes[0][1].set_title('Female data')
axes[0][1].grid()

year = 2000

# Bottom left plot, male data
surveys_df[(surveys_df['sex']=='M') & (surveys_df['year'] < year)].plot("hindfoot_length", "weight", kind="scatter", ax=axes[1][0], color='blue')
axes[1][0].set_title(f'Male data (< {year})')
axes[1][0].grid()

# Bottom right plot, male data
surveys_df[(surveys_df['sex']=='F') & (surveys_df['year'] >= year)].plot("hindfoot_length", "weight", kind="scatter", ax=axes[1][1], color='red')
axes[1][1].set_title(f'Female data (>= {year})')
axes[1][1].grid()

# Removing individual plot labels
for i in range(2):
    for j in range(2):
        axes[i][j].set_xlabel('')
        axes[i][j].set_ylabel('')

# Initializing figure labels
fig.supxlabel("Hindfoot Length [cm]",fontsize=14)
fig.supylabel("Weight [Kg]",fontsize=14)
fig.suptitle('Scatter plot of weight versus hindfoot length', fontsize=15)