In [None]:
import pandas as pd
surveys_df = pd.read_csv('../data/surveys.csv')

### Exercise 1

Type the following commands and check the outputs. Can you tell what each command does? What is the difference between commands with and without parenthesis?

```python
surveys_df.shape
surveys_df.columns
surveys_df.index
surveys_df.dtypes
surveys_df.head(<try_various_integers_here>)
surveys_df.tail(<try_various_integers_here>)
```

### Exercise 2
Perform some basic statistics on the weight column. For practical reasons, it can be useful to first create a variable `weight` that contains the just the weight column. It will make the code look a bit cleaner. Can you tell what each method listed below does? Look at our explorative plot, does the statistics make sense?

```python
weight=surveys_df['weight']
weight.min()
weight.max()
weight.mean()
weight.std()
weight.counts()
```

### Exercise 3
- Swap the order of column names in `surveys_df[['plot_id', 'species_id']]`
- Repeat one of the column names like `surveys_df[['plot_id', 'plot_id', 'species_id']]`.
How does the results look like and why?
- Which error occurrs in `surveys_df['plot_id', 'species_id']` and why?
- Which error occurrs in `surveys_df['speciess']`?

In [None]:
print(surveys_df[['species_id', 'plot_id']])

In [None]:
surveys_df[['plot_id', 'plot_id', 'species_id']] # repeating column plot_id

In [None]:
surveys_df['plot_id', 'species_id'] 
# The tuple, or combination ('plot_id', 'species_id') is not a 
# column name (key) in the dataframe --> KeyError: ('plot_id', 'species_id')

In [None]:
surveys_df['speciess']
# 'speciess' is not a column name (key) in the dataframe

### Exercise 4
What happens when you execute:
- `surveys_df[0:1]`
- `surveys_df[:4]`
- `surveys_df[:-1]`

In [None]:
surveys_df[0:1] # shows the first row of the dataframe
surveys_df[:4] # shows the first four rows from index 0 to index 3
surveys_df[:-1] # shows all rows of the dataframe

### Exercise 5
What happens in the following two examples?

- ```surveys_df.iloc[0:4, 1:4]```;
- ```surveys_df.loc[0:4, 1:4]```.


In [None]:
print(surveys_df.iloc[0:4, 1:4])
surveys_df.loc[0:4, 1:4] # the function loc works with indices for rows (0:4), 
# but not with indices for columns (1:4). COlumns do have names in our dataframe

### Exercise 6
- Create a new DataFrame that only contains observations with sex values that are not female or male. Print the number of rows in this new DataFrame. Verify the result by comparing the number of rows in the new DataFrame with the number of rows in the surveys DataFrame where sex is NaN (hint: there is a function `isnull`).
- Create a new DataFrame that contains only observations that are of sex male or female and where weight values are greater than 0.

In [None]:
df = surveys_df[(surveys_df['sex'].isnull())]
print("Number of rows:", len(df))
print("Unique values in column 'sex':", df['sex'].unique())

### Exercise 7: Putting it all together 
1. Clean the column *sex* (leave out samples we do not know whether they are male or female) and save the result as a new dataframe `clean_df`.
2. Fill undefined *weight* values with the mean of all valid weights in `surveys_df`.
3. Calculate the average weight of that new DataFrame `clean_df`

In [None]:
# Step 1
# sex is 'F' or 'M'. The `|` means or.
clean_df = surveys_df[(surveys_df['sex']=='F') | (surveys_df['sex']=='M')]
# or not sex is null. The `~` means not.
clean_df = surveys_df[~(surveys_df['sex'].isnull())]

# Step 2
clean_df.weight.fillna(surveys_df.weight.mean())

# Step 3
print("Average weight of surveys_df:", surveys_df.weight.mean())
print("Average weight of clean_df:", clean_df.weight.mean())

### Exercise 8
Let's see in which plots animals get more food. Calculate the average weight per plot! Complete the code below.

In [None]:
grouped_data = surveys_df.groupby("plot_id")
grouped_data['weight'].mean()

### Exercise 9
Investigate the group keys and row indexes for this more complex grouping example. 
Why are there more than 48 groups?
What happened to the third group and why dos it not turn up in our statistics?

In [None]:
grouped_data = surveys_df.groupby(['sex', 'plot_id'])
print(len(grouped_data.groups))
grouped_data.groups.keys() # we also have a categorial value 'nan'.

### Exercise 10
Would it make sense to group our data frame by the column *weight*? Why or why not?

In [None]:
# In real life nearly every sample has a unique value. So nearly every sample would 
# be placed in an own group.
# In our training data you can see that there are quite some values for weight. So
# usually it is not a good idea to categorise (group) data on such values.
print("Number of rows:", len(surveys_df))
print(len(surveys_df['weight'].unique())) #includes nan
print(len(surveys_df.groupby(['weight']).groups)) #does not include nan

### Exercise 11
In the given example of vertical concatenation, you concatenated two DataFrames with the same columns. What would happen if the two DataFrames to concatenate have different column number and names?

  1. Create a new DataFrame using the last 10 rows of the species DataFrame (`species_df`);
  2. Concatenate vertically `surveys_df_sub_first10` and your just created DataFrame;
  3. Print the concatenated DataFrame info on the screen. How may rows does it have? What happened to the columns? Explain why you get this result.

In [None]:
species_df = pd.read_csv("../data/species.csv")
species_df_sub_last10 = species_df.tail(10)

surveys_df_sub_first10 = surveys_df.head(10)
vert_concat = pd.concat([surveys_df_sub_first10, species_df_sub_last10], axis=0)

vert_concat

We get a total of 20 rows and 12 columns. The original dataframes together had a total of 13 columns. As they both have a column `species_id`, this one is collapsed. All other columns are padded with `NaN` values.
We expect 20 rows, as we are putting two DataFrames of 10 rows one after the other. The padding of the columns happens because these two DataFrames do not have the same column names. To keep all the information that was in the original DataFrames, the padding of columns that occur in only one of the two is necessary.

### Exercise 12
In the given example of horizontal concatenation, you first concatenated two DataFrame with different indices, then reset the indices of the second one. Based on the outcome of these two cases, try to answer the following questions:
  1. What happens when you concatenate horizontally two DataFrames with different indexing?
  2. What happens when you concatenate horizontally two DataFrames with the same columns?
  3. What happens when you try to select a column of the `horizontal_stack` DataFrame we just created?
  4. How can you select a specific column, when multiple columns share a name?

1. The columns of both DataFrames are kept, duplicates are not merged, and the rows of both DataFrames are kept as well, no merging there either. This results in the two original DataFrames appearing in a checker-pattern in the resulting DataFrames, with the empty spaces padded out with `NaN` values.
2. Columns still are not merged, but now, rows with a common index are merged. This means that the information of both rows is put into a single row in the resulting DataFrame. If no corresponding row exists in the other DataFrame, the row is still padded with `NaN`'s in the result. But if a corresponding row (with the same index) *does* exists, this is no longer necessary.

In [None]:
# 3.
surveys_df_sub_last10 = surveys_df.tail(10)
surveys_df_sub_last10 = surveys_df_sub_last10.reset_index(drop=True)
horizontal_stack = pd.concat([surveys_df_sub_first10, surveys_df_sub_last10], axis=1)
horizontal_stack['species_id']

In [None]:
# 4.
horizontal_stack.iloc[:,5]

### Exercise 13
Time to play with plots! Look at the pandas.DataFrame.plot() documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) and change your data visualization selecting different DataFrame columns, x and y axes, and kind of plot (try at least three different plots).


In [None]:
fig, ax1 = plt.subplots() # prepare a matplotlib figure

surveys.plot("hindfoot_length", "weight", kind="scatter", ax=ax1)

# Provide further adaptations with matplotlib:
ax1.set_xlabel("Hindfoot length")
ax1.tick_params(labelsize=16, pad=8)
fig.suptitle('Scatter plot of weight versus hindfoot length', fontsize=15)