### Basic introduction to working with tabular data in python
Before getting started, make sure you have activated the `scrnaseq` environment which we created with mamba earlier on (you should have started the notebook server from within that environment). This is recommended because the packages we will need for this tutorial are already installed into that environment.

#### 1. Import the required packages

In [None]:
import numpy as np  # allows to manipulate arrays
import pandas as pd  # allows to work with tabular data
import matplotlib.pyplot as plt  # allows to plot
import seaborn as sns  # makes your plots look nicer and gives you advanced plotting options

#### 2. Download the dataset
You can download the .csv file of the dataset from here https://gist.github.com/slopp/ce3b90b9168f2f921784de84fa445651#file-penguins-csv (via the button Download ZIP at the top right, then unzip and move dataset to a project folder). 

This dataset contains measurements of penguins from different species – you can think of it as similar to a dataset on cells from different types, like what we will be working with later. 

An in depth educational analysis of the dataset can be found here https://allisonhorst.github.io/palmerpenguins/articles/intro.html for R and, with less detail, for python here https://github.com/mcnakhaee/palmerpenguins, but we recommend you refrain from looking at this too much now and rather find your own way to maximize learning.

#### 3. Loading and inspecting the data
The comments (lines which start with #, contain comments and are not executed as python code) contain step-by-step instructions to work with the penguin dataset. Add corresponding python code below each comment. For easier debugging and comparisons with other participants, we suggest you keep the suggested names.

In [None]:
# Load the penguin dataset
penguins = pd.read_csv("penguins.csv", index_col=0)

In [None]:
# Display the first few rows of the dataset


In [None]:
# Get the dataset dimensions, i.e. the number of penguins in the dataset and the number of features


In [None]:
# For each feature, check the datatype - what do the resulting types mean?


In [None]:
# For caterogrical data, the categories used can be displayed like this:
print(penguins['species'].unique())
# Print to screen and inspect the categories for all categorical data types


There seems to be "nan" as a gender. What does this mean?

In [None]:
# search for missing data entries in the dataframe. How many values are missing?


### 3.1 Questions
Based on the analysis above, discuss with your neighbor and note down the answers to the following questions into the notebook:
1. How many penguins and features do we have in this dataset?
2. What are te different penguin species (think: cell types)?
3. Are there any missing values in the dataset that we need to handle? Do you have preliminary ideas on how to handle them?
4. What are the numerical features in this dataset which we can analyze?

### 4. Preprocessing the data
As above, you will find instructions written as comments below which you have to turn into python code. The steps in this section help us prepare to data for the actual analysis.

In [None]:
# Remove all rows with missing entries from the dataframe
penguins_complete = penguins.dropna()

In [None]:
# Measure the size of the new dataframe


In [None]:
# Select all numerical columns which have to do with body measurements and transfer them to a separate dataframe
penguins_numerical = 

In [None]:
# Check that it worked by inspecting the first few rows of penguins_numerical


### 4.1 Questions
Discuss with your neighbor and answer in writing.
1. How many penguins were discarded by the data clean-up step which required all features to be present?
2. Thinking ahead, why would we need a separate object with numerical measurements only? 

### 5. Data transformations
In this section, we will explore some basic statistics of the numerical features and use them to normalize our data. As above, please follow the instructions in the comments. We will standardize each feature to its z-score (https://en.wikipedia.org/wiki/Standard_score) in order to make features comparable across units and orders of magnitude. This is common preprocessing step for numerical data across different data types as otherwise, a feature with high values (say, penguin height in cm) might dominate a feature with low values (say, penguin weight in kg) in downstream analyses. 

In [None]:
# Calculate and display the mean of each body measurement
penguins_numerical.mean()

In [None]:
# Calculate and display the standard deviation of each body measurement


In [None]:
# For each value in penguins_numerical, calculate its z-score version in a new dataframe
penguins_standardized = 

In [None]:
# Inspect the top rows of the standardized dataframe


In [None]:
# Calculate and display the mean of each body measurement after standardization


In [None]:
# Calculate and display the standard deviation of each body measurement after standardization


### 5.1 Questions

1. What problem could have arisen had we not removed missing values prior to these calculations?
2. After standardization, some of the body measurements are negative. What does a negative standardized bill length mean? What does a flipper length of 0 (after standardization) mean?
3. Look at the mean values after standardization. Are they what you would have expected? Why (not)?

### 6. Data filtering
In this section, we will practice subselecting parts of tabular data, a typical task also in single cell RNA Seq analysis (think: Selecting all hepatocytes, filtering out low quality cells with few sequencing reads etc.). As above, please translate the instructions into code.

In [None]:
# Filter out penguins which are heavier than 4000g
large_penguins = 

In [None]:
# Create a dataset which contains only penguins which have shorter than average (i.e. mean) flippers
short_flippers = 

In [None]:
# Check out the species distribution within the large_penguins group
large_penguins['species'].value_counts()

In [None]:
# Check out the species distribution within the short_flippers group


### 6.1 Questions
1. How many penguins have flippers shorter than the mean - more than half of the penguins, less than half of the penguins, or exactly half of the penguins? Which statistical average would have exactly half of the penguins below it? 
2. Based on the above analysis, what can you say about the three penguins species?

### 7. Visualization
In this section, we will produce two plots of our data using the plotting package seaborn. The plot types we will be using are called `sns.scatterplot` and `sns.`. As above, translate the comments into code.

In [None]:
# Produce a 2D scatter plot of two numerical features of your choice and colour the points by species.
sns.scatterplot(data=penguins_complete, x='flipper_length_mm', y='bill_length_mm', hue='species')

In [None]:
# Produce a 2D scatter plot as above, but use the remaining two features and colour the plot according to sex


In [None]:
# Produce a violinplot of body mass by species (bonus points if you manage to split the violin by sex ;-))


### 7.1 Questions
1. From the plots above, what can you say about body measurements when comparing male to female penguins?
2. If the penguins in the plots above were single cells, what could the different categories be? What could the species category mean in the context of cells, and what could the numerical categories represent?

In [None]:
#### The end, well done!