In [1]:
from datetime import date
print(date.today())

2023-04-28


# Concepts

## Pre-course survey evaluation

The University of Vienna doesn't allow the specification of prerequisites for courses; therefore, it may well be that I can't demand that students have any prior experience with single-cell transcriptomics, even though this is supposed to be an advanced course. I assessed the students' expertise and expectations via a [pre-course survey](https://forms.gle/GdBvcCoLnkAgXLy17). Here are my main takeaways:

- I can't expect too much technical expertise. The students' background is very varied, ranging from experts to absolute newbies.
- Half of the students have not attended any of the somewhat relevant courses in the faculty curriculum.
- The students consider themselves to be at least acquainted with statistics.
- The students mostly don't know linear algebra.
- Programming proficiency is all over the place.
- Students view this as an introductory course; most hope to learn something they can use in their work.
- Students are somewhat familiar with point-and-click software.
- Students do not use version control often or at all.
- Many students are familiar with the command line and use it quite often.
- Students feel relatively confident in their ability to write simple scripts, and their ability to troubleshoot online. They feel less confident about overcoming problems if they are stuck. This may indicate that they feel their programming ability is shallow.

## How does that inform my preparation?

Taking the inhomogeneous level of the students into account I have two choices:

1. Do a two-speed course where the novices grapple with basic concepts and the advanced students can attack advanced material on their own.
2. Keep the "advanced" part on the theoretical side: pick a few important topics and cover them in depth, challenging the students to understand the important concepts.

Since this is a practical course it is of course also important that there is a hands-on portion, and that the students also practice the mechanical/technical aspect. However, I'd like to avoid wasting time on software installation, as everyone will be bringing their own laptops and a very different background.

## What did I choose and why?

For the first iteration of the course, I decided to go ahead with option 2), focus on a smaller range of topics and cover them in-depth rather than giving the students recipes. One aspect is technical; I have no clue how versed the students _really_ are, and I'd rather be positively surprised and breeze through the course than overwhelm them.

# 0. Preparation

## Technical solutions for students

I explored the option of using the University's JupyterHub instance. Unfortunately the machine is rather limited and I don't want to limit myself to that. Before committing, I want to have a run through the course and see what resources I need. Then, I will try to coordinate with LISC and reserve a big chunk of CPUs/RAM for the days/weeks of the course. I will probably be using [The Littlest JupyterHub](https://tljh.jupyter.org/en/latest/), and running things from there.

This means that I will be keeping track of the installation instructions/steps here, as well as the environment setup steps and results.

UPDATE: the University JupyterHub is too limited in its technical specifications. Setting up on the cluster would be an option and needs to be explored. If the students' laptops are powerful enough, it might be preferable to just have the students install locally.

# 1. Normalisation

Objectives: 

1. Explain rationale behind normalisation/variance stabilisation
1. List options for normalisation/variance stabilisation
1. Explain their preference for one
1. Normalise a dataset
1. Demonstrate that the normalisation achieved the intended effect

### 1.1 Why is normalisation important?

- have students run without normalisation and see what happens.
- does normalisation distort/enhance biological signal?
- literature about normalisation:
    - [Comparison of transformations for single-cell RNA-seq data](https://www.nature.com/articles/s41592-023-01814-1)
    - [Validation of noise models for single-cell transcriptomics](https://www.nature.com/articles/nmeth.2930)
    - [The triumphs and limitations of computational methods for scRNA-seq](https://www.nature.com/articles/s41592-021-01171-x)

### 1.2 How do we normalise?

- challenge them to think about what we need. What does it mean if we normalise?
- What do we want to achieve with normalisation? Do we get it?
- Literature:
    - [Normalisation of single-cell RNA-seq counts by log(x+1)](https://academic.oup.com/bioinformatics/article/37/15/2223/6155989)
    - [Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1)
    - [Depth normalization for single-cell genomics count data](https://www.biorxiv.org/content/10.1101/2022.05.06.490859v1.full)

### 1.3 How do we evaluate if it worked?

- gold standard
- simulated data
- expert knowledge

# 2. Clustering

Objectives: 

1. List some definitions of "cell type"
1. Explain what clustering is and why it makes sense for scRNA-seq.
1. How do we get from a cells $\times$ genes matrix to a clustering?
1. List ways of calculating cell-cell similarity
1. Explain their preference for one
1. Produce a cell type tree
1. Work with the CT tree to identify putative CT families

# 3. Integration

Objectives: 

1. List possible sources of differences between single-cell datasets
1. Explain why these differences may lead to problems, i.e. in cell type detection or diff. gene expression
1. List the main strategies for dataset integration
    - global models
    - linear embedding models
    - graph-based methods
    - deep learning
1. Find and parse recent benchmarking publications
1. Explain their preference for a benchmarking strategy/method
1. Run integration method on dataset
1. Demonstrate that integration achieved the intended effect

### 3.1 Why is integration needed?

- have students run without integration and see what happens
- construct possible scenarios/experimental designs that could lead to batch effects
- take away lessons for experimental design. Are there ways to prevent batch effects?

### 3.2 How do we integrate?

- work on gold standard (HCA or MCA)
- run different tools (at least in Python ecosystem):
    - BBKNN
    - MNN
    - pyHarmony
    - pyLiger
    - scVI
    - scANVI

What do we learn from the integrated data that we couldn't learn before?

### 3.3 How do we evaluate if it worked?

- discuss with students: What would we expect? How would we measure it?
- use [`scib`](https://github.com/theislab/scib) from [Luecken _et al._ 2021](https://www.nature.com/articles/s41592-021-01336-8)
- see how homogeneous each cluster is before/after
- HVG conservation before/after
- run Alison's method (z-scores) as baseline integration
- compare to "no integration at all"

# 4. Cross-species comparison

Objectives:

- List possible approaches for (pairwise) cross-species comparisons
- List pros and cons of each
- Explain how gene homology, conservation of expression complicate cross-species comparisons
- Outline possible strategies to account for that
- Run manual comparisons of various marker genes
- Run SAMap and visualise cluster similarity matrix

### How do we integrate?

- manual: compare marker genes for selected cluster pairs between (closely?) related species
- 1-1 orthologs (subset and integrate as in #3)
- SAMap
- Literature:
    - [Benchmarking strategies for cross-species integration of single-cell RNA-sequencing data](https://www.biorxiv.org/content/10.1101/2022.09.27.509674v2)

### What do we learn?

How does integration of scRNA-seq data help us form and validate evolutionary hypotheses?