### Assignment: Cell Type Annotation in PBMCs Using scRNA-Seq Data

#### Objective 
In this assignment, you’ll analyze single-cell RNA sequencing (scRNA-seq) data from the PBMC 3k dataset, which contains approximately 3,000 peripheral blood mononuclear cells from a healthy donor. These cell types play crucial roles in the immune system, each contributing to immune responses in various ways. In total, there are around five major categories (T cells, B cels, NK cells, Monocytes and Dendritic cells), but within each, there are more specific subtypes that can be identified using additional markers in single-cell RNA sequencing. 

You’ll use techniques like dimensionality reduction, clustering, and marker gene analysis to identify and annotate distinct cell types within the dataset. Your main job is to identify clusters of cells that are similar to each other and find discriminative markers per cluster to use those for cell annotation. 

---

> **Note:** If your research contains single cell RNA data you are free to use that datasource instead of the PBMC dataset. 

---

#### Scanpy  
You will use `Scanpy` in Python for data processing, clustering, and visualization. Furthermore you can `Sklearn` if you want to use methods not available in scanpy. 

The `Scanpy` object is a data structure used in the Scanpy package for single-cell RNA sequencing (scRNA-seq) analysis in Python. It is designed to store and manage scRNA-seq data along with associated metadata, results of quality control and pre-processing steps, and results of downstream analyses such as cell clustering, differential gene expression analysis, and data visualization.

The Scanpy object is built on top of the `AnnData` object, which is a generic container for annotated data in Python. It typically contains the following components:

- `.X`: a matrix or sparse matrix of gene expression data, where rows correspond to cells and columns correspond to - genes. This matrix is used for downstream analyses.
- `.obs`: a DataFrame containing metadata for each cell, such as cell type, sample ID, and experimental condition.
- `.var`: a DataFrame containing metadata for each gene, such as gene name, gene ID, and gene biotype.
- `.obsm`: a dictionary of additional per-cell annotations, such as cell embeddings obtained through dimensionality reduction methods like principal component analysis (PCA) or t-distributed stochastic neighbor embedding (t-SNE).-
- `.varm`: a dictionary of additional per-gene annotations.
- `.uns`: a dictionary of unstructured annotations, which can be used to store arbitrary data or metadata.
- `.obs_names`: a vector of cell names or IDs.
- `.var_names`: a vector of gene names or IDs.




---
#### Assignment Steps
In this assignment you can use the scanpy [tutorial]https://scanpy-tutorials.readthedocs.io/en/multiomics/pbmc3k.html for code snippets. Make sure that you understand each piece of code snippet before usage. The code blocks in this notebook can be used to test code snippets and to gain more understandig. 


1. **Data Loading and Preprocessing**  
   - Load the  dataset
   - Filter the data to remove low-quality cells and genes based on minimum gene and cell thresholds.
   - Filter on QC metrics.
   - Normalize and scale the data to prepare it for further analysis.

2. **Dimensionality Reduction**  
   - Perform PCA on the dataset to reduce dimensionality, capturing the primary sources of variation.
   - Use UMAP or t-SNE to further reduce the data for visualization, making it easier to see clusters of cells. Justify why your selected dimension reduction technique is suitable.

3. **Clustering to Identify Cell Types**  
   - Apply a clustering algorithm to group similar cells based on gene expression profiles. Justify your chosen cluster algorithm in a scientific manner (either with experiment or scientific source)
   - Visualize these clusters using UMAP or t-SNE to explore the distinct immune cell populations in the PBMC dataset.

4. **Marker Gene Analysis and Annotation**  
   - Identify marker genes for each cluster by finding genes that are uniquely expressed in certain clusters. 
   - Based on known immune cell markers, annotate each cluster with likely cell type identities, such as T cells, B cells, NK cells, or monocytes.

5. **Cluster Composition Analysis**  
   - Calculate the proportion of each cell type in the dataset. This will help you understand the overall composition of PBMCs in this sample.
   - Visualize the composition using a bar plot to show the distribution of each immune cell type.

6. **Interpretation and Reporting**  
   - Summarize your findings in a report, detailing how you identified and annotated each cell type.
   - Reflect on the immune cell diversity observed in the data.

---

#### Learning Outcomes  
Through this assignment, you’ll gain experience with:
- Preprocessing scRNA-seq data and applying quality control.
- Using dimensionality reduction and clustering to identify cell types.
- Performing marker gene analysis for cell type annotation.
- Understanding the composition and diversity of cells in blood samples.

---

> **Bonus**
Experiment with multiple techniques and evaluate the outcome. Mind you that you need to extract the `X` or `X_pca` matrix from the `scanpy` object first to use sklearn objects. 

---

#### Assessment criteria
- Organized solution: Portfolio well-organized. Code is devided in functions or class methods, using coding standards and is adequately documented. Code wich is not written in functions or methods will not be reviewed. Assignment can be easily reproduced by others. 
- Problem Understanding and Formulation: Demonstrates a clear understanding of the problem to be addressing
- Literature: cites recent and authoritative sources 
- Data Preprocessing and Exploration: Thoroughly preprocessed the data to handle missing values, outliers, and other data quality issues. Explores the dataset to gain insights and understand its characteristics
- Model Selection and Architecture: Chooses appropriate unsupervised machine learning algorithms for the given problem. Provides a rationale for the choices based on the characteristics of the data and problem
- Result and discussion: Interprets and discusses the results in a meaningful way. Compares the results to baselines. Conclusions are drawn from the results supported by evidence.
- Critical Thinking and Problem-Solving: Student demonstrates critical thinking skills by addressing challenges and proposing insightful solutions
- Presentation and Communication:  The concepts are explained clearly, and technical terms are appropriately defined

Mind you if you want to use the sklean `Pipeline` function you need to build your own custom transformers

## 1. Load the PBMC dataset and apply preprocessing
- download the data from: 
https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
- Unpack the data and inspect the files. 


### Load the data 

Load the data into an Anndata object. Use the tuturial for tips 

In [None]:
#your code here to load the data into an anndata object

### Inspect the data
This initialized AnnData object will contain the raw (non-normalized) scRNA-seq data as well as the metadata associated with the data. The `obs` attribute contains cell metadata, the `var` attribute contains gene metadata, the `obsm` attribute contains additional per-cell annotations, and the `varm` attribute contains additional per-gene annotations. The `obs_names` and `var_names` attributes are assigned the cell and gene names or IDs from the pbmc_data object. 

Now inspect the structure of the loaded data object. 

In [None]:
#your solution here

### Preprocessing

- Inspect the structure of the loaded data object. 
- Filter the data to remove low-quality cells and genes based on minimum gene and cell thresholds.
- Filter on QC metrics.
- Normalize and scale the data to prepare it for further analysis.
- Organize your code in functions or class methods. 


In [None]:
#your solution here

## 2. Dimensionality Reduction

   - Perform PCA on the dataset to reduce dimensionality, capturing the primary sources of variation.
   - Use UMAP or t-SNE to further reduce the data for visualization, making it easier to see clusters of cells. 
   - Justify why your selected dimension reduction technique and its configuration is suitable.


In [None]:
#your solution here

## 3. Clustering to Identify Cell Types
  

In [None]:
#your solution here

## 4. Marker Gene Analysis and Annotation