# Bayesian Data Analysis Project: Gene Cluster Analysis Using Single cell RNA-sequencing

## Introduction

Single-cell RNA sequencing measures RNA-expression levels for individual cells allowing identification of RNA-expression in cell subpopulations. The goal of this study is to cluster the single cell expression profiles into clusters using probilistic modelleing and see if these resemblance clusters produced by a skilled bioinformatiocian

The clustering produced by the bioinformatician is seen below. Some clustering can be seen, however the data also appears relatively noisy

<img src="figures/bioinf_clusters.png",width=800>
### Dataset

The data is obtained from single-cell RNA sequencing of bone marrow from mice. The reads were mapped to the mice genome and each read mapping to a gene were measured as a *count* for the corresponding gene. Data were measured for for 1500 samples and ≈27000 genes. We selected a subset of 48 interesting genes for furhter study based on a previous  litterature study. Secondly we were also provided with cell-type labels calculated using the method presented in [Franziska et. al. 2015]



#### Dataset analysis
Due to low amounts of RNA in a single living cell, the dataset is very sparse with 95% of the genes (in the filteres dataset) having a count of zero and approximately 5.5 counts per sample on average

<img src="figures/histogram.png",width=400>

Using the TSNE dimensionality reduction algorithm we visualized the complete dataset (27000 genes) and the filteres dataset (48 genes) as seen below. The non-filtered dataset have almost no structure whereas some structure is seen for the filtered dataset. Secondly we also see that the computed cell-type labels show no agreement with the structure in the filtered dataset suggesting that these should not be trusted.
<img src="figures/tsne_all.png",width=400>
<img src="figures/tsne_filtered.png",width=400>






## Methods and Results

In [45]:
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import gzip
import cPickle as cpkl

In [61]:
#loading the preprocessed data. See data_processing.ipynb for details
with open('preprocessed_data.cpkl','r') as f:
    data = cpkl.load(f)
    
x_train = data['x_train']     #one hot matrix (samples x genes)
t_train = data['t_train']     #labels from bioinformatician 
genes_id = data['genes_id']   
genes = data['genes']
sampleid = data['sampleid']   #sampleid and words contain the same information as x_train just encoded as sampleid: cell, words: id for seen words  
words = data['words']



## Conclusion

## References

Paul, Franziska, et al. "Transcriptional heterogeneity and lineage commitment in myeloid progenitors." Cell 163.7 (2015): 1663-1677 