# Prepare real data in a format compatible with BayVel
Since *BayVel* runs in Julia and R, we need to export the real dataset in a format that is compatible with these programming languages. 

In particular, we will save the dataset as csv file. We will save different file, depending on the scVelo pre-processing functions applied. 


Load the packages.

In [1]:
import scanpy
import scvelo as scv
import numpy as np
import scipy.sparse
import pandas as pd 
import copy 

In [2]:
scv.settings.verbosity = 3  # show errors(0), warnings(1), info(2), hints(3)
scv.settings.presenter_view = True  # set max width size for presenter view
scv.settings.set_figure_params('scvelo')  # for beautified visualization

Set which dataset ("Pancreas" or "DentateGyrus") we want to prepare in a format compatible with *Bayel* and set the path where the .csv files will be saved.

In [3]:
typeSIM = "Pancreas"
pathOutput = "pathOutput"

Load the data.

In [4]:
if typeSIM == "Pancreas":
    adata = scv.datasets.pancreas()
elif typeSIM == "DentateGyrus":
    adata = scv.datasets.dentategyrus()

100%|██████████| 50.0M/50.0M [00:02<00:00, 17.5MB/s]


Apply the scVelo pre-processing steps, with the default parameters used in scVelo notebooks. 

First of all preform the filtering of the genes.

In [6]:
scv.pp.filter_genes(adata, min_shared_counts=20)
adata_filter = copy.deepcopy(adata)

Filtered out 20801 genes that are detected 20 counts (shared).


Now normalize the data and extract the 200 most high variable genes.

In [7]:
scv.pp.normalize_per_cell(adata)
scv.pp.filter_genes_dispersion(adata, n_top_genes=2000)

Normalized count data: X, spliced, unspliced.
Extracted 2000 highly variable genes.


Save the results of the different pre-processing steps.

In [None]:
# Just filtering of the 2000 most highly variable genes
path = pathOutput + "/" + typeSIM + "/filter"

adata_filter_toGenes2000 = adata_filter[:,adata.var_names]

adata_filter_toGenes2000.write_csvs(path + "/", skip_data=False)
unspliced = pd.DataFrame(data=scipy.sparse.csr_matrix.todense(adata_filter_toGenes2000.layers["unspliced"]))
unspliced.to_csv(path + "/unspliced.csv", index=False)
spliced = pd.DataFrame(data=scipy.sparse.csr_matrix.todense(adata_filter_toGenes2000.layers["spliced"]))
spliced.to_csv(path + "/spliced.csv", index=False)

In [None]:
# normalized data
path = pathOutput + "/" + typeSIM + "/filter_and_normalize_noLog"

adata.write_csvs(path + "/", skip_data=False)
unspliced = pd.DataFrame(data=scipy.sparse.csr_matrix.todense(adata.layers["unspliced"]))
unspliced.to_csv(path + '/unspliced.csv', index=False)
spliced = pd.DataFrame(data=scipy.sparse.csr_matrix.todense(adata.layers["spliced"]))
spliced.to_csv(path + '/spliced.csv', index=False)

Now take the logarithm of the data and then save the data again.

In [9]:
scv.pp.log1p(adata)

  scv.pp.log1p(adata)


In [None]:
path = pathOutput + "/" + typeSIM + "/filter_and_normalize"

adata.write_csvs(path + "/", skip_data=False)

unspliced = pd.DataFrame(data=scipy.sparse.csr_matrix.todense(adata.layers["unspliced"]))
unspliced.to_csv(path + '/unspliced.csv', index=False)
spliced = pd.DataFrame(data=scipy.sparse.csr_matrix.todense(adata.layers["spliced"]))
spliced.to_csv(path + '/spliced.csv', index=False)

As last step computes also the moments and save the last dataset. 

In [None]:
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

In [None]:
path = pathOutput + "/" + typeSIM + "/moments"

adata.write_csvs(path + "/", skip_data=False)

unspliced = pd.DataFrame(data=scipy.sparse.csr_matrix.todense(adata.layers["unspliced"]))
unspliced.to_csv(path + '/unspliced.csv', index=False)
spliced = pd.DataFrame(data=scipy.sparse.csr_matrix.todense(adata.layers["spliced"]))
spliced.to_csv(path + '/spliced.csv', index=False)

Mu = pd.DataFrame(np.asmatrix(adata.layers["Mu"]))
Mu.to_csv(path + '/Mu.csv', index=False)
Ms = pd.DataFrame(np.asmatrix(adata.layers["Ms"]))
Ms.to_csv(path + '/Ms.csv', index=False)