# Set up

In this notebook we will prepare the necessary files to run the notebooks.

We will set up the directories and download files automatically, if possible, to make this first process less tedious. Unfortunately, there will be some files that will need to be downloaded manually, either because the repository doesn't allow direct download, or because it requires some logging.

In [None]:
%load_ext autoreload

In [None]:
%autoreload 2

In [None]:
import gzip
import pandas as pd
import scanpy as sc
from vpolo.alevin import parser

In [None]:
!python setup.py install

In [None]:
import sys, os
sys.path.insert(0, os.getcwd() + '/code')

# Selection of palettes for cluster coloring, and scatter values
from triku_nb_code.file_download_and_generation import process_ding, process_mereu

In [None]:
root_dir = !pwd
root_dir = root_dir[0][:-9]

## Downloading the data

The data in this notebook is fundamental. We will download several datasets to do the benchamarkings, and we will also process them.

Currently, if you have downloaded triku repo files directly, the file structure should be as follows:
```
triku\
    cli\
    pp\
    ...
notebooks\
    code\
    *.ipynb files
LICENSE
MANIFEST.in
README.md
requirements.txt
setup.py
```

After this part we will add a `data` folder, with some datasets.
That is, at the end of the section you should have a structure like this:

```
data\
triku\
notebooks\
LICENSE
...
```


### Downloading Mereu et al. 2020 dataset
This is a great benchmarking dataset with human PBMCs and mouse colon cells, with several library preparation methods. We will download some of them, mainly the mose used ones (Chromium, SMARTseq-2, CELseq, InDrops, etc.).

We will also include cell type information for each dataset, so that we can use it later to do comparisons with other methods.


The final structure of the folder should be:
```
data\
    Mereu_2020\
        tsv\
        cell_types\
```

* `tsv` should have the original .tsv files from GEO reposititory.
* `cell_types` should have two dataframes, one for human and one for mouse. This dataframes have the cell types depicted in the publication.
   The cell types have been obtained from [here](https://www.dropbox.com/s/i8mwmyymchx8mn8/sce.all_classified.technologies.RData?dl=0). For simplicity, they have been extracted from the adata, and added into the folder.
* `Mereu_2020` should have several adatas, for each tecnique and organism, with read counts and the observed cell types.

In [None]:
mereu_dir = root_dir + 'data/Mereu_2020/'
os.makedirs(mereu_dir + 'tsv', exist_ok=True)
os.makedirs(mereu_dir + 'cell_types', exist_ok=True)

In [None]:
mereu_tsv_dir = mereu_dir + 'tsv'
# CELseq2
!aria2c -x 16 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133539/suppl/GSE133539%5FCELseq2%5Fhuman%5Fexp%5Fmat%2Etsv%2Egz -d $mereu_tsv_dir
!aria2c -x 16 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133539/suppl/GSE133539%5FCELseq2%5Fmouse%5Fexp%5Fmat%2Etsv%2Egz -d $mereu_tsv_dir
# Droposeq
!aria2c -x 16 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133540/suppl/GSE133540%5FDropseq%5Fhuman%5Fexp%5Fmat%2Etsv%2Egz -d $mereu_tsv_dir
!aria2c -x 16 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133540/suppl/GSE133540%5FDropseq%5Fmouse%5Fexp%5Fmat%2Etsv%2Egz -d $mereu_tsv_dir
# QUARTZseq
!aria2c -x 16 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133543/suppl/GSE133543%5FQUARTZseq%5Fhuman%5Fexp%5Fmat%2Etsv%2Egz -d $mereu_tsv_dir
!aria2c -x 16 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133543/suppl/GSE133543%5FQUARTZseq%5Fmouse%5Fexp%5Fmat%2Etsv%2Egz -d $mereu_tsv_dir
# SMARTseq2
!aria2c -x 16 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133545/suppl/GSE133545%5FSMARTseq2%5Fhuman%5Fexp%5Fmat%2Etsv%2Egz -d $mereu_tsv_dir
!aria2c -x 16 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133545/suppl/GSE133545%5FSMARTseq2%5Fmouse%5Fexp%5Fmat%2Etsv%2Egz -d $mereu_tsv_dir
# singleNuclei
!aria2c -x 16 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133546/suppl/GSE133546%5FSingleNuclei%5Fhuman%5Fexp%5Fmat%2Etsv%2Egz -d $mereu_tsv_dir
!aria2c -x 16 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133546/suppl/GSE133546%5FSingleNuclei%5Fmouse%5Fexp%5Fmat%2Etsv%2Egz -d $mereu_tsv_dir
# ddSEQ
!aria2c -x 16 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133547/suppl/GSE133547%5FddSEQ%5Fhuman%5Fexp%5Fmat%2Etsv%2Egz -d $mereu_tsv_dir
!aria2c -x 16 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133547/suppl/GSE133547%5FddSEQ%5Fmouse%5Fexp%5Fmat%2Etsv%2Egz -d $mereu_tsv_dir
# inDrop
!aria2c -x 16 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133548/suppl/GSE133548%5FinDrop%5Fhuman%5Fexp%5Fmat%2Etsv%2Egz -d $mereu_tsv_dir
!aria2c -x 16 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133548/suppl/GSE133548%5FinDrop%5Fmouse%5Fexp%5Fmat%2Etsv%2Egz -d $mereu_tsv_dir
# 10X
!aria2c -x 16 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133535/suppl/GSE133535%5F10X2x5Kcell250Kreads%5Fhuman%5Fexp%5Fmat%2Etsv%2Egz -d $mereu_tsv_dir
!aria2c -x 16 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133535/suppl/GSE133535%5F10X2x5Kcell250Kreads%5Fmouse%5Fexp%5Fmat%2Etsv%2Egz -d $mereu_tsv_dir

In [None]:
!gunzip $mereu_tsv_dir/*.gz 

Now that files have been downloaded and extracted, we will generate the adatas. Each adata will have the structure `{technique}_{organism}.h5`. 
It will contain the cells that are annotated.

In [None]:
process_mereu(mereu_dir)

### Downloading Ding et al. 2020 dataset
Ding dataset is uploaded to Single Cell Portal, under sccesion numbers SCP424 and SCP425. The data is under login accession, so you must login, download the data, and place it in the listed directories.

Adter dataset downloading the final file structure should look like this:
```
Ding_2020\
    human\
        cells.read.new.txt   ->   Barcode names
        counts.read.txt.gz   ->   Count matrix in MM format
        genes.read.txt       ->   Feature names
        meta.txt             ->   Metadata file with annotations
    mouse\
        cells.names.new.txt  ->   Barcode names
        count.reads.txt.gz   ->   Count matrix in MM format
        genes.count.txt      ->   Feature names
        meta_combined.txt    ->   Metadata file with annotations
```

In [None]:
ding_dir = root_dir + 'data/Ding_2020/'
os.makedirs(ding_dir + 'human', exist_ok=True)
os.makedirs(ding_dir + 'mouse', exist_ok=True)

In [None]:
process_ding(ding_dir)

To simplify nomenclature with Mereu's dataset, we are going to delete and rename certain datasets.

In [None]:
os.replace(ding_dir + '/10x Chromium (v3)_human.h5ad', ding_dir + '/10X_human.h5ad')
os.replace(ding_dir + '/10x Chromium_mouse.h5ad', ding_dir + '/10X_mouse.h5ad')
os.replace(ding_dir + '/DroNc-seq_mouse.h5ad', ding_dir + '/SingleNuclei_human.h5ad')
os.replace(ding_dir + '/inDrops_human.h5ad', ding_dir + '/inDrop_human.h5ad')
os.replace(ding_dir + '/Drop-seq_human.h5ad', ding_dir + '/Dropseq_human.h5ad')
os.replace(ding_dir + '/Smart-seq2_human.h5ad', ding_dir + '/SMARTseq2_human.h5ad')
os.replace(ding_dir + '/Smart-seq2_mouse.h5ad', ding_dir + '/SMARTseq2_mouse.h5ad')
os.replace(ding_dir + '/CEL-Seq2_human.h5ad', ding_dir + '/CELseq2_human.h5ad')
os.remove(ding_dir + '/10x Chromium (v2) A_human.h5ad')
os.remove(ding_dir + '/10x Chromium (v2) B_human.h5ad')
os.remove(ding_dir + '/10x Chromium (v2)_human.h5ad')

### Downloading 10X datasets

In [None]:
!wget http://cf.10xgenomics.com/samples/cell-exp/3.0.0/heart_10k_v3/heart_10k_v3_raw_feature_bc_matrix.h5 -P $root_dir/data/10x
!wget http://cf.10xgenomics.com/samples/cell-exp/3.0.0/neuron_10k_v3/neuron_10k_v3_raw_feature_bc_matrix.h5 -P $root_dir/data/10x
!wget http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_10k_v3/pbmc_10k_v3_raw_feature_bc_matrix.h5 -P $root_dir/data/10x

In [None]:
!aria2c -x 8 http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/3.0.0/neuron_10k_v3/neuron_10k_v3_fastqs.tar -d $root_dir/data/10x/FASTQs
!aria2c -x 8 http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/3.0.0/heart_10k_v3/heart_10k_v3_fastqs.tar -d $root_dir/data/10x/FASTQs
!aria2c -x 8 http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/3.0.0/pbmc_10k_v3/pbmc_10k_v3_fastqs.tar -d $root_dir/data/10x/FASTQs

In [None]:
!tar -xvf $root_dir/data/10x/FASTQs/neuron_10k_v3_fastqs.tar -C $root_dir/data/10x/FASTQs/
!tar -xvf $root_dir/data/10x/FASTQs/heart_10k_v3_fastqs.tar -C $root_dir/data/10x/FASTQs/
!tar -xvf $root_dir/data/10x/FASTQs/pbmc_10k_v3_fastqs.tar -C $root_dir/data/10x/FASTQs/

In [None]:
!aria2c -x 16 ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.pc_transcripts.fa.gz -d $root_dir/data/references
!aria2c -x 16 ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.primary_assembly.annotation.gtf.gz -d $root_dir/data/references
    
!aria2c -x 16 ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.pc_transcripts.fa.gz -d $root_dir/data/references
!aria2c -x 16 ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.primary_assembly.annotation.gtf.gz -d $root_dir/data/references

In [None]:
for prefix in ['vM25', 'v34']:
    with gzip.open(f'{root_dir}/data/references/gencode.{prefix}.pc_transcripts.fa.gz', 'r') as f:
        lines = f.readlines()

    t2gdict = {}
    for i, line in enumerate(lines):
        line = line.decode('utf-8')
        if line.startswith('>'):
            t = line.split('|')[0]
            t2gdict[f'{t[1:]}'] = line.split('|')[5]
            lines[i] = t + '\n'
        else:
            lines[i] = line

    df = pd.DataFrame(t2gdict.items())
    df.to_csv(f'{root_dir}/data/references/txp2gene_{prefix}.tsv', sep='\t', header=None, index=None)

    with open(f'{root_dir}/data/references/gencode.{prefix}.pc_transcripts.fa', 'w') as f:
        f.writelines(lines)

In [None]:
!salmon index -t $root_dir/data/references/gencode.vM25.pc_transcripts.fa -i $root_dir/data/references/index_mouse
!salmon index -t $root_dir/data/references/gencode.v34.pc_transcripts.fa -i $root_dir/data/references/index_human

In [None]:
!salmon alevin -lISR -1 $root_dir/data/10x/FASTQs/neuron_10k_v3_fastqs/neuron_10k_v3_S1_L001_R1_001.fastq.gz \
$root_dir/data/10x/FASTQs/neuron_10k_v3_fastqs/neuron_10k_v3_S1_L002_R1_001.fastq.gz \
-2 $root_dir/data/10x/FASTQs/neuron_10k_v3_fastqs/neuron_10k_v3_S1_L001_R2_001.fastq.gz \
$root_dir/data/10x/FASTQs/neuron_10k_v3_fastqs/neuron_10k_v3_S1_L002_R2_001.fastq.gz \
--chromium -i $root_dir/data/references/index_mouse -p 8 -o $root_dir/data/10x/FASTQs/alevin_output_neuron --tgMap $root_dir/data/references/txp2gene_vM25.tsv

In [None]:
!salmon alevin -lISR -1 $root_dir/data/10x/FASTQs/heart_10k_v3_fastqs/heart_10k_v3_S1_L001_R1_001.fastq.gz \
$root_dir/data/10x/FASTQs/heart_10k_v3_fastqs/heart_10k_v3_S1_L002_R1_001.fastq.gz \
-2 $root_dir/data/10x/FASTQs/heart_10k_v3_fastqs/heart_10k_v3_S1_L001_R2_001.fastq.gz \
$root_dir/data/10x/FASTQs/heart_10k_v3_fastqs/heart_10k_v3_S1_L002_R2_001.fastq.gz \
--chromium -i $root_dir/data/references/index_mouse -p 8 -o $root_dir/data/10x/FASTQs/alevin_output_heart --tgMap $root_dir/data/references/txp2gene_vM25.tsv

In [None]:
!salmon alevin -lISR -1 $root_dir/data/10x/FASTQs/pbmc_10k_v3_fastqs/pbmc_10k_v3_S1_L001_R1_001.fastq.gz \
$root_dir/data/10x/FASTQs/pbmc_10k_v3_fastqs/pbmc_10k_v3_S1_L002_R1_001.fastq.gz \
-2 $root_dir/data/10x/FASTQs/pbmc_10k_v3_fastqs/pbmc_10k_v3_S1_L001_R2_001.fastq.gz \
$root_dir/data/10x/FASTQs/pbmc_10k_v3_fastqs/pbmc_10k_v3_S1_L002_R2_001.fastq.gz \
--chromium -i $root_dir/data/references/index_human -p 8 -o $root_dir/data/10x/FASTQs/alevin_output_pbmc --tgMap $root_dir/data/references/txp2gene_v34.tsv

In [None]:
for dataset_prefix in ['heart', 'pbmc', 'neuron']:
    alevin_df = parser.read_quants_bin(f"{root_dir}/data/10x/FASTQs/alevin_output_{dataset_prefix}")
    adata = sc.AnnData(alevin_df)
    adata.write_h5ad(f"{root_dir}/data/10x/FASTQs/alevin_output_{dataset_prefix}/{dataset_prefix}_10k_v3_filtered_feature_bc_matrix.h5")

### Generating artificial datasets
Refer to `2_Generation_of_artificial_datasets.ipynb` notebook.