# Getting data from paper

### 1) Find GEO accession Number

When data is uploaded to GEO it will be reported in the paper with a GEO accession number. Ctrl + F for GEO will usually get you to the ID needed to download data. For the Macosko et al. paper, the accession number is [GSE63472](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63472).

### 2) Search available data in GEO. 

Go to the GEO datasets [website](https://www.ncbi.nlm.nih.gov/gds) and put the accession number in the search bar. The next step is similar to that from the cellranger download notebook. Scroll through to the bottom will show you the files that are available. You will notice the raw data is available through SRA (see the link halfway down the page). Look for the (SRX907219) link and click through.

**Raw Data**

We will process the data from **P14 mouse retina 1** through the Dropseq pipeline to learn how to generate counts matricies. Since the fastq files are large, we don't want everyone downloading them individually. Instead, this has been done for you and they are located in our shared class directory: 

    /oasis/tscc/scratch/cshl_2018/raw_data_macosko/

**If you are using your own cluster, you will need to download these files yourself.**

Instructions on how this was downloaded are here:

1) From the GEO page, follow the SRA link and find the sequencing results from P14 mouse retina 1 located [here](https://www.ncbi.nlm.nih.gov/sra?term=SRX907219).

2) Click on the SRA link [SRR1853178](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1853178) for that experiment and nagivate to the downloads tab.

3) If you move to the download page, you may see some links available to download 
(for example, the BAM file from cellranger). However, you may also see a message 
("*SRA Toolkit tools directly operate on SRA runs. Toolkit has capacity to find requested 
runs at NCBI and download (and cache) only the part you really need...*"), which means you will need to download their **fastq-dump** toolkit, also located on this page. Here is the command to use to download part 1 of the Macosko dataset:

```bash
fastq-dump --split-files --gzip SRR1853178```

- The **split-files** flag will split paired-end reads, since these reads are paired end.
- The **gzip** flag will automatically compress the fastq files.

### 3) Downsampling large datasets.

[seqtk](https://github.com/lh3/seqtk) is a great toolkit for processing sequences in FASTA/Q formats, including what we need to do here, **downsampling**


create environment

```bash
    conda create -y -n seqtk
    
    source activate seqtk
```


install

```bash
    cd ~/software

    git clone https://github.com/lh3/seqtk.git;

    cd seqtk; make
```

add to PATH

```bash    
    vi ~/.bashrc

    i

    export PATH="/home/iachaim/software/seqtk:$PATH"

    esc

    :wq
```   
 
 
And update those changes with source:

```bash
    source ~/.bashrc
```

**run seqtk**

In this case we subsampled 100,000,000 read pairs from two large paired FASTQ files (SRR1853178_1/2.fastq.gz). It is **very** important to remember to use the same random seed to keep pairing (here ```-s100```) :

```bash

  seqtk sample -s100 SRR1853178_1.fastq.gz 100000000 > SRR1853178_1_100M.fastq.gz
  
  seqtk sample -s100 SRR1853178_2.fastq.gz 100000000 > SRR1853178_2_100M.fastq.gz
  
```