# Shahlab Pipeline Analysis Guidelines

## Data Transfer

### GSC aligned data
1.. If you are on the project email list, you will receive an email from the GSC with a title similar to " New FASTQ files available for SOW GSC-0297", which will tell you the principal investigator who submitted these samples. The email contains an attachment with relevant information for sample identification and download.

```
POOLED DATA:
library: IX4686
.
.
.
number_of_sublibraries: 3
-------------------------
>>>ALERT(S) DETECTED
-------------------------
LIBRARY	INDEX	FLOWCELL	LANE	ANTIBODY	UPPER_PROTOCOL	LOWER_PROTOCOL	TAXONOMY_ID	GENOME_REF	SEQ LENGTH	PATIENT_ID	EXTERNAL_ID	ORIGINAL_SOURCE	SUBMITTED_SPIKE_IN_SEQUENCE	SUBMITTED_SPIKE_IN_IDENTIFIER	ALERT_CODE	ALERT_NOTES	BIO_QC_STATUS	BIO_QC_COMMENTS
A51910	CCTTAG	C9W6RANXX	1	None	Strand Specific Transcriptome 2.1	Illumina Indexing	9606	Homo sapiens	76	SKBR3_KUTRsi_2	SA_601	None	None	None	None	Passed	2016-08-30: (N/A -> Passed) Warning:0.076 portion of reads matching human ribosomal RNA is higher than threshold 0.05
A60778	CAAAAG	C9W6RANXX	1	None	Strand Specific Transcriptome 2.2	Illumina Indexing	9606	Homo sapiens	76	SKBR3_scramSix1	SA902-R2	SA_736	None	None	None	None	Passed	2016-08-30: (N/A -> Passed) Warning:0.074 portion of reads matching human ribosomal RNA is higher than threshold 0.05
A60779	CCAACA	C9W6RANXX	1	None	Strand Specific Transcriptome 2.2	Illumina Indexing	9606	Homo sapiens	76	SKBR3_KUTRsi_3	SA903	SA_737	None	None	QC_21	Portion of reads mapping to target species ribosomal RNA is higher than expected.	Failed	2016-08-30: (N/A -> Failed) Warning:0.125 portion of reads matching target mitochondria is higher than threshold 0.1;Failed:0.121 portion of reads matching human ribosomal RNA is higher than threshold 0.1
```
 		* library pooling id: IX4686
		* LIBRARY: A51910
		* EXTERNAL_ID: SKBR3_KUTRsi_2
```
WHOLE GENOME DATA:
library: A62078
.
.
.
number_of_sublibraries: 1
-------------------------
LIBRARY	INDEX	FLOWCELL	LANE	ANTIBODY	UPPER_PROTOCOL	LOWER_PROTOCOL	TAXONOMY_ID	GENOME_REF	SEQ LENGTH	PATIENT_ID	EXTERNAL_ID	ORIGINAL_SOURCE	SUBMITTED_SPIKE_IN_SEQUENCE	SUBMITTED_SPIKE_IN_IDENTIFIER	ALERT_CODE	ALERT_NOTES	BIO_QC_STATUS	BIO_QC_COMMENTS
A62078	GCACTT	H323YALXX	1	None	Genome Shotgun PCRFree 1.1	Illumina Indexing	9606	Homo sapiens	151	HCT116	SA988	SA_829	None	None	None	None	Passed	2016-08-30: (N/A -> Passed) 
```
 		* LIBRARY: A62078
 		* EXTERNAL_ID: SA988
2.. To transfer pooled exome and rnasomnieq data use the **library pooling id**, **LIBRARY** and **EXTERNAL_ID** information to make a *tab-separated* input file with the format, with each sample on one line:
IX4686	A51910	SKBR3_KUTRsi_2

3.. To transfer whole genome data, use the **LIBRARY** and **EXTERNAL_ID** information to make a *tab-separated* input file with the format, with each sample on one line:
A62078	SA988

4.. Clone the stash repository <> to install the transfer script in your own folder on the GSC system.

5.. Identify what the project name is for the samples in question, and make sure to batch samples by project in each input file. Samples will be soft linked in the relevant project folder in a samples/ subfolder (create the samples/ subfolder in the project folder on lustre if it doesn't already exist). The project name will be a command line input for the transfer script and must match the spelling of the /share/lustre/projects folder being written to.

6.. ssh to xhost10 (so that the GSC folders can be accessed)

7.. `cat my_inputfile.txt | sh make_transferBam_script_onBeast_merge_bwa-mem.sh -f -t \<sample type>` to see the files that will be transferred.

8.. `cat my_inputfile.txt | sh make_transferBam_script_onBeast_merge_bwa-mem.sh -t \<sample type> -p \<project>` to create the transfer script in your home directory on beast

9.. ssh over to beast

10.. Open a screen.

11.. `sh transferScript_YYYY_MM_DD.sh` to start sample transfer

12.. Sample files (bam, bai, flagstats, bamstats, fastq) should be transferred to the /share/lustre/archive/ folder and have the folder structure /share/lustre/archive/_sample\_id_/illumina_tech/libID/_aligner_\_aligned for bam files and /share/lustre/archive/_sample\_id_/illumina\__tech_/_libID_/sequence/ for fastq files

### Collaborator data (non-GSC)
1. Collect information about samples
  * platform used to sequence data
  * alignment method, duplicates marked, genome ascension number (hg19, mm10, etc)
  * an existing master sample list for this collaborator/project
2. Create meaningful sample ids for each sample, using prefixes that correspond to the sample submitter or project name, with a 4-digit number starting at 0001.. For a small number of samples for a small project, use external ids provided, with underscores separating relevant information.
3. Because there is likely no library id for collaborator samples, just use the sample id as the library id when transferring samples to the archive.
4. Sample files (bam, bai, flagstats, bamstats, fastq) should be rsync transferred to the /share/lustre/archive/ folder and have the folder structure /share/lustre/archive/_sample\_id_/illumina_tech/libID/_aligner_\_aligned for bam files and /share/lustre/archive/_sample\_id_/illumina\__tech_/_libID_/sequence/ for fastq files

### Transfer to shahlab/archive/ or extscratch/shahlab/archive/ on the GSC
1. When transferring data to shahlab/archive/, at the very least, rename the files with the sample id prepending the file name (for bam and bai files) and rsync to a folder named with the sample id, then rsync the folder strtucture from the GSC, as desired (eg. 75nt/hg19a_jg-e69/bwa-0.5.7/)
2. To transfer data to extscratch/shahlab/archive/, there is an archive manager available to transfer files from lustre. There is also much more limited space on extscratch/ than on shahlab/
 	* calculate how much space the files take up using something like: 
 	```ls -hs LIST.OF.FILES | awk -F'G' 'BEGIN {sum=0} /^[ 0-9.]*G/ {sum+=$1} END {print sum"G"}'```
 	* login to the gsc & check if there is enough space to transfer these bam files `du -h /genesis/extscratch/shahlab/`
 	* Create input file to /genesis/extscratch/shahlab/archive/manager/code/projects/*tsv
    *Input file should contain the following:
    *header labelled: "bccrc_file_path"
    *contain a list of files to be transferred as they appear in /share/lustre/archive on beast
 	* ssh thost05
 	* open screen
 	* source /genesis/extscratch/shahlab/archive/manager/bash_profile
	* `python /genesis/extscratch/shahlab/archive/manager/code/manager.py /genesis/extscratch/shahlab/archive/manager/code/config.yaml --get_bam_index`

When files are transferred, they will be in the same archive folder structure on extscratch as they are in the lustre archive folder.


## Ticket Setup for New Analyses on [Jira](https://www.bcgsc.ca/jira/login.jsp)
1.. Create a master Jira ticket under the correct project name with all samples in the project listed in a table in the ticket Description.
For example:

| SA_ID | LIbID |Downloaded | GSNAP | MISO | CUFFLINKS | CUFFLINKS_MATRIX |
|-	| -| -| -| -| -| -|
|SA494X4	|A60788| | | | | | 
|SA495X5	|A60790| | | | | | 
|SA496X2	|A60782| | | | | | 

2. Create a sub-ticket for each task (hint: the analysis list for this sample set is given as a header in the table) and add all relevant collaborators as watchers.
 	* From the **More** dropdown menu > select **Create Sub-task**
 	* Copy the SampleID, LibID and relevant analysis type to the new sub-ticket and create a similar table to track analyses on that ticket. Add watchers again to this ticket.
 	* When the task is started, select the **Start Progress** button to commence 
 	* When the sub-task is completed, select **Resolve Issue** then have the Reporter or another supervisor approve the closure of the ticket. 

## Pipeline Runs
1. Select a folder within your own directory and clone the relevant pipeline from shahlab [Stash](https://svn.bcgsc.ca/bitbucket/projects) into it.
2. 
