# Processing of the El Sidron Y chromosome capture data

In [1]:
# make sure dependencies have been generated (they should have been
# if this notebook is run in batch mode by make)
(cd ../; make init)

make: Nothing to be done for `init'.


In [2]:
bam_dir="../bam"
input_dir="../input"
tmp_dir="../tmp"



I found this directory, which seems to contain some Y chromosome El Sidron data:

In [3]:
qiaomeis_dir="/mnt/454/Carbon_beast_QM/Y_Sidron_TY"



In [4]:
ls $qiaomeis_dir

10M_all.fasta			   snps2hg.R
10M_all.fasta.gz		   Tianyuan_haplo_dirved.txt
1_Extended_VCF			   Tianyuan-Y-alllib.intervals
BS11_haplo_dirved.txt		   Tianyuan-Y-alllib.realigned.bai
core				   Tianyuan_Y_l30_map40.base
cteam				   tree.txt
damage				   Ust_haplo_dirved.txt
Dolni16_haplo_dirved.txt	   Y_haplo.BS11.2.hg
EMH_HGDP			   Y_haplo.BS11.snps
fas				   Y_haplo.Dolni16.2.hg
final_bam			   Y_haplo.Dolni16.snps
getBase_vcf_chrY_proceed.py	   Y_haplo.Kos14.2.hg
getBase_vcf_chrY.py		   Y_haplo.Kos14.snps
haplo				   Y_haplo.LD1.2.hg
HGDP				   Y_haplo.LD1.snps
isogg_version_9.22.csv		   Y_haplo.Losch2.2.hg
isogg_version_9.22.processed	   Y_haplo.Losch2.snps
isogg_version_9.22_v2.order.txt    Y_haplo.Losch.hd
Kos14_haplo_dirved.txt		   Y_haplo.Losch.snps
LD1_haplo_dirved.txt		   Y_haplo.Sidron.2.hg
LD1_Y.l35q30.rd.reference_aln.bam  Y_haplo.Sidron.3.hg
logs				   Y_haplo.Sidron.snps
Losch_haplo_dirved.txt		   Y_haplo.Tianyuan.3.hg
mergelane			   Y_haplo.Tianyuan

There are bunch of BAM files in Qiaomei's directory and it's not clear if/how/why there were processed, filtered etc. No README file to be found either.

What relevant BAM files could be in here?

In [5]:
find ${qiaomeis_dir} -type f -name "*Sidron*.bam" | xargs ls -l

-rw-r--r-- 1 public staff           0 Mar 11  2014 /mnt/454/Carbon_beast_QM/Y_Sidron_TY/fas/Sidron_Y.rd.bam
-rw-r--r-- 1 public staff           0 Mar 12  2014 /mnt/454/Carbon_beast_QM/Y_Sidron_TY/final_bam/hg19_evan/Sidron.hg19_evan.Y.bam
-rw-r--r-- 1 public staff    15794513 Mar  5  2014 /mnt/454/Carbon_beast_QM/Y_Sidron_TY/final_bam/hg19_evan/Sidron.hg19_evan.Y.dq.bam
-rw-r--r-- 1 public staff    15628473 Mar  5  2014 /mnt/454/Carbon_beast_QM/Y_Sidron_TY/mergelane/Sidron.hg19_evan.Y.bam
-rw-r--r-- 1 public staff 17048090508 Mar  5  2014 /mnt/454/Carbon_beast_QM/Y_Sidron_TY/mergelane/Sidron_Y.bam
-rw-r--r-- 1 public staff         127 Mar  6  2014 /mnt/454/Carbon_beast_QM/Y_Sidron_TY/mergelane/Sidron_Y.rd.bam
-rw-r--r-- 1 public staff         127 Mar 10  2014 /mnt/454/Carbon_beast_QM/Y_Sidron_TY/mergelane/Sidron_Y.rd.l35q30.bam
-rw-r--r-- 1 public staff     2390108 Nov 28  2013 /mnt/454/Carbon_beast_QM/Y_Sidron_TY/mutation_Y/Y_Tianyuan_Sidron_hg19_1000g/Sidron.bam
-rw-r--r-- 1 

Many different files, hard to know which is which. Now, let's assume that the biggest file (`/mnt/454/Carbon_beast_QM/Y_Sidron_TY/mergelane/Sidron_Y.bam`) is closest to the original sequence data.

### Which runs does the data come from?

In [6]:
sam view -n100000 /mnt/454/Carbon_beast_QM/Y_Sidron_TY/mergelane/Sidron_Y.bam \
    | awk -F':' '{ n[$1]++ } ; END {for(k in n){print k}}'

SN7001204_0235_BH72E4ADXX_R_PEdi_A3601_A3605
SN7001204_0228_BH06Y0ADXX_R_PEdi_A3207_A3208


### What are the read groups of interest?

Save them to a file too.

In [7]:
read_groups=$tmp_dir/sidron_read_groups.txt



In [8]:
samtools view -H /mnt/454/Carbon_beast_QM/Y_Sidron_TY/mergelane/Sidron_Y.bam \
    | grep '@RG' \
    | awk -F':' '{print $2}' | tee $read_groups

A3134
A3135
A3136
A3137
A3138
A3139
A3140
A3141
A3142
A3143
A3144
A3145
A3146
A3147
A3148
A3149
A3150
A3151
A3152
A3153
A3154
A3155
A3156
A3157
A3158
A3159
A3160
A3161
A3162
A3163
A3164
A3165
A3166
A3167
A3168
A3169
A3170
A3171
A3172
A3173
A3174
A3175
A3176
A3177
A3178
A3179
A3180
A3181
A3182
A3183
A3184
A3185
A3186
A3187
A3188
A3189
A3190
A3191
A3192
A3193


Here's a link to the database entries for these libraries: http://bioaps01:5984/default/_design/lims/_list/web/libs.csv?startkey_plain=A3193&limit=60&include_docs=true&startkey=[%22A%22%2C3193]&descending=true

According to the database, the libraries were prepared from a Neandertal sample: **El Sidron 1253**

>An adult male bone flake, labelled as SD 1253, is one of the best El Sidrón samples in terms of genetic content and low DNA contamination. In a recent massively parallel sequencing approach for this sample, only 0.27 per cent of the mitochondrial sequences obtained were exogenous contaminants (Briggs et al. 2009 - http://www.sciencemag.org/content/325/5938/318.long), the lowest level for any Neanderthal sample studied so far.

(Quote from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2828008/)

### Which of the four lanes (from two runs determined above) contain the libraries of interest?

#### First run

In [9]:
ls /mnt/ngs_data/*SN7001204_0228_BH06Y0ADXX_R_PEdi_A3207_A3208/{Ibis,Bustard}/BWA/*.bam

ls: cannot access /mnt/ngs_data/*SN7001204_0228_BH06Y0ADXX_R_PEdi_A3207_A3208/Bustard/BWA/*.bam: No such file or directory
/mnt/ngs_data/130917_SN7001204_0228_BH06Y0ADXX_R_PEdi_A3207_A3208/Ibis/BWA/s_1-hg19_evan.bam
/mnt/ngs_data/130917_SN7001204_0228_BH06Y0ADXX_R_PEdi_A3207_A3208/Ibis/BWA/s_2-hg19_evan.bam


Seems the Bustard data are not there...

#### Second run

In [10]:
ls /mnt/ngs_data/*SN7001204_0235_BH72E4ADXX_R_PEdi_A3601_A3605/{Ibis,Bustard}/BWA/*.bam

ls: cannot access /mnt/ngs_data/*SN7001204_0235_BH72E4ADXX_R_PEdi_A3601_A3605/Ibis/BWA/*.bam: No such file or directory
/mnt/ngs_data/131129_SN7001204_0235_BH72E4ADXX_R_PEdi_A3601_A3605/Bustard/BWA/s_1_sequence_ancient_hg19_evan.newrg.bam
/mnt/ngs_data/131129_SN7001204_0235_BH72E4ADXX_R_PEdi_A3601_A3605/Bustard/BWA/s_2_sequence_ancient_hg19_evan.bam


And here, for a change, Ibis data are missing...

After checking manually the headers, only two of the four BAM files from the four lanes above contain read groups A3134-A3193 determined above:
* `/mnt/ngs_data/130917_SN7001204_0228_BH06Y0ADXX_R_PEdi_A3207_A3208/Ibis/BWA/s_2-hg19_evan.bam`
* `/mnt/ngs_data/131129_SN7001204_0235_BH72E4ADXX_R_PEdi_A3601_A3605/Bustard/BWA/s_2_sequence_ancient_hg19_evan.bam`

# Extract the read groups of interest and merge the data

### Extract reads from the first lane

In [11]:
samtools view -bhR $read_groups /mnt/ngs_data/130917_SN7001204_0228_BH06Y0ADXX_R_PEdi_A3207_A3208/Ibis/BWA/s_2-hg19_evan.bam \
    > $tmp_dir/130917_SN7001204_0228_BH06Y0ADXX_R_PEdi_A3207_A3208___A3134-A3193.bam



### Extract reads from the second lane

In [12]:
samtools view -bhR $read_groups /mnt/ngs_data/131129_SN7001204_0235_BH72E4ADXX_R_PEdi_A3601_A3605/Bustard/BWA/s_2_sequence_ancient_hg19_evan.bam \
    > $tmp_dir/131129_SN7001204_0235_BH72E4ADXX_R_PEdi_A3601_A3605___A3134-A3193.bam



### Merge data from both lanes

In [13]:
samtools merge $tmp_dir/sidron_merged_lanes.bam $tmp_dir/130917_SN7001204_0228_BH06Y0ADXX_R_PEdi_A3207_A3208___A3134-A3193.bam $tmp_dir/131129_SN7001204_0235_BH72E4ADXX_R_PEdi_A3601_A3605___A3134-A3193.bam



In [14]:
# what is the total number of reads in the merged BAM file?
samtools view $tmp_dir/sidron_merged_lanes.bam | wc -l

232663077


# Remove duplicates and perform filtering

## length >= 35 and MQ >= 37

In [15]:
bam-rmdup -l 35 -q 37 -r -o $tmp_dir/sidron_rmdup_len35mapq37_unsorted.bam $tmp_dir/sidron_merged_lanes.bam

[K#RG	in	out	in@MQ20	single@MQ20	unseen	total	%unique	%exhausted
--	1,641,444	243,473	243,473	86,817	48k	290k	14.8	83.5


In [16]:
samtools sort $tmp_dir/sidron_rmdup_len35mapq37_unsorted.bam -o $tmp_dir/sidron_rmdup_len35mapq37_sorted.bam



In [17]:
samtools index $tmp_dir/sidron_rmdup_len35mapq37_sorted.bam



In [18]:
samtools view $tmp_dir/sidron_rmdup_len35mapq37_sorted.bam | wc -l

244170


## length >= 35 and MQ >= 30

In [10]:
bam-rmdup -l 35 -q 30 -r -o $tmp_dir/sidron_rmdup_len35mapq30_unsorted.bam $tmp_dir/sidron_merged_lanes.bam

[K#RG	in	out	in@MQ20	single@MQ20	unseen	total	%unique	%exhausted
--	1,641,446	243,476	243,476	86,820	48k	290k	14.8	83.5


In [11]:
samtools sort $tmp_dir/sidron_rmdup_len35mapq30_unsorted.bam -o $tmp_dir/sidron_rmdup_len35mapq30_sorted.bam



In [12]:
samtools index $tmp_dir/sidron_rmdup_len35mapq30_sorted.bam



In [13]:
samtools view $tmp_dir/sidron_rmdup_len35mapq30_sorted.bam | wc -l

244175


## length >= 35 and MQ >= 25

In [16]:
bam-rmdup -l 35 -q 25 -r -o $tmp_dir/sidron_rmdup_len35mapq25_unsorted.bam $tmp_dir/sidron_merged_lanes.bam

[K#RG	in	out	in@MQ20	single@MQ20	unseen	total	%unique	%exhausted
--	1,656,862	248,271	248,271	89,935	50k	290k	15.0	83.1


In [17]:
samtools sort $tmp_dir/sidron_rmdup_len35mapq25_unsorted.bam -o $tmp_dir/sidron_rmdup_len35mapq25_sorted.bam



In [18]:
samtools index $tmp_dir/sidron_rmdup_len35mapq25_sorted.bam



In [19]:
samtools view $tmp_dir/sidron_rmdup_len35mapq25_sorted.bam | wc -l

248986


## length >= 35 and MQ >= 20

In [29]:
bam-rmdup -l 35 -q 20 -r -o $tmp_dir/sidron_rmdup_len35mapq20_unsorted.bam $tmp_dir/sidron_merged_lanes.bam

[K#RG	in	out	in@MQ20	single@MQ20	unseen	total	%unique	%exhausted
--	1,776,218	275,887	275,887	103,519	60k	330k	15.5	82.1


In [30]:
samtools sort $tmp_dir/sidron_rmdup_len35mapq20_unsorted.bam -o $tmp_dir/sidron_rmdup_len35mapq20_sorted.bam



In [31]:
samtools index $tmp_dir/sidron_rmdup_len35mapq20_sorted.bam



In [32]:
samtools view $tmp_dir/sidron_rmdup_len35mapq20_sorted.bam | wc -l

276610


# Subset only to reads mapped to the Y chromosome

## MQ >= 37

In [20]:
samtools view $tmp_dir/sidron_rmdup_len35mapq37_sorted.bam Y -o $tmp_dir/sidron_rmdup_len35mapq37_sorted_chrY.bam



In [21]:
samtools index $tmp_dir/sidron_rmdup_len35mapq37_sorted_chrY.bam



In [26]:
samtools view $tmp_dir/sidron_rmdup_len35mapq37_sorted_chrY.bam | wc -l

113993


## MQ >= 30

In [22]:
samtools view $tmp_dir/sidron_rmdup_len35mapq30_sorted.bam Y -o $tmp_dir/sidron_rmdup_len35mapq30_sorted_chrY.bam



In [23]:
samtools index $tmp_dir/sidron_rmdup_len35mapq30_sorted_chrY.bam



In [27]:
samtools view $tmp_dir/sidron_rmdup_len35mapq30_sorted_chrY.bam | wc -l

113993


## MQ >= 25

In [24]:
samtools view $tmp_dir/sidron_rmdup_len35mapq25_sorted.bam Y -o $tmp_dir/sidron_rmdup_len35mapq25_sorted_chrY.bam



In [25]:
samtools index $tmp_dir/sidron_rmdup_len35mapq25_sorted_chrY.bam



In [28]:
samtools view $tmp_dir/sidron_rmdup_len35mapq25_sorted_chrY.bam | wc -l

114278


## MQ >= 20

In [33]:
samtools view $tmp_dir/sidron_rmdup_len35mapq20_sorted.bam Y -o $tmp_dir/sidron_rmdup_len35mapq20_sorted_chrY.bam



In [34]:
samtools index $tmp_dir/sidron_rmdup_len35mapq20_sorted_chrY.bam



In [35]:
samtools view $tmp_dir/sidron_rmdup_len35mapq20_sorted_chrY.bam | wc -l

117799


<br><br><br><br><br>
# Extract reads falling within Lippold et al. target regions

In [19]:
lippold_targets_bed=$input_dir/lippold_regions.bed



## Length >= 35, MQ >= 37

Filtering for both MAPQ >= 30 and MAPQ >= 37 gave the same numbers of reads. In order to keep the analysis consistent with Susanna's Denisovan 8 data (which are filtered for MAPQ >= 37), I will keep the El Sidron data just for MAPQ37 as well.

In [20]:
bedtools intersect -a $tmp_dir/sidron_rmdup_len35mapq37_sorted.bam -b $lippold_targets_bed \
    > $bam_dir/sidron_rmdup_len35mapq37_sorted_lippold.bam



In [21]:
samtools index $bam_dir/sidron_rmdup_len35mapq37_sorted_lippold.bam



In [22]:
samtools view $bam_dir/sidron_rmdup_len35mapq37_sorted_lippold.bam | wc -l

89488


# Rename the final on-target BAM file

In [23]:
rename 's/sidron_rmdup_len35mapq37_sorted_lippold/lippold_sidron/' $bam_dir/sidron_rmdup_len35mapq37_sorted_lippold.bam*



<br><br><br><br><br><br><br><br><br><br>
<hr>
# Notes

* Would it be possible to include El Sidron Y chromosome data from the exome capture? This could increase the coverage a little bit.

* Is filtering for read groups A3134-A3193 enough or is a more strict method necessary?

```
usage:
./splitBAM.pl [options] in.bam

options:
-byreadgroup        write one BAM file for each read group identifier present in BAM file [default]
-byfile             read a list of index combinations (#LibID P7 P5) and determine readgroup again, 
                    ignores readgroup assignment in BAM file, requires perfect index matching 
-statsonly [int]    only print statistics for first [int] sequences
-z0                 maximum Z0 value (~unknownness of index) [inf]
-minscore           phred score cutoff for index base quality [0]
-maxnumber          allowed number of index base quality scores below cutoff [inf]
-p7only             split based on p7 index only
-p5only             split based on p5 index only
-perfect            filter single index data for perfect index matches
-mapped             output mapped sequences only
-paired             output overlap-merged sequences only
-minlength          minimum sequence length

output:
-out.bam            BAM files
-screen             statistics for all index combinations identified
```

* For genotyping: lower the quality of T/A terminal nucleotides? From Sergi's exome paper:

> Fragments may carry residual cytosine deamination in the first positions of the 5’ end
and the last positions in the 3’ end in spite of the UDG treatment (Figure S2A). These
bases are read as thymine and adenosine, respectively. Because the deamination
process does not affect the qualities of these bases, deamination can potentially
influence downstream analyses. To mitigate this problem, we lower the base quality
score in the phred scale (14) to 2 for any ‘T’ nucleotide occurring within the first five
bases or ‘A’ nucleotide within the last five positions in a sequence.

Note that this would work only for USER-treated data. Find out what was the library prep method for this.

## More samples?

Mez 2

Feldhofer 1

Denisova 4 and 8

2 other sidrons 

**Chagyrkskaya ulna** is SP3394
* Sequencing runs with shotgun data:
* 141128 -> library pool L5128
* 150212-> library pool  R1928 (this is Petra's data used for the washing paper)
* 150330 -> library pool L5300 (lane 2)
 
**Hohlenstein-Stadel** is SP3363
* Sequencing runs with shotgun data:
* 150330 -> library pool L5300 (lane 2)
* 150623 -> library pool L5406
* 152810 -> library pool L5512 (lane2) (This one I have not processed yet)
 
There are excel sheets with the specific library numbers and index combinations per sample for each of these sequencing runs in the public->AncientDNA->sequencing_runs folder

The bam files are in /home/viviane_slon/PhD/ and then either Chagyrskaya or Hohlenstein_Stadel; in sub-folders "split" for each sequencing run.
  

### Susanna's samples

Saturation plots for the two libraries: '/r1/people/susanna_sawyer/Neandertal/Denisova_Molar_paper/figures/library saturation figure.tif'

Data is in
/mnt/expressions/susanna/Denisova_Molar_2_L9133 (+ README file in read_me_HiSeq_L9133)
/mnt/expressions/susanna/Denisova_Molar_2_L9133 (+ README file in read_me_HiSeq_L9351)

### Different options how to do library prep:

* double stranded library UDG treated - very efficient to remove uracils, very little damage persists
* single stranded library UDG treated - strong deamination patterns persist