# 3 Filetypes

* How to count the number of reads?

```
grep "^@" sample_seq.fq
```


* How to do adapter trimming?

```
module load trimmomatic

trimmomatic SE -phred33 sample_seq.fq sample_seq_p.fq \
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
```


* How to filter mappped sequencing reads?

```
samtools view -b -q 30 -F 4 -o sample.sorted.filtered.bam sample.sorted.bam
samtools index sample.sorted.filtered.bam

gatk MarkDuplicates --REMOVE_DUPLICATES -I sample.sorted.filtered.bam -O sample.sorted.filtered.rmdup.bam -M metrics.txt
samtools index sample.sorted.filtered.rmdup.bam
```


* GATK

Problem encountered = `bam` file is not indexed. Solved by: `samtools index sample.sorted.bam`.


* How many SNPs are the in the file?

This is a zipped file. Use `zcat sample.vcf.gz | ` to run standard commands on it.

```
zcat sample.vcf.gz | grep -v "^#" | wc -l
```


* What is the position of the first and the last SNP?

```
zcat sample.vcf.gz | tail -n 1 | cut -f2
```


* Here is another vcf file. Can you see some differences to the one you just created?

The main difference is that one contains information for each position on the chromosome, the other only for positions that are not idential to the reference genomes (variable positions).




# 4 Data filtering

* Copy them to your directory and index them!

```
cp $main/share/altai.filt.vcf.gz $main/martin/
cp $main/share/denisovan.filt.vcf.gz $main/martin/

bcftools index denisovan.filt.vcf.gz
bcftools index altai.filt.vcf.gz
```


* Merge them together with the `chr21.YRI.CEU.vcf.gz`!

```
bcftools merge -Oz -o merged_file.vcf.gz -g $refgenome chr21.YRI.CEU.vcf.gz altai.filt.vcf.gz denisovan.filt.vcf.gz
```


* Extra task: Remove sites that have missing data in any individual, and count the number of sites for which this is true.

```
bcftools filter '-e GT=="./."' -o merged_file_nomiss.vcf.gz merged_file.vcf.gz 
bcftools filter '-i GT=="./."'  merged_file.vcf.gz | bcftools view -H | wc -l 
```



# 5 Data analysis


* Read in the summary table
```
hets<-read.table("hets.txt", sep="\t",header=T,comment.char="")
```


* Plot a histogram

```
png("hets_histogram.png",600,600)
hist(hets[,6],breaks=20)
dev.off()
```


* Calculate the mean for CEU and YRI separately, compared to the archaics

```
ceu<-unlist(read.table("appladmix/notebooks/1_data/data/CEU.list", sep="\t",header=F))
yri<-unlist(read.table("appladmix/notebooks/1_data/data/YRI.list", sep="\t",header=F))
mean(hets[which(hets[,3]%in%ceu),6])
mean(hets[which(hets[,3]%in%yri),6])
mean(hets[which(hets[,3]%in%c("AltaiNea","DenisovaPinky")),6])
```


* Perform a statistical test whether CEU and YRI are different

```
wilcox.test(hets[which(hets[,3]%in%ceu),6],hets[which(hets[,3]%in%yri),6])
```


* There are ~30 million bases for which we can have data on chr21. Often heterozygosity is expressed als hets per 1,000bp. Calculate this and  present the values in such a way.

```
genom=30000000

mean(hets[which(hets[,3]%in%ceu),6])/genom*1000
mean(hets[which(hets[,3]%in%yri),6])/genom*1000
hets[which(hets[,3]%in%c("AltaiNea","DenisovaPinky")),6]/genom*1000
```

