# Processing of VCF and BAM data for the Upper-Paleolithic paper

**Positions used for the analysis**: subset of sites from the archaic admixture array at which:

* Altai Neandertal is homozygous
* majority of Yorubans from 1000 Genomes Project have a high frequency allele different from Altai

Note that we are ignoring the low coverage Neandertal samples because of high amount of noise introduced by them.

In [1]:
position_file="/mnt/scratch/mateja/Early_modern_humans/nuclear_captures/Archaic_admixture_final/final_test3.bed"
wc -l $position_file

477011 /mnt/scratch/mateja/Early_modern_humans/nuclear_captures/Archaic_admixture_final/final_test3.bed


In [2]:
cd /mnt/scratch/mateja/Early_modern_humans/nuclear_captures/Archaic_admixture_final/
mkdir Altai_homozygous_and_YRI_at_high_frequency
cd Altai_homozygous_and_YRI_at_high_frequency



## Calling random alleles from high-coverage individuals

In [3]:
mkdir highcov



The code bellow is run in parallel for each chromosome (i.e. 22 processes in parallel), best to run on a cluster!

In [4]:
random_allele_calling_script="/mnt/scratch/mp/EMH/sample_from_vcf.py"

# Process VCF files of all chromosomes from a given individual and perform
# a random allele sampling for each of them, merging the results at the end.
function process_highcov_individual {
    vcf_id=$1
    vcf_dir=$2
    bed_file=$3
    output_file=$4

    # process one VCF at a time, redirecting output to tmp files
    echo {1..22} | xargs -n1 | \
        xargs -P22 -I{} sh -c "python3 $random_allele_calling_script --vcf ${vcf_dir}/${vcf_id}.hg19_1000g.{}.mod.vcf.gz --bed $bed_file --chr {} > ${output_file}_{}.tmp"

    # merge all temporary files into one final output file
    cat ${output_file}_{1..22}.tmp | sort -n -k1,1 -k2,2 > $output_file

    # remove the temporary files
    rm ${output_file}_{1..22}.tmp
}



### What are the available individuals in the B-team?

In [5]:
cat /mnt/454/HighCovNeandertalGenome/1_Extended_VCF/Individuals.txt

Name1	Name2	Group	Sex
HGDP01029	HGDP01029	San	M
DNK02	DNK02	Dinka	M
HGDP00521	HGDP00521	French	M
HGDP00665	HGDP00665	Sardinian	M
HGDP00542	HGDP00542	Papuan	M
HGDP00778	HGDP00778	Han	M
HGDP01307	HGDP01307	Dai	M
HGDP00998	HGDP00998	Karitiana	M
HGDP00927	HGDP00927	Yoruba	M
HGDP01284	HGDP01284	Mandenka	M
HGDP0456	HGDP00456	Mbuti	M
SS6004467	HGDP01308	Dai	M
SS6004468	HGDP00533	French	M
SS6004469	HGDP00775	Han	M
SS6004470	HGDP01286	Mandenka	M
SS6004471	HGDP00982	Mbuti M
SS6004472	HGDP00546	Papuan	M
SS6004473	HGDP01036	San	M
SS6004474	HGDP01076	Sardinian	M
SS6004475	HGDP00936	Yoruba	M
SS6004476	HGDP01015	Karitiana M
SS6004477	WON,M	Australian	M
SS6004478	BUR,E	Australian	F
SS6004479	MIXE0007	Mixe	U
SS6004480	DNK07	Dinka	M
DenisovaPinky	Denisova	Denisova	F
AltaiNea	Neandertal	Neandertal	?


### Perform the random allele calling

In [6]:
bteam_vcf_dir="/mnt/454/HighCovNeandertalGenome/1_Extended_VCF"



In [7]:
process_highcov_individual SS6004468 $bteam_vcf_dir/SS6004468 $position_file highcov/French.bed



In [8]:
process_highcov_individual SS6004469 $bteam_vcf_dir/SS6004469 $position_file highcov/Han.bed



In [9]:
process_highcov_individual SS6004471 $bteam_vcf_dir/SS6004471 $position_file highcov/Mbuti.bed



In [10]:
process_highcov_individual SS6004472 $bteam_vcf_dir/SS6004472 $position_file highcov/Papuan.bed



In [11]:
process_highcov_individual SS6004473 $bteam_vcf_dir/SS6004473 $position_file highcov/San.bed



In [12]:
process_highcov_individual SS6004474 $bteam_vcf_dir/SS6004474 $position_file highcov/Sardinian.bed



In [13]:
process_highcov_individual SS6004475 $bteam_vcf_dir/SS6004475 $position_file highcov/Yoruba.bed



In [14]:
process_highcov_individual SS6004476 $bteam_vcf_dir/SS6004476 $position_file highcov/Karitiana.bed



In [15]:
process_highcov_individual SS6004477 $bteam_vcf_dir/SS6004477 $position_file highcov/Australian.bed



In [16]:
process_highcov_individual SS6004480 $bteam_vcf_dir/SS6004480 $position_file highcov/Dinka.bed



In [18]:
process_highcov_individual Ust_Ishim /mnt/454/Ust_Ishim/1_Extended_VCF $position_file highcov/Ust-Ishim.bed



## Calling majority alleles from mpileup of reads of Upper-Paleolithic samples

In [21]:
mkdir emh



In [25]:
emh_bam_dir="/mnt/scratch/mateja/Early_modern_humans/nuclear_captures/Archaic_admixture_final/Individuals_L35_MQ37_deam_only/On_target_L35MQ37_deam_only"



In [27]:
majority_allele_calling_script="/mnt/scratch/mp/EMH/sample_from_mpileup.py"
reference="/mnt/solexa/Genomes/hg19_evan/whole_genome.fa"



**Run the cell bellow only an a number cruncher, since it is running all the sampling in parallel!**

In [28]:
python3 $majority_allele_calling_script --bam $emh_bam_dir/Afontova_gora_L35MQ37_sorted.deam3_or_5.ontarget.L35MQ37.bam \
                                        --bed $position_file --strand_check non-USER_all \
                                        --ref $reference > emh/AfontovaGora.bed &


python3 $majority_allele_calling_script --bam $emh_bam_dir/Dolni_Vestonice_13_L35MQ37_sorted.deam3_or_5.ontarget.L35MQ37.bam \
                                        --bed $position_file --strand_check USER \
                                        --ref $reference > emh/DolniVestonice13.bed &


python3 $majority_allele_calling_script --bam $emh_bam_dir/Dolni_Vestonice_15_L35MQ37_sorted.deam3_or_5.ontarget.L35MQ37.bam \
                                        --bed $position_file --strand_check USER \
                                        --ref $reference > emh/DolniVestonice15.bed &


python3 $majority_allele_calling_script --bam $emh_bam_dir/Dolni_Vestonice_16_L35MQ37_sorted.deam3_or_5.ontarget.L35MQ37.bam \
                                        --bed $position_file --strand_check USER \
                                        --ref $reference > emh/DolniVestonice16.bed &


python3 $majority_allele_calling_script --bam $emh_bam_dir/Dolni_Vestonice_43_L35MQ37_sorted.deam3_or_5.ontarget.L35MQ37.bam \
                                        --bed $position_file --strand_check USER \
                                        --ref $reference > emh/DolniVestonice43.bed &


python3 $majority_allele_calling_script --bam $emh_bam_dir/El_Miron_L35MQ37_sorted.deam3_or_5.ontarget.L35MQ37.bam \
                                        --bed $position_file --strand_check USER \
                                        --ref $reference > emh/ElMiron.bed &


python3 $majority_allele_calling_script --bam $emh_bam_dir/Grotta_L35MQ37_sorted.deam3_or_5.ontarget.L35MQ37.bam \
                                        --bed $position_file --strand_check USER \
                                        --ref $reference > emh/Continenza.bed &


python3 $majority_allele_calling_script --bam $emh_bam_dir/Kostenki12_L35MQ37_sorted.deam3_or_5.ontarget.L35MQ37.bam \
                                        --bed $position_file --strand_check USER \
                                        --ref $reference > emh/Kostenki12.bed &


python3 $majority_allele_calling_script --bam $emh_bam_dir/Kostenki14_L35MQ37_sorted.deam3_or_5.ontarget.L35MQ37.bam \
                                        --bed $position_file --strand_check USER \
                                        --ref $reference > emh/Kostenki14.bed &


python3 $majority_allele_calling_script --bam $emh_bam_dir/Muierii_L35MQ37_sorted.deam3_or_5.ontarget.L35MQ37.bam \
                                        --bed $position_file --strand_check USER \
                                        --ref $reference > emh/Muierii.bed &


python3 $majority_allele_calling_script --bam $emh_bam_dir/Oase1_L35MQ37_sorted.deam3_or_5.ontarget.L35MQ37.bam \
                                        --bed $position_file --strand_check USER \
                                        --ref $reference > emh/Oase1.bed &


python3 $majority_allele_calling_script --bam $emh_bam_dir/Ostuni_1_L35MQ37_sorted.deam3_or_5.ontarget.L35MQ37.bam \
                                        --bed $position_file --strand_check USER \
                                        --ref $reference > emh/Ostuni1.bed &


python3 $majority_allele_calling_script --bam $emh_bam_dir/Ostuni_2_L35MQ37_sorted.deam3_or_5.ontarget.L35MQ37.bam \
                                        --bed $position_file --strand_check USER \
                                        --ref $reference > emh/Ostuni2.bed &


python3 $majority_allele_calling_script --bam $emh_bam_dir/Pavlov_1_L35MQ37_sorted.deam3_or_5.ontarget.L35MQ37.bam \
                                        --bed $position_file --strand_check USER \
                                        --ref $reference > emh/Pavlov1.bed &


python3 $majority_allele_calling_script --bam $emh_bam_dir/Villabruna_L35MQ37_sorted.deam3_or_5.ontarget.L35MQ37.bam \
                                        --bed $position_file --strand_check non-USER_all \
                                        --ref $reference > emh/Villabruna.bed &

[15] 9993
