I have downloaded the ancestral state for papio as fasta files, and will in this notebook check that they seem correct (ancestral state being either ref or alt in the called files)

In [1]:
%run ../scripts/notebook_settings.py

In [2]:
meta_data_samples = pd.read_table("../data/metadata_with_x_missing.txt", sep=" ")

Checking chrom 20, as it is the smallest and therefore quickest

In [3]:
zarr_dir = "/faststorage/project/baboondiversity/data/PG_panu3_zarr_12_03_2021/callset.zarr/chr20"
#Opening the zarr data
callset = zarr.open_group(zarr_dir, mode="r")
gt = allel.GenotypeArray(callset["calldata/GT"])

In [4]:
callset.tree()

Tree(nodes=(Node(disabled=True, name='/', nodes=(Node(disabled=True, name='calldata', nodes=(Node(disabled=Tru…

In [5]:
ancestor_fa = "/faststorage/project/baboondiversity/data/ancestral_state_panu3_23_04_2021/papio_anubis_ancestor_20.fa"

In [6]:
from Bio import SeqIO

fasta_sequences = SeqIO.parse(open(ancestor_fa),'fasta')
for fasta in fasta_sequences:
    name, sequence = fasta.id, str(fasta.seq)
    print(name, len(sequence))

ANCESTOR_for_chromosome:Panu_3.0:20:1:72050622:1 72050622


In [7]:
print(callset["variants/POS"][-1], len(sequence)) # ancestor file slighly longer than last segregating site

72048358 72050622


High confidence calls are capitalized, while low confidence are not. For this first test, I capitalize everything.
Note that pos starts at 1, while when indexing into the fasta, you start at 0.

In [8]:
ancestor_list = []
pos_list = []
for i in callset["variants/POS"][:]:
    ancestor_list.append(sequence[i-1].upper())
    pos_list.append(i)

In [9]:
(ancestor_list.count(".")+ancestor_list.count("-")+ancestor_list.count("N"))/len(ancestor_list)

0.06526563448422809

In [10]:
ancestor_list.count("N")/len(ancestor_list)

0.003889395211757354

In [11]:
ancestor_list.count("-")/len(ancestor_list)

0.031613287901549346

Roughly 6 % of identified sites on chr20 is not typeable, with 0.17 being due to not being able to find support for a specific state, 3.1 % being indels and the rest not being covered in the EPO alignment

In [12]:
ref_fit = (ancestor_list == callset["variants/REF"][:]).sum()

In [13]:
alt_fit = (ancestor_list == callset["/variants/ALT"][:, 0]).sum()

In [14]:
(ref_fit+alt_fit)/len(ancestor_list), ref_fit/len(ancestor_list)

(0.9289051071365263, 0.8221807614419518)

This comparision implicitly filters ref/alt which aren´t SNPs, as the ancestral state is checked as a single nucleotide

In [15]:
ac = gt.count_alleles()

In [16]:
(ac.is_biallelic()).sum()/len(ac)

0.954285550579864

In [17]:
veclen = np.vectorize(len)
(veclen(callset["/variants/ALT"][:][:, 0]) != 1).sum()

0

In [18]:
veclen = np.vectorize(len)
(veclen(callset["/variants/REF"][:]) != 1).sum()

0

In [19]:
zarr_path = '/faststorage/project/baboondiversity/data/PG_panu3_zarr_12_03_2021/callset.zarr/chr20'
callset   = zarr.open_group(zarr_path, mode='r')

In [20]:
from Bio import SeqIO
mask_fasta = "/home/eriks/primatediversity/people/erik/data/panu3_callability_mask/Panu_3.0_callability_mask_chr20.fa"
fasta_sequences = SeqIO.parse(open(mask_fasta),'fasta')
for fasta in fasta_sequences:
    name, mask_sequence = fasta.id, str(fasta.seq)
pos_array = callset["variants/POS"][:]
bool_list = [0]*len(pos_array)
for i in range(len(bool_list)):
    pos = pos_array[i]
    if mask_sequence[pos-1] != "N":
        bool_list[i] = 1

In [21]:
bool_array = np.array(bool_list, dtype=bool)

In [22]:
sum(bool_array)

2578705

In [23]:
len(bool_array[:100])

100

Working on polarizing the map

In [24]:
ancestor_fa = "/faststorage/project/baboondiversity/data/ancestral_state_panu3_23_04_2021/papio_anubis_ancestor_20.fa"
from Bio import SeqIO

fasta_sequences = SeqIO.parse(open(ancestor_fa),'fasta')
for fasta in fasta_sequences:
    name, sequence = fasta.id, str(fasta.seq)
    print(name, len(sequence))

ANCESTOR_for_chromosome:Panu_3.0:20:1:72050622:1 72050622


In [25]:
##B.4
def polarize_map(callset, ancestral_sequence):
    '''
    Input:
        - chrom     : chromosome number
        - callset   : Zarr object which directs to all the arrays
    Output:
        - mapping   : np array with the mapping array in order to polarize corerctly alleles accoring to AA_ensmbl
    '''
    ancestor_list = []
    for i in callset["variants/POS"][:]:
        ancestor_list.append(ancestral_sequence[i-1].upper())
    mapping    = np.array([[0, 1]]*(len(ancestor_list)))
    ens_eq_alt = (ancestor_list == callset["variants/ALT"][:, 0])
    mapping[ens_eq_alt, 0] = 1
    mapping[ens_eq_alt, 1] = 0
    return mapping

mapping = polarize_map(callset, sequence)

In [37]:
(mapping[:,1] == [0]).sum()/len(mapping)

0.10672434569457451