Author: Dan Shea  
Date: 2019.08.27  
#### Extract 30 kbp region from consensus sequences of the founders and the reference sequence
We discovered a region of Chromosome 4 (chr04) that appears to be a recombination hotspot. Therefore, we will examine the sequence in the region of `chr04:19,370,000-19,400,000` and attempt to conduct a multiple sequence alignment using `MAFFT`. Because the region is so large, we will examine the alignment of two separate methods within the `MAFFT` alignment software. `L-INS-i` is highly accurate and suitable for alignment on 10 - 100 protein sequences. However, given the regions to be aligned are 30 kbp in length, we will also make use of `FFT-NS-2` since, <blockquote>FFT-NS-2 and other progressive methods can align many and/or long DNA/protein sequences, because of an FFT approximation and a linear-space DP algorithm.</blockquote> [reference](https://mafft.cbrc.jp/alignment/software/about.html)

In [1]:
from Bio import SeqIO

In [2]:
reference_file = 'reference_genome/IRGSP-1.0_genome.fasta'
# Sample names, founder names, and sample_founder as ordered lists
samples = ['N01','N03','N04','N05','N06','N07','N08','N09','N10','N11',
           'N12','N13','N14','N16','N17','N18','N19','N20','N21','N22',]
founders = ['KASALATH','KEIBOBA','SHONI','TUPA_121-3','SURJAMUKHI','RATUL','BADARI_DHAN','KALUHEENATI','JAGUARY','REXMONT',
            'URASAN','TUPA_729','DEE_JIAO_HUA_LUO','NERICA_1','TAKANARI','C8005','MOUKOTOU','NORTAI','SESIA','HAYAYUKI',]
datadirs = ['_'.join([x, y]) for x, y in zip(samples, founders)]

In [3]:
refio = SeqIO.parse(reference_file, format='fasta')

In [4]:
sequences = list()
for seq in refio:
    if seq.id == 'chr04':
        sequences.append(seq[19369999:19399999])
        break

In [5]:
sequences

[SeqRecord(seq=Seq('TTGAAGCAAGCACGTCCCTACAATATATCTTCCCAATTGTATATTGAAATGTTG...AGT', SingleLetterAlphabet()), id='chr04', name='chr04', description='chr04', dbxrefs=[])]

In [6]:
for key, founder, d in zip(samples, founders, datadirs):
    seqio = SeqIO.parse('beagle_output/{}/{}_consensus.fa'.format(d, key), format='fasta')
    for seq in seqio:
        if seq.id == 'chr04':
            sequences.append(seq[19369999:19399999])
            break

In [7]:
# Fixed a typo in the faster headers - 2019.09.02 DJS
for s, t in zip(sequences, ['IRGSP1.0']+founders):
    s.id = '{}:{}:{}:{}'.format(t, s.id, '19370000', '19400000')
    print(s.id)

IRGSP1.0:chr04:19370000:19400000
KASALATH:chr04:19370000:19400000
KEIBOBA:chr04:19370000:19400000
SHONI:chr04:19370000:19400000
TUPA_121-3:chr04:19370000:19400000
SURJAMUKHI:chr04:19370000:19400000
RATUL:chr04:19370000:19400000
BADARI_DHAN:chr04:19370000:19400000
KALUHEENATI:chr04:19370000:19400000
JAGUARY:chr04:19370000:19400000
REXMONT:chr04:19370000:19400000
URASAN:chr04:19370000:19400000
TUPA_729:chr04:19370000:19400000
DEE_JIAO_HUA_LUO:chr04:19370000:19400000
NERICA_1:chr04:19370000:19400000
TAKANARI:chr04:19370000:19400000
C8005:chr04:19370000:19400000
MOUKOTOU:chr04:19370000:19400000
NORTAI:chr04:19370000:19400000
SESIA:chr04:19370000:19400000
HAYAYUKI:chr04:19370000:19400000


In [8]:
SeqIO.write(sequences, 'chr04_hotspot_locus.fasta', format='fasta')

21

#### Second extraction
Further analysis of the region shows that `19,393,100..19,393,400` is a good locus to focus on as this is where the differences between highly recombinant founders and non-recombinant founders is marked by the presence/absence of some SNPs.

In [9]:
refio = SeqIO.parse(reference_file, format='fasta')
sequences = list()
for seq in refio:
    if seq.id == 'chr04':
        sequences.append(seq[19393099:19393399])
        break

In [10]:
for key, founder, d in zip(samples, founders, datadirs):
    seqio = SeqIO.parse('beagle_output/{}/{}_consensus.fa'.format(d, key), format='fasta')
    for seq in seqio:
        if seq.id == 'chr04':
            sequences.append(seq[19393099:19393399])
            break

In [11]:
for s, t in zip(sequences, ['IRGSP1.0']+founders):
    s.id = '{}:{}:{}:{}'.format(t, s.id, '19393100', '19393400')
    print(s.id)

IRGSP1.0:chr04:19393100:19393400
KASALATH:chr04:19393100:19393400
KEIBOBA:chr04:19393100:19393400
SHONI:chr04:19393100:19393400
TUPA_121-3:chr04:19393100:19393400
SURJAMUKHI:chr04:19393100:19393400
RATUL:chr04:19393100:19393400
BADARI_DHAN:chr04:19393100:19393400
KALUHEENATI:chr04:19393100:19393400
JAGUARY:chr04:19393100:19393400
REXMONT:chr04:19393100:19393400
URASAN:chr04:19393100:19393400
TUPA_729:chr04:19393100:19393400
DEE_JIAO_HUA_LUO:chr04:19393100:19393400
NERICA_1:chr04:19393100:19393400
TAKANARI:chr04:19393100:19393400
C8005:chr04:19393100:19393400
MOUKOTOU:chr04:19393100:19393400
NORTAI:chr04:19393100:19393400
SESIA:chr04:19393100:19393400
HAYAYUKI:chr04:19393100:19393400


In [12]:
SeqIO.write(sequences, 'chr04_hotspot_locus_300bp.fasta', format='fasta')

21