#### It is helpful to make a small test set of each data by taking the first 2000-4000 lines from each fasta file 
#### of all 4 file types - If you do it multiple times it just appends text lines to bottom of text editor

$head -n2000 N706_N504_R1.fastq >> ./10000Reads/N706_N504_R1.fastq

check how many lines: (terminal)

$wc -l N706_N504_R1.fastq

can compress with gZip
 
$gzip N706_N505_R1.fastq

In [2]:
from __future__ import print_function
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import pylab
import pandas as pd
import numpy as np
import os
import sys
import gzip
import itertools
import operator
import subprocess
import twobitreader
from Bio.Alphabet import IUPAC
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import pysam
import shutil

from LAM_scripts.LAM_helpersDanner.py import *


#  Pipeline1
# This, for the most part, is the UDITAS pipeline modified to allow for REPLACE targeting. 

Overview:
- Trim off the 5 nt on both sides and check for primering (how many to trim off of read 1 or two)
- need to check for priming (make it able to check priming on Read1 or two. add missmatches. and if input is trimmed or not. pull primer from data sheet. pull primer using coordiantes and get seq downstream depending on lenght. 
- trims short amplicons for reads that go into the adapter or illumina primers
- local align to the plasmid without the AAV seq at all
- local align to the AAV seq without the HDR arms
- after pulling out the reads that didn't align to AAV seq or plasmid backbone, analyze the breaks

- generate table of expected amplicons (use the hdr sample and just replace seq between breaks if thats possible)
- align agasint the reads that didn't map to AAV or plasmid backbone,
  look at indels and quantification

Test pipeline 2:
- Trim up to the cut site. Run end-to-end across the genome. Look for expected off target translocations
- also generally measure the frequency at which integrrations happen

Questions:
- which side is the HDR arm on? which of the two sides should i be looking for extra hdr transcripts from
- Can I bring in some of the analysis tools from crispresso to understand the indel profiles better.



In [1]:
#Directory

directory = '/home/eric/Data/Spaced_Nicking/LAM_MiSeq_HBB_1/Demulitiplexed_corefacility'
print(directory)

/home/eric/Data/Spaced_Nicking/LAM_MiSeq_HBB_1/Demulitiplexed_corefacility


In [3]:
##########        Assign the file_genome_2bit location.     ############ 
#
#   This is needed for pulling sequence from the referene genome by location
#assembly = amplicon_info['genome']
assembly = 'hg38'
file_genome_2bit = os.path.join('/home/eric/Data/Ref_Genomes', assembly + '.2bit')
print(file_genome_2bit)

###############   BOWTIE2_INDEXES for genome alignments    ################
#
#check in bash: > ECHO $GENOMES_2BIT

%env BOWTIE2_INDEXES=/home/eric/Data/Ref_Genomes

/home/eric/Data/Ref_Genomes/hg38.2bit
env: BOWTIE2_INDEXES=/home/eric/Data/Ref_Genomes


In [33]:
########## Remove first 5 nts to remove adapter seq and spot generation sequencs  #########
# 
#     Misha adds 5 nt's to help spot generation on Illumina on both read 1 and 2
#    Read 1 has gene binding primer
#    Read 2 has adapter to ligate on universal reverse seq


                    # decide on the length to trim
adapter_seq = 'GACTATAGGGCACGCGTGG'
adapt_len = len(adapter_seq)
read2trim = 5 + adapt_len
print('read2trim is :', read2trim, ' nucleotides.')


                  # run the trimming
for i in range(56):
        
    amplicon_info = get_csv_data(directory, i)
    
    trimming_R1_R2(directory, amplicon_info, R1trim = 5, R2trim = read2trim)
    
    print('done with sample', i)


read2trim is : 24  nucleotides.
number of reads: 8320
done with sample 0
number of reads: 40781
done with sample 1
number of reads: 21681
done with sample 2
number of reads: 26980
done with sample 3
number of reads: 34345
done with sample 4
number of reads: 23382
done with sample 5
number of reads: 14105
done with sample 6
number of reads: 16620
done with sample 7
number of reads: 13242
done with sample 8
number of reads: 39568
done with sample 9
number of reads: 17759
done with sample 10
number of reads: 47438
done with sample 11
number of reads: 49661
done with sample 12
number of reads: 130634
done with sample 13
number of reads: 118465
done with sample 14
number of reads: 98215
done with sample 15
number of reads: 82900
done with sample 16
number of reads: 86032
done with sample 17
number of reads: 60302
done with sample 18
number of reads: 60363
done with sample 19
number of reads: 75480
done with sample 20
number of reads: 41911
done with sample 21
number of reads: 99176
done wit

In [28]:
#could make an automatic way to generate the primer mismatch and downstream if I want later
start = 5226605
end = 5226627

genome = twobitreader.TwoBitFile(file_genome_2bit)
primer = genome['chr11'][int(start):int(end)]
print(primer)
print(reverse_complement(primer))

TGTCACAGTGCAGCTCACTCAG
CTGAGTGAGCTGCACTGTGACA


### Discard Mispriming Reads

LAM uses a anchored primer and then gets ride of the background gDNA. Then it uses a nested primer and so there should be a very clean product. 

In [35]:
############### GOOD PRIMING Filter ##########
#
#    Here we assume the NNNNN is no longer on Read1 and I used 'trimmed_R1R2() function'
#   LAM gene specific primer on READ 1 in this case. The program only understnad checing primer 1***

# make a dataframe to capture all of the priming information and put it in the 'results' folder
results_df_all = pd.DataFrame()

results_folder = os.path.join(directory, 'results')
if not os.path.exists(results_folder):
    os.mkdir(results_folder)
results_file = os.path.join(directory, 'results','all_priming.xlsx')
    
## inputs for correct_priming2() function
mismatches =2                       # the number of mismatches you can have in the primer and downstream seq total
downstream = 10                     # lenght of sequence downstream of the primer
trimmed_R1R2 = True                 #if the file has been already tri
removePrimerPlusDownstream = False  # remove the primer/downstream seq if it is good (good for guideseq)
exportMismatch = True               # export the file of mismatches sequences

for i in range(56):
        
    amplicon_info = get_csv_data(directory, i)

    #5primer is everything but AT, AT is for checking mispriming. The full sequence is the olgio from guideseq.
    #EVERYTHING IS CAPITAL
    #                       the extra TGTGCC has a mutation T>G in the hdr plasmid
    #                    
    ThreePrimeEnd_seq =       'TGTCACAGTGCAGCTCACTCAGTGTGGC'
    #      Ch11:5226605:5226627                   
    ThrePrimeEnd_primeronly = 'TGTCACAGTGCAGCTCACTCAG'
    
    #3primer 
    FivePrimeEnd_seq =            'CCATCTATTGCTTACATTTGCTTCTGACACAACTGTGTTCAC'
    #      Ch11:5227054:5227084
    FivePrimeEnd_seq_primeronly = 'CCATCTATTGCTTACATTTGCTTCTGACAC'

    direction = amplicon_info['Direction']

    if direction == 3:
        primer_seq_plus_downstream = ThreePrimeEnd_seq
        primer_seq = ThrePrimeEnd_primeronly
    elif direction == 5:
        primer_seq_plus_downstream = FivePrimeEnd_seq
        primer_seq = FivePrimeEnd_seq_primeronly
    
    df_sample_results = correct_priming2(directory, amplicon_info, primer_seq, primer_seq_plus_downstream, 
                                          mismatches, trimmed_R1R2, removePrimerPlusDownstream,
                                          exportMismatch)
    #add the results to the ongoing dataframe        
    results_df_all = results_df_all.append(df_sample_results, ignore_index=True)
    print('done with sample', i)

#export the final table
results_df_all.to_excel(results_file)    
print(results_df_all)


done with sample 0
done with sample 1
done with sample 2
done with sample 3
done with sample 4
done with sample 5
done with sample 6
done with sample 7
done with sample 8
done with sample 9
done with sample 10
done with sample 11
done with sample 12
done with sample 13
done with sample 14
done with sample 15
done with sample 16
done with sample 17
done with sample 18
done with sample 19
done with sample 20
done with sample 21
done with sample 22
done with sample 23
done with sample 24
done with sample 25
done with sample 26
done with sample 27
done with sample 28
done with sample 29
done with sample 30
done with sample 31
done with sample 32
done with sample 33
done with sample 34
done with sample 35
done with sample 36
done with sample 37
done with sample 38
done with sample 39
done with sample 40
done with sample 41
done with sample 42
done with sample 43
done with sample 44
done with sample 45
done with sample 46
done with sample 47
done with sample 48
done with sample 49
done with 

In [42]:
### TRIMMING ####
#need to trim off the end of the short reads. This is for amplicons that were too short and have the other side on them.

direction5primer = 'ATACCGTTATTAACATATGACAACTCAATTAAAC'
direction3primer = 'TGTCACAGTGCAGCTCACTCAG' 
adapter ='GACTATAGGGCACGCGTGG'

for i in range(56):
        
    amplicon_info = get_csv_data(directory, i)
    
    trim_short_fastq(directory, amplicon_info, direction5primer, direction3primer, adapter)
    
    print('done with sample', i)


done with sample 0
done with sample 1
done with sample 2
done with sample 3
done with sample 4
done with sample 5
done with sample 6
done with sample 7
done with sample 8
done with sample 9
done with sample 10
done with sample 11
done with sample 12
done with sample 13
done with sample 14
done with sample 15
done with sample 16
done with sample 17
done with sample 18
done with sample 19
done with sample 20
done with sample 21
done with sample 22
done with sample 23
done with sample 24
done with sample 25
done with sample 26
done with sample 27
done with sample 28
done with sample 29
done with sample 30
done with sample 31
done with sample 32
done with sample 33
done with sample 34
done with sample 35
done with sample 36
done with sample 37
done with sample 38
done with sample 39
done with sample 40
done with sample 41
done with sample 42
done with sample 43
done with sample 44
done with sample 45
done with sample 46
done with sample 47
done with sample 48
done with sample 49
done with 



## Need to make an new bowtie2 index file that includes targeting for alignment agains the whole genome so it is all in one sheet together. 

### Other option would be to align it to the amplicon, extract unaligned files and then align to the genome but seems cleaner this way. 
#### Ideally every sequence is unique between the targeting vector and genome.


1. Build your fastas of interest and label .fa files.
    1. You need fasta of hg38 or reference genome. You can pull this from downloaded bowtie indexed sampels and then use the following command to turn the index into a fasta file: bowtie2-inspect hg38 > hg38.fa   
    2. Put all the fasta files in the same folder. Should also use the transfected plasmid
2. index the files with bowtie
    1. use the command bowtie2-build -f pE049,pe038_mc.fa,hg38.fa -p hg38_plus_targetvectorandplasmid
    2. this has the -p to make it take less ram in my case.
    3. In this case it adds the hg38 and the minicircle targeting file together
    4. I have a Intel® Core™ i7-5500U CPU @ 2.40GHz × 4 with 15.1 GiB ram and it required about 13.8 gigs of ram and 2 hours to do a hg38+small fasta index
    5. be sure to pay attention to the name of the new indexed file. "hg38_plus_targetvector" in example above. Add it to the sample_info.csv sheet. Under the tab 'genome_plus_targeting'.
    6. you can check it indexed correcctly: bowtie2-inspect -s hg38_plus_targetvector
    

    



### Making the reference alingment sequences

- I will use the plasmid without the AAV sequence to follow backbone integration
- I will use the AAV reference without the Homology arms, otherwise good stuff with align there
- I will make amplicons with an HDR sequence. The hdr reference sequence will be directly copied out of
     the .csv file and not altered.
    

In [72]:
# Running the reference plasmid. This is acutally the AAV plasmid without the AAV HDR sequence
# single smaple for testing


amplicon_info = get_csv_data(directory, 0)
print('N7 :', amplicon_info['index_I1'], "   N5 : ", amplicon_info['index_I2'])
print ("sample name: ", amplicon_info['name'], "   sample name: ", amplicon_info['description'])
print('rection type', get_reaction_type(amplicon_info))

create_plasmid_reference(directory, amplicon_info)
create_AAV_reference(directory, amplicon_info)
create_amplicon(directory, amplicon_info, file_genome_2bit)



N7 : N701    N5 :  N501
sample name:  1.1.F.1    sample name:  Ctrl (#1) 5 end LAM-HTGTS
rection type double_cut_same_chromosome_and_HDR


In [74]:
######## GENERATING REFERENCE SEQUENCES ###########
#
# MAKE SURE PLASMID DOESNT HAVE THE AAV PORTION IN IT
# MAKE SURE AAV DOESN'T HAVE HDR ARMS IN IT


for i in range(56):
    
    amplicon_info = get_csv_data(directory, i)
    print('N7 :', amplicon_info['index_I1'], "   N5 : ", amplicon_info['index_I2'])
    print ("sample name: ", amplicon_info['name'], "   sample name: ", amplicon_info['description'])
    get_reaction_type(amplicon_info)

    create_plasmid_reference(directory, amplicon_info)
    
    create_AAV_reference(directory, amplicon_info)
    create_amplicon(directory, amplicon_info, file_genome_2bit)



N7 : N701    N5 :  N501
sample name:  1.1.F.1    sample name:  Ctrl (#1) 5 end LAM-HTGTS
N7 : N701    N5 :  N502
sample name:  1.2.F.1    sample name:  Ctrl (#1) 5 end LAM-HTGTS
N7 : N701    N5 :  N504
sample name:  1.3.F.1    sample name:  Ctrl (#1) 5 end LAM-HTGTS
N7 : N701    N5 :  N505
sample name:  2.1.F.1    sample name:  Ctrl + AAV 5 end LAM-HTGTS
N7 : N701    N5 :  N506
sample name:  2.2.F.1    sample name:  Ctrl + AAV 5 end LAM-HTGTS
N7 : N701    N5 :  N507
sample name:  2.3.F.1    sample name:  Ctrl + AAV 5 end LAM-HTGTS
N7 : N701    N5 :  N508
sample name:  3.1.F.1    sample name:  Cas9WT+ sgHBB1-1+sgHBB2-6+25pmol dsODN (#1) 5 end LAM-HTGTS
N7 : N701    N5 :  N510
sample name:  3.2.F.1    sample name:  Cas9WT+ sgHBB1-1+sgHBB2-6+25pmol dsODN (#1) 5 end LAM-HTGTS
N7 : N702    N5 :  N501
sample name:  3.3.F.1    sample name:  Cas9WT+ sgHBB1-1+sgHBB2-6+25pmol dsODN (#1) 5 end LAM-HTGTS
N7 : N702    N5 :  N502
sample name:  4.1.F.1    sample name:  Cas9D10A+ sgHBB1-1+sgHBB2-6+25p

In [75]:
#try out the alignment to the plasmid backbone with the AAV/HDR virus removed


amplicon_info = get_csv_data(directory, 19) #cas9 cutting sample
print('N7 :', amplicon_info['index_I1'], "   N5 : ", amplicon_info['index_I2'])
print ("sample name: ", amplicon_info['name'], "   sample name: ", amplicon_info['description'])
print('rection type', get_reaction_type(amplicon_info))


#align_plasmid_local(directory, amplicon_info, ncpu=12)
# use a mapQ score of >1
#extract_unmapped_reads_plasmid(directory, amplicon_info)



N7 : N703    N5 :  N505
sample name:  3.2.R.1    sample name:  Cas9WT+ sgHBB1-1+sgHBB2-6+AAV (#1) 3 end LAM-HTGTS
rection type double_cut_same_chromosome_and_HDR


In [None]:
def analyze_alignments_plasmid_and_AAV(dir_sample, amplicon_info, min_MAPQ, file_genome_2bit, do_plasmid):
    N7 = amplicon_info['index_I1']
    N5 = amplicon_info['index_I2']
        
    exp_dir = create_filename(dir_sample, N7, N5, 'mainfolder')

    file_UMI = create_filename(dir_sample, N7, N5, 'umifastqgz')
    UMI_dict = create_barcode_dict(file_UMI)
    
    results_folder = os.path.join(exp_dir, 'results')
    if not os.path.exists(results_folder):
        os.mkdir(results_folder)

    results_file = create_filename(dir_sample, N7, N5, 'results_plasmid')

    if do_plasmid:
        file_sorted_bam_plasmid_local = create_filename(dir_sample, N7, N5, 'sorted_bam_plasmid_local')

        bam_in_alignment_file = pysam.AlignmentFile(file_sorted_bam_plasmid_local, 'rb')
        bam_in = bam_in_alignment_file.fetch()

        genome = twobitreader.TwoBitFile(file_genome_2bit)  # Load genome. Used for getting the sequences
        
        length_to_test = 15  # We check this number of bases after the primer
        uditas_primer_length = amplicon_info['end'] - amplicon_info['start']
        
        if amplicon_info['strand'] == '+':  # This is the UDiTaS oligo strand
            #I had to add int() command to make this work for some reason
            seq_after_uditas_primer = genome[amplicon_info['chr']][int(amplicon_info['end']):int((amplicon_info['end'] + length_to_test))]
            
        elif amplicon_info['strand'] == '-':
            seq_after_uditas_primer = reverse_complement(genome[amplicon_info['chr']][int((amplicon_info['start'] - length_to_test)):(int(amplicon_info['start']))])
        n_max_mismatches = 2  # We allow this number of mismatches between the read and the sequence after the primer

        names_list_plasmid_genome = []
        UMI_list_plasmid_genome = []
        names_list_plasmid_only = []
        UMI_list_plasmid_only = []
        
        for read in bam_in:
            if read.mapping_quality >= min_MAPQ and not read.is_unmapped and not read.is_secondary:
                if read.is_read2:  # R2 is the UDiTaS primer
                    if read.is_reverse:
                        seq_test = reverse_complement(read.query_sequence)[int(uditas_primer_length):int((uditas_primer_length + length_to_test))]
                    else:
                        seq_test = read.query_sequence[int(uditas_primer_length): int(uditas_primer_length + length_to_test)]
                    # Sometimes, after cutadapt we have a read shorter than uditas_primer_length + length_to_test
                    # We skip those directly without calculating hamm_dist, which doesn't make sense
                    if (len(seq_test) == len(seq_after_uditas_primer.upper()) and
                        hamm_dist(seq_test, seq_after_uditas_primer.upper()) <= n_max_mismatches):
                        # Reads for which the R2 has genomic sequence after the UDiTaS primer
                        UMI_list_plasmid_genome.append(UMI_dict[read.query_name][0])
                        names_list_plasmid_genome.append(read.query_name)
                    else: # We put those short reads into the plasmid only bucket
                        UMI_list_plasmid_only.append(UMI_dict[read.query_name][0])
                        names_list_plasmid_only.append(read.query_name)

        total_reads_plasmid_genome = len(set(names_list_plasmid_genome))
        total_reads_collapsed_plasmid_genome = len(set(UMI_list_plasmid_genome))
        total_reads_plasmid_only = len(set(names_list_plasmid_only))
        total_reads_collapsed_plasmid_only = len(set(UMI_list_plasmid_only))

        results_df = pd.DataFrame({'target_plus_plasmid_total_reads': [total_reads_plasmid_genome],
                                   'target_plus_plasmid_total_reads_collapsed': [total_reads_collapsed_plasmid_genome],
                                   'plasmid_only_total_reads': [total_reads_plasmid_only],
                                   'plasmid_only_total_reads_collapsed': [total_reads_collapsed_plasmid_only]
                                   },
                                  columns=['target_plus_plasmid_total_reads',
                                           'target_plus_plasmid_total_reads_collapsed',
                                           'plasmid_only_total_reads',
                                           'plasmid_only_total_reads_collapsed'])
    else:
        results_df = pd.DataFrame(index=np.arange(1),
                                  columns=['target_plus_plasmid_total_reads',
                                           'target_plus_plasmid_total_reads_collapsed',
                                           'plasmid_only_total_reads',
                                           'plasmid_only_total_reads_collapsed'])

    results_df.to_excel(results_file)

    return results_df

In [None]:
# This looks for AAV insertion weather or not 
# fastq_source = 'trimmed_fastq' or 'plasmid_align_exracted_fastq'

def align_AAV_local(dir_sample, amplicon_info, ncpu=4, fastq_source):

    # We first check if the experiment had any guides
    N7 = amplicon_info['index_I1']
    N5 = amplicon_info['index_I2']
    # exp_dir = create_filename(dir_sample, N7, N5, 'mainfolder')
    
    if fastq_source == 'trimmed_fastq':
        file_cutadapt_R1 = create_filename(dir_sample, N7, N5, 'R1trimmed')
        file_cutadapt_R2 = create_filename(dir_sample, N7, N5, 'R2trimmed')

        file_sam_AAValign_all_fastq_local = create_filename(dir_sample, N7, N5, 'sam_AAValign_all_fastq_local')
        file_sam_report_AAValign_all_fastq_local = create_filename(dir_sample, N7, N5, 'sam_report_AAValign_all_fastq_local')

        if not os.path.exists(os.path.dirname(file_sam_AAValign_all_fastq_local)):
            os.mkdir(os.path.dirname(file_sam_AAValign_all_fastq_local))

        file_bam_AAV_allfastq_local = create_filename(dir_sample, N7, N5, 'bam_AAValign_all_fastq_local')
        file_sorted_bam_AAV_allfastq_local = create_filename(dir_sample, N7, N5, 'sorted_bam_AAValign_all_fastq_local')
        # file_sorted_bai_genome_local = create_filename(dir_sample, N7, N5, 'sorted_bai_genome_local')

        if not os.path.exists(os.path.dirname(file_bam_AAV_allfastq_local)):
            os.mkdir(os.path.dirname(file_bam_AAV_allfastq_local))
    
    if fastq_source == 'plasmid_align_exracted_fastq':
        file_cutadapt_R1 = create_filename(dir_sample, N7, N5, 'unmapped_plasmid_R1fastq')
        file_cutadapt_R2 = create_filename(dir_sample, N7, N5, 'unmapped_plasmid_R2fastq')

        file_sam_AAV_plasmidextractFastq_local = create_filename(dir_sample, N7, N5, 'samfile_bam_AAV_plasmidextractFastq_local')
        file_sam_report_AAV_plasmidextractFastq_local = create_filename(dir_sample, N7, N5, 'sam_report_AAV_plasmidextractFastq_local')

        if not os.path.exists(os.path.dirname(file_sam_AAV_plasmidextractFastq_local)):
            os.mkdir(os.path.dirname(file_sam_AAV_plasmidextractFastq_local))

        file_bam_AAV_plasmidextractFastq_local = create_filename(dir_sample, N7, N5, 'bam_AAV_plasmidextractFastq_local')
        file_sorted_bam_AAV_plasmidextractFastq_local = create_filename(dir_sample, N7, N5, 'sorted_bam_AAV_plasmidextractFastq_local')
        # file_sorted_bai_genome_local = create_filename(dir_sample, N7, N5, 'sorted_bai_genome_local')

        if not os.path.exists(os.path.dirname(file_bam_AAV_plasmidextractFastq_local)):
            os.mkdir(os.path.dirname(file_bam_AAV_plasmidextractFastq_local))

    # local alignment to the genome with bowtie2
    initial_dir = os.getcwd()

    folder_amplicons = create_filename(dir_sample, N7, N5, 'amplicons')

    os.chdir(folder_amplicons)

    bowtie2_command = ['bowtie2', '--local', '-p', str(ncpu),
                       '-X', '5000', '-k', '2', '-x', 'plasmid',
                             '-1', file_cutadapt_R1, '-2', file_cutadapt_R2,
                             '-S', file_sam_plasmid_local]

    handle_sam_report_genome_local = open(file_sam_report_plasmid_local, 'wb')

    subprocess.call(bowtie2_command, stderr=handle_sam_report_genome_local)

    handle_sam_report_genome_local.close()

    # convert sam to bam
    sam_to_bam_plasmid_local_command = ['samtools', 'view', '-Sb', file_sam_plasmid_local]

    handle_file_bam_plasmid_local = open(file_bam_plasmid_local, 'wb')

    subprocess.call(sam_to_bam_plasmid_local_command, stdout=handle_file_bam_plasmid_local)

    # sort bam files
    sort_bam_plasmid_local_command = ['samtools', 'sort', file_bam_plasmid_local, '-o', file_sorted_bam_plasmid_local]

    subprocess.call(sort_bam_plasmid_local_command)

    # Create bam index files
    create_bam_plasmid_local_index_command = ['samtools', 'index', file_sorted_bam_plasmid_local]
    subprocess.call(create_bam_plasmid_local_index_command)

    # Clean up
    os.remove(file_sam_plasmid_local)
    os.remove(file_bam_plasmid_local)

    os.chdir(initial_dir)

In [None]:
######## GENERATING REFERENCE SEQUENCES ###########
#
# MAKE SURE PLASMID DOESNT HAVE THE AAV PORTION IN IT
# MAKE SURE AAV DOESN'T HAVE HDR ARMS IN IT


for i in range(56):
    
    amplicon_info = get_csv_data(directory, i)
    print('N7 :', amplicon_info['index_I1'], "   N5 : ", amplicon_info['index_I2'])
    print ("sample name: ", amplicon_info['name'], "   sample name: ", amplicon_info['description'])
    
    align_plasmid_local(directory, amplicon_info, ncpu=12)


In [None]:
#extract the unmapped reads
extract_unmapped_reads_plasmid(directory, amplicon_info)

In [None]:
#run the plasmid analysis to coudn plasmid integration events
result_plasmid_df = analyze_alignments_plasmid(directory, amplicon_info, min_MAPQ, file_genome_2bit, True)
result_plasmid_df

In [None]:
#align against our suite of amplicons
align_amplicon(directory, amplicon_info, check_plasmid_insertions, ncpu)

In [None]:
#this will extract unmap reads new folder (files that did not align to the predicted structural variants)
extract_unmapped_reads_amplicons(directory, amplicon_info)