(3.1) FASTQ to FASTA converter: Implement a FASTQ to FASTA file format converter in python (the input is a FASTQ file and the output is a FASTA file)

(3.2) I have a sam file called “sequences.sam” and a genomic feature file called “features.bed”. I would like to annotate all the sequences in the sam file with the bed file (overlapping features). Can you write down the commands needed using samtools and bedtools?

(3.3) Can you obtain at least two mRNA isoforms of TP53 from NCBI or other online resources? Generate a bed file containing both, color them differently and visualize them on UCSC genome browser? Take a screenshot of the visualization.



3.1 
execute:python3 fastq_fasta_converter.py input_file

- read in input file line by line
-iterate over lines: if starts with @ write to output, \n, next line also added to output



In [None]:
import sys
import os

#### assign input & output ####
input_fastq_file = sys.argv[1]
output_fasta_file = os.path.splitext(input_fastq_file)[0] + '.fa'

#### read in each line ####

with open(input_fastq_file, 'r') as fq, open(output_fasta_file, 'w') as fa:
    while True:
        identifier = fq.readline().strip()
        if not identifier:
            break # end of file
        if not identifier.startswith('@'):
            continue # skip lines that don't stat with @
        sequence = fq.readline().strip()
        fq.readline() #skip + 
        fq.readline() #skip quality score

        fa_seq_header = '>' + identifier[1:]
        fa.write(f'{fa_seq_header}\n{sequence}\n')

print(f'Conversion complete: {input_fastq_file} > {output_fasta_file}')



next step - download fastqfile [5] 


(3.2) I have a sam file called “sequences.sam” and a genomic feature file called “features.bed”. I would like to annotate all the sequences in the sam file with the bed file (overlapping features). Can you write down the commands needed using samtools and bedtools?

Steps:  
Samtools [Convert sam > bam, sort bam, index bam]  [1]
bedtools [intersect bam with annotated bed]  [2]


(3.3) Can you obtain at least two mRNA isoforms of TP53 from NCBI or other online resources? Generate a bed file containing both, color them differently and visualize them on UCSC genome browser? Take a screenshot of the visualization.

> STEPS:
> Download isoforms [3], and genome assembly GRCh38.p14 chromosome 17 [4]
> unzip reference file
> Use BWA to align fasta to reference genome  
> Samtools to convert sam > bam, sort, index  
> Bedtools to convert bam > bed + add color, combine bed files  
> Visualize on UCSC

In [None]:
#bash#
gunzip Homo_sapiens.GRCh38.dna.chromosome.17.fa.gz  # dont need to rename, this changes the file from .fa.gz to .fa
reference="Homo_sapiens.GRCh38.dna.chromosome.17.fa"
bwa index $reference
###note### recompressed file to .fa.gz due to large storage size. deleted index files of reference.
bwa mem $reference isoform.fa > isoform.sam  # sam is aligned file
samtools view -b isoform.sam > isoform.bam #binary aligned file
samtools sort isoform.bam -o isoform_sorted.bam
samtools index isoform_sorted.bam #creates bai file
bedtools bamtobed -i isoform_sorted.bam > isoform.bed
# awk '{OFS="\t"; print $0, "255,0,0"}' isoform.bed > isoform_colored.bed
# ^^ the above is wrong. the Rgb field must be in the 9th column, the traditional bamtobed only creates 6 columns.
# Therefore we must add col#7 (thickStart) and col#8 (thickEnd) as well as Rgb. Thickstart/end can copy the coordinate start-end per guidelines
#correct awk command below:
awk '{OFS="\t"; print $1, $2, $3, $4, $5, $6, $2, $3, "0,255,0"}' isoform.bed > isoform_colored.bed
cat isoform1_colored.bed isoform2_colored.bed > combined_isoforms.bed

References:
[1] https://www.htslib.org/doc/samtools.html  
[2] https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html#  
[3] https://www.ncbi.nlm.nih.gov/gene/7157  
[4] https://ftp.ensembl.org/pub/release-113/fasta/homo_sapiens/dna/  
[5] TESTX_H7YRLADXX_S1_L001_R1_001.fastq.gz renamed to TestX.fastq from https://github.com/hartwigmedical/testdata
