# Q3: Multi-line FASTA to Single-line FASTA

## Problem Statement

Fasta file format is shown below. Many times, it is annoying to find sequences of interest because of fix-width format, i.e., if 60 character fix-width file is there then only except the header `(starting with >)` of the sequence each line will have 60 characters. So, if a sequence is of width 600, then it would be written in 10 lines. Your goal is convert this format to a format with header in a single line followed by its sequence in a single line.

## Example Input

\>this is a long sequence
GTTCTACTTGCGGACGGATCGTAACCGAACTGGCCCGGATCTTTCATCCTCATGTAGAT
GCACAAAAGGTTCATCTAATAGTACTACCTCTTCTACTCGC  
\>this is okay  
GGTTCATCTAATAGTACTACCTCTTCTACTCGC 

## Example Output

\>this is a long sequence
GTTCTACTTGCGGACGGATCGTAACCGAACTGGCCCGGATCTTTCATCCTCATGTAGATGCACAAAAGGTTCATCTAATAGTACTACCTCTTCTACTCGC  
\>this is okay  
GGTTCATCTAATAGTACTACCTCTTCTACTCGC 

### Key Requirements
1. **Memory Efficiency**: Handle large sequences (e.g., human chromosomes with 220M+ bases) without storing entire sequences in memory.
2. **Correct Formatting**:
   - Headers (`>...`) remain unchanged.
   - Sequences are concatenated into a single line.
3. **Verification**: Compare input/output files to validate correctness.

### Approach
1. **Stream Processing**:
   - Read lines sequentially (no in-memory storage of entire sequences).
   - Write headers immediately; concatenate sequence lines on-the-fly.
2. **State Tracking**:
   - Use a flag (`in_sequence`) to detect transitions between headers and sequences.
3. **Verification**:
   - Print the first 5 lines of both input and output files for visual comparison.

### Code Explanation
1. **Conversion Logic**:
   - **Headers**: Written directly to the output file.
   - **Sequences**: Stripped of newlines and written immediately to avoid memory buildup.
   - **Edge Cases**:
     - Empty lines in input are ignored.
     - Final newline added if the file ends with a sequence.

2. **Verification Logic**:
   - Prints the first 5 lines of both files to confirm:
     - Headers are preserved.
     - Sequences are collapsed into single lines.


In [1]:
with open('multiline_input.fasta', 'r') as f_input, open('singleLine_output.fasta', 'w') as f_output:
    in_sequence = False
    for line in f_input:
        if line.startswith('>'):
            if in_sequence:
                f_output.write('\n')
            f_output.write(line)
            in_sequence = True  # Next lines are expected to be sequence
        else:
            stripped = line.strip()
            if stripped:
                f_output.write(stripped)
                in_sequence = True  # Ensure state is updated
    # Add a final newline if the file ends with a sequence
    if in_sequence:
        f_output.write('\n')

In [2]:
# Verification: Print first 5 lines of both files for comparison
print("===== Original File Preview =====")
with open('multiline_input.fasta', 'r') as f:
    for _ in range(5):
        line = f.readline()
        if not line:
            break
        print(line.strip())

print("\n===== Converted File Preview =====")
with open('singleLine_output.fasta', 'r') as f:
    for _ in range(5):
        line = f.readline()
        if not line:
            break
        print(line.strip())

===== Original File Preview =====
>1
TCCAATTGAGCTATACCGTCAGAAGTAGATGAGATCTAACGGCGCGCCCGGTTGTCGCAG
CCACGCTAGGCACAGTCAAAGCAGTGACCCAGCCCTTCAGACGTTACTGTGAAAATCTGT
CCACCGAGAACGCTTTCAAAATATATTAGCGGTTTAAGCTAGGCAAGTCACGGCATGTCC
TTACAAAAGCCTACGAGCCGGGCCACATTATGTCGCCCATAAGATGCCAATACAATCTCA

===== Converted File Preview =====
>1
TCCAATTGAGCTATACCGTCAGAAGTAGATGAGATCTAACGGCGCGCCCGGTTGTCGCAGCCACGCTAGGCACAGTCAAAGCAGTGACCCAGCCCTTCAGACGTTACTGTGAAAATCTGTCCACCGAGAACGCTTTCAAAATATATTAGCGGTTTAAGCTAGGCAAGTCACGGCATGTCCTTACAAAAGCCTACGAGCCGGGCCACATTATGTCGCCCATAAGATGCCAATACAATCTCAAGTAGGATAAGCATGTCTTACCACTAGCGTTCTGTGTTTGTGGGGTTACGCGTTGTGTTACCAACGGTCCCCTATAGCTCAGCAGTAATAAGTTTTATTGTGCTCAAGGCC
>2
GATAGCGACGACGCTAGTTCCTAGGGACCCGGGGGGGATCAGTGGTGACGGTATGCTCTTGCTGGTGCGACCCGTAACGAAGTCACATGGGTAGTGGCGGCACAGGATATCGTAGTGACAGGAACCTGATATTCCGTTGCAGAACACACGCGACAATCTACAAGTTATTTTCGGCTGAGTTACTTACTCACAAGATCCACATGGAAATTTAATATACGACGCTGAGGAGGGTTGCACAACGGCTCTCTTTCCTAAGAACAGAATTGCCTAAGAATTCAGAGCTCAAAGGCTCCATTTCTTGAGGTATGGTCGTCGGTATATTAAC

## Conclusion  
- **Efficient Conversion**: Streamlined multi-line to single-line FASTA conversion without memory overload.  
- **Correct Formatting**: Preserves headers, merges sequences, and handles edge cases.  
- **Verification**: Includes file previews to confirm accuracy.  
- **Optimal for Large Data**: Processes sequences line-by-line, ideal for genome-scale files.  