<a href="https://colab.research.google.com/github/bforsbe/SK2534/blob/main/Phylo_znf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install necessary libraries and tools

### Subtask:
Install Biopython and a suitable MSA tool (e.g., Clustal Omega).


**Reasoning**:
Install biopython and clustal omega using pip and apt-get respectively.



In [3]:
%pip install biopython
!apt-get update -y
!apt-get install -y clustalw

Hit:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 https://cli.github.com/packages stable InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading packag

## Load the fasta sequences

### Subtask:
Read the sequences from the generated FASTA file using Biopython.


**Reasoning**:
Import the SeqIO module from Biopython and read the sequences from the "output.fasta" file using SeqIO.parse().



In [None]:
from Bio import SeqIO

# Read the sequences from the FASTA file
sequences = list(SeqIO.parse("output.fasta", "fasta"))

# Print the number of sequences read to verify
print(f"Number of sequences read: {len(sequences)}")

Number of sequences read: 3021


## Perform the multiple sequence alignment

### Subtask:
Use Biopython's interface to the chosen MSA tool to align the sequences.


**Reasoning**:
Use Biopython's ClustalwCommandline to perform multiple sequence alignment on the extracted FASTA sequences and save the output to a new FASTA file.



In [None]:
from Bio import SeqIO
from Bio.Align.Applications import ClustalwCommandline

# Read the sequences from the FASTA file
sequences = list(SeqIO.parse("output.fasta", "fasta"))

# Modify sequence IDs to be unique
for i, seq in enumerate(sequences):
    seq.id = f"{seq.id}_{i}"
    seq.description = "" # Clear description to avoid issues with some tools

# Write the modified sequences to a temporary FASTA file
temp_in_file = "output_unique.fasta"
SeqIO.write(sequences, temp_in_file, "fasta")

# Define input and output file names
out_file = "aligned.fasta"

# Create a ClustalwCommandline object with the temporary file
clustalw_cline = ClustalwCommandline("clustalw", infile=temp_in_file, outfile=out_file)

# Run the command line tool
stdout, stderr = clustalw_cline()

# Print confirmation message
print(f"Multiple sequence alignment performed and saved to {out_file}")

Multiple sequence alignment performed and saved to aligned.fasta


## Display or save the alignment

### Subtask:
Display or save the resulting MSA to a file in a common format (e.g., Stockholm, Clustal, or FASTA).


In [None]:
# Read and print the content of the aligned FASTA file
alignment_file = "aligned.fasta"
try:
    with open(alignment_file, "r") as f:
        file_content = f.read()
    print("Content of aligned.fasta:")
    print(file_content)
except FileNotFoundError:
    print(f"Error: The file {alignment_file} was not found.")
except Exception as e:
    print(f"An error occurred while reading the file: {e}")


**Reasoning**:
The output of ClustalW is in a modified FASTA format that includes a header and spacing. `AlignIO.read` with the standard "fasta" format is failing to parse it. I will try reading it with the "clustal" format, which is designed to handle Clustal output.



In [None]:
from Bio import AlignIO

# Read the aligned sequences from the FASTA file using the 'clustal' format
alignment_file = "aligned.fasta"
try:
    alignment = AlignIO.read(alignment_file, "clustal")

    # Print the alignment to the console
    print("Multiple Sequence Alignment (Clustal format):")
    print(alignment)

    # Optionally, write the alignment to a new file in FASTA format
    output_fasta_file = "aligned_output.fasta"
    AlignIO.write(alignment, output_fasta_file, "fasta")
    print(f"\nAlignment also saved to {output_fasta_file} in FASTA format.")

except FileNotFoundError:
    print(f"Error: The file {alignment_file} was not found.")
except Exception as e:
    print(f"An error occurred while reading the file: {e}")

Multiple Sequence Alignment (Clustal format):
Alignment with 3021 rows and 40 columns
--YKCEI--CQKMFN---HKS--YLTVHAKIHQN------ A0A1B2RVU6_0
--YKCDI--CSAVFT---TDS--ELKVHSKIHR------- AF-A0A0B7BJ65-F1-model_v4_1715
--YKCDV--CDKSYK---VKS--SLNEHLIIH-------- AF-A0A7R9HB19-F1-model_v4_1317
--YKCDV--CDNSFK---MKS--SLYTHLLIHTN------ AF-A0A7R8VXV2-F1-model_v4_1602
--YKCDI--CQKSFV---NKS--NLNNHYLIHT------- AF-A0A4Y2NZM6-F1-model_v4_1736
--YKCLI--CDKSFG---LKS--NLNYHMKIH-------- AF-A0A087U0Z0-F1-model_v4_1757
--YHCSV--CEKCYK---TKG--GLKRHHTIVK------- AF-A0A397V9P1-F1-model_v4_2061
--YKCEI--CKKEFK---TKH--GLKGH------------ MGYP003998274573_2901
---KCLD--CYKKFK---TKH--RAEKHFLIHKT------ A0A1B2RVS9_55
--YDCNV--CHKGFK---RKF--DLTKHYSIHNT------ A0A346G481_9
--YGCGV--CGKKFK---MKH--HLVGHMKIHT------- 3.30.160.60_2711
--YGCGV--CGKKFK---MKH--HLVGHMKIH-------- 2cshA01_2759
--YKCDV--CDKAFK---HKH--HLTEHKRLHS------- AF-A0A077YWL4-F1-model_v4_310
--YKCDI--CEKAFK---HKH--HLTEHKRLHS------- AF-G5EBU4-F1-model_v4_1030
--FKC

# Task
Modify the notebook to load the FASTA file from "https://github.com/bforsbe/SK2534/blob/main/aligned_output.fasta", add a query sequence, perform multiple sequence alignment using ClustalW, and calculate the maximum sequence identity of the query sequence against all other sequences.

## Load fasta from url

### Subtask:
Download the FASTA file from the provided URL and load the sequences into a list.


**Reasoning**:
Download the FASTA file from the URL and load the sequences into a list using requests, io, and Biopython's SeqIO.



In [23]:
import requests
import io
from Bio import SeqIO

# Define the URL of the FASTA file and the local filename
url = "https://raw.githubusercontent.com/bforsbe/SK2534/main/aligned_output.fasta"
local_fasta_file = "template_aligned.fasta"

# Download the content of the file and save it locally if it doesn't exist
try:
    with open(local_fasta_file, "r") as f:
        print(f"Using existing template file: {local_fasta_file}")
except FileNotFoundError:
    print(f"Downloading template file from: {url}")
    response = requests.get(url)
    response.raise_for_status() # Raise an exception for bad status codes
    with open(local_fasta_file, "w") as f:
        f.write(response.text)
    print(f"Template file downloaded and saved as: {local_fasta_file}")

# Read the sequences from the local FASTA file using SeqIO.parse
sequences = list(SeqIO.parse(local_fasta_file, "fasta"))

# Print the number of sequences loaded to verify
print(f"Number of sequences loaded from template: {len(sequences)}")

Downloading template file from: https://raw.githubusercontent.com/bforsbe/SK2534/main/aligned_output.fasta
Template file downloaded and saved as: template_aligned.fasta
Number of sequences loaded from template: 3021


**Reasoning**:
Biopython is already installed, but the previous code block failed to import it. I will retry the previous code block to download and load the FASTA file.



In [5]:
import requests
import io
from Bio import SeqIO

# Define the URL of the FASTA file
url = "https://raw.githubusercontent.com/bforsbe/SK2534/main/aligned_output.fasta"

# Download the content of the file
response = requests.get(url)
response.raise_for_status() # Raise an exception for bad status codes

# Use io.StringIO to wrap the downloaded content
fasta_content = io.StringIO(response.text)

# Read the sequences from the wrapped content using SeqIO.parse
sequences = list(SeqIO.parse(fasta_content, "fasta"))

# Print the number of sequences loaded to verify
print(f"Number of sequences loaded: {len(sequences)}")

Number of sequences loaded: 3021


## Add query sequence

### Subtask:
Define a query sequence and add it to the list of sequences.


**Reasoning**:
Define the query sequence, create a SeqRecord for it, and add it to the existing list of sequences.



In [43]:
#@title Set query sequence

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# Define the query sequence string
query_sequence_string = "YKCTFGNSVFSTVKGLKRCLKVHSN " #@param{type:"string"}

# Create a SeqRecord object for the query sequence with a fixed ID
query_sequence = SeqRecord(Seq(query_sequence_string), id="Query_Sequence", description="")

# Create a new list of sequences including the original sequences from the template and the query
sequences_with_query = sequences + [query_sequence]

# Print the updated number of sequences to verify
print(f"Updated number of sequences (including query): {len(sequences_with_query)}")

Updated number of sequences (including query): 3022


## Prepare sequences for alignment

### Subtask:
Modify sequence IDs to be unique and write all sequences (including the query) to a temporary FASTA file.


**Reasoning**:
Modify the sequence IDs to be unique, clear descriptions, and write all sequences to a temporary FASTA file for alignment.



In [44]:
#@title Modify seqIDs

# Modify sequence IDs to be unique and clear descriptions for the list including the query
for i, seq in enumerate(sequences_with_query):
    # Only modify IDs if they are not the query sequence
    if seq.id != "Query_Sequence":
        seq.id = f"{seq.id}_{i}"
    seq.description = "" # Clear description to avoid issues with some tools

# Define a filename for the temporary FASTA file
temp_in_file = "temp_sequences_for_alignment.fasta"

# Write the modified sequences to the temporary FASTA file
SeqIO.write(sequences_with_query, temp_in_file, "fasta")

# Print confirmation message
print(f"Prepared sequences (including query) and saved to {temp_in_file}")

Prepared sequences (including query) and saved to temp_sequences_for_alignment.fasta


## Perform multiple sequence alignment

### Subtask:
Use ClustalW to align the sequences from the temporary file and save the output to a new file.


**Reasoning**:
Use Biopython's ClustalwCommandline to perform multiple sequence alignment on the sequences in the temporary FASTA file and save the output to a new FASTA file.



In [45]:
#@title Make MSA

from Bio.Align.Applications import ClustalwCommandline

# Define the name for the output alignment file
out_file = "aligned_with_query.fasta"

# Create a ClustalwCommandline object
# Specify the path to the clustalw executable if it's not in your PATH
clustalw_cline = ClustalwCommandline("clustalw", infile=temp_in_file, outfile=out_file)

# Run the command line tool
stdout, stderr = clustalw_cline()

# Print confirmation message
print(f"Multiple sequence alignment performed and saved to {out_file}")

Multiple sequence alignment performed and saved to aligned_with_query.fasta


## Read the alignment

### Subtask:
Read the aligned sequences from the output file using `AlignIO.read`.


**Reasoning**:
Read the aligned sequences from the output file using AlignIO.read with the 'fasta' format and print the alignment object.



In [46]:
#@title Read alignment file
from Bio import AlignIO

# Define the name of the alignment file output by ClustalW
alignment_file = "aligned_with_query.fasta"

# Read the alignment file using AlignIO.read with 'clustal' format
try:
    alignment = AlignIO.read(alignment_file, "clustal")

    # Print the alignment object to verify
    print("Alignment object:")
    print(alignment)

except FileNotFoundError:
    print(f"Error: The file {alignment_file} was not found.")
except Exception as e:
    print(f"An error occurred while reading the file: {e}")

Alignment object:
Alignment with 3022 rows and 51 columns
----FKCNE---CEKAFS----YSS------QLARHQKV-H---...--- AF-B4DUB8-F1-model_v4_1910_38_
----FKCNE---CEKAFS----YSS------QLARHQKV-HIT-...--- ProtVar_P13682_Q9NSC5_A_2089_3
----FKCNE---CEKAFS----YSS------QLARHQKV-H---...--- ProtVar_P13682_Q12933-2_A_2097
----FKCNE---CEKAFS----YSS------QLARHQKV-H---...--- ProtVar_P13682_Q9NYL2-2_A_2149
----FKCNE---CEKAFS----YSS------QLARHQKV-H---...--- ProtVar_P13682_Q12933_A_2188_4
----FKCNE---CEKAFS----YSS------QLARHQKV-H---...--- ProtVar_P13682_P23508_A_2213_4
----FKCNE---CEKAFS----YSS------QLARHQKV-H---...--- ProtVar_P13682_P57678_A_2241_4
----FKCGE---CEKAFS----YSS------QLARHQKV-H---...--- AF-Q5M881-F1-model_v4_83_45_45
----YKCKQ---CEKCFV----QKS------QLVRHQKV-HR--...--- AF-A0A6J0SK80-F1-model_v4_1330
----YKCPE---CGKSFS----VSS------YLIRHQKI-HT--...--- AF-A0A7K5Y3Q2-F1-model_v4_1248
----YKCPE---CGKSFS----VSS------YLIRHQKI-HT--...--- AF-A0A7K7FT70-F1-model_v4_1378
----YKCPE---CGKSFS----VSS------YLIRHQKI-

## Analyze sequence identity

### Subtask:
Calculate the maximum sequence identity of the query sequence against all other sequences in the alignment.


**Reasoning**:
Calculate the maximum sequence identity by iterating through the alignment, comparing the query sequence to every other sequence, and keeping track of the highest identity score found.



In [47]:
#@title Analyze maximum SeqID

# Find the query sequence in the alignment using its fixed ID
query_sequence_id = "Query_Sequence"
query_seq = None
for record in alignment:
    if record.id == query_sequence_id:
        query_seq = record
        break

if query_seq is None:
    print(f"Error: Query sequence with ID '{query_sequence_id}' not found in the alignment.")
else:
    max_identity = 0.0

    # Iterate through each sequence in the alignment
    for record in alignment:
        # Skip the query sequence itself
        if record.id == query_sequence_id:
            continue

        # Calculate sequence identity
        identical_residues = 0
        aligned_length = 0
        for i in range(alignment.get_alignment_length()):
            query_residue = query_seq[i]
            current_residue = record[i]

            # Only consider positions where neither sequence has a gap
            if query_residue != '-' and current_residue != '-':
                aligned_length += 1
                if query_residue == current_residue:
                    identical_residues += 1

        # Avoid division by zero if there are no aligned positions
        if aligned_length > 0:
            identity = (identical_residues / aligned_length) * 100
            max_identity = max(max_identity, identity)
        else:
            identity = 0.0 # Or handle as appropriate if no comparable positions

    print(f"Maximum sequence identity of the query sequence against others: {max_identity:.2f}%")

Maximum sequence identity of the query sequence against others: 50.00%


## Display results

### Subtask:
Print the maximum sequence identity.


**Reasoning**:
Print the value of the `max_identity` variable with a descriptive message as requested.



In [48]:
print(f"The maximum sequence identity of the query sequence against all other sequences is: {max_identity:.2f}%")

The maximum sequence identity of the query sequence against all other sequences is: 50.00%
