<a href="https://colab.research.google.com/github/bforsbe/SK2534/blob/main/Phylo_znf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install necessary libraries and tools

### Subtask:
Install Biopython and a suitable MSA tool (e.g., Clustal Omega).


**Reasoning**:
Install biopython and clustal omega using pip and apt-get respectively.



In [15]:
%pip install biopython
!apt-get update -y
!apt-get install -y clustalomega

0% [Working]            Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
0% [Waiting for headers] [Waiting for headers] [Connected to cloud.r-project.or                                                                               Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
                                                                               Get:3 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
                                                                               Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
0% [3 InRelease 127 kB/127 kB 100%] [Connected to cloud.r-project.org (3.171.850% [Connected to cloud.r-project.org (3.171.85.81)] [Connected to r2u.stat.illi                                                                               Hit:5 https://cli.github.com/packages stable InRelease
0% [Connected to cloud.r-project.org (3.171.85.81)] [Connected to r2u.stat.ill

**Reasoning**:
Clustal Omega could not be located with the previous command. I will try installing it from a specific repository.



In [16]:
!apt-get update -y
!apt-get install -y clustalw

0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com (185.1                                                                               Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
                                                                               Hit:3 https://cli.github.com/packages stable InRelease
0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com (185.1                                                                               Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
0% [Waiting for headers] [Waiting for headers] [Connected to ppa.launchpadconte                                                                               Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
0% [Waiting for headers] [Waiting for headers] [Waiting for headers]

## Load the fasta sequences

### Subtask:
Read the sequences from the generated FASTA file using Biopython.


**Reasoning**:
Import the SeqIO module from Biopython and read the sequences from the "output.fasta" file using SeqIO.parse().



In [17]:
from Bio import SeqIO

# Read the sequences from the FASTA file
sequences = list(SeqIO.parse("output.fasta", "fasta"))

# Print the number of sequences read to verify
print(f"Number of sequences read: {len(sequences)}")

Number of sequences read: 3021


## Perform the multiple sequence alignment

### Subtask:
Use Biopython's interface to the chosen MSA tool to align the sequences.


**Reasoning**:
Use Biopython's ClustalwCommandline to perform multiple sequence alignment on the extracted FASTA sequences and save the output to a new FASTA file.



In [19]:
from Bio import SeqIO
from Bio.Align.Applications import ClustalwCommandline

# Read the sequences from the FASTA file
sequences = list(SeqIO.parse("output.fasta", "fasta"))

# Modify sequence IDs to be unique
for i, seq in enumerate(sequences):
    seq.id = f"{seq.id}_{i}"
    seq.description = "" # Clear description to avoid issues with some tools

# Write the modified sequences to a temporary FASTA file
temp_in_file = "output_unique.fasta"
SeqIO.write(sequences, temp_in_file, "fasta")

# Define input and output file names
out_file = "aligned.fasta"

# Create a ClustalwCommandline object with the temporary file
clustalw_cline = ClustalwCommandline("clustalw", infile=temp_in_file, outfile=out_file)

# Run the command line tool
stdout, stderr = clustalw_cline()

# Print confirmation message
print(f"Multiple sequence alignment performed and saved to {out_file}")

Multiple sequence alignment performed and saved to aligned.fasta


## Display or save the alignment

### Subtask:
Display or save the resulting MSA to a file in a common format (e.g., Stockholm, Clustal, or FASTA).


In [None]:
# Read and print the content of the aligned FASTA file
alignment_file = "aligned.fasta"
try:
    with open(alignment_file, "r") as f:
        file_content = f.read()
    print("Content of aligned.fasta:")
    print(file_content)
except FileNotFoundError:
    print(f"Error: The file {alignment_file} was not found.")
except Exception as e:
    print(f"An error occurred while reading the file: {e}")


**Reasoning**:
The output of ClustalW is in a modified FASTA format that includes a header and spacing. `AlignIO.read` with the standard "fasta" format is failing to parse it. I will try reading it with the "clustal" format, which is designed to handle Clustal output.



In [22]:
from Bio import AlignIO

# Read the aligned sequences from the FASTA file using the 'clustal' format
alignment_file = "aligned.fasta"
try:
    alignment = AlignIO.read(alignment_file, "clustal")

    # Print the alignment to the console
    print("Multiple Sequence Alignment (Clustal format):")
    print(alignment)

    # Optionally, write the alignment to a new file in FASTA format
    output_fasta_file = "aligned_output.fasta"
    AlignIO.write(alignment, output_fasta_file, "fasta")
    print(f"\nAlignment also saved to {output_fasta_file} in FASTA format.")

except FileNotFoundError:
    print(f"Error: The file {alignment_file} was not found.")
except Exception as e:
    print(f"An error occurred while reading the file: {e}")

Multiple Sequence Alignment (Clustal format):
Alignment with 3021 rows and 40 columns
--YKCEI--CQKMFN---HKS--YLTVHAKIHQN------ A0A1B2RVU6_0
--YKCDI--CSAVFT---TDS--ELKVHSKIHR------- AF-A0A0B7BJ65-F1-model_v4_1715
--YKCDV--CDKSYK---VKS--SLNEHLIIH-------- AF-A0A7R9HB19-F1-model_v4_1317
--YKCDV--CDNSFK---MKS--SLYTHLLIHTN------ AF-A0A7R8VXV2-F1-model_v4_1602
--YKCDI--CQKSFV---NKS--NLNNHYLIHT------- AF-A0A4Y2NZM6-F1-model_v4_1736
--YKCLI--CDKSFG---LKS--NLNYHMKIH-------- AF-A0A087U0Z0-F1-model_v4_1757
--YHCSV--CEKCYK---TKG--GLKRHHTIVK------- AF-A0A397V9P1-F1-model_v4_2061
--YKCEI--CKKEFK---TKH--GLKGH------------ MGYP003998274573_2901
---KCLD--CYKKFK---TKH--RAEKHFLIHKT------ A0A1B2RVS9_55
--YDCNV--CHKGFK---RKF--DLTKHYSIHNT------ A0A346G481_9
--YGCGV--CGKKFK---MKH--HLVGHMKIHT------- 3.30.160.60_2711
--YGCGV--CGKKFK---MKH--HLVGHMKIH-------- 2cshA01_2759
--YKCDV--CDKAFK---HKH--HLTEHKRLHS------- AF-A0A077YWL4-F1-model_v4_310
--YKCDI--CEKAFK---HKH--HLTEHKRLHS------- AF-G5EBU4-F1-model_v4_1030
--FKC