<a href="https://colab.research.google.com/github/carlocamilloni/Structural-Bioinformatics/blob/main/Notebooks/lab05_StructurePred.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Protein Structure Prediction

In [None]:
#Install conda using the new conda-colab library
!pip install -q condacolab
import condacolab
condacolab.install_miniconda()

In this notebook you will try to characterize one protein from its sequence using multiple tools. In particular we will work on proteins from Cyprinid herpesvirus 2 that is a virus attacking gold fishes. You will perform

1.   Secondary structure prediction
2.   Tertiary structure prediction

As for the last time, link your google drive, create a folder for this  task and save a copy of this notebook there

Then move to your local folder:

In [1]:
# mount google drive  
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
local_path='/content/drive/MyDrive/Structural_Bioinformatics/Task5'   # PASTE YOUR LOCAL PATH IN PLACE OF /content
%cd {local_path}

/content/drive/MyDrive/Structural_Bioinformatics/Task5


Now you can choose one among these sequences on which you will work. To choose it set the seq variable in the following:

*   A9MHP5: Beta-barrel assembly-enhancing protease
*   A9MI44: Uncharacterized protein
*   A9MJ84: Uncharacterized protein
*   A9MG92: ABC transmembrane type-1 domain-containing protein

You can learn more about these proteins checking them on www.uniprot.org


In [None]:
# XXX is the protein you choose
seq='XXXX.fasta'

In [None]:
#get the sequence you choose as:
!wget https://www.uniprot.org/uniprot/{seq}

In [None]:
#Installing biopython using pip
!pip install biopython

Biopython is an excellent companion for working with DNA and protein sequences and also with structures. Here, we will show how to use it for reading a protein sequence file.


In [None]:
import os
from pathlib import Path 
from Bio import SeqIO, Entrez
# read your protein sequence
aaseq = SeqIO.read(seq, format="fasta")

What is great about SeqIO is that you can use it to manipulate your sequence (e.g. sorting, changing formats, etc) and also to print information about your sequence, such as its description, sequence and accession ID.
You can try these commands below by first writing "aaseq." and then selecting one of the autocomplete options suggested by Google Colab. With this information, try to obtain the information requested below.

In [None]:
#Printing the number of amino acids as an example
print("Sequence length (aa):")
print(len(aaseq))

In [None]:
#Printing the description of the sequence
print("The description of the sequence is:")
#INSERT YOUR OWN COMMAND BELOW (aaseq.xxx)
print()

#Printing the accession ID of the sequence
print("The ID of the sequence is:")
#INSERT YOUR OWN COMMAND BELOW
print()

#Printing the sequence
print("The sequence is:")
#INSERT YOUR OWN COMMAND BELOW
print()

# Secondary Structure Prediction

Now we can begin with a secondary structure prediction from the sequence. For this we will use S4PRED (https://github.com/psipred/s4pred)

In [None]:
%cd {local_path}

In [None]:
# we need to download and install s4pred
!git clone --depth 1 https://github.com/psipred/s4pred
%cd s4pred
%mkdir weights
%cd weights
!wget http://bioinfadmin.cs.ucl.ac.uk/downloads/s4pred/weights_1.pt
!wget http://bioinfadmin.cs.ucl.ac.uk/downloads/s4pred/weights_2.pt
!wget http://bioinfadmin.cs.ucl.ac.uk/downloads/s4pred/weights_3.pt
!wget http://bioinfadmin.cs.ucl.ac.uk/downloads/s4pred/weights_4.pt
!wget http://bioinfadmin.cs.ucl.ac.uk/downloads/s4pred/weights_5.pt

In [None]:
# Install: PyTorch
# this can be long...
!pip install torch==1.5.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Now we can run the prediction over our fasta sequence and save it in a text file. You can then double click on the file to see the results

In [None]:
%cd {local_path}
!python s4pred/run_model.py {seq} > ss.{seq}.txt
%ls ss.{seq}.txt

The file is made of three columns that report the secondary structure prediction and are COIL/HELIX/EXTENDED, we can now load and plot the results

In [None]:
import pandas as pd
# here we are giving the titles to the columns of the file
header_list = ["position", "sequence", "ss", "coil", "helix","extended"]
# here we are reading the file skipping the first two lines
# you need to set your path to your file
data = pd.read_csv('/content/drive/MyDrive/Task2/ss.A9MHP5.fasta.txt',sep='\s+',skiprows=2,names=header_list)
data = pd.DataFrame(data)

import matplotlib.pyplot as plt
#we are generating three plots calling each column with its name
#and setting the properties of the plot (e.g. the color of the line)
plt.plot(data['position'], data['coil'], color='black')
plt.plot(data['position'], data['helix'], color='blue')
plt.plot(data['position'], data['extended'], color='red')
#show the plot now
plt.show()

you can check how to improve the plot looking for matplotlib plot example. This plot can be used to describe what the secondary structure of the protein can be

#AB INITIO STRUCTURE PREDICTION


Here you will try to predict your structure using alphafold. Prepare a folder to save your data. Open the google colab at this link 
https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb
and save it locally before moving on



Depending on the length of your sequence alphafold can take from 30 minutes to a few hours, be carefull to not close the webpage to avoid to lose the calculation.

#More analysis

Once you have all the PDB structures obtained using different techniques you can compare them using VMD.
In addition to VMD you can also visualise them directly inside the notebook as in the following example:

Let's visualize some PDBs!

In the following code cell we take a line by line look to the different commands needed to visualize two of the chains that you extracted from 6ANE. Please note that the text inside each parenthesis follows the structure of a JSON data structure which is reviewed in detail here.
Pay special attention to the following py3Dmol classes and what they do: addmodel, setBackground and setStyle.
ðŸ’¡ HINT: The syntax for loading different structures and properties is really straightforward and self explanatory. Even so, remember to be carefull when opening and closing {} and () and using quotes when needed.

In [None]:
#First we assign the py3Dmol.view as view
view=py3Dmol.view()
#The following lines are used to add the addModel class
#to read multiple PDB files 
view.addModel(open('.pdb', 'r').read(),'pdb')
view.addModel(open('.pdb', 'r').read(),'pdb')
#Zooming into all visualized structures 
view.zoomTo()
#Here we set the background color as white
view.setBackgroundColor('white')
#Here we set the visualization style for chain A and B
view.setStyle({'chain':'B'},{'cartoon': {'color':'purple'}})
view.setStyle({'chain':'C'},{'cartoon': {'color':'yellow'}})
#And we finally visualize the structures using the command below
view.show()

If the structures are not overlapped you can use biopython to overlap them

In [None]:
#The following code was created by Anders Steen Christensen
#from the University of Basel and is available at
#https://gist.github.com/andersx/6354971

import Bio.PDB
import os

# Select what residue numbers you wish to align
# and put them in a list
start_id =  # first res
end_id   =  # last res to be aligned
atoms_to_be_aligned = range(start_id, end_id + 1)

# Start the parser
pdb_parser = Bio.PDB.PDBParser(QUIET = True)

# Get the structures
ref_structure = pdb_parser.get_structure("reference", ".pdb")
sample_structure = pdb_parser.get_structure("sample", ".pdb")

# Use the first model in the pdb-files for alignment
# Change the number 0 if you want to align to another structure
ref_model    = ref_structure[0]
sample_model = sample_structure[0]

# Make a list of the atoms (in the structures) you wish to align.
# In this case we use CA atoms whose index is in the specified range
ref_atoms = []
sample_atoms = []

# Iterate of all chains in the model in order to find all residues
for ref_chain in ref_model:
  # Iterate of all residues in each model in order to find proper atoms
  for ref_res in ref_chain:
    # Check if residue number ( .get_id() ) is in the list
    if ref_res.get_id()[1] in atoms_to_be_aligned:
      # Append CA atom to list
      ref_atoms.append(ref_res['CA'])

# Do the same for the sample structure
for sample_chain in sample_model:
  for sample_res in sample_chain:
    if sample_res.get_id()[1] in atoms_to_be_aligned:
      sample_atoms.append(sample_res['CA'])

# Now we initiate the superimposer:
super_imposer = Bio.PDB.Superimposer()
super_imposer.set_atoms(ref_atoms, sample_atoms)
super_imposer.apply(sample_model.get_atoms())

# Print RMSD:
print('The calculated RMSD is:')
print (str(super_imposer.rms) + ' Ã…')

# Save the aligned version of one of the chains of 6ANE
io = Bio.PDB.PDBIO()
io.set_structure(sample_structure) 
io.save("_aligned.pdb")

After aligment you can reuse the code above for visualisation.