Skip to content

Extend structure alignment page with multiple alignments #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Jul 22, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ A brief introduction into [BioJava](https://github.com/biojava/biojava).

The goal of this tutorial is to provide an educational introduction into some of the features that are provided by BioJava.

At the moment this tutorial is still under development. Please check the [BioJava Cookbook](http://biojava.org/wiki/BioJava:CookBook3.0) for a more comprehensive collection of many examples of what is possible with BioJava and how to do things.
At the moment this tutorial is still under development. Please check the [BioJava Cookbook](http://biojava.org/wiki/BioJava:CookBook3.0) for a more comprehensive collection of examples about what is possible with BioJava and how to do things.

## Index

Expand All @@ -16,10 +16,9 @@ Book 1: [The Core module](core/README.md), basic working with sequences.

Book 2: [The Alignment module](alignment/README.md), pairwise and multiple alignments of protein sequences.

Book 3: [The Protein Structure modules](structure/README.md), everything related to working with 3D structures.

Book 4: [The Genomics Module](genomics/README.md), working with genomic data
Book 3: [The Structure modules](structure/README.md), everything related to working with 3D structures.

Book 4: [The Genomics Module](genomics/README.md), working with genomic data.

## License

Expand Down
2 changes: 1 addition & 1 deletion alignment/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,4 +63,4 @@ Navigation:

Prev: [Book 1: The Core module](../core/README.md)

Next: [Book 3: The Protein Structure modules](../structure/README.md)
Next: [Book 3: The Structure modules](../structure/README.md)
8 changes: 4 additions & 4 deletions bin/update_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ def makefooter(self):
name = p.makename()
# Get a path to p relative to our own path
link = os.path.relpath(p.rootlink(),os.path.dirname(self.rootlink()))
linkmd.append("[{}]({})".format(name,link))
linkmd.append("[{0}]({1})".format(name,link))
p = p.parent
linkmd.reverse()
lines.append("\n| ".join(linkmd))
Expand All @@ -123,13 +123,13 @@ def makefooter(self):
prev = self.parent.children[pos-1]
name = prev.makename()
link = os.path.relpath(prev.rootlink(),os.path.dirname(self.rootlink()))
lines.append("Prev: [{}]({})".format(name,link))
lines.append("Prev: [{0}]({1})".format(name,link))
lines.append("")
if pos < len(self.parent.children)-1:
next = self.parent.children[pos+1]
name = next.makename()
link = os.path.relpath(next.rootlink(),os.path.dirname(self.rootlink()))
lines.append("Next: [{}]({})".format(name,link))
lines.append("Next: [{0}]({1})".format(name,link))
lines.append("")

#lines.append(self.makename()+", "+self.link)
Expand Down Expand Up @@ -162,7 +162,7 @@ def __repr__(self):

# Output tree
def pr(node,indent=""):
print "{}{}".format(indent,node.link,node.rootlink())
print "{0}{1}".format(indent,node.link,node.rootlink())
for n in node.children:
pr(n,indent+" ")

Expand Down
2 changes: 1 addition & 1 deletion genomics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,4 +64,4 @@ Navigation:
[Home](../README.md)
| Book 4: The Genomics Module

Prev: [Book 3: The Protein Structure modules](../structure/README.md)
Prev: [Book 3: The Structure modules](../structure/README.md)
22 changes: 11 additions & 11 deletions structure/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
The Protein Structure Modules of BioJava
The Structure Modules of BioJava
=====================================================

A tutorial for the protein structure modules of [BioJava](http://www.biojava.org)
A tutorial for the structure modules of [BioJava](http://www.biojava.org)

## About
<table>
Expand Down Expand Up @@ -32,35 +32,35 @@ Chapter 1 - Quick [Installation](installation.md)

Chapter 2 - [First Steps](firststeps.md)

Chapter 3 - The [data model](structure-data-model.md) for the representation of macromolecular structures.
Chapter 3 - The [Structure Data Model](structure-data-model.md), for the representation of macromolecular structures

Chapter 4 - [Local installations](caching.md) of PDB
Chapter 4 - [Local Installations](caching.md) of PDB

Chapter 5 - The [Chemical Component Dictionary](chemcomp.md)

Chapter 6 - How to [work with mmCIF/PDBx files](mmcif.md)
Chapter 6 - How to [Work with mmCIF/PDBx Files](mmcif.md)

Chapter 7 - [SEQRES and ATOM records](seqres.md), mapping to Uniprot (SIFTs)
Chapter 7 - [SEQRES and ATOM Records](seqres.md), mapping to Uniprot (SIFTs)

Chapter 8 - Protein [Structure Alignments](alignment.md)
Chapter 8 - [Structure Alignments](alignment.md)

Chapter 9 - [Biological Assemblies](bioassembly.md)

Chapter 10 - [External Databases](externaldb.md) like SCOP &amp; CATH

Chapter 11 - [Accessible Surface Areas](asa.md)

Chapter 12 - [Contacts within a chain and between chains](contact-map.md)
Chapter 12 - [Contacts Within a Chain and between Chains](contact-map.md)

Chapter 13 - Finding all interfaces in crystal: [crystal contacts](crystal-contacts.md)
Chapter 13 - Finding all Interfaces in Crystal: [Crystal Contacts](crystal-contacts.md)

Chapter 14 - Protein Symmetry

Chapter 15 - Bonds

Chapter 16 - [Special Cases](special.md)

Chapter 17 - [Lists](lists.md) of PDB IDs and PDB [status information](lists.md).
Chapter 17 - [Lists](lists.md) of PDB IDs and PDB [Status Information](lists.md)


### Author:
Expand Down Expand Up @@ -88,7 +88,7 @@ The content of this tutorial is available under the [CC-BY](http://creativecommo

Navigation:
[Home](../README.md)
| Book 3: The Protein Structure modules
| Book 3: The Structure modules

Prev: [Book 2: The Alignment module](../alignment/README.md)

Expand Down
229 changes: 229 additions & 0 deletions structure/alignment-data-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
Structure Alignment Data Model
===

## AFPChain Data Model

The `AFPChain` data structure was designed to store pairwise structural
alignments. The class functions as a bean, and contains many variables
used internally by the alignment algorithms implemented in biojava.

Some of the important stored variables are:
* Algorithm Name
* Optimal Alignment: described later.
* Optimal RMSD: final and total RMSD value of the alignment.
* TM-score
* BlockRotationMatrix: rotation component of the superposition transformation.
* BlockShiftVector: translation component of the superposition transformation.

BioJava class: [org.biojava.bio.structure.align.model.AFPChain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/model/AFPChain.html)

### The Optimal Alignment

The residue equivalencies of the alignment (EQRs) are described in the optimal
alignment variable, a triple array of integers, where the indices stand for:

```java
int[][][] optAln = afpChain.getOptAln();
int residue = optAln[block][chain][eqr];
```

* **block**: the blocks divide the alignment into different parts. The
division can be due to non-topological rearrangements (e.g. circular
permutations) or due to flexible parts (e.g. domain switch). There can
be any number of blocks in a structural alignment, defined by the structure
alignment algorithm.
* **chain**: in a pairwise alignment there are only two chains, or structures.
* **eqr**: EQR stands for equivalent residue position, i.e. the alignment
position. There are as many positions (EQRs) in a block as the length of
the alignment block, and their number is equal for any of the two chains in
the same block.

In each entry (combination of the three indices described above) an integer
is stored, which corresponds to the residue index in the specified chain, i.e.
the index in the Atom array of the chain. In between the same block, the stored
integers (residues) are always in increasing order.

### Examples

Some examples of how to get the basic properties of an `AFPChain`:

```java
afpChain.getAlgorithmName(); //Name of the algorithm that generated the alignment
afpChain.getBlockNum(); //Number of blocks
afpChain.getTMScore(); //TM-score
afpChain.getTotalRmsdOpt() //Optimal RMSD
afpChain.getBlockRotationMatrix()[0] //get the rotation matrix of the first block
afpChain.getBlockShiftVector()[0] //get the translation vector of the first block
```

### Overview

As an overview, the `AFPChain` data model:

* Only supports **pairwise alignments**, i.e. two chains or structures aligned.
* Can support **flexible alignments** and **non-topological alignments**.
However, their combinatation (a flexible alignment with topological rearrangements)
can not be represented, because the blocks mean either one or the other.
* Can not support **non-sequential alignments**, or they would require a new block
for each EQR, because sequentiality of the residues is assumed inside each block.

## MultipleAlignment Data Model

Since BioJava 4.1.0, a new data model is available to store structure alignments.
The `MultipleAlignment` data structure is a general model that supports any of the
following properties, and any combination:

* **Multiple structures**: the model is no longer restricted to pairwise alignments.
* **Non-topological alignments**: such as circular permutations or domain rearrangements.
* **Flexible alignments**: parts of the alignment with different superposition
transformation.

In addtition, the data structure is not limited in the number and types of scores
it can store, because the scores are stored in a key:value fashion, as it will be
described later.

BioJava class: [org.biojava.bio.structure.align.multiple.MultipleAlignment](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/MultipleAlignment.html)

### Object Hierarchy

The biggest difference with `AFPChain` is that the `MultipleAlignment` data
structure is object oriented.
The hierarchy of sub-objects is represented below:

<pre>
MultipleAlignmentEnsemble
|
MultipleAlignment(s)
|
BlockSet(s)
|
Block(s)
</pre>

* **MultipleAlignmentEnsemble**: the ensemble is the top level of the hierarchy.
As a top level, it stores information regarding creation properties (algorithm,
version, creation time, etc.), the structures involved in the alignment (Atoms,
structure identifiers, etc.) and cached variables (atomic distance matrices).
It contains a collection of `MultipleAlignment` that share the same properties
stored in the ensemble. This construction allows the storage of alternative
alignments inside the same data structure.

* **MultipleAlignment**: the `MultipleAlignment` stores the core information of a
multiple structure alignment. It is designed to be the return type of the multiple
structure alignment algorithms. The object contains a collection of `BlockSet` and
it is linked to its parent `MultipleAlignmentEnsemble`.

* **BlockSet**: the `BlockSet` stores a flexible part of a multiple structure
alignment. A flexible part needs the residue equivalencies involved, contained in
a collection of `Block`, and a transformation matrix for every structure that
describes the 3D superposition of all structures. It is linked to its parent
`MultipleAlignment`.

* **Block**: the `Block` stores the aligned positions (equivalent residues) of a
`BlockSet` that are in sequentially increasing order. Each `Block` represents a
sequential part of a non-topological alignment, if more than one `Block` is present.
It is linked to its parent `BlockSet`.

### The Optimal Alignment

In the `MultipleAlignment` data structure the aligned residues are stored in a
double List for every `Block`. The indices of the double List are the following:

```java
List<List<Integer>> optAln = block.getAlnRes();
Integer residue = optAln.get(chain).get(eqr);
```

The indices mean the same as in the optimal alignment of the `AFPChain`, just to
remember them:

* **chain**: chain or structure index.
* **eqr**: EQR stands for equivalent residue position, i.e. the alignment
position. There are as many positions (EQRs) in a block as the length of
the alignment block, and their number is equal for any of the chains in
the same block.

As in `AFPChain`, each entry (combination of the two indices described above)
is an Integer that corresponds to the residue index in the specified chain, i.e.
the index in the Atom array of the chain. Caution has to be taken in the code,
because a `MultipleAlignment` can contain gaps, which are represented as `null`
in the List entries.

### Alignment Scores

All the objects in the hierarchy levels implement the `ScoresCache` interface.
This interface allows the storage of any number of scores as a key:value set.
The key is a `String` that describes the score and used to recover it after,
and the value is a double with the calculated score. The interface has only
two methods: putScore and getScore.

The following lines of code are an example on how to do score manipulations
on a `MultipleAlignment`:

```java
//Put a score into the alignment and get it back
alignment.putScore('myRMSD', 1.234);
double myRMSD = alignment.getScore('myRMSD');

BlockSet bs = alignment.getBlockSets().get(0);
//The same can be done for BlockSets
alignment.putScore('bsRMSD', 1.234);
double bsRMSD = alignment.getScore('bsRMSD');
```

### Manipulating Multiple Alignments

Some classes are designed to contain utility methods for manipulating a `MultipleAlignment` object.
The most important ones are ennumerated and briefly described below:

* [MultipleAlignmentScorer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentScorer.html): contains frequent names for scores and methods to calculate them.

* [MultipleAlignmentTools](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentTools.html): contains helper methods, such as sequence alignment calculation, transform atom arrays of the structures or calculate aligned residue distances between all structures.

* [MultipleAlignmentWriter](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleAlignmentWriter.html): contains methods to generate different types of String outputs of the alignment, e.g. FASTA, XML, FatCat.

* [MultipleSuperimposer](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/util/MultipleSuperimposer.html): interface for implementations that calculate the structure superpositions of the alignment. Some examples of implementations are the ReferenceSuperimposer (superimposes all the structures to a reference) and the CoreSuperimposer (only uses EQRs present in all structures, without gaps, to superimpose them).

* [MultipleAlignmentXMLParser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/xml/MultipleAlignmentXMLParser.html): contains a method to create a `MultipleAlignment` object from an XML file representation.

### Overview

As an overview, the `MultipleAlignment` data model:

* Supports any number of aligned structures, **multiple structures**.
* Can support **flexible alignments** and **non-topological alignments**,
and any of their combinatations (e.g. a flexible alignment with topological
rearrangements).
* Can not support **non-sequential alignments**, or they would require a new
`Block` for each EQR, because sequentiality of the residues is a requirement
for each `Block`.
* Can store **any score** in any of the four object hierarchy level, making it
easy to adapt to new requirements and algorithms.

For more examples and information about the `MultipleAlignment` data structure
go to the Demo package on the biojava-structure module or look through the interface
files, where the javadoc explanations can be found.

## Conversion between Data Models

The conversion from an `AFPChain` to a `MultipleAlignment` is possible trough the
ensemble constructor. An example on how to do it programatically is below:

```java
AFPChain afpChain;
Atom[] chain1;
Atom[] chain2;
boolean flexible = false;
MultipleAlignmentEnsemble ensemble = new MultipleAlignmentEnsemble(afpChain, chain1, chain2, false);
MultipleAlignment converted = ensemble.getMultipleAlignments().get(0);
```

There is no method to convert from a `MultipleAlignment` to an `AFPChain`, because
the first representation supports any number of structures, while the second is
only supporting pairwise alignments. However, the conversion can be done with some
lines of code if needed (instantiate a new `AFPChain` and copy one by one the
properties that can be represented from the `MultipleAlignment`.

===

Go back to [Chapter 8 : Structure Alignments](alignment.md).
Loading