/
multalin.txt
779 lines (698 loc) · 36.5 KB
/
multalin.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
MultAlin documentation
======================
(Version 5.0, 5.1, 5.2, 5.3, 5.4)
To jump to a specific section, search for "SECTION -#-", replacing the #
with the appropriate section number.
CONTENTS
========
SECTION -0- Introduction
SECTION -1- New in the last releases
NEW in version 5.0
NEW in version 5.1
NEW in version 5.2
NEW in version 5.3, 5.3.1, 5.3.2
NEW in version 5.4
SECTION -2- Installing MultAlin
SECTION -3- Running MultAlin
A. Cautions
B. Command line mode
C. Interactive mode
SECTION -4- Algorithm
A. Similarity scores for a pair of sequences.
B. The FAST alignments (step 0).
C. The hierarchical clustering (step 1).
D. The Multiple alignment (step 2).
E. Consensus sequences and scores (step 3).
F. Iteration.
SECTION -5- File formats
A. Input Sequence File
B. Output Sequence File
C. Clustering Sequence File
D. Score File
SECTION -6- List of the package files
SECTION -0- Introduction
========================
Welcome to MultAlin! This is software that will allow you to align
simultaneously several biological sequences on computer that use UNIX system.
What is a Multiple sequence alignment? It is the arrangement of several
protein or nucleic acid sequences with postulated gaps so that similar residues
are juxtaposed. A score is attached to identities, conservative or non-
conservative substitutions (the score measuring the similarity) and a penalty to
gaps; an ideal program would maximise the total score, taking account of all
possible alignments and allowing for any length gap at any position.
Unfortunately the computing requirements, both of time and memory, grow as
the nth power, where n is the sequence number, so this ideal alignment can be
found only for two sequences or three short sequences. In the general case, to
be practicable programs must restrict the conditions of the optimisation.
Nevertheless it is undeniably useful to have an automatic system available for
multiple sequence alignment to provide a starting point for a more human
analysis.
MultAlin creates a multiple sequence alignment from a group of related
sequences using progressive pairwise alignments. The method used is described in
"Multiple sequence alignment with hierarchical clustering", F.Corpet, 1988,
Nucl. Acids Res. 16 10881-10890.
SECTION -1- New in the last releases
====================================
NEW in version 5.0
------------------
Comparison tables can include negative entries. GCG tables can be used.
Gap penalty can be length dependent. Gap at sequence extremities can be
scored or not.
NEW in version 5.1
------------------
There is a maximal number of iterations set to 10 (see F. Iteration).
A bug has been fixed that prohibited the comparison of two sequences only.
SCO and sco, CLU and clu are now valid extensions for score files and
clustering files.
Portability has been tested for more platforms.
NEW in version 5.2
------------------
The similarity coefficient at a position is still the mean of all pairwise
coefficients at this position, BUT only the sequences for which the position is
internal are counted. Example:
CCPC50 QDG DAAKGEKEFN .KCKACHMI QAPDGTDII. KGGKTGPNLY
CCRF2C ..G DAAKGEKEFN .KCKTCHSI IAPDGTEIV. KGAKTGPNLY
CCRF2S QEG DPEAGAKAFN .QCQTCHVI VDDSGTTIAG RNAKTGPNLY
CCQF2R .EG DAAAGEKVSK .KCLACHTF DQGGAN.... ...KVGPNLF
CCQF2P .AG DAAVGEKIAK AKCTACHDL NKGGPI.... ...KVGPPLF
|
MultAlin sequence # 245 5555555555 555555555 5555555555 5555555555
Clustalv sequence # 245 5555555555 155555555 5555553331 3335555555
I think that it is important to take the mean over all sequences for new
gaps to be preferentially inserted at the same position as old gaps. But this is
a problem when sequence lengths are inhomogeneous, so I have made this
modification.
NEW in version 5.3, 5.3.1, 5.3.2
--------------------------------
The pairwise scores that are used to build the clustering can now be
evaluated by three different methods:
absolute = the score is the pairwise alignment score, using the current
similarity table and gap penalties. It was the old method.
percentage = the score is the pairwise alignment score, divided by the length
of the shortest sequence.
identity = the score is the number of identical pairs, divided by the length of
the shortest sequence.
Individual weights can be assigned to each sequence in order to down-
weight near duplicate sequences and up-weight the most divergent ones. They are
computed using the clustering tree and normalised so that their mean is 1.0.They
are written on the output file.
It is now possible to choose the order of the sequences in the output
file, as input or as aligned.
MultAlin can be used with already aligned sequences (.mul file), only to
change the output file. When the input file has a mul extension, MultAlin does
not realign the sequences, but reads the last ma.cfg file and optionally new
options to write a new alignment file.
The entries of blosum62.tab have been made non-negative by adding 4 to
each entry. It becomes the default table.
In release 5.3.1, bugs have been fixed and a new output format added (see
"doc Format").
In release 5.3.2, the mul format has been modified to become a standard
Fasta/Pearson format. In all input formats, sequences can be written with
lowercase letters.
NEW in version 5.4
------------------
The pairwise and the alignment processes are modified to handle alignment
of large families more quickly. In the alignment process, modifications are
limited to the implementation (no theoretical change). In the pairwise process,
very similar sequences (more than 80% identity) are clustered together without a
hierarchical classification and only one sequence in the cluster is compared to
the other sequences for the global classification. This allow to reduce
drastically the pairwise step that can be time limiting in automatic alignments
of large families of sequences.
Since version 5.4.1, it is possible to align two groups of already aligned
sequences (profiles) or a sequence with a profile (see option -2). MultAlin can
read symbol comparison tables from GCG package, version 9 and upper. The user can
parametrise the doc format (see SECTION -5- File formats/ B. Output Sequence File/
doc Format).
SECTION -2- Installing MultAlin
===============================
See ma_c.txt
SECTION -3- Running MultAlin
=============================
A. Cautions
===========
Before aligning large sequences, you may test MultAlin with shorter
sequences, and look at the system occupation (see 'ps' UNIX command) during
alignment. When the swap partition of the hard disk is full, MultAlin can use an
internal swap mode on user partition.
You can run MultAlin in two modes:
* command line mode.
* interactive mode which helps you to select program parameters and options.
Help: type 'ma -h' or 'ma -?' to obtain help screen.
B. Command line mode
====================
Syntax:(1) ma [[<Option> ...] <SequenceFile> [<ClusterFile.xxx>]]
(2) ma [<Option> ...] <AlignmentFile.mul>
Syntax (1)
----------
If all the sequences to align are in a unique file, <SequenceFile> must be its
name. If the sequences are in different files (with a unique format), you can
create a file with the list of the sequence file names, one by line. In this
case, <SequenceFile> will be @<ListFile>, where <ListFile> is the name of the
list file (do not forget the @ in front of the file name). If <SequenceFile> has
.mul extension, it will be considered as an alignment file and syntax (2) will
be used.
<ClusterFile.clu> contains cluster obtained by previous FAST alignment (or by
any other method) of the same sequences. The file format used is the same that
MultAlin MS-DOS version.
<ClusterFile.sco> contains a triangular table of alignment scores for any pair
of sequences included in SequenceFile. These scores are used by MultAlin to
compute a first clustering (instead of using Fast).
If no ClusterFile is given, the first clustering is done with Fast (Lipman &al
algorithm).
Options are:
-r Recover last configuration.
Input options:
-i:<Input format> with Format: gcg, embl, genbank,
mul (MultAlin=Pearson), auto(matically determined).
-p select Parts of sequences.
Alignment options:
-c:<symbol Comparison table>
-g:<Gap value> gap_penalty = gap_value + gap_length_value x length(gap)
-l:<Gap length value>
-x:<Gap at ext> 0 no penalties, 1 penalty at the end
2 penalty at the beginning, 3 at both extremities
-1 One iteration only.
-u Unweighted sequences.
-s:<Scoring method> with Scor meth: abs(olute), per(centage), ide(ndity)
Output options:
-o:<Output format> with Format: msf (GCG), mul (MultAlin) or doc (Word).
-a output sequences ordered as input.
-A aligned (default)
-d Draw the clustering in the output clustering file.
-k:<U>.<L> U and L are minimal % for Uppercase and Lowercase consensus
-q quiet: no message during alignment
Default options:
-i:auto -c:blosum62.tab -g:12 -l:2 -o:msf -x:0 -s:abs -A -k:90.50
Recover option (-r)
--------------------
MultAlin always saves configuration parameters, used for the last
alignment:
- [InputFormat]: input format (gcg, mul, embl, genbank or undefined).
- [SymbolCompTableFile]: name of the symbol comparison table file.
- [GapValue],[Gap2Value]: penalty for gap opening and extension.
- [GapExtValue]: penalty for end gap
- [OneIter]: one iteration only (true, false ).
- [Weighted]: weighted sequences (true, false ).
- [ScoringMethod]: scoring method (absolute, percentage, identity).
- [OutputFormat]: output format (msf, mul, doc).
- [OutputOrder]: output order (aligned, input).
- [ClusteringOutputFormat]: clustering output format (list, drawing).
- [ConsensusLevel]: consensus levels.
- [OutputStyle],[LineSize],[GraduationStep]: parameters for doc output.
This information is saved in a text file 'ma.cfg'. If you want to edit
this file, you have to use the same key words.
Sequence input format (-i)
--------------------------
MultAlin automatically recognises four formats (GCG, MultAlin, GenBank,
EMBL). However, if error message appears, you have to specify input formats or
correct sequence syntax (see Files formats in this manual).
Select parts of sequences option (-p)
-------------------------------------
If you use this option, the program will pause when reading sequence
file(s), and ask you the sequence range to use for alignment.
Comparison symbol table file (-c)
----------------------------------
You can use MultAlin tables (.tab), or GCG package tables (.cmp). MultAlin
tests the file name for the .cmp extension to know the file format. If the file
is not present in the current file, it is looked for in ${MULTALIN}directory:
type "ls ${MULTALIN}/*.tab" to get the list of available tables.
Both kinds of tables contain matrix of coefficients and gap penalties, opening &
extension, but MultAlin tables may also contain homology symbols used in the
consensus sequence (See "dayhoff.tab" file for example).
ATTENTION: old GCG tables (before version 9) give non-integer values:
MultAlin multiplies GCG values by 10 to use integer values. 10 must also
multiply gap penalties. For example, if you use "nwsgappep.cmp" with a gap value
of 5 and a gap length weight of 0.1 in GCG programs, you must use it with a gap
value of 50 and a gap length weight of 1 in MultAlin.
Gap value (-g)
---------------
Gap length weight (-l)
----------------------
The score of an alignment is equal to the sum of the values of the
matches, each one scored with the comparison table, less the gap value times the
number of internal gaps and less the gap length weight times the total length of
the internal gaps. There is no penalty for end gaps unless the option -x is set
to a non-0 value.
Default values are -g:12 -l:2
Gap at extremities (-x)
-----------------------
With this option, it is possible to weigh end gaps as all other gaps:
-x:1 a gap at the end is weighted
-x:2 a gap at the beginning is weighted
-x:3 both end gaps are weighted
-x:0 end gaps are not weighted (default).
One iteration only option (-1)
-------------------------------
With this option, final alignment can be obtained more quickly, but it may
not be the best possible alignment.
Unweighted sequences option (-u)
--------------------------------
By default, sequences are weighted. Use this option to give them all a
weight of 1.0.
Scoring method (-s)
-----------------------
With this option, it is possible to choose how the pairwise scores are
computed:
-s:abs absolute alignment score (default)
-s:per percentage alignment score
-s:ide percentage of identical pairs
Sequence output format (-o)
---------------------------
You can save alignment result in two formats: MSF (default) or MultAlin.
In both formats, sequences are saved in only one file. If MultAlin format is
chosen, a consensus sequence is saved in a second file, with extension .con
A third format: doc has been added in version 5.3. It is a MSF file with
indications for a Microsoft Word Macro to add colours for conserved regions (see
SECTION -5- File formats/ B. Output Sequence File/ doc Format).
Sequence output order (-a or -A)
--------------------------------
By default, the sequences are ordered in the output file as they are
aligned. If you want them to be ordered as they were in the input file use the
option -a. If your input file is a mul file with sequences ordered as the
previous input file, use the option -A to get the sequences ordered as aligned.
Clustering output file (-d)
--------------------------
When MultAlin calculates the first clustering order (with Fast or from a
score table), it is saved in a file with .clu extension. The clustering order
used for the last iteration is saved in a file with .cl2 extension.
By default, these files format is a list of the sequence names with
parentheses indicating how the clustering is done. In this case, the files can
be used as input cluster file for an other alignment.
The clustering order can also be saved as a dendrogram. The drawing uses
text characters and can be printed to any printer or screen. Use the -d option
for this format.
In the interactive mode, answer l(ist) for the first format (default) and
d(rawing) for the other.
Consensus values option (-k)
----------------------------
At the end of the alignment, a consensus sequence is computed. For each
column in the alignment, the most representative residue is chosen. If it is
present in more than U% sequences, an Uppercase character is written; else if it
is present in more than L% sequences, a Lowercase character is written; else a
white character is written.
Default values are U=90 and L=50, i.e. -k:90.50
Quiet (-q)
----------
With this option, no messages are displayed to the screen.
Syntax (2)
----------
ma [<Option> ...] <AlignmentFile.mul>
<AlignmentFile.mul> is an output file of a previous run of MultAlin in
MultAlin format and with .mul extension. By default, the alignment will not be
changed but new outputs can be obtained with different <Option> (-o, -f, -k and
-q).
Profile alignment option (-2)
----------------------------
With this option, the <AlignmentFile> is cut into two profiles that are
realigned. The value n of the option is the number of the first sequence in the
second profile; the first profile corresponds to aligned sequences 1 to n-1 and
the second profile corresponds to aligned sequences n to the end of
<AlignmentFile>. Sequence weights are read from the file or set to 1 if
unreadable. This new alignment is performed with the last used parameters (as
with option -r) or with new options given on the command line (-c, -g, -l, -x).
C. Interactive mode
===================
Type 'ma' without any option and answer the questions: they correspond to
the options that can be set with the command line mode. Only the principal
options are automatically proposed; you must ask for more options (input,
alignment & output options) to get the others.
SECTION -4- Algorithm
=====================
The alignment of two sequences can be performed with any program that has
been developed since 1970. However, a rigorously optimal alignment of even a
small number of short sequences is currently intractable because of the amount
of time and memory that would be necessary.
MultAlin proposes an alternative approach that sacrifices a SMALL amount
of sensitivity for a high degree of computational tractability. MultAlin does a
series of progressive pairwise alignments between sequences and clusters of
sequences. A cluster consists of two or more already aligned sequences.
MultAlin begins by computing similarity scores for every possible pair of
sequences using a fast algorithm (step 0). These scores are used to create a
hierarchical clustering represented by a dendrogram (step 1). This clustering
shows the order of the pairwise alignments that are performed to produce the
final alignment (step 2). A consensus sequence and pairwise scores are computed
(step 3). To achieve step 3, the scores of all the pairwise alignments included
in the multiple one are computed and they can be used to do step 1 again; if the
clustering order is different, a new multiple alignment can be done following
this new clustering (steps 2 and 3). This process can be iterated until the
clustering order remains unchanged by iteration.
A. Similarity score for a pair of sequences
===========================================
This score is equal to the sum of the values of the matches (each match
scored with the scoring table) less the gap penalty times the number of the
internal gaps and less the gap length weight times the total length of internal
gaps. By default, no penalty is charged for terminal gaps.
An optimal alignment is one with the maximum possible score. It is sensitive to
the symbol comparison values and to the gap penalty. Once this optimal alignment
is computed, three different similarity scores can be computed (see New in 5.3).
By default the pairwise score is the alignment score.
B. The FAST alignments (step 0)
===============================
To initiate the process of aligning N sequences, MultAlin must computes N
times (N-1) similarity scores. It would take to much time to look for the
optimal score. Therefore, MultAlin uses an algorithm from Lipman and Pearson
(1985, Science 227, 1435-1441) that gives a score that is not surely the maximum
one (because the length of the gaps that can be inserted in each sequence is
limited) but that can measure quickly the similarity between two sequences.
The first step finds the diagonals having the largest number of short
perfect matches (words). The word length depends on the size of the alphabet
used to describe the sequences:
Alphabet size Word length Scoring factor
2 or 3 6 24
4 to 7 4 16
8 to 15 3 12
16 to 31 2 8
If a word from the second sequence does exist in the first one Fast adds a
score for the word to the score of the diagonal on which the word occurs. This
added score is equal to the word length times a scoring factor (if a word match
overlaps another word on the same diagonal, only the scoring factor for non-
overlapping symbols is added). When the diagonal is not new, a factor equal to
the length between the two perfect matches decreases the score, unless the score
becomes negative. In this latter case, the two regions are considered as
different and are scored separately.
In a second step, the five best diagonal regions are re-scored using the
symbol comparison table and the one with the best new score allows to select the
best diagonal.
In the last step, an alignment of the two sequences is performed around
the best diagonal, with a 31 symbol wide window, using the symbol comparison
table and the gap penalty given for the multiple alignment. Actually, the
alignment is not produced, only its score is computed.
This last score is entered in a similarity matrix that records the
similarity scores of all possible pair of sequences.
C. The hierarchical clustering (step 1)
=======================================
The approach used by MultAlin is sensitive to the order in which sequences
are aligned. The multiple alignment will be better if closest sequences are
aligned first to each other and then to less similar sequences. A clustering
algorithm determines the order of these alignments, from the pairwise similarity
scores calculated at the previous step.
The method used by MultAlin is called UPGMA for unweighted pair-group
method using arithmetic averages (Sneath, P.H.A. and Sokal, R.R. (1973) in
Numerical Taxonomy, 230- 234, Freeman, San Francisco). The hierarchy is built
from its base (the set of sequences) creating at each step a new cluster by
union of two clusters or sequences that are the closest ones. The distance
between two clusters or sequences is given by the matrix of similarity that is
updated at each step of the clustering as follow. The similarity score between
the new cluster and a cluster i is the arithmetic mean of the similarity scores
between each of the two clusters which union creates it and the cluster i. At
each step, the number of clusters decreases by one until there is only one
cluster.
The clustering can be represented as a dendrogram that is NOT a
phylogenetic tree although the horizontal branch lengths are proportional to the
similarity scores. The dendrogram purpose is only to produce a clustering order
to create the multiple alignment. This order is the only information from the
clustering used by MultAlin.
D. The Multiple alignment (step 2)
==================================
The clustering order is used as follow. The two most similar sequences are
aligned to produce the first cluster. Then MultAlin aligns the next most related
sequence to this cluster or the next two more similar sequences to each other to
produce another cluster. A series of such pairwise alignment of clusters
includes more and more sequences until there is only one cluster of all the
sequences and the final alignment is produced.
To align two sequences, MultAlin uses an algorithm from Needleman and
Wunsch (1970, J. Mol. Biol. 48, 443-453) that allows to find the maximum
possible score between two sequences and one alignment that corresponds to it.
The method consists in building a matrix with one sequence across the top size
and the other one down the left size. A path in this matrix can represent an
alignment between the sequences: a path is a broken line joining the top row or
the left column to the bottom row or the right column. The segments of this
broken line are either parallel to the diagonal or parallel to one edge (for a
gap). An optimal path corresponds to an optimal alignment. The idea of the
method is to recursively find optimal paths beginning at each cell of the matrix
and to record their scores in the matrix at the beginning position of the path.
Once the matrix has been calculated, a traceback procedure is performed to find
the successive cells of the best path. It first cell is the one with the maximum
value in the top row or the left column. This value is the score of the
alignment.
To align two clusters (or a cluster and a sequence), MultAlin uses an
extension of this method. For an alignment of two individual sequences, the
comparison score between any two-sequence symbols is found in a symbol
comparison table. For an alignment of clusters of aligned sequences, the
comparison score between any two positions in these clusters is the arithmetic
average of the scores of all possible symbol comparisons at these positions. In
this average, the sequence weights are used as multiplication factors.
E. Consensus sequences and scores (step 3)
=========================================
MultAlin generates a consensus sequence finding at each position a
character that is function of the characters in all the sequences at its
position. If more than 90% of the characters in the column are the same letter,
the consensus character is this letter. If more than 90% of the characters in
the column are in the same conservative substitution cluster, the consensus
character is this cluster symbol. In both cases, the consensus character and the
characters that belong to its symbol cluster are represented with an uppercase
letter. If more than 50% of the characters in the column are the same letter,
the consensus character is this letter. If more than 50% of the characters in
the column are in the same conservative substitution cluster, the consensus
character is this cluster symbol. In both cases, the consensus character and the
characters that belong to its symbol cluster are represented with a lowercase
letter. If none of these conditions is true, the consensus character is a space.
The consensus levels (90 and 50) can be modified in the configuration dialog or
file (ma.cfg).
The multiple alignment produces an alignment for each pair of sequences
that are included in it. These pairwise alignments are not necessarily optimal
pairwise alignments because they must be compatible between themselves. They are
used to compute a new clustering of the sequences (new step 1).
F. Iteration
============
The complete process can be represented by the following figure:
Method step 0 & 1 2 3 & 1 2 3 & 1...
Clust Mult Clust Mult Clust...
pass # 0 1 1 2 2 ...
where: Clust= Clustering order; Mult= Multiple alignment.
The iterations stop when two successive clustering orders are identical.
If the first clustering order is calculated with the Fast algorithm (step 0),
the process usually converges after one or two passes. With an arbitrary first
clustering order, more passes can be necessary (4 to 6). In some rare cases, the
process can oscillate: if it is between two clustering orders, MultAlin detects
it and stop the process; if the cycle is bigger, the user must interrupt the
process himself (hitting Ctrl-C). In version 5.1, the maximal number of
iterations is set to 10.
G. Profile alignment
====================
Since version 5.4, MultAlin can align two profiles or one sequence with a
profile. A profile is a set of already aligned sequences. Profiles are read from
an alignment file that can be the output file of a previous alignment, in
MultAlin format. The value n of the -2 option (-2:n) is the number of the first
sequence of the second profile. Each sequence gets the weight it has in the
file, normalised so that the mean weight in each profile is 1. Positions that
are all '.' or '-' in a profile are deleted from the profile. The two profiles
are aligned as the last two clusters of a standard multiple alignment. There is
no iteration. A consensus sequence is derived and outputs are created as before.
In the special case of an alignment of one sequence with all the others (n =2 or
n is the last sequence), the sequence weights are modified to try to give more
weights to sequences that are similar to the lonely sequence. This method is
inspired by O'Brien (E.A. O'Brien, C. Notredame and D. G, Higgins, 1998,
Optimization of ribosomal RNA profile alignments, Bioinformatics, 14, 332-341).
The lonely sequence has weight 1. The other sequence weights are normalised so
that their mean is 1.A first alignment is done with these weights. The percent
difference of all profile sequences with the lonely sequence is estimated with
this alignment. The new weights of the profile sequences are the product of
their original weight by the reciprocal of the percent difference with the
lonely sequence. Weights are then normalised again so that their mean is 1.
These new weights are used during the alignment but the original ones are used
to derive the consensus sequence. In the output file, the lonely sequence is
printed next to its nearest sequence in the profile (use -a option to keep the
sequence at its original place).
SECTION -5- File formats
========================
A. Input Sequence File
======================
MultAlin Format (or Fasta/Pearson format)
-----------------------------------------
> SeqName the sequence name is the
> first word of the first comment line
> max: 8 letters
> comment lines begin with >
AAAACCGTTAAA
> SeqNam2 the 2nd sequence beginning
> shows the end of the first one
AAACCTGGAC
GenBank Format
--------------
LOCUS SeqName
any lines
ORIGIN anything
1 aggtcccttt tgtgttgttt
The sequence name is the first word after the LOCUS key word. The sequence
begins on the line following the ORIGIN key word. The next sequence information
begins with the LOCUS key word.
EMBL Format (flat file) (or Swiss-Prot format)
------------------------
ID SeqName
any lines
SQ anything
aauccagug gagaucaaag
any sequence lines
//
The sequence name is the first word after the ID key word. The sequence begins
on the line following the SQ key word. The next sequence information begins on
the line following //
GCG Format
-----------
Only one sequence, which name is the file name. The sequence begins on the
line that follows '..' Comments between <, > or $ are deleted.
B. Output Sequence File
=======================
mul Format
----------
This format is the same as the MultAlin input format. A '-' is inserted in
each sequence at a gap position, i.e.:
>CCPC50 129 Weight: 0.68
QDGDAAKGEKEFN-KCKACHMIQAPDGTDII-KGGKTGPNLYGVVGRKIA
SEEGFK-YGEGILEVAEKNPDLTWTEADLIEYVTDPKPWLVKMTDDKGAK
TKMTFKMGKNQA--DVVAFLAQNSPDAGGDGEAA
>CCRF2C 116 Weight: 0.68
--GDAAKGEKEFN-KCKTCHSIIAPDGTEIV-KGAKTGPNLYGVVGRTAG...
msf Format
----------
This is the GCG format for Multiple Sequence File, i.e.:
Symbol comparison table: blosum62
Gap weight: 12
Gap length weight: 2
Consensus symbols:
! is anyone of IV
$ is anyone of LM
% is anyone of FY
# is anyone of NDQEBZ
MSF: 134 Check: 0 ..
Name: CCPC50 Len: 134 Check: 7173 Weight: 0.68
Name: CCRF2C Len: 134 Check: 1222 Weight: 0.68
Name: CCRF2S Len: 134 Check: 8544 Weight: 1.39
Name: CCQF2R Len: 134 Check: 9048 Weight: 1.12
Name: CCQF2P Len: 134 Check: 1873 Weight: 1.12
Name: Consensus Len: 134 Check: 5858 Weight: 0.00
//
1 50
CCPC50 QDGDAAKGEK EFN.KCKACH MIQAPDGTDI I.KGGKTGPN LYGVVGRKIA
CCRF2C ..GDAAKGEK EFN.KCKTCH SIIAPDGTEI V.KGAKTGPN LYGVVGRTAG
CCRF2S QEGDPEAGAK AFN.QCQTCH VIVDDSGTTI AGRNAKTGPN LYGVVGRTAG
CCQF2R .EGDAAAGEK VSK.KCLACH TFDQ...... .GGANKVGPN LFGVFENTAA
CCQF2P .AGDAAVGEK IAKAKCTACH DLNK...... .GGPIKVGPP LFGVFGRTTG
Consensus .eGDaaaGeK .fn.kC.aCH .i....gt.i .g...KtGPn LyGVvgrtag
51 100
CCPC50 SEEGFK.YGE GILEVAEKNP DLTWTEADLI EYVTDPKPWL VKMTDDKGAK...
doc Format
----------
Symbol comparison table: blosum62
Gap weight: 12
Gap length weight: 2
Consensus symbols:
! is anyone of IV
$ is anyone of LM
% is anyone of FY
# is anyone of NDQEBZ
MSF: 134 Check: 0 ..
Name: CCPC50 Len: 134 Check: 7173 Weight: 0.68
Name: CCRF2C Len: 134 Check: 1222 Weight: 0.68
Name: CCRF2S Len: 134 Check: 8544 Weight: 1.39
Name: CCQF2R Len: 134 Check: 9048 Weight: 1.12
Name: CCQF2P Len: 134 Check: 1873 Weight: 1.12
Name: Consensus Len: 134 Check: 7880 Weight: 0.00
//
1 50
CCPC50 Q(D)[GD](AA)K[G](E)[K] E(FN)-(K)[C]K(A)[CH] M(I)QAPD(GT)D(I) I-KG...
CCRF2C [GD](AA)K[G](E)[K] E(FN)-(K)[C]KT[CH] S(I)IAPD(GT)E(I) V-KGA[K]...
CCRF2S Q(E)[GD]PE(A)[G]A[K] A(FN)-Q[C]QT[CH] V(I)VDDS(GT)T(I) A(G)RNA[K]...
CCQF2R (E)[GD](AAA)[G](E)[K] VSK-(K)[C]L(A)[CH] TFDQ------ -(G)GAN[K]V[...
CCQF2P A[GD](AA)V[G](E)[K] IAKA(K)[C]T(A)[CH] DLNK------ -(G)GPI[K]V[GP...
Consensus (e)[GD](aaa)[G](e)[K] (fn) (k)[C] (a)[CH] (i) (gt) (i) (g)...
51 100
CCPC50 SEEG[F](K)-[Y]G(E) (G)IL(E)VAE(K)NP D(LT)[W]T[E]AD(L)I E[Y](V)T[D...
Once the colour indications, () and [], are translated to true colours,
this page is similar to a msf page. Lines include 50 residues by blocks of 10 residues.
Highly conserved positions (marked by []) are red and weakly conserved ones (marked by
()) are blue.
It is possible to parametrise the doc output with 3 parameters in the
configuration file (edit ma.cfg and use -r option).
LineSize : is the number of residues by line (default 50)
GraduationStep : is the number of residues by block in a line (default 10)
OutputStyle = [Normal | Case | Difference ]
Normal (default)
In all sequences, all positions are in upper-case.
Case
All the positions in each sequence that are identical with the
consensus are in upper-case, the other positions are in lower-case.
CCQF2P aGDAAvGEK iakaKCtACH dlnkggpi-- -----KvGPp LFGVfGRTtG TfagYs-Ysp Gyt
Consensus ..GDaa.GeK .fn.kC.aCH .i....gt.i .....KtGPn L%GVvgrtag t...%k.Y.e g..
Difference
The first sequence is normal; in the other sequences, the residue
identical to the first sequence residue at the same position is represented by a
point(.), the others are in upper-case.
CCPC50 QDGDAAKGEK EFN-KCKACH MIQAPDGTDI I-KGGKTGPN LYGVVGRKIA SEEGFK-YGE GIL
CCRF2C ........ ...-...T.. S.I.....E. V-..A..... .......TAG TYPE..-.KD S.V Normal :
To translate the colour indications, () and [], to true colours, you can
use Microsoft Word and the MultAlin macro, included in MultAlin.dot, as follow:
Open your .doc file with Microsoft Word (File/Open)
Change the templates (File/Models... or Tools/Models, Link..., search the disk to
select MultAlin.dot, Open)
Run MultAlin Macro (Tools/Macro..., select MultAlin, Run)
You can also add MultAlin macro to your current model (Normal.dot):
Tools/Macro..., Organizer, Close File then Open File (on the same button),
search the disk to select MultAlin.dot, Open, select MultAlin, Copy >> into
Normal.dot, Close
You can translate the doc format file to an html file as follow (this
information is also available in doc2html.txt):
- edit the doc file: for each line after '//'
find the line that begins with //
add the following line
</pre><pre class=seq><A NAME='Alignment'></A>
for each line between current line and end of file
replace all '][' by nothing
replace all ')(' by nothing
replace all '[' by '<em class=high>'
replace all '(' by '<em class=low>'
replace all ']' by '</em>'
replace all ')' by '</em>'
- edit head.html if you want other colours than the default ones
- concatenate the three files
Unix: cat head.html myfile.doc tail.html >myfile.html
Dos: copy head.html + myfile.doc + tail.html myfile.html
To make this translation, you can use one of the following scripts:
- doct2html.csh, that is a C-shell script
- doc2html.pl, that is a Perl script.
C. Clustering Sequence File
===========================
clu file as a list
------------------
This is a standard tree format, i.e.:
(((CCPC50,CCRF2C),CCRF2S),(CCQF2R,CCQF2P));
It can be used as a clustering input file.
clu file as a drawing
---------------------
This is a schematic drawing of the clustering tree, i.e.:
CCPC50 +-------------------+-------------------------------------------------+
CCRF2C + | |
CCRF2S --------------------+ |
CCQF2R ----------+-----------------------------------------------------------+
CCQF2P ----------+
D. Score File
=============
You can built your own score file; MultAlin will use it to compute its
first clustering. Here is an example that shows the format:
CCPC50
CCRF2S 1257
CCRF2C 1273 1245
CCQF2R 1136 1173 1134
CCQF2P 1098 1143 1122 1255
SECTION -6- List of the package files
=====================================
MultAlin files:
---------------
ma or ma.exe : MultAlin itself
multalin.txt : MultAlin documentation (this file)
ma_c.txt : installation documentation
*.tab : symbol comparison table files
source/* : source file if you need to re-make ma
example/cytc : an example of input file
example/multalin : an example of command to run ma correctly
Doc conversion files
--------------------
doc2html.txt : documentation to translate a MultAlin doc file to an Html file
doc2html.pl : perl script that does this job
doc2html.csh : C-shell script that does this job
*.html : Html constant parts of a MultAlin result page
doc2word.txt : documentation to translate a MultAlin doc file to a Microsoft
Word document.
multalin.dot : Microsoft Word model, including 2 macros to do this job.