-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.FuzzyLZ
233 lines (144 loc) · 7.24 KB
/
README.FuzzyLZ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
Copyright (c) David Powell <david@drp.id.au>
FuzzyLZ attempts to compress a (DNA) sequence using an
inexact repeats model.
FuzzyLZ is the implementation of an idea for compression
using inexact repeats. The inexact repeats are modelled
using an alignment model.
The most relevant paper is:
L. Allison, T. Edgoose and T. I. Dix
"Compression of strings with approximate repeats."
Intell. Sys. in Mol. Biol. 1998, pp8-16
This implementation allows for any number of forward and
backward "machines". Here a "machine" is a FSA used to
model sequences with inserts, delete, matches or
mismatches. The machines may be either 1 state or 3 state
machines - corresponding to simple of linear gap costs in
the alignment of the approximate repeats.
FuzzyLZ - VERSION 1.3
RUNNING
-------
The code is known to compile and run with JDK1.4.1. I have
not tested with any other versions of the JDK.
If you are using Sun's JDK with large sequences, I recommend
the -Xmx option be used to increase the amount of memory
available to the java program. eg. to allow 512MB use:
java -Xmx512m -jar fuzzyLZ.jar
USAGE
-----
Usage: java -jar fuzzyLZ.jar [options] <seqFile|checkpointFile>
<seqFile> is the sequence to be compressed in FASTA format.
If the input file is 'foo.fa' the files output are:
foo.fa-msglen.txt - Contains the number of bits taken to
encode each chracter along the sequence.
foo.fa-final-iterXX.ppm - An image for the end of
iteration XX showing the
probability over the matrix with
brighter colours representing
better matches. Red is used for
forward matches, and green of
reverse.
foo.fa-finalActive-iterXX.ppm - an image show which cells
were part of the
computation. Again red for
forward, green for reverse.
foo.fa-finalHits-iterXX.ppm - an image showing where on
the matrix there was an hash
table hit for the heuristic.
Options:
--iterations=i Number of iterations
(default='1')
The number times to run the algorithm re-estimating
the parameters each time.
--preFile=s Prepend 'preFile' to sequence.
(default='')
's' must be a file in FASTA format. This option
"prepends" the sequence in 's' to the sequence to be
compressed. This has the effect of seeing how well a
sequence can be compressed in the presence of
another sequence.
--seqModel=s Base sequence model. Use 'markov(n)' for
a n-th order markov model, n=-1 for uniform
(default='markov(0)')
--alphabet=s Alphabet used by the sequence.
(default='atgc')
The alphabet used in the sequence.
--fwdMach=s Comma separated list of machines to use for forward matches.
(Use an empty string '' for no forward machines.)
Supported: 1state,3state
(default='1state')
A list of the forward machines to use.
--revMach=s Comma separated list of machines to use for reverse matches.
(Use an empty string '' for no reverse machines.)
Supported: 1state,3state
(default='1state')
A list of the reverse machines to use.
--debug=i Debug level (higher gives more verbose output)
(default='2')
--imageSize=i Maximum Image size in pixels
(default='1024')
--imageFreq=i Save an image every <n> seconds. (0 - to disable)
(default='0')
How often, in seconds, to save an image of the
current state of computation. An image is always
saved at the end of every iteration.
--checkFreq=i Save a checkpoint every <n> seconds. (0 - to disable)
(default='0')
How often, in seconds, to save a checkpoint of
the current state of computation. This can be
used to resume computation, and even to change
some of the parameters.
--statsFreq=i Display some stats every <n> seconds. (0 - to disable)
(default='300')
Print out some statistics every so many seconds.
--msgFile=s Output file for encode length of each character.
(The default is based on the input file name)
(default='')
Output filename. The default is the name of the
input file with '-msglen.txt' appended.
--outDir=s Directory to save output files in.
(default='./')
Output directory. Note that if '--msgFile' is
specified, then this option is not used for the
msglen.txt file.
--hashSize=i Window size to use for constructing hashtable (0 - for full N^2 algorithm)
(default='20')
The window size to use for the speed-up heuristic.
An exact match of this many characters is required
before algorithm is "activated" in this area.
--computeWin=i Number of cells to activate on a hashtable hit
(default='10')
This specifies how closely the algorithms "looks"
around an exact match.
--cutML=i When (cell_value - base_cell > cutML) then cell is killed. (in bits)
(default='4')
This specifies how bad the algorithm has to be
doing compared to the base states before the cells
are de-activated. (In bits)
--paramFile=s Parameter file to read for various model parameters (see docs)
(default='')
This option is not for the faint-of-heart. The
parameter file has name=value pairs on separate
lines. The possible parameters depend on the
various models being used. The file
'fuzzyLZ-params.txt' in the source package gives
parameters that are equivalent to the defaults.
To discover what parameters may be set, first run
fuzzyLZ without setting --paramFile, then look
after the printed line 'Final costs:'. It is
possible to save these lines to a file and use
them on a different run as the --paramFile. Note
the parameters will always be normalized before
being used.
--resume Resume from a checkpoint
(default='false')
Continue computation from a previously saved
checkpoint. The checkpoint file must be given on
the command line _instead_ of the sequence file.
--overwrite Overwrite msglen file.
(default='false')
Overwrite the various output files.
The parameters that can be changed when resuming from a
checkpoint are: --iterations --imageFreq --checkFreq
--statsFreq --imageSize --debug
Good luck!
-- David Powell <david@drp.id.au>