Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 326 lines (248 sloc) 11.881 kb
1c02f0c Bio.sequtils and Bio.SeqUtils were duplicated code, and even worse were ...
chapmanb authored
1 This file provides documentation for modules in Biopython that have been moved
2 or deprecated in favor of other modules. This provides some quick and easy
3 to find documentation about how to update your code to work again.
4
d01c450 Getting ready for release 1.46.
mdehoon authored
5
6 Bio.Rebase
7 ==========
8 Deprecated as of Release 1.46.
9
10 Bio.Gobase
11 ==========
12 Deprecated as of Release 1.46.
13
14 Bio.CDD
15 =======
16 Deprecated as of Release 1.46.
17
21059b1 @peterjc Bio.biblio was deprecated for Biopython 1.45, but I didn't remember to u...
peterjc authored
18 Bio.biblio
19 ==========
d01c450 Getting ready for release 1.46.
mdehoon authored
20 Deprecated as of Release 1.45.
21059b1 @peterjc Bio.biblio was deprecated for Biopython 1.45, but I didn't remember to u...
peterjc authored
21
4556db2 @peterjc Bringing these up to date with changes since Biopython 1.44
peterjc authored
22 Bio.WWW
23 =======
24 The modules under Bio.WWW are being deprecated as of Release 1.45
25 The functionality in Bio.WWW.SCOP, Bio.WWW.InterPro and Bio.WWW.ExPASy
26 is now available from Bio.SCOP, Bio.InterPro and Bio.ExPASy instead.
27
5145a4d @peterjc Bringing this up to date for Biopython 1.44
peterjc authored
28 Bio.Blast.NCBIWWW
29 =================
30 The deprecated functions blast and blasturl were removed in Release 1.44
31
32 Bio.SeqIO
33 =========
34 The old Bio.SeqIO.FASTA and Bio.SeqIO.generic were deprecated in favour of
35 the new Bio.SeqIO module as of Release 1.44
36
37 Bio.lcc
38 =======
39 Deprecated in favor of Bio.SeqUtils.lcc in Release 1.44
40
41 Bio.crc
42 =======
43 Deprecated in favor of Bio.SeqUtils.CheckSum in Release 1.44
44
45 Bio.FormatIO
46 ============
47 This was removed in Release 1.44
48
49 Bio.expressions
50 ===============
51 This has been deprecated as of Release 1.44
52
53 Bio.Kabat
54 =========
55 This was deprecated in Release 1.43 and removed in Release 1.44
56
34b4f31 Added the functions 'complement' and 'reverse_complement' to Bio.Seq's S...
mdehoon authored
57 Bio.SeqUtils
58 ============
59 The functions 'complement' and 'antiparallel' in Bio.SeqUtils have been
60 deprecated as of Release 1.31. Use the functions 'complement' and
61 'reverse_complement' in Bio.Seq instead.
62
63 Bio.GFF
64 =======
65 The functions 'forward_complement' and 'antiparallel' in Bio.GFF.easy have been
66 deprecated as of Release 1.31. Use the functions 'complement' and
67 'reverse_complement' in Bio.Seq instead.
efd9b60 Added blast to qblast change to DEPRECATED file
chapmanb authored
68
1c02f0c Bio.sequtils and Bio.SeqUtils were duplicated code, and even worse were ...
chapmanb authored
69 Bio.sequtils
b0acc00 Added instructions on how to move to Bio.Cluster from Bio.kMeans and
mdehoon authored
70 ============
1c02f0c Bio.sequtils and Bio.SeqUtils were duplicated code, and even worse were ...
chapmanb authored
71 Deprecated as of Release 1.30.
72 Use Bio.SeqUtils instead.
b0acc00 Added instructions on how to move to Bio.Cluster from Bio.kMeans and
mdehoon authored
73
909bae9 Deprecated Bio.SVM and recommend usage of libsvm.
chapmanb authored
74 Bio.SVM
75 =======
76 Deprecated as of Release 1.30.
77 The Support Vector Machine code in Biopython has been superceeded by a
78 more robust (and maintained) SVM library, which includes a python
79 interface. We recommend using LIBSVM:
80
81 http://www.csie.ntu.edu.tw/~cjlin/libsvm/
b0acc00 Added instructions on how to move to Bio.Cluster from Bio.kMeans and
mdehoon authored
82
23b046b Removed internal references to RecordFile, which are really not needed.
chapmanb authored
83 Bio.RecordFile
84 ==============
85 Deprecated as of Release 1.30.
86 RecordFile wasn't completely implemented and duplicates the work
87 of most standard parsers. We recommend using a specific iterator
88 (Bio.Fasta.Iterator for example) without a parser to get back
89 text records.
90
b0acc00 Added instructions on how to move to Bio.Cluster from Bio.kMeans and
mdehoon authored
91 Bio.kMeans and Bio.xkMeans
92 ==========================
909bae9 Deprecated Bio.SVM and recommend usage of libsvm.
chapmanb authored
93 Deprecated as of Release 1.30.
b0acc00 Added instructions on how to move to Bio.Cluster from Bio.kMeans and
mdehoon authored
94
95 The k-Means algorithm is an algorithm for unsupervised clustering of data.
96 Biopython includes an implementation of the k-means clustering algorithm
97 in kMeans.py. Recently, a larger set of clustering algorithms entered
98 Biopython as Bio.Cluster. As the kcluster routine in Bio.Cluster also implements
99 the k-means clustering algorithm, the kMeans.py module has been deprecated.
100 Below you will find a description of how to switch from kMeans.py to
101 Bio.Cluster's kcluster.
102
103 The function kcluster in Bio.Cluster performs k-means or k-medians clustering.
104 The corresponding function in kMeans.py is called cluster. This function takes
105 the following arguments:
106
107 o data
108 o k
109 o distance_fn
110 o init_centroids_fn
111 o calc_centroid_fn
112 o max_iterations
113 o update_fn
114
115 The function kcluster in Bio.Cluster takes the following arguments:
116
117 o data
118 o nclusters
119 o mask
120 o weight
121 o transpose
122 o npass
123 o method
124 o dist
125 o initialid
126
127
128 Arguments for kMeans.py's cluster, and their equivalents in Bio.Cluster
129 -----------------------------------------------------------------------
130
131
132 o data:
133
134 In kMeans.py, data is a list of vectors, each containing the same number of
135 data points. Within the context of clustering genes based on their gene
136 expression values, each vector would correspond to the gene expression data of
137 one particular gene, and the values in the vector would correspond to the
138 measured gene expression value by the different microarrays. The cluster
139 routine in kMeans.py always performs a row-wise clustering by grouping vectors.
140
141 The argument data to Bio.Cluster's kcluster has the same structure as in
142 kMeans.py. However, Bio.Cluster allows row-wise and column-wise clustering by
143 the transpose argument. If transpose==0 (the default value), kcluster performs
144 row-wise clustering, consistent with kMeans.py. If transpose==1, kcluster
145 performs column-wise clustering. The same behavior can be obtained, of course,
146 by transposing the data array before calling kcluster.
147
148
149 o k:
150
151 The desired number of clusters is specified by the input argument k in
152 kMeans.py. The corresponding argument in Bio.Cluster's kcluster is nclusters.
153
154 o distance_fn:
155
156 In kMeans.py, the argument distance_fn represents the distance function to
157 calculate the distances between items and cluster centroids. This argument
158 corresponds to a true Python function. The default value is the Euclidean
159 distance, implemented as distance.euclidean in distance.py. User-defined
160 distance functions can also be used.
161
162 The k-means routine in Bio.Cluster does not allow user-specified distance
163 functions. Instead, it provides the following nine built-in distance functions,
164 depending on the argument dist:
165
166 dist=='e': Euclidean distance
167 dist=='h': Harmonically summed Euclidean distance
168 dist=='b': City-block distance
169 dist=='c': Pearson correlation
170 dist=='a': absolute value of the Pearson correlation
171 dist=='u': uncentered correlation
172 dist=='x': absolute uncentered correlation
173 dist=='s': Spearmans rank correlation
174 dist=='k': Kendalls tau
175
176 User-defined distance functions are possible only by modifying the C code in
177 cluster.c (which may not be as hard as it sounds). The default distance function
178 is the Euclidean distance (distance=='e'). Note that in Bio.Cluster the
179 Euclidean distance is defined as the sum of squared differences, whereas in
180 kMeans.py the square root of this quantity is taken. This does not affect the
181 clustering result.
182
183 o init_centroids_fn:
184
185 This function specifies the initial choice for the cluster centroids. By
186 default, cluster in kMeans.py uses a random initial choice of cluster centroids
187 by randomly choosing k data vectors from the input vectors in the data input
188 argument. Alternatively, the user can specify a user-defined function to choose
189 the initial cluster centroids.
190
191 In Bio.Cluster, the k-means algorithm in kcluster starts from an initial cluster
192 assignment instead of an initial choice of cluster centroids. As far as I know,
193 these two initialization methods are equivalent in practice. Similar to the
194 cluster routine in kMeans.py, Bio.Cluster's kcluster performs a random initial
195 assignment of items to clusters. Alternatively, users can specify a
196 (deterministic) initial clustering via the initialid argument. This argument is
197 None by default. If not None, it should be a 1D array (or list) containing the
198 number (between 0 and nclusters-1) of the cluster to which each item is
199 assigned initially.
200
201 Note that the k-means routine in Bio.Cluster performs automatic repeats of the
202 algorithm, each time starting from a different random initial clustering. See
203 the comment for the npass argument below.
204
205 o calc_centroid_fn:
206
207 This argument specifies how to calculate the cluster centroids, given the data
208 vectors of the items that belong to each cluster. By default, the mean over the
209 vectors is calculated. A user-defined function can also be used.
210
211 Bio.Cluster's kcluster does not allow user-defined functions. Instead, the
212 method to calculate the cluster centroid is determined by the argument method,
213 which can be either 'a' (arithmetic mean) or 'm' (median). The default is to
214 calculate the mean ('a').
215
216 o max_iterations:
217
218 The cluster routine in kMeans.py has an argument max_iterations, which is used
219 to stop the iteration it the routine does not converge after the given number of
220 iterations.
221
222 The kcluster routine in Bio.Cluster does not have such an argument. The failure
223 of a k-means algorithm to converge is due to the occurrence of periodic
224 clustering solutions during the course of the k-means algorithm. The kcluster
225 routine in Bio.Cluster automatically checks for the occurrence of such a
226 periodicity in the solutions. If a periodic behavior is detected, the algorithm
227 is interrupted and the last clustering solution is returned. Accordingly, the
228 kcluster routine is guaranteed to return a clustering solution. Also see the
229 discussion of the npass argument below.
230
231 o update_fn:
232
233 The argument update_fn to cluster in kMeans.py is a hook function that is
234 called at the beginning of every iteration and passed the iteration number,
235 cluster centroids, and current cluster assignments. It is used by xkMeans.py,
236 which provides a visualization of k-means clustering. Currently there is no
237 equivalent in Bio.Cluster.
238
239
240 Other arguments for Bio.Cluster's kcluster.
241 -------------------------------------------
242
243 Three arguments in Bio.Cluster's kcluster do not have a direct equivalent in
244 kMeans.py's cluster.
245
246 o mask:
247
248 Microarray experiments tend to suffer from a large number of missing data. The
249 argument mask to Bio.Cluster's kcluster lets the user specify which data are
250 missing. This argument is an array with the same shape as data, and contains
251 a 1 for each data point that is present, and a 0 for a missing data point:
252
253 mask[i,j]==1: data[i,j] is valid
254 mask[i,j]==0: data[i,j] is a missing data point
255
256 Missing data points are ignored by the clustering algorithm. By default, mask
257 is an array containing 1's everywhere.
258
259 o weight:
260
261 The weight argument is used to put different weights on different data point.
262 For example, when clustering genes based on their gene expression profile, we
263 may want to attach a bigger weight to some microarrays compared to others. By
264 default, the weight argument contains equal weights of 1.0 for all data points.
265 Note that for row-wise clustering, the weight argument is a 1D vector whose
266 length is equal to the number of columns. For column-wise clustering, the length
267 of this argument is equal to the number of rows.
268
269 o npass:
270
271 Typical implementations of the k-means clustering algorithm rely on a random
272 initialization. Unlike Self-Organizing Maps, however, the k-means algorithm has
273 a clearly defined goal, which is to minimize the within-cluster sum of
274 distances. Different k-means clustering solutions (based on different initial
275 clusterings) can therefore be compared to each other directly. In order to
276 increase the chance of finding the optimal k-means clustering solution, the
277 k-means routine in Bio.Cluster automatically repeats the algorithm npass times,
278 each time starting from a different initial random clustering. The best
279 clustering solution, as well as in how many of the npass attempts it was found,
280 is returned to the user. For more information, see the output variable nfound
281 below.
282
283
284 Return values
285 -------------
286
287 The cluster routine in kMeans.py returns two values:
288
289 o centroids
290 o clusters
291
292 The kcluster routine in Bio.Cluster returns four values:
293
294 o clusterid
295 o centroids
296 o error
297 o nfound
298
299
300 o centroids:
301
302 The centroids return value contains the centroids of the k clusters that were
303 found, and corresponds to the centroids return value from Bio.Cluster's
304 kcluster routine.
305
306 o clusters:
307
308 The clusters return value contains the number of the cluster to which each
309 vector was assigned. The corresponding return value in Bio.Cluster's kcluster
310 is clusterid.
311
312 o error:
313
314 The error return value from Bio.Cluster's kcluster is the within-cluster sum of
315 distances for the optimal clustering solution that was found. This value can be
316 used to compare different clustering solutions to each other.
317
318 o nfound:
319
320 The nfound return value from Bio.Cluster's kcluster shows in how many of the
321 npass runs the optimal clustering solution was found. Accordingly, nfound is at
322 least 1 and at most equal to npass. A large value for nfound is an indication
323 that the clustering solution that was found is optimal. On the other hand, if
324 nfound is equal to 1, it is very well possible that a better clustering solution
325 exists than the one found by kcluster.
Something went wrong with that request. Please try again.