Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 245 lines (184 sloc) 9.847 kB
1c02f0c Bio.sequtils and Bio.SeqUtils were duplicated code, and even worse we…
chapmanb authored
1 This file provides documentation for modules in Biopython that have been moved
2 or deprecated in favor of other modules. This provides some quick and easy
3 to find documentation about how to update your code to work again.
4
5 Bio.sequtils
b0acc00 Added instructions on how to move to Bio.Cluster from Bio.kMeans and
mdehoon authored
6 ============
1c02f0c Bio.sequtils and Bio.SeqUtils were duplicated code, and even worse we…
chapmanb authored
7 Deprecated as of Release 1.30.
8 Use Bio.SeqUtils instead.
b0acc00 Added instructions on how to move to Bio.Cluster from Bio.kMeans and
mdehoon authored
9
10
11 Bio.kMeans and Bio.xkMeans
12 ==========================
13
14 The k-Means algorithm is an algorithm for unsupervised clustering of data.
15 Biopython includes an implementation of the k-means clustering algorithm
16 in kMeans.py. Recently, a larger set of clustering algorithms entered
17 Biopython as Bio.Cluster. As the kcluster routine in Bio.Cluster also implements
18 the k-means clustering algorithm, the kMeans.py module has been deprecated.
19 Below you will find a description of how to switch from kMeans.py to
20 Bio.Cluster's kcluster.
21
22 The function kcluster in Bio.Cluster performs k-means or k-medians clustering.
23 The corresponding function in kMeans.py is called cluster. This function takes
24 the following arguments:
25
26 o data
27 o k
28 o distance_fn
29 o init_centroids_fn
30 o calc_centroid_fn
31 o max_iterations
32 o update_fn
33
34 The function kcluster in Bio.Cluster takes the following arguments:
35
36 o data
37 o nclusters
38 o mask
39 o weight
40 o transpose
41 o npass
42 o method
43 o dist
44 o initialid
45
46
47 Arguments for kMeans.py's cluster, and their equivalents in Bio.Cluster
48 -----------------------------------------------------------------------
49
50
51 o data:
52
53 In kMeans.py, data is a list of vectors, each containing the same number of
54 data points. Within the context of clustering genes based on their gene
55 expression values, each vector would correspond to the gene expression data of
56 one particular gene, and the values in the vector would correspond to the
57 measured gene expression value by the different microarrays. The cluster
58 routine in kMeans.py always performs a row-wise clustering by grouping vectors.
59
60 The argument data to Bio.Cluster's kcluster has the same structure as in
61 kMeans.py. However, Bio.Cluster allows row-wise and column-wise clustering by
62 the transpose argument. If transpose==0 (the default value), kcluster performs
63 row-wise clustering, consistent with kMeans.py. If transpose==1, kcluster
64 performs column-wise clustering. The same behavior can be obtained, of course,
65 by transposing the data array before calling kcluster.
66
67
68 o k:
69
70 The desired number of clusters is specified by the input argument k in
71 kMeans.py. The corresponding argument in Bio.Cluster's kcluster is nclusters.
72
73 o distance_fn:
74
75 In kMeans.py, the argument distance_fn represents the distance function to
76 calculate the distances between items and cluster centroids. This argument
77 corresponds to a true Python function. The default value is the Euclidean
78 distance, implemented as distance.euclidean in distance.py. User-defined
79 distance functions can also be used.
80
81 The k-means routine in Bio.Cluster does not allow user-specified distance
82 functions. Instead, it provides the following nine built-in distance functions,
83 depending on the argument dist:
84
85 dist=='e': Euclidean distance
86 dist=='h': Harmonically summed Euclidean distance
87 dist=='b': City-block distance
88 dist=='c': Pearson correlation
89 dist=='a': absolute value of the Pearson correlation
90 dist=='u': uncentered correlation
91 dist=='x': absolute uncentered correlation
92 dist=='s': Spearmans rank correlation
93 dist=='k': Kendalls tau
94
95 User-defined distance functions are possible only by modifying the C code in
96 cluster.c (which may not be as hard as it sounds). The default distance function
97 is the Euclidean distance (distance=='e'). Note that in Bio.Cluster the
98 Euclidean distance is defined as the sum of squared differences, whereas in
99 kMeans.py the square root of this quantity is taken. This does not affect the
100 clustering result.
101
102 o init_centroids_fn:
103
104 This function specifies the initial choice for the cluster centroids. By
105 default, cluster in kMeans.py uses a random initial choice of cluster centroids
106 by randomly choosing k data vectors from the input vectors in the data input
107 argument. Alternatively, the user can specify a user-defined function to choose
108 the initial cluster centroids.
109
110 In Bio.Cluster, the k-means algorithm in kcluster starts from an initial cluster
111 assignment instead of an initial choice of cluster centroids. As far as I know,
112 these two initialization methods are equivalent in practice. Similar to the
113 cluster routine in kMeans.py, Bio.Cluster's kcluster performs a random initial
114 assignment of items to clusters. Alternatively, users can specify a
115 (deterministic) initial clustering via the initialid argument. This argument is
116 None by default. If not None, it should be a 1D array (or list) containing the
117 number (between 0 and nclusters-1) of the cluster to which each item is
118 assigned initially.
119
120 Note that the k-means routine in Bio.Cluster performs automatic repeats of the
121 algorithm, each time starting from a different random initial clustering. See
122 the comment for the npass argument below.
123
124 o calc_centroid_fn:
125
126 This argument specifies how to calculate the cluster centroids, given the data
127 vectors of the items that belong to each cluster. By default, the mean over the
128 vectors is calculated. A user-defined function can also be used.
129
130 Bio.Cluster's kcluster does not allow user-defined functions. Instead, the
131 method to calculate the cluster centroid is determined by the argument method,
132 which can be either 'a' (arithmetic mean) or 'm' (median). The default is to
133 calculate the mean ('a').
134
135 o max_iterations:
136
137 The cluster routine in kMeans.py has an argument max_iterations, which is used
138 to stop the iteration it the routine does not converge after the given number of
139 iterations.
140
141 The kcluster routine in Bio.Cluster does not have such an argument. The failure
142 of a k-means algorithm to converge is due to the occurrence of periodic
143 clustering solutions during the course of the k-means algorithm. The kcluster
144 routine in Bio.Cluster automatically checks for the occurrence of such a
145 periodicity in the solutions. If a periodic behavior is detected, the algorithm
146 is interrupted and the last clustering solution is returned. Accordingly, the
147 kcluster routine is guaranteed to return a clustering solution. Also see the
148 discussion of the npass argument below.
149
150 o update_fn:
151
152 The argument update_fn to cluster in kMeans.py is a hook function that is
153 called at the beginning of every iteration and passed the iteration number,
154 cluster centroids, and current cluster assignments. It is used by xkMeans.py,
155 which provides a visualization of k-means clustering. Currently there is no
156 equivalent in Bio.Cluster.
157
158
159 Other arguments for Bio.Cluster's kcluster.
160 -------------------------------------------
161
162 Three arguments in Bio.Cluster's kcluster do not have a direct equivalent in
163 kMeans.py's cluster.
164
165 o mask:
166
167 Microarray experiments tend to suffer from a large number of missing data. The
168 argument mask to Bio.Cluster's kcluster lets the user specify which data are
169 missing. This argument is an array with the same shape as data, and contains
170 a 1 for each data point that is present, and a 0 for a missing data point:
171
172 mask[i,j]==1: data[i,j] is valid
173 mask[i,j]==0: data[i,j] is a missing data point
174
175 Missing data points are ignored by the clustering algorithm. By default, mask
176 is an array containing 1's everywhere.
177
178 o weight:
179
180 The weight argument is used to put different weights on different data point.
181 For example, when clustering genes based on their gene expression profile, we
182 may want to attach a bigger weight to some microarrays compared to others. By
183 default, the weight argument contains equal weights of 1.0 for all data points.
184 Note that for row-wise clustering, the weight argument is a 1D vector whose
185 length is equal to the number of columns. For column-wise clustering, the length
186 of this argument is equal to the number of rows.
187
188 o npass:
189
190 Typical implementations of the k-means clustering algorithm rely on a random
191 initialization. Unlike Self-Organizing Maps, however, the k-means algorithm has
192 a clearly defined goal, which is to minimize the within-cluster sum of
193 distances. Different k-means clustering solutions (based on different initial
194 clusterings) can therefore be compared to each other directly. In order to
195 increase the chance of finding the optimal k-means clustering solution, the
196 k-means routine in Bio.Cluster automatically repeats the algorithm npass times,
197 each time starting from a different initial random clustering. The best
198 clustering solution, as well as in how many of the npass attempts it was found,
199 is returned to the user. For more information, see the output variable nfound
200 below.
201
202
203 Return values
204 -------------
205
206 The cluster routine in kMeans.py returns two values:
207
208 o centroids
209 o clusters
210
211 The kcluster routine in Bio.Cluster returns four values:
212
213 o clusterid
214 o centroids
215 o error
216 o nfound
217
218
219 o centroids:
220
221 The centroids return value contains the centroids of the k clusters that were
222 found, and corresponds to the centroids return value from Bio.Cluster's
223 kcluster routine.
224
225 o clusters:
226
227 The clusters return value contains the number of the cluster to which each
228 vector was assigned. The corresponding return value in Bio.Cluster's kcluster
229 is clusterid.
230
231 o error:
232
233 The error return value from Bio.Cluster's kcluster is the within-cluster sum of
234 distances for the optimal clustering solution that was found. This value can be
235 used to compare different clustering solutions to each other.
236
237 o nfound:
238
239 The nfound return value from Bio.Cluster's kcluster shows in how many of the
240 npass runs the optimal clustering solution was found. Accordingly, nfound is at
241 least 1 and at most equal to npass. A large value for nfound is an indication
242 that the clustering solution that was found is optimal. On the other hand, if
243 nfound is equal to 1, it is very well possible that a better clustering solution
244 exists than the one found by kcluster.
Something went wrong with that request. Please try again.