Find file History
Pull request Compare This branch is 3 commits behind cathywu:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


PyML - a Python Machine Learning package

Asa Ben-Hur

PyML is a flexible Python framework for using various classification 
methods including Support Vector Machines (SVM).


Only Unix/Linux/OS-X is supported.
The numpy package

A script is provided so installation follows the standard python idiom:

>>> python build
>>> python install

See the Installation section in the tutorial for a trouble-shooting guide.


Tutorial - see doc/tutorial.pdf 
Module documentation - see doc/autodoc/public/index.html
Usage examples - see PyML/demo/

PyML 0.7.9

- Fixed compilation issue under gcc 4.6.1 (tracker ID: 3361193)
- Fixed issue with VectorDataSet (tracker ID: 3364921)
- Updated tutorial

PyML 0.7.8
- Added wrapper for liblinear linear SVMs.  If you only need a linear
  SVM, these solvers offer a very significant speedup for large
     SVM(optimizer = 'liblinear', loss = 'l2') for l2 loss SVM
     SVM(optimizer = 'liblinear', loss = 'l1') for l1 loss SVM
- Added containers.setData.SetData - a dataset container where each
example is a set of objects.
- Chris Hiszpanski reported an issue and fix for ROC calculation that
would fail for a corner case.
- When creating a dataset with numeric labels, the 'numericLabels'
keyword argument was not passed on to the Labels constructor when
creating such a dataset from arrays/lists
- the "stratifiedCV" method of SVR was being called by model
selection. That was addressed by defining it to be regular
cross-validation (stratified CV doesn't make sense for regression).
Better solution would be to have a separate base class for regression.
- Can create an empty SparseDataSet or VectorDataSet, and then
populate it on the fly with features (using its addFeatures method).

PyML 0.7.7
- Fixed the classify method of the KNN classifier
  (cross-validation worked fine).
- Fixed issue with the __repr__ of the Results objects that was giving
  an error when results of an unlabeled dataset were being displayed.

PyML 0.7.6
- Added positional kmer dataset creation (an implementation of
  Sonenburg et al's weighted degree kernel) that represents the
  features explicitly.

PyML 0.7.5
- Reworked the SequenceData container
- Fixed a bug in Labels.oneAgainstRest (thanks to Marcel Luethi for
  bug report)
- Fixed a bug in feature selection - feature IDs are now correctly retained

- Fixed bug in SparseDataSet's addFeatures method

- Removed dependency on scipy

PyML 0.7.4
- Modified the way svm models are saved/loaded.  Now you can save
  any SVM model.  Note that the interface for loading/saving has
  OneAgainstRest classifiers also support save/load.
- Fixed the import error from 0.7.3

PyML 0.7.3

- Fixed bug  2971509 reported by Vebjorn Ljosa ( ljosa ) 
- Updated the requirements for PyML (python 2.5 and up).
- Increased the flexibility of the Aggregate container
- ROC curve plotting:  when doing cross-validation, PyML 
  now plots an ROC curve produced by averaging ROC curves
  of individual folds.
- AUC calculation gives the correct result when the discriminant
  scores are concentrated on a few values.
- the assess module has been split into several modules under
  the evaluators directory

PyML 0.7.2

- Fixed bug in Aggregate container (in the case of a weighted 
  combination of datasets)  Bug report by Eithon Cadag
- Stephen Picolo found issue when using feature selection with the VectorDataSet 
  container.  Use SparseDataSet instead.

PyML 0.7.1

- added k-means clustering (PyML.clusterers.kmeans)
- corrected handling of nan's
- fixed an import statement in classifiers/
- Compilation error in Kernel.h fixed (shows in gcc 4.3)


- preproc.pca - updated code to new version of numpy
- demo2d.scatter - improved interface to make it more useful

PyML 0.7.0

- a restructuring of the module structure.  see tutorial for details
- small changes in demo2d
- linear kernel wasn't normalizing properly
- myio.myopen now handles bz2 files as well
- myio.myopen does not open in universal newline support mode
  by default - it throws the pyml parser off
- improved the way SequenceData reads fasta files -- it can now
  extract labels using a user-supplied function that extracts
  the id and label out of the fasta header.
- added a method for generating spectrum kernels 

PyML 0.6.15

- the svm cacheSize for libsvm wasn't being set up correctly.
  thanks to David Doukhan for the bugfix.
- added an 'addFeatures' method to the SparseDataSet and VectorDataSet 
- cleaned up the VectorDataSet container
- a dataset's attachLabels now takes either a labels object or a file name.
- when saving a dataset, the format (csv or sparse) is chosen according to
  the type of the container (sparse containers save to sparse format and
  non-sparse containers save to csv).
- improved the discussion of model selection in the tutorial
- more options in demo2d and a change of format

PyML 0.6.14

- Fixed bug in KNN.cpp where in some cases the decision function value
  is sometimes wrong (thanks to Martial Hue the bug report and a fix).
- Fixed a bug in loading non-linear models (thanks to shiela reynolds 
  for the bug report).
- construction from an array is now in the form  
  datafunc.VectorDataset(x) instead of datafunc.VectorDataset(X=x)
- small changes in demo2d (with the help of Adam Labadorf)

PyML 0.6.13

- Updates to SequenceData, including the addition of the positional-kmer-kernel
  and a save method.
- Datasets now maintain the norms of the examples, so that computation of 
  cosine and Gaussian kernels is much faster.
- a 'sample' module has been added with function for creating samples from a
  dataset (with and without replacement).
- fixed a segfault in the Aggregate container
- streamlined SWIG wrapper files


- added in the ext directory

PyML 0.6.12

- PyML now distributed under LGPL license.
- The list of labels input argument for the Labels constructor can now be
  any list-like object.
- Fixed bug in loading a saved ResultsList.
- removed usage of numpy.nonzero in
- debugged 
- debugged demo2d

PyML 0.6.11

- wrappers now use swig 1.3.31
- added release number accessible as PyML.__version__
- fixed a bug in
- cleaned up some of the functions in the datafunc module
- when performing CV it could happen that a class would not be represented
  in one of the folds, creating a cryptic error/segfault.  now this is 
  explicitly checked.


- Corrected compilation bug that only showed up in new gcc compilers (version 4.1.1).
  (Bug reported by Michael Hamilton).

PyML 0.6.10

- Compatibility with latest numpy 1.0 (array codes, numpy.nonzero)
- Rewrote the constructor of 'datafunc.Labels'.
- The 'translate' method of VectorDataSet went missing, and has been restored.
- Cleaned up the libsvm wrapper.
- Dealt with bug in the C++ dataset containers introduced in 0.6.9

PyML 0.6.9

- added VectorDataSet -- a C++ non-sparse vector container.
  this resulted in some renaming of the dataset containers.
- updated the libsvm wrapper -- now wrapping version 2.82.
  to simplify the update of the wrapper, the pure python datasets are 
  no longer supported for training svm classifiers (it's easy to convert!)
- removed references to Numeric from
- better handling of nan's as a result of prediction
- a few more tweaks of the RegressionResults class
- reworked demo2d to work with version 0.87 of matplotlib
- the check method of the sparse parser didn't handle correctly files
  with only a single line (thanks to Jian-Qiu for the pointer and fix).

PyML 0.6.8

- reworked the Results containers, including improved support for regression 
  (RegressionResults container)
- corrected bug in reading in data with numeric labels (the numericLabels flag wasn't working)
- corrected compilation issue in KNN.h (showed up under fedora core 5)
- changed the directory structure of the package; was updated accordingly.
- corrected bug in RFE that affected cases where the number of desired features
was greater than the number of features in the dataset.
- corrected a couple of import issues related to the transition from Numeric to numpy

PyML 0.6.7

- numpy is now used instead of Numeric
- featureCount/featureCounts for the python containers
- rewrote RegressionResults (thanks to Jian Qiu for his contribution).
  still some work to be done to make it really useful
- datasets can be read from gzipped files.
- took care of some compilation warnings.
- assess.significance now also reports the medians of the statistics.
- KNN now accepts a generic dataset container.
- there is no longer a need to explicitly set an optimizer for non-vector
- cv and stratified CV can now do partial CV.

PyML 0.6.6

- cleaned up copy constructor of KernelData, and unified it with the standard
  copy constructor.
- fixed bug in using libsvm solver with 'Cosine' kernel.
- assess.loadResults2 / assess.saveResultsObjects now handle lists of lists of 
  objects, or dictionaries of lists of results objects
- fixed bug in getKernelMatrix; the last element of the diagonal was missing.
- fixed bug in assess.loo
- ROC score is not shown for multi class problems.
- Platt object now tries a couple of times before it raises an exception
- ker.sortKernel2 -- sorts a kernel matrix according to a given list of patterns
- when non-normalized roc curve is computed the normalized area is returned.

PyML 0.6.5

- added multi.allOneAgainstRest for performing all one-against-rest classification
- made sure that when making a dataset from an array, that it's a Numeric array
- revamping of the ridge regression classifier (it's still pure python)
- numFolds in cv and stratifiedCV can also be set using a keyword parameter
- added a baseClassifiers.IteratorClassifier from which modelSelection.Param and
  composite.FeatureSelectAll inherit
- added FeatureSelectAll -- for a backward/forward selection method, this looks at
  accuracy as a function of the number of features.
- corrected bug in setting of k in KNN
- corrected bug in Gaussian kernel for the pure python datasets
- -- a module for playing with data in two dimensions

PyML 0.6.4

- more options in Platt classifiers (see documentation for details)
- added a 'split' method to the datafunc module; it randomly splits a dataset into two parts
- corrected bug in nCV (setting of number of iterations; this is now a keyword argument)
- Better parsing of pairData input files
- Corrected refcount initialization in KernelData


- Corrected bug in CSVparser.readLabels
- Corrected bug in Results object in a case of one fold
- Merged the KNN and KNNC classes so that the user doesn't
  need to worry about that.

PyML 0.6.3

- Corrected bugs in CSV parser
  autodetect of a header row has proven problematic.
  if your CSV file has a header row you now need to specify that 
  as a keyword argument (headerRow = True).  the header row allows you
  to specify feature IDs (think of it as an excel spreadsheet).
- corrected bug in construction of a dataset from a Numeric array

PyML 0.6.2

- More cleaning up of the Results object
- replaced matlab with pylab
- added a getKernel method to the DataSet interface
- added a ker.showKernel function that displays a kernel matrix
- corrected bug in saving results object (not enough attributes were saved)

PyML 0.6.1

- Added save method for Platt classifier
- PyML no longer works with python 2.2 (sub-classing of the 'list' object
  has a problem in that version of python).
- fixed problem in Aggregate -- it broke down when the datasets from which
  it was constructed went out of scope.  other improvements to this class.
- fixed bug for reconstructing saved nonlinear SVM.
- Saved SVM had a problem with unlabeled data since it didn't reload its
  classLabels attribute -- bug fixed.
- Results object has been re-designed.
  The Results object IS a list of Results_
  objects that each provide the results on a different part of the
  data as is the case when doing CV.
  ROC scores computed using CV were to computed in previous versions by 
  pooling the decision function values and computing an pooled ROC curve.
  This has been changed to an average ROC score that is the average
  of the ROC areas from the different folds.
- Results object has a 'toFile' method for saving to an ascii file
  rather than pickling the object
- Added a ResultsList object (used for combining CV results on the
  same dataset).  nCV is now a classifier method and returns a
  ResultsList object rather than a list of Results objects
- Gnuplot related functions have been removed.  If you use them, just
  get them from from an older version.
- the 'xml' option in saving a results object has been discontinued.
- confusionMatrix attribute in a Results object is now a list rather than
  a Numeric array for safety in pickling
- the sparse parser has a 'sparsify' keyword argument that is "False"
  by default.  "sparsify" means not add features whose value in the file
  is 0.
- improved the way ModelSelector recognizes selection via roc scores
- composite.Platt2 - a classifier for rescaling discriminant values into 
  probability estimates based on Platt's method, but with better numerical
  stability (contributed by Tobias Mann).

PyML 0.6.0

- addFeature method for the sparse dataset containers
- registered attributes: the user can register attributes to a container;
  those attributes are then copied on copy construction
- eliminated the "aux" attribute from a dataset -- no need for that now
  that registeredAttributes are available
- the interface for "testing" of a dataset has changed:  the input for
  the test function includes the training data.  The "test" function
  of the classifier calls the "test" function of the dataset; therefore 
  the training data is now saved in the classifier
- corrected error in in definition of pairdataset extension
- "save" method of SVM objects now working.  Thanks to Olav Zimmerman
  for his input.
- timing information for training and testing a classfier are saved in the
  Result object's log or cvLog attributes
- Corrected bug in parsing space delimited files
- reading files in gist format (including label files)
- added datafunc.sample for sampling a dataset
- CV/stratifiedCV can save intermediate results
- the Labels constructor and the dataset constructors take a keyword
  argument 'positiveClass' that tells the Labels constructor which of
  the class labels is the positive class.  If the class labels are 
  +1 and -1 or 1 and -1, the positive class is detected automatically.
- assess.resultsByFold -- Given a Results object obtained by CV, divide 
  it into a list of Results objects for each fold.
- PyML is now supposed to work with 64-bit machines:
  the problem was that the hash values generated by python are 64-bit
  numbers.  These hash values are used as feature IDs in the SparseCDataSet,
  so those had to be changed into "long" instead of "int".  Using a "myint"
  proved problematic since arbitrary templates for vectors are not yet
  supported in swig.
- Plugged a couple of memory leaks

PyML 0.5.4

- corrected bug in roc50 calculation
- corrected bugs related to SVR
- gist and gradient descent solvers for svm formulation without a bias
- Results object has an attribute foldStart so that the patterns between
  in the range [foldStart[i]:foldStart[i+1]] were in cross-validation fold
  number i where i is in range(0,numFolds)
- Construction of a dataset from a Numeric array is now possible
- Improvements in assess.plotROC and assess.plotROCs functions

PyML 0.5.3

- Param and ParamGrid are iterators:, Param.stratifiedCV, Param.loo etc return an iterator:
  p = Parm(...)
  for result in :
- datafunc.BaseDataSet renamed datafunc.VectorDataSet, which is what it
  is.  datafunc.BaseDataSet is now a more generic base class
- a dataset container has a train method that calls the dataset's
  'trainingFunc' method with the keyword arguments it receives.
  a classifier first trains the data before training itself. 
  (BaseClassifier calls the data's train method).
  naturally, there is also a "test" method.
  assess.test calls a dataset's test method with the keyword arguments it
  received.  if you redefine a classifier's "test" methos be sure to call 
  the dataset's test method.
- a dataset's train/test/cv methods are called as: self.method(self, **args)
  rather than self.method(self, *options, **args)

PyML 0.5.2

- assess.cvFromFolds -- perform cv when training and testing examples for
  each fold are given
- assess.stratifiedCV now can receive a list of patterns that are to serve
  as training examples in all CV folds
- corrected bug in cosine kernel
- corrected bug in "normalize" function in the case of zero vectors.
- ker.combineKernels -- add or multiply kernels (instead of ker.addKernels)
- ker.sortKernel plus a few other more esoteric functions for dealing with
  kernel matrices
- patternID created by default for a dataset
- modelSelector object now also supports selection by area under the roc/rocN
- aggregate data object -- combines several c++ data objects into a single
  dataset; its dot product is the sum of the dot products of the individual
  dataset objects
- corrected bug in assess.scatter
- corrected bugs reported by users in ParamGrid and csv parser

PyML 0.5.1

- assess.plotROCs -- plotting of multiple ROC curves for easy comparison
- corrected bugs in PairData
- better save method for vector-based containers
- corrected bug in getPattern method of SparseCDataSet
- datafunc -- labels object remembers what classes it was made from
  in the case of copy construction.

PyML 0.5

C++ part received a major upgrade:
- new level of abstraction for dataset objects -- generic interface for dataset
  and kernel objects allows "plug and play" data and kernel objects.
  These work with the PyML SMO object for SVM training (code based on libsvm).
- Cleaner wrapping of the C++ dataset container
- KernelData C++ container for an in-memory kernel matrix.
  The dotProduct function returns the appropriate entry of the kernel matrix.
  Like all containers, a kernel function can be defined on top of the dotProduct.

minor changes:
- Results object can be constructed from a Container object
- data object supports an "aux" attribute which is a list which stores auxiliary
  information on each pattern.
  uses -- a value of "C" for svm training.
- write kernel to a file, now directly in C++ (
- adding kernels that are stored in a file (
- Data container for data whose patterns are composed of pairs of objects

- removed some "warts" from the code with the help of pychecker
- updated documentation

unreleased code:
Random Forests

PyML 0.43 (unreleased)

- preprocessing object 
- save method for svm classifiers (not really working yet)
- improved save method for dataset containers
- corrected bug in ModelSelector object (CV did not work)
- In some cases a decision function may return a 'nan'.  Python has
  some peculiarities in handling nan.  This situation is now handled.
- improved CSV parser; interface modified (indexing starts at 0).
- succ feature score
- Results object computes rocN scores where N is 50 by default

PyML 0.42

support for One-class SVM and SVM regression
speedup of reading csv files
memory save mode for linear svm, some memory saving in general and plugging
a couple of memory leaks
memory still leaks from the swig interface -- help would be apprecited.

PyML 0.41:

Correction of bugs introduced in 0.4

PyML 0.4:

New Features:

Addition of a sparse dataset implemented in C++ - PyML runs several times 
faster in some cases.
Use the KNNC nearest neighbor classifier with this container.

RFE and Filter methods do automatic selection of the number of features.

Multiplicative update (a.k.a zero-norm) feature selection.

Changes in the interfaces:

All the dataset classes now have a Kernel automatically attached to them, so
a dataset is actually a feature-space.

There is now one SVM class - the object knows what type of SVM (linear, 
nonlinear) to train according to the kernel attached to the data object on 
which it is trained.

Changes in the classifier interface:

The "test" method of a classifier receives a dataset and a list of patterns
on which to test.

The "classify" and "decisionFunc" members of a classifier receive as input
a dataset and the id of a pattern rather than the pattern itself.

A DataSet now supports kernel/similarity data by:
>>> d = datafunc.DataSet(K=K, L = L)
where K is a Numeric 2-d matrix, and L is a list of labels (optional).
Construction of a dataset from a Numeric array X is now:
>>> d = datafunc.DataSet(X=X, L = L)

This help support usage of precomputed kernels:
If you install William Stafford Noble's Gist package, you can train and
test SVMs on any kernel matrix that was computed in advance.

Changes in the "Result" class; also added better support for saving objects
so that they are recoverable even when changes are made in the code.


The Platt classifier (in - is basically a translation of the spider
code; it suffers from some numerical instability that sometimes makes
it break down.