There are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. This project is about approach (b), and it's reached a state where it may be useful to others as a platform for research and experimentation.
There are several optimization algorithms available with the baseline being sparse gradient descent (GD) on a loss function (several are available), The code should be easily usable. Its only external dependence is on the boost library, which is often installed by default.
To build vw from source, in various environments, please follow the instructions on the getting started page.
- Getting Started
- Command line arguments
- Algorithm details (e.g., input format, loss functions)
- Coding Guidelines
- Feature Hashing and Extraction
- Multiclass Classification
- Learning to Search Sub-System
- Python interface for learning to search
- Connect with the community on Gitter
- Ask a question on StackOverflow with the
- Open an issue
There are several features that (in combination) can be powerful.
- Input Format. The input format for the learning algorithm is substantially more flexible than might be expected. Examples can have features consisting of free form text, which is interpreted in a bag-of-words way. There can even be multiple sets of free form text in different namespaces.
- Speed. The learning algorithm is pretty fast---similar to the few other online algorithm implementations out there. As one datapoint, it can be effectively applied on learning problems with a sparse terafeature (i.e. 1012 sparse features). As another example, it's about a factor of 3 faster than Leon Bottou's svmsgd on the RCV1 example in wall clock execution time.
- Scalability. This is not the same as fast. Instead, the important characteristic here is that the memory footprint of the program is bounded independent of data. This means the training set is not loaded into main memory before learning starts. In addition, the size of the set of features is bounded independent of the amount of training data using the hashing trick.
- Feature Pairing. Subsets of features can be internally paired so that the algorithm is linear in the cross-product of the subsets. This is useful for ranking problems. David Grangier seems to have a similar trick in the PAMIR code. The alternative of explicitly expanding the features before feeding them into the learning algorithm can be both computation and space intensive, depending on how it's handled.
Many people have contributed to the project at this point. John Langford, Alekh Agarwal, Miroslav Dudik, Daniel Hsu, Nikos Karampatziakis, Olivier Chapelle, Paul Mineiro, Matt Hoffman, Jake Hofman, Sudarshan Lamkhede, Shubham Chopra, Ariel Faigon, Lihong Li, Gordon Rios, and Alex Strehl have all worked on VW. Many others have contributed via feature requests, bug reports, or bug patches.
VW is also a vehicle for advanced research. The first public version containing hashing, caching, and true online learning was released in 2007. Since then, many different algorithms and results have influenced its design, including:
- Kai-Wei Chang, He He, Hal Daumé III, John Langford, Stephane Ross, A Credit Assignment Compiler for Joint Prediction, NIPS 2016.
- Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daumé III, John Langford Learning to Search Better Than Your Teacher, ICML 2015.
- Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford, A Reliable Effective Terascale Linear Learning System, 2011.
- M. Hoffman, D. Blei, F. Bach, Online Learning for Latent Dirichlet Allocation, in Neural Information Processing Systems (NIPS) 2010.
- Alina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang Agnostic Active Learning Without Constraints NIPS 2010.
- John Duchi, Elad Hazan, and Yoram Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, JMLR 2011 & COLT 2010.
- H. Brendan McMahan, Matthew Streeter, Adaptive Bound Optimization for Online Convex Optimization, COLT 2010.
- Nikos Karampatziakis and John Langford, Importance Weight Aware Gradient Updates UAI 2010.
- Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, Josh Attenberg, Feature Hashing for Large Scale Multitask Learning, ICML 2009.
- Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and SVN Vishwanathan, Hash Kernels for Structured Data, AISTAT 2009.
- John Langford, Lihong Li, and Tong Zhang, Sparse Online Learning via Truncated Gradient, NIPS 2008.
- Leon Bottou, Stochastic Gradient Descent, 2007.
- Avrim Blum, Adam Kalai, and John Langford Beating the Holdout: Bounds for KFold and Progressive Cross-Validation. COLT99 pages 203-208.
- Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35: 773–782.