TODO.txt

CODE IMPROVEMENTS
=================

- Rewrite the FeatureSelection package:
    - Improve the API of Feature Selection and how we handle different data types.
    - Refactor AbstractCategoricalFeatureSelector and simplify the method calls.
    - Refactor PCA to use Vectors and Matrices instead of arrays.
    - In AbstractCategoricalFeatureSelector and TFIDF we should keep track of the kept columns not the removedColumns.
- Consider dropping all the common.dataobjects and use their internalData directly instead.
- Refactor the statistics package and replace all the static methods with proper inheritance.
- Write generic optimizers instead of having optimization methods in the algorithms. Add the optimizers and regularization packages under mathematics.

- Consider moving storages in a separate module that is inherited by common.
- Consider moving all tests in a separate module.
- Run the tests with different configurations (one for each storage engine). Create a profile that runs them all & connect it with CI.


NEW FEATURES
============

- Create the following Numerical Scalers: PercentileScaler.

- Create a storage engine for MapDB 3 once caching & asynchronous writing is supported. Remove the HOTFIX for MapDB bug #664.
- Create a storage engine for BerkeleyDB.
- Add the ability to call Machine Learning algorithms from command line or Python:
    - https://pypi.python.org/pypi/javabridge
    - https://github.com/LeeKamentsky/python-javabridge/
    - https://github.com/fracpete/python-weka-wrapper


DOCUMENTATION
=============

- Improve the code documentation.
- Write How-to blog posts on building Text Classification models.
- Update the website and link directly to the latest and previous documentations.


NEW ALGORITHMS
==============

- Speed up LDA: http://www.cs.ucsb.edu/~mingjia/cs240/doc/273811.pdf
- Factorization Machines: http://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
- Develop the FunkSVD and PLSI as probabilistic version of SVD.
- Collaborative Filtering for Implicit Feedback Datasets: http://yifanhu.net/PUB/cf.pdf
- Write a Mixture of Gaussians clustering method.
- Include an anomaly detection algorithm.
- Provide a wrapper for DBSCANClusterer and NeuralNet implementations of Maths.
- Add the ability to search through the configuration space and find the best performing algorithmic configuration.


TO CHECK OUT
============

Linear Algebra
--------------

- JBLAS - Linear Algebra for Java:
	https://github.com/mikiobraun/jblas
	http://jblas.org/

Huge Collection libs, DBs and Storage
-------------------------------------

- Vanilla-java - HugeCollections:
    https://code.google.com/p/vanilla-java/wiki/HugeCollections

- Fastutil:
    http://fastutil.di.unimi.it/#install

- Joafip:
    http://joafip.sourceforge.net/javadoc/net/sf/joafip/java/util/PHashMap.html

- Chronicle Map:
    https://github.com/OpenHFT/Chronicle-Map/

- H2 Database:
    http://www.h2database.com/html/main.html