Skip to content

Commit

Permalink
Preparing the new version.
Browse files Browse the repository at this point in the history
  • Loading branch information
datumbox committed Mar 19, 2016
1 parent 8d4ea04 commit 3f84b40
Show file tree
Hide file tree
Showing 8 changed files with 35 additions and 99 deletions.
108 changes: 22 additions & 86 deletions CHANGELOG.md
@@ -1,93 +1,29 @@
CHANGELOG
=========

Version 0.7.0-SNAPSHOT - Build 20160318
---------------------------------------
Version 0.7.0 - Build 20160319
------------------------------

- Rename the erase() method to delete() in all interfaces.
- None of the ML methods relies on recordIds for any calculation. Even the algorithms that use Matrixes are now patched.
- Major refactoring of the Dataset class:
- It now implements the Collection<Record> interface and it is renamed as Dataframe.
- It allows remove operations.
- It always stores xDataTypes in memory.
- It stores in memory an index LinkedList which is used to return Records in insertion order.
- Renamed methods:
- getRecordNumber() -> size()
- extractXColumnValues() -> getXColumn()
- extractYValues() -> getYColumn()
- removeColumns() -> dropXColumns()
- getVariableNumber() -> xColumnSize()
- erase() -> delete()
- generateNewSubset() -> getSubset()
- _set() -> _unsafe_set()
- Removed methods:
- extractXColumnValuesByY()
- New methods:
- addRecord()
- indexOf()
- index()
- values()
- entries()
- Plus all others required for the Collection interface.
- Modified methods:
- iterator() no longer returns Integers but Records.
- add() no longer returns the ID of the record but boolean. To get the ID use addRecord().
- Fixing a bug on Adaboost which resulted in mapping incorrectly the recordIds.
- Added serialVersionUID in every serializable class.
- Modified the algorithms that require Matrixes to use the sparse OpenMapRealMatrix instead of BlockRealMatrix.
- Improved Exceptions and Exception messages.
- Add support of Treemaps in the DatabaseConnector. Dataframe stores the records in TreeMap.
- Use _unsafe_set() where possible to speed up the calculations.
- BaseMLrecommender now inherits from BaseMLmodel.
- CollaborativeFiltering algorithm modified to support more generic User-user CF and to return Validation Metrics.
- Tests will now fail if the external datasets can't be downloaded.
- Changed the Validation Metrics policy to exclude from macro averages the classes that did not exist in validation set.
- Implemented StorageHints on the DatabaseConnectors and MapDB can use a hybrid approach of storing some data IN_MEMORY and some IN_DISK. Also support IN_CACHE for LRU cached maps.
- The AutoCloseable interface is implemented in all close() methods.
- Remove duplicate Javadocs comments.
- Changed the way that single vars are stored with DatabaseConnectors. Multiple vars can now be stored. Also the save() and load() methods were renamed to saveObject() and loadObject().
- Changed names of constants and methods that were not compatible with Java's naming conventions.
- Changed the behaviour and the public methods of the DatabaseConnector interface. The dropDatabase() is replaced with a clear() method that deletes all the data but does not close the connection with the database.
- KnowledgeBase is no longer serializable. Its serializable fields are stored individually into the Database.
- Restructuring the framework to remove all FindBug warnings.
- KnowledgeBase is now an interface, while the implementation moved to StandardKnowledgeBase. The interface contains a static factory method to produce any KB. This enable us to define the knowledgeBase field of BaseTrainable private and final.
- All methods that have copy() now implement the Copyable interface.
- Added the multi-thread support on the predict method of most ML models.
- Updated the MapDBConnector to call compact on primary db before close.
- Updated the DatabaseConnector.getBigMap() support thread-safe maps.
- Simplify the architecture. Massive restructuring of packages and class names. Change the structure of Interfaces and inheritance in TP, MP, VM and Algorithms.
- On Dataframe we now can set records on keys that do not already exist.
- The NgramsExtractor was rewritten to remove its internal state.
- Added a concurrency package on common with useful helper classes.
- Added threads on the Builder of Dataframes, on the DataTransformation and FeatureSelection algorithms (both fit and transform).
- Converted the big test Datasets to files stored as resources in the project. This includes all the files that we downloaded from the web.
- Added the TrainParallelizable for the models that use parallelism during training.
- Updated all Wrappers to inherit from Parallelizable and set the parallelized flags of the individual components before training and predict/transform.
- Modified Dataframe to be usable without a synchronized block.
- Created a basic feature switching component for the framework.
- Removed the DOUBLE_ACCURACY_LOW property from the tests since it is no longer used and increased the accuracy of DOUBLE_ACCURACY_MEDIUM.
- Added a skip/limit option on the Dataframe.Builder.parseCSVFile().
- Added multithreading support on the training for many ML models.
- Fixed the limitation on the clustering algorithms which forced us to store the clusters in memory.
- Forced the headerDataTypes parameter of Dataframe.Builder.parseCSVFile() to be a LinkedHashMap.
- Added custom thread pool in the parallel stream execution. Created the ForkJoinStream class and enhanced the ThreadMethods class.
- Created a new Configuration mechanism and added the ability to configure the concurrency.
- Updated the NgramsExtractor algorithm to export more keywords.
- Removed the transacted option of MapDB.
- The configuration of Tests is now controlled using a properties file.
- Changed the names of setters/getters of Wrapper classes.
- Increased the test-coverage.
- Performed profiling and changed thread logic where necessary to improve speed.
- The TextClassifierTest no longer passes a prefix in each test.
- Added a new DataframeMapType feature switch. The purpose is to test the performance of HashMap+Index vs TreeMap for the Dataframe object.
- Renamed HTMLCleaner to HTMLParser and moved it to a different namespace.
- Moving StringCleaner inside common.utilities.
- Removed the ClasspathSuite from the tests.
- Created the Extractable which is inherited by AbstractTextExtractor.
- Convert project into multimodule project.
- Added support for L1, L2 and ElasticNet regularization.
- Rename library to datumbox-framework-lib.
- Change the structure and config files of persistent storage.
- Speed & Memory:
- Added multi-threading support on the majority of algorithms and methods, making the 0.7.0 version several times faster than 0.6.x.
- Implemented Storage Hints & hybrid strategies which enable the efficient use of LRU cache and faster training for large datasets that don't fit in memory.
- All the algorithms which require Matrixes now use sparse implementations to reduce the memory footprint.
- Fixed a limitation on clustering algorithms which forced us to store the list of clusters in memory.
- Algorithms & Methods:
- Added L1, L2 and ElasticNet regularization in Logistic, Ordinal and Linear Regression algorithms.
- The Collaborative Filtering algorithm was modified to support more generic User-user CF models.
- Updated the NgramsExtractor algorithm to export more keywords and provide better signals for NLP models.
- Framework Architecture:
- The framework is now split to separate modules and the main library is renamed to "datumbox-framework-lib".
- The Dataset class is replaced with the Dataframe class, which implements the Collection interface and enables the processing of the records in parallel.
- Major changes on the structure of Interfaces and inheritance to simplify the architecture.
- BaseMLrecommender now inherits from BaseMLmodel.
- Code Improvements & Bug Fixes:
- Added serialVersionUID in every serializable class.
- Improved Exceptions and error messages.
- Fixed a bug on Adaboost which resulted in mapping incorrectly the recordIds.
- Improved documentation and javadocs comments.
- Increased the test-coverage.

Version 0.6.1 - Build 20160102
------------------------------
Expand Down
6 changes: 3 additions & 3 deletions README.md
@@ -1,6 +1,6 @@
Datumbox Machine Learning Framework
===================================
[![Build Status](https://api.travis-ci.org/datumbox/datumbox-framework.svg)](https://travis-ci.org/datumbox/datumbox-framework) [![Dependency Status](https://www.versioneye.com/java/com.datumbox:datumbox-framework/0.7.0-SNAPSHOT/badge.svg)](https://www.versioneye.com/java/com.datumbox:datumbox-framework/0.7.0-SNAPSHOT) [![License](https://img.shields.io/:license-apache-blue.svg)](./LICENSE)
[![Build Status](https://api.travis-ci.org/datumbox/datumbox-framework.svg)](https://travis-ci.org/datumbox/datumbox-framework) [![Dependency Status](https://www.versioneye.com/java/com.datumbox:datumbox-framework/0.7.0/badge.svg)](https://www.versioneye.com/java/com.datumbox:datumbox-framework/0.7.0) [![License](https://img.shields.io/:license-apache-blue.svg)](./LICENSE)

[![Datumbox](http://www.datumbox.com/img/logo.png)](http://www.datumbox.com/)

Expand All @@ -16,7 +16,7 @@ The code is licensed under the [Apache License, Version 2.0](./LICENSE).
Version
-------

The latest version is 0.7.0-SNAPSHOT (Build 20160318).
The latest version is 0.7.0 (Build 20160319).

The [master branch](https://github.com/datumbox/datumbox-framework/tree/master) is the latest stable version of the framework. The [devel branch](https://github.com/datumbox/datumbox-framework/tree/devel) is the development branch. All the previous stable versions are marked with [tags](https://github.com/datumbox/datumbox-framework/releases).

Expand All @@ -32,7 +32,7 @@ Maven:
<dependency>
<groupId>com.datumbox</groupId>
<artifactId>datumbox-framework-lib</artifactId>
<version>0.7.0-SNAPSHOT</version>
<version>0.7.0</version>
</dependency>
```

Expand Down
6 changes: 3 additions & 3 deletions TODO.txt
@@ -1,8 +1,7 @@
CODE IMPROVEMENTS
=================

- Update all maven plugins and dependencies to their latest versions.
- Add support for MapDB 3.0 once a stable version is released. Remove the HOTFIX for MapDB bug #664.
- Support MapDB 3.0 once a stable version is released. Remove the HOTFIX for MapDB bug #664.


NEW FEATURES
Expand All @@ -16,7 +15,7 @@ DOCUMENTATION
=============

- Improve the code documentation.
- Write How-to blog post on building a Text Classification model.
- Write How-to blog posts on building Text Classification models.


NEW ALGORITHMS
Expand All @@ -26,6 +25,7 @@ NEW ALGORITHMS
- Speed up LDA: http://www.cs.ucsb.edu/~mingjia/cs240/doc/273811.pdf
- Write a Mixture of Gaussians clustering method.
- Develop the FunkSVD and PLSI as probabilistic version of SVD.
- Collaborative Filtering for Implicit Feedback Datasets: http://yifanhu.net/PUB/cf.pdf
- Include an anomaly detection algorithm.
- Add the ability to search through the configuration space and find the best performing algorithmic configuration.

Expand Down
2 changes: 1 addition & 1 deletion datumbox-framework-applications/pom.xml
Expand Up @@ -26,7 +26,7 @@
<parent>
<groupId>com.datumbox</groupId>
<artifactId>datumbox-framework</artifactId>
<version>0.7.0-SNAPSHOT</version>
<version>0.7.0</version>
<relativePath>../pom.xml</relativePath>
</parent>

Expand Down
2 changes: 1 addition & 1 deletion datumbox-framework-common/pom.xml
Expand Up @@ -26,7 +26,7 @@
<parent>
<groupId>com.datumbox</groupId>
<artifactId>datumbox-framework</artifactId>
<version>0.7.0-SNAPSHOT</version>
<version>0.7.0</version>
<relativePath>../pom.xml</relativePath>
</parent>

Expand Down
2 changes: 1 addition & 1 deletion datumbox-framework-core/pom.xml
Expand Up @@ -26,7 +26,7 @@
<parent>
<groupId>com.datumbox</groupId>
<artifactId>datumbox-framework</artifactId>
<version>0.7.0-SNAPSHOT</version>
<version>0.7.0</version>
<relativePath>../pom.xml</relativePath>
</parent>

Expand Down
2 changes: 1 addition & 1 deletion datumbox-framework-lib/pom.xml
Expand Up @@ -26,7 +26,7 @@
<parent>
<groupId>com.datumbox</groupId>
<artifactId>datumbox-framework</artifactId>
<version>0.7.0-SNAPSHOT</version>
<version>0.7.0</version>
<relativePath>../pom.xml</relativePath>
</parent>

Expand Down
6 changes: 3 additions & 3 deletions pom.xml
Expand Up @@ -20,7 +20,7 @@
<modelVersion>4.0.0</modelVersion>
<groupId>com.datumbox</groupId>
<artifactId>datumbox-framework</artifactId>
<version>0.7.0-SNAPSHOT</version>
<version>0.7.0</version>
<packaging>pom</packaging>

<name>Datumbox Machine Learning Framework</name>
Expand Down Expand Up @@ -103,14 +103,14 @@
<maven-jar-plugin-version>2.6</maven-jar-plugin-version>
<gmaven-plugin-version>1.5</gmaven-plugin-version>
<license-maven-plugin-version>2.11</license-maven-plugin-version>
<nexus-staging-maven-plugin-version>1.6.6</nexus-staging-maven-plugin-version>
<nexus-staging-maven-plugin-version>1.6.7</nexus-staging-maven-plugin-version>
<gpg-plugin-version>1.6</gpg-plugin-version>

<!-- Code Dependencies -->
<commons-lang-version>3.4</commons-lang-version>
<commons-math-version>3.6</commons-math-version>
<commons-csv-version>1.2</commons-csv-version>
<slf4j-api-version>1.7.18</slf4j-api-version>
<slf4j-api-version>1.7.19</slf4j-api-version>
<libsvm-version>3.21</libsvm-version>
<lpsolve-version>5.5.2.0</lpsolve-version>
<mapdb-version>1.0.9</mapdb-version>
Expand Down

0 comments on commit 3f84b40

Please sign in to comment.