Skip to content

Commit

Permalink
CSV reader and converter to Dataset. Refactor code of DatasetBuilder …
Browse files Browse the repository at this point in the history
…and TypeInference.
  • Loading branch information
datumbox committed Apr 17, 2015
1 parent 4fedd2b commit 24af41c
Show file tree
Hide file tree
Showing 66 changed files with 350 additions and 190 deletions.
21 changes: 11 additions & 10 deletions CHANGELOG.md
@@ -1,22 +1,23 @@
CHANGELOG
=========

Version 1.0.0 - Build 20150415
Version 1.0.0 - Build 20150417
------------------------------

- Add support of [MapDB](http://www.mapdb.org/) database engine.
- Remove MongoDB support due to performance issues.
- Reduce the level of abstraction and simplify framework's architecture.
- Rewrite the persistance mechanisms, remove unnecessary data structures and features that increased the complexity.
- Change the public methods of the Machine Learning models to resemble Python's Scikit-Learn APIs.
- Change the software License from "GNU General Public License v3.0" to "Apache License, Version 2.0".
- Added support of [MapDB](http://www.mapdb.org/) database engine.
- Removed MongoDB support due to performance issues.
- Reduced the level of abstraction and simplify framework's architecture.
- Rewrote the persistance mechanisms, remove unnecessary data structures and features that increased the complexity.
- Changed the public methods of the Machine Learning models to resemble Python's Scikit-Learn APIs.
- Changed the software License from "GNU General Public License v3.0" to "Apache License, Version 2.0".
- Added convenience methods to build a dataset from CSV or text files.

Version 0.5.1 - Build 20141105
------------------------------

- Updating the pom.xml file.
- Submitting the Datumbox Framework and the LPSolve library in Maven Central Repository.
- Resolving issues [#1](https://github.com/datumbox/datumbox-framework/issues/1), [#2](https://github.com/datumbox/datumbox-framework/issues/2) and [#4](https://github.com/datumbox/datumbox-framework/issues/4). All the dependencies are now on Maven.
- Updated the pom.xml file.
- Submitted the Datumbox Framework and the LPSolve library in Maven Central Repository.
- Resolved issues [#1](https://github.com/datumbox/datumbox-framework/issues/1), [#2](https://github.com/datumbox/datumbox-framework/issues/2) and [#4](https://github.com/datumbox/datumbox-framework/issues/4). All the dependencies are now on Maven.

Version 0.5.0 - Build 20141018
------------------------------
Expand Down
2 changes: 1 addition & 1 deletion LICENSE
Expand Up @@ -187,7 +187,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright [yyyy] [name of copyright owner]
Copyright 2013 Vasilis Vryniotis

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
5 changes: 5 additions & 0 deletions NOTICE
Expand Up @@ -24,6 +24,11 @@ The following libraries are included in packaged versions of this project:
* LICENSE: http://www.apache.org/licenses/LICENSE-2.0.txt (Apache License, Version 2.0)
* HOMEPAGE: https://commons.apache.org/proper/commons-math/

* Apache Commons CSV
* COPYRIGHT: Copyright 2014 The Apache Software Foundation
* LICENSE: http://www.apache.org/licenses/LICENSE-2.0.txt (Apache License, Version 2.0)
* HOMEPAGE: http://commons.apache.org/proper/commons-csv/

* Guava
* COPYRIGHT: Copyright 2007 Google Inc.
* LICENSE: http://www.apache.org/licenses/LICENSE-2.0.txt (Apache License, Version 2.0)
Expand Down
2 changes: 1 addition & 1 deletion README.md
Expand Up @@ -60,7 +60,7 @@ By far the most important part missing from the Framework is the Documentation a
Acknowledgements
----------------

Many thanks to [Eleftherios Bampaletakis](http://gr.linkedin.com/pub/eleftherios-bampaletakis/39/875/551) for his invaluable input on improving the architecture of the Framework. Also many thanks to ej-technologies GmbH for providing us with a license for their [Java Profiler](http://www.ej-technologies.com/products/jprofiler/overview.html).
Many thanks to [Eleftherios Bampaletakis](http://gr.linkedin.com/pub/eleftherios-bampaletakis/39/875/551) for his invaluable input on improving the architecture of the Framework. Also many thanks to ej-technologies GmbH for providing a license for their [Java Profiler](http://www.ej-technologies.com/products/jprofiler/overview.html).

Useful Links
------------
Expand Down
2 changes: 0 additions & 2 deletions TODO.txt
@@ -1,8 +1,6 @@
CODE IMPROVEMENTS
=================

- CSV reader and converter to Dataset.

- Improve Serialization by setting the serialVersionUID in every serializable class?
- Create better Exceptions and Exception messages.
- Add multithreading support.
Expand Down
7 changes: 6 additions & 1 deletion pom.xml
Expand Up @@ -50,7 +50,6 @@
<url>http://gr.linkedin.com/pub/eleftherios-bampaletakis/39/875/551</url>
<roles>
<role>Java Consultant</role>
<role>Developer</role>
</roles>
</contributor>
</contributors>
Expand Down Expand Up @@ -84,6 +83,7 @@
<cpsuite-version>1.2.7</cpsuite-version>
<commons-lang-version>3.3.2</commons-lang-version>
<commons-math-version>3.4.1</commons-math-version>
<commons-csv-version>1.1</commons-csv-version>
<guava-version>18.0</guava-version>
<libsvm-version>3.17</libsvm-version>
<lpsolve-version>5.5.2.0</lpsolve-version>
Expand Down Expand Up @@ -267,6 +267,11 @@
<artifactId>commons-math3</artifactId>
<version>${commons-math-version}</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-csv</artifactId>
<version>${commons-csv-version}</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
Expand Down
Expand Up @@ -23,7 +23,6 @@
import com.datumbox.framework.machinelearning.common.bases.mlmodels.BaseMLmodel;
import com.datumbox.framework.machinelearning.common.bases.wrappers.BaseWrapper;
import com.datumbox.framework.machinelearning.common.bases.datatransformation.DataTransformer;
import com.datumbox.framework.utilities.dataset.DatasetBuilder;
import com.datumbox.framework.utilities.text.extractors.TextExtractor;
import java.net.URI;
import java.util.HashMap;
Expand Down Expand Up @@ -95,7 +94,7 @@ public void fit(Map<Object, URI> dataset, TrainingParameters trainingParameters)
textExtractor.setParameters(trainingParameters.getTextExtractorTrainingParameters());

//build trainingDataset
Dataset trainingDataset = DatasetBuilder.parseFromTextFiles(dataset, textExtractor, knowledgeBase.getDbConf());
Dataset trainingDataset = Dataset.Builder.parseTextFiles(dataset, textExtractor, knowledgeBase.getDbConf());

_fit(trainingDataset);

Expand Down Expand Up @@ -160,7 +159,7 @@ public Dataset predict(URI datasetURI) {
textExtractor.setParameters(trainingParameters.getTextExtractorTrainingParameters());

//build the testDataset
Dataset testDataset = DatasetBuilder.parseFromTextFiles(dataset, textExtractor, dbConf);
Dataset testDataset = Dataset.Builder.parseTextFiles(dataset, textExtractor, dbConf);

getPredictions(testDataset);

Expand All @@ -180,7 +179,7 @@ public BaseMLmodel.ValidationMetrics validate(Map<Object, URI> dataset) {


//build the testDataset
Dataset testDataset = DatasetBuilder.parseFromTextFiles(dataset, textExtractor, dbConf);
Dataset testDataset = Dataset.Builder.parseTextFiles(dataset, textExtractor, dbConf);

BaseMLmodel.ValidationMetrics vm = getPredictions(testDataset);

Expand Down
Expand Up @@ -15,7 +15,6 @@
*/
package com.datumbox.common.dataobjects;

import com.datumbox.common.utilities.TypeInference;
import java.util.ArrayList;
import java.util.Collection;
import java.util.LinkedHashMap;
Expand Down
Expand Up @@ -15,7 +15,6 @@
*/
package com.datumbox.common.dataobjects;

import com.datumbox.common.utilities.TypeInference;
import java.util.Collection;
import java.util.Iterator;

Expand Down
133 changes: 129 additions & 4 deletions src/main/java/com/datumbox/common/dataobjects/Dataset.java
Expand Up @@ -17,13 +17,26 @@

import com.datumbox.common.persistentstorage.interfaces.DatabaseConfiguration;
import com.datumbox.common.persistentstorage.interfaces.DatabaseConnector;
import com.datumbox.common.utilities.TypeInference;
import com.datumbox.framework.utilities.text.cleaners.StringCleaner;
import com.datumbox.framework.utilities.text.extractors.TextExtractor;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.Serializable;
import java.net.URI;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVRecord;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
*
* @author Vasilis Vryniotis <bbriniotis@datumbox.com>
Expand All @@ -32,6 +45,84 @@ public final class Dataset implements Serializable, Iterable<Integer> {

public static final String yColumnName = "~Y";
public static final String constantColumnName = "~CONSTANT";

public static final class Builder {

public static Dataset parseTextFiles(Map<Object, URI> textFilesMap, TextExtractor textExtractor, DatabaseConfiguration dbConf) {
Dataset dataset = new Dataset(dbConf);
Logger logger = LoggerFactory.getLogger(Dataset.Builder.class);

for (Map.Entry<Object, URI> entry : textFilesMap.entrySet()) {
Object theClass = entry.getKey();
URI datasetURI = entry.getValue();

logger.info("Dataset Parsing " + theClass + " class");

try (final BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(new File(datasetURI)), "UTF8"))) {
for (String line; (line = br.readLine()) != null;) {
dataset.add(new Record(new AssociativeArray(textExtractor.extract(StringCleaner.clear(line))), theClass));
}
}
catch (IOException ex) {
dataset.erase();
throw new RuntimeException(ex);
}
}

return dataset;
}

public static Dataset parseCSVFile(Reader reader, Map<String, TypeInference.DataType> headerDataTypes, char delimiter, char quote, String recordSeparator, DatabaseConfiguration dbConf) {
Logger logger = LoggerFactory.getLogger(Dataset.Builder.class);

logger.info("Parsing CSV file");

if (!headerDataTypes.containsKey(yColumnName)) {
logger.warn("WARNING: The file is missing the response variable column " + Dataset.yColumnName + ".");
}

Dataset dataset = new Dataset(dbConf, headerDataTypes.get(yColumnName), headerDataTypes); //use the private constructor to pass DataTypes directly and avoid updating them on the fly

CSVFormat format = CSVFormat
.RFC4180
.withHeader()
.withDelimiter(delimiter)
.withQuote(quote)
.withRecordSeparator(recordSeparator);

try (final CSVParser parser = new CSVParser(reader, format)) {
for (CSVRecord row : parser) {

if (!row.isConsistent()) {
logger.warn("WARNING: Skipping row " + row.getRecordNumber() + " because its size does not match the header size.");
continue;
}

Object y = null;
AssociativeArray xData = new AssociativeArray();
for (Map.Entry<String, TypeInference.DataType> entry : headerDataTypes.entrySet()) {
String column = entry.getKey();
TypeInference.DataType dataType = entry.getValue();

Object value = TypeInference.parse(row.get(column), dataType); //parse the string value according to the DataType
if (yColumnName.equals(column)) {
y = value;
}
else {
xData.put(column, value);
}
}
dataset._add(new Record(xData, y)); //use the internal _add() to avoid the update of the Metas. The Metas are already set in the construction of the Dataset.
}
}
catch (IOException ex) {
dataset.erase();
throw new RuntimeException(ex);
}
return dataset;
}

}

private Map<Integer, Record> recordList;

Expand All @@ -43,6 +134,11 @@ public final class Dataset implements Serializable, Iterable<Integer> {
private transient DatabaseConnector dbc;
private transient DatabaseConfiguration dbConf;

/**
* Public constructor.
*
* @param dbConf
*/
public Dataset(DatabaseConfiguration dbConf) {
//we dont need to have a unique name, because it is not used by the connector on the current implementations
//dbName = "dts_"+new BigInteger(130, RandomValue.getRandomGenerator()).toString(32);
Expand All @@ -56,6 +152,25 @@ public Dataset(DatabaseConfiguration dbConf) {
xDataTypes = dbc.getBigMap("tmp_xColumnTypes", true);
}

/**
* Private constructor used by the Builder inner static class.
*
* @param dbConf
* @param yDataType
* @param xDataTypes
*/
private Dataset(DatabaseConfiguration dbConf, TypeInference.DataType yDataType, Map<String, TypeInference.DataType> xDataTypes) {
this(dbConf);
this.yDataType = yDataType;
this.xDataTypes.putAll(xDataTypes);
this.xDataTypes.remove(yColumnName); //make sure to remove the response variable from the xDataTypes
}

/**
* Returns the type of the response variable.
*
* @return
*/
public TypeInference.DataType getYDataType() {
return yDataType;
}
Expand Down Expand Up @@ -261,10 +376,21 @@ public void recalculateMeta() {
* @return
*/
public Integer add(Record r) {
Integer newId=_add(r);
updateMeta(r);
return newId;
}

/**
* Adds the record in the dataset without updating the Meta. The add method
* returns the id of the new record.
*
* @param r
* @return
*/
private Integer _add(Record r) {
Integer newId=(Integer) recordList.size();
recordList.put(newId, r);
updateMeta(r);

return newId;
}

Expand All @@ -278,7 +404,6 @@ public Integer add(Record r) {
public Integer set(Integer rId, Record r) {
_set(rId, r);
updateMeta(r);

return rId;
}

Expand Down
Expand Up @@ -15,7 +15,6 @@
*/
package com.datumbox.common.dataobjects;

import com.datumbox.common.utilities.TypeInference;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Iterator;
Expand Down
Expand Up @@ -15,7 +15,6 @@
*/
package com.datumbox.common.dataobjects;

import com.datumbox.common.utilities.TypeInference;
import java.util.Map;
import org.apache.commons.math3.linear.ArrayRealVector;
import org.apache.commons.math3.linear.BlockRealMatrix;
Expand Down
4 changes: 2 additions & 2 deletions src/main/java/com/datumbox/common/dataobjects/Record.java
Expand Up @@ -97,10 +97,10 @@ public boolean equals(Object obj) {
return false;
}
final Record other = (Record) obj;
if (!Objects.equals(this.x, other.x)) {
if (!Objects.equals(this.y, other.y)) {
return false;
}
if (!Objects.equals(this.y, other.y)) {
if (!Objects.equals(this.x, other.x)) {
return false;
}
return true;
Expand Down

0 comments on commit 24af41c

Please sign in to comment.