Skip to content

Commit

Permalink
Renamed HTMLCleaner to HTMLParser and moved it to a different namespa…
Browse files Browse the repository at this point in the history
…ce. Moving StringCleaner inside common.utilities. Removed the ClasspathSuite from the tests. Created the Extractable which is inherited by AbstractTextExtractor.
  • Loading branch information
datumbox committed Mar 10, 2016
1 parent 391cfd4 commit 596ffd0
Show file tree
Hide file tree
Showing 17 changed files with 742 additions and 797 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Expand Up @@ -2,9 +2,11 @@
*.jar *.jar
*.war *.war
*.ear *.ear
*.iml


/target/ target/
/.settings/ /.settings/
/.idea/
.classpath .classpath
.project .project
nbactions.xml nbactions.xml
Expand Down
6 changes: 5 additions & 1 deletion CHANGELOG.md
@@ -1,7 +1,7 @@
CHANGELOG CHANGELOG
========= =========


Version 0.7.0-SNAPSHOT - Build 20160121 Version 0.7.0-SNAPSHOT - Build 20160310
--------------------------------------- ---------------------------------------


- Rename the erase() method to delete() in all interfaces. - Rename the erase() method to delete() in all interfaces.
Expand Down Expand Up @@ -80,6 +80,10 @@ Version 0.7.0-SNAPSHOT - Build 20160121
- Performed profiling and changed thread logic where necessary to improve speed. - Performed profiling and changed thread logic where necessary to improve speed.
- The TextClassifierTest no longer passes a prefix in each test. - The TextClassifierTest no longer passes a prefix in each test.
- Added a new DataframeMapType feature switch. The purpose is to test the performance of HashMap+Index vs TreeMap for the Dataframe object. - Added a new DataframeMapType feature switch. The purpose is to test the performance of HashMap+Index vs TreeMap for the Dataframe object.
- Renamed HTMLCleaner to HTMLParser and moved it to a different namespace.
- Moving StringCleaner inside common.utilities.
- Removed the ClasspathSuite from the tests.
- Created the Extractable which is inherited by AbstractTextExtractor.


Version 0.6.1 - Build 20160102 Version 0.6.1 - Build 20160102
------------------------------ ------------------------------
Expand Down
5 changes: 0 additions & 5 deletions NOTICE
Expand Up @@ -55,11 +55,6 @@ The following libraries are required for the tests of this project:
* LICENSE: http://www.opensource.org/licenses/cpl.php (Common Public License Version 1.0) * LICENSE: http://www.opensource.org/licenses/cpl.php (Common Public License Version 1.0)
* HOMEPAGE: http://www.junit.org/ * HOMEPAGE: http://www.junit.org/


* ClasspathSuite
* COPYRIGHT: Copyright 2006 Johannes Link
* LICENSE: http://www.apache.org/licenses/LICENSE-2.0.txt (Apache License, Version 2.0)
* HOMEPAGE: https://github.com/takari/takari-cpsuite

* Logback * Logback
* COPYRIGHT: Copyright 1999 QOS.ch * COPYRIGHT: Copyright 1999 QOS.ch
* LICENSE: http://logback.qos.ch/license.html (Eclipse Public License v1.0 / GNU Lesser General Public License version 2.1) * LICENSE: http://logback.qos.ch/license.html (Eclipse Public License v1.0 / GNU Lesser General Public License version 2.1)
Expand Down
4 changes: 2 additions & 2 deletions README.md
Expand Up @@ -15,7 +15,7 @@ The code is licensed under the [Apache License, Version 2.0](https://github.com/
Version Version
------- -------


The latest version is 0.7.0-SNAPSHOT (Build 20160121). The latest version is 0.7.0-SNAPSHOT (Build 20160310).


The [master branch](https://github.com/datumbox/datumbox-framework/tree/master) is the latest stable version of the framework. The [devel branch](https://github.com/datumbox/datumbox-framework/tree/devel) is the development branch. All the previous stable versions are marked with [tags](https://github.com/datumbox/datumbox-framework/releases). The [master branch](https://github.com/datumbox/datumbox-framework/tree/master) is the latest stable version of the framework. The [devel branch](https://github.com/datumbox/datumbox-framework/tree/devel) is the development branch. All the previous stable versions are marked with [tags](https://github.com/datumbox/datumbox-framework/releases).


Expand Down Expand Up @@ -65,7 +65,7 @@ The Framework can be improved in many ways and as a result any contribution is w
Acknowledgements Acknowledgements
---------------- ----------------


Many thanks to [Eleftherios Bampaletakis](http://gr.linkedin.com/pub/eleftherios-bampaletakis/39/875/551) for his invaluable input on improving the architecture of the Framework. Also many thanks to ej-technologies GmbH for providing a license for their [Java Profiler](http://www.ej-technologies.com/products/jprofiler/overview.html). Many thanks to [Eleftherios Bampaletakis](http://gr.linkedin.com/pub/eleftherios-bampaletakis/39/875/551) for his invaluable input on improving the architecture of the Framework. Also many thanks to ej-technologies GmbH for providing a license for their [Java Profiler](http://www.ej-technologies.com/products/jprofiler/overview.html) and to IntelliJ for providing a license for their [Java IDE](https://www.jetbrains.com/idea/).


Useful Links Useful Links
------------ ------------
Expand Down
1 change: 0 additions & 1 deletion TODO.txt
Expand Up @@ -9,7 +9,6 @@ CODE IMPROVEMENTS
- Test MapDB+async writes with SynchronizedBlocks.WITHOUT_SYNCHRONIZED flag. - Test MapDB+async writes with SynchronizedBlocks.WITHOUT_SYNCHRONIZED flag.
- Test DataframeMapType.TREEMAP vs DataframeMapType.HASHMAP performance. - Test DataframeMapType.TREEMAP vs DataframeMapType.HASHMAP performance.
- Remove feature switches. - Remove feature switches.
- Upgrade maven-surefire to 2.19.1.
- Add support for MapDB 3.0 once a stable version is released. - Add support for MapDB 3.0 once a stable version is released.


- Add the ability to call Machine Learning algorithms from command line like in Mahout. - Add the ability to call Machine Learning algorithms from command line like in Mahout.
Expand Down
19 changes: 6 additions & 13 deletions pom.xml
Expand Up @@ -77,10 +77,10 @@
<properties> <properties>
<!-- Build Plugins --> <!-- Build Plugins -->
<java-version>1.8</java-version> <java-version>1.8</java-version>
<maven-compiler-plugin-version>3.3</maven-compiler-plugin-version> <maven-compiler-plugin-version>3.5.1</maven-compiler-plugin-version>
<maven-javadoc-plugin-version>2.10.3</maven-javadoc-plugin-version> <maven-javadoc-plugin-version>2.10.3</maven-javadoc-plugin-version>
<maven-source-plugin-version>2.4</maven-source-plugin-version> <maven-source-plugin-version>3.0.0</maven-source-plugin-version>
<maven-surefire-plugin-version>2.18</maven-surefire-plugin-version> <maven-surefire-plugin-version>2.19.1</maven-surefire-plugin-version>
<maven-jar-plugin-version>2.6</maven-jar-plugin-version> <maven-jar-plugin-version>2.6</maven-jar-plugin-version>
<gmaven-plugin-version>1.5</gmaven-plugin-version> <gmaven-plugin-version>1.5</gmaven-plugin-version>
<license-maven-plugin-version>2.11</license-maven-plugin-version> <license-maven-plugin-version>2.11</license-maven-plugin-version>
Expand All @@ -90,15 +90,14 @@
<commons-lang-version>3.4</commons-lang-version> <commons-lang-version>3.4</commons-lang-version>
<commons-math-version>3.6</commons-math-version> <commons-math-version>3.6</commons-math-version>
<commons-csv-version>1.2</commons-csv-version> <commons-csv-version>1.2</commons-csv-version>
<slf4j-api-version>1.7.13</slf4j-api-version> <slf4j-api-version>1.7.18</slf4j-api-version>
<libsvm-version>3.21</libsvm-version> <libsvm-version>3.21</libsvm-version>
<lpsolve-version>5.5.2.0</lpsolve-version> <lpsolve-version>5.5.2.0</lpsolve-version>
<mapdb-version>1.0.8</mapdb-version> <mapdb-version>1.0.9</mapdb-version>


<!-- Test Dependencies --> <!-- Test Dependencies -->
<junit-version>4.12</junit-version> <junit-version>4.12</junit-version>
<cpsuite-version>1.2.7</cpsuite-version> <logback-classic-version>1.1.6</logback-classic-version>
<logback-classic-version>1.1.3</logback-classic-version>


<!-- Configuration --> <!-- Configuration -->
<gpg.keyname>7083A486</gpg.keyname> <gpg.keyname>7083A486</gpg.keyname>
Expand Down Expand Up @@ -319,12 +318,6 @@
<version>${junit-version}</version> <version>${junit-version}</version>
<scope>test</scope> <scope>test</scope>
</dependency> </dependency>
<dependency>
<groupId>io.takari.junit</groupId>
<artifactId>takari-cpsuite</artifactId>
<version>${cpsuite-version}</version>
<scope>test</scope>
</dependency>
<dependency> <dependency>
<groupId>ch.qos.logback</groupId> <groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId> <artifactId>logback-classic</artifactId>
Expand Down
10 changes: 5 additions & 5 deletions src/main/java/com/datumbox/applications/nlp/CETR.java
Expand Up @@ -25,8 +25,8 @@
import com.datumbox.common.utilities.PHPMethods; import com.datumbox.common.utilities.PHPMethods;
import com.datumbox.framework.machinelearning.clustering.Kmeans; import com.datumbox.framework.machinelearning.clustering.Kmeans;
import com.datumbox.framework.statistics.descriptivestatistics.Descriptives; import com.datumbox.framework.statistics.descriptivestatistics.Descriptives;
import com.datumbox.framework.utilities.text.cleaners.HTMLCleaner; import com.datumbox.framework.utilities.text.parsers.HTMLParser;
import com.datumbox.framework.utilities.text.cleaners.StringCleaner; import com.datumbox.common.utilities.StringCleaner;
import java.util.ArrayList; import java.util.ArrayList;
import java.util.Arrays; import java.util.Arrays;
import java.util.HashMap; import java.util.HashMap;
Expand Down Expand Up @@ -152,7 +152,7 @@ public String extract(String html, CETR.Parameters parameters) {
String row = rows.get(rowId); String row = rows.get(rowId);


//extract the clear text from the selected row //extract the clear text from the selected row
row = StringCleaner.removeExtraSpaces(HTMLCleaner.extractText(row)); row = StringCleaner.removeExtraSpaces(HTMLParser.extractText(row));
if(row.isEmpty()) { if(row.isEmpty()) {
continue; continue;
} }
Expand Down Expand Up @@ -392,15 +392,15 @@ private int countNumberOfTags(String text) {
} }


private int countContentChars(String text) { private int countContentChars(String text) {
return StringCleaner.removeExtraSpaces(HTMLCleaner.extractText(text)).length(); return StringCleaner.removeExtraSpaces(HTMLParser.extractText(text)).length();
} }


private List<String> extractRows(String text) { private List<String> extractRows(String text) {
return Arrays.asList(text.split("\n")); return Arrays.asList(text.split("\n"));
} }


private String clearText(String text) { private String clearText(String text) {
text = HTMLCleaner.removeNonTextTagsAndAttributes(text); //remove all the irrelevant HTML Tags that are not related to the text (such as forms, scripts etc) text = HTMLParser.removeNonTextTagsAndAttributes(text); //remove all the irrelevant HTML Tags that are not related to the text (such as forms, scripts etc)
if(PHPMethods.substr_count(text, '\n')<=1) { //if the document is in a single line (no spaces), then break it in order for this algorithm to work if(PHPMethods.substr_count(text, '\n')<=1) { //if the document is in a single line (no spaces), then break it in order for this algorithm to work
text = text.replace(">", ">\n"); text = text.replace(">", ">\n");
} }
Expand Down
Expand Up @@ -27,7 +27,7 @@
import com.datumbox.framework.machinelearning.common.abstracts.wrappers.AbstractWrapper; import com.datumbox.framework.machinelearning.common.abstracts.wrappers.AbstractWrapper;
import com.datumbox.framework.machinelearning.common.abstracts.datatransformers.AbstractTransformer; import com.datumbox.framework.machinelearning.common.abstracts.datatransformers.AbstractTransformer;
import com.datumbox.framework.machinelearning.common.interfaces.ValidationMetrics; import com.datumbox.framework.machinelearning.common.interfaces.ValidationMetrics;
import com.datumbox.framework.utilities.text.cleaners.StringCleaner; import com.datumbox.common.utilities.StringCleaner;
import com.datumbox.framework.utilities.text.extractors.AbstractTextExtractor; import com.datumbox.framework.utilities.text.extractors.AbstractTextExtractor;
import java.net.URI; import java.net.URI;
import java.util.HashMap; import java.util.HashMap;
Expand Down
11 changes: 6 additions & 5 deletions src/main/java/com/datumbox/common/dataobjects/Dataframe.java
Expand Up @@ -23,12 +23,12 @@
import com.datumbox.common.persistentstorage.interfaces.DatabaseConnector.StorageHint; import com.datumbox.common.persistentstorage.interfaces.DatabaseConnector.StorageHint;
import com.datumbox.common.concurrency.StreamMethods; import com.datumbox.common.concurrency.StreamMethods;
import com.datumbox.common.concurrency.ThreadMethods; import com.datumbox.common.concurrency.ThreadMethods;
import com.datumbox.common.interfaces.Extractable;
import com.datumbox.development.switchers.DataframeMapType; import com.datumbox.development.switchers.DataframeMapType;
import com.datumbox.development.switchers.DataframeMapTypeMark; import com.datumbox.development.switchers.DataframeMapTypeMark;
import com.datumbox.development.switchers.SynchronizedBlocks; import com.datumbox.development.switchers.SynchronizedBlocks;
import com.datumbox.development.switchers.SynchronizedBlocksMark; import com.datumbox.development.switchers.SynchronizedBlocksMark;
import com.datumbox.framework.utilities.text.cleaners.StringCleaner; import com.datumbox.common.utilities.StringCleaner;
import com.datumbox.framework.utilities.text.extractors.AbstractTextExtractor;
import java.io.BufferedReader; import java.io.BufferedReader;
import java.io.File; import java.io.File;
import java.io.FileInputStream; import java.io.FileInputStream;
Expand Down Expand Up @@ -87,7 +87,7 @@ map should have as index the names of each class and as values the URIs
pass a single URI with null as key. pass a single URI with null as key.
The method requires as arguments a file with the category names and locations The method requires as arguments a file with the category names and locations
of the training files, an instance of a AbstractTextExtractor which is used of the training files, an instance of a TextExtractor which is used
to extract the keywords from the documents and the Database Configuration to extract the keywords from the documents and the Database Configuration
Object. Object.
* *
Expand All @@ -96,7 +96,7 @@ map should have as index the names of each class and as values the URIs
* @param conf * @param conf
* @return * @return
*/ */
public static Dataframe parseTextFiles(Map<Object, URI> textFilesMap, AbstractTextExtractor textExtractor, Configuration conf) { public static Dataframe parseTextFiles(Map<Object, URI> textFilesMap, Extractable textExtractor, Configuration conf) {
Dataframe dataset = new Dataframe(conf); Dataframe dataset = new Dataframe(conf);
Logger logger = LoggerFactory.getLogger(Dataframe.Builder.class); Logger logger = LoggerFactory.getLogger(Dataframe.Builder.class);


Expand Down Expand Up @@ -833,7 +833,8 @@ public boolean hasNext() {
/** {@inheritDoc} */ /** {@inheritDoc} */
@Override @Override
public Record next() { public Record next() {
return records.get(it.next()); Integer id = it.next();
return records.get(id);
} }


/** {@inheritDoc} */ /** {@inheritDoc} */
Expand Down

0 comments on commit 596ffd0

Please sign in to comment.