Fix for TIKA-1787 : NamedEntityParser #61

Closed
wants to merge 11 commits into
from

Conversation

Projects
None yet
2 participants
@thammegowda
Contributor

thammegowda commented Oct 31, 2015

UPDATE : Wiki URL : https://wiki.apache.org/tika/TikaAndNER

Added NamedEntityParser that supports loading of different NER implementations at runtime.
The default NER implementation based on OpenNLP is supplied.

Another implementation based on StanfordCoreNLP is located here This is GNU GPL 3, So kept separate. See UPDATE 2 below

@chrismattmann This is not 100% complete, here are few TODOs :

  1. The NER implementing class name needs is to be read from tika config if possible/available. Currently relying on Java Properties. Please suggest me on how to resolve this todo
    EDIT : 2. Looking for a best way to read parsed text from non text streams within the NamedEntityParser (not sure if a parser can read output of previous parsers like html or pdf). Please suggest me on how to resolve this todo Using secondary parser to get text content

UPDATE : 1. Added Regex Based NER . Though this can recognize much more patterns than names, (I am using it for recognising weapon names and weapon types )

UPDATE 2 : Added Core NLP NER with runtime class binding this one is still using java binding instead of command invocation, because :

  • The commandline binding is not portable across environments (or need to maintain those many ports)
  • setup overead in distribted environment like hadoop.

UPDATE 3 : Chaining support :
Now we can chain many NER Implementations (OpenNLP, CoreNLP, RegEx) to the NamedEntityParser.

+
+import java.io.InputStream;
+import java.nio.charset.StandardCharsets;
+import java.util.*;

This comment has been minimized.

@chrismattmann

chrismattmann Nov 10, 2015

Contributor

please remove star imports

@chrismattmann

chrismattmann Nov 10, 2015

Contributor

please remove star imports

+import java.util.Arrays;
+import java.util.HashSet;
+
+import static org.junit.Assert.*;

This comment has been minimized.

@chrismattmann

chrismattmann Nov 10, 2015

Contributor

no star imports :)

@chrismattmann

chrismattmann Nov 10, 2015

Contributor

no star imports :)

+import java.util.HashSet;
+import java.util.Set;
+
+import static org.junit.Assert.*;

This comment has been minimized.

@chrismattmann

chrismattmann Nov 10, 2015

Contributor

no star imports

@chrismattmann

chrismattmann Nov 10, 2015

Contributor

no star imports

+ (prefixPath + "ner-location.bin"): (urlPrefix + "/en-ner-location.bin"),
+ (prefixPath + "ner-organization.bin"): (urlPrefix + "/en-ner-organization.bin"),
+ (prefixPath + "ner-date.bin"): (urlPrefix + "/en-ner-date.bin")/*,
+ (prefixPath + "ner-time.bin"): (urlPrefix + "/en-ner-time.bin"),

This comment has been minimized.

@chrismattmann

chrismattmann Nov 10, 2015

Contributor

no commented code

@chrismattmann

chrismattmann Nov 10, 2015

Contributor

no commented code

@@ -0,0 +1,11 @@
+<?xml version="1.0" encoding="UTF-8"?>

This comment has been minimized.

@chrismattmann

chrismattmann Nov 10, 2015

Contributor

ALv2 header please

@chrismattmann

chrismattmann Nov 10, 2015

Contributor

ALv2 header please

+ set.clear();
+ set.addAll(Arrays.asList(md.getValues("NER_DATE")));
+ assertTrue(set.contains("1960 - 1975"));
+ //assertTrue(set.contains("1960"));

This comment has been minimized.

@chrismattmann

chrismattmann Nov 10, 2015

Contributor

no commented out code please

@chrismattmann

chrismattmann Nov 10, 2015

Contributor

no commented out code please

+ set.clear();
+ set.addAll(Arrays.asList(md.getValues("NER_LOCATION")));
+ assertTrue(set.contains("Los Angeles"));
+ //assertTrue(set.contains("California"));

This comment has been minimized.

@chrismattmann

chrismattmann Nov 10, 2015

Contributor

no commented out code

@chrismattmann

chrismattmann Nov 10, 2015

Contributor

no commented out code

+
+import org.apache.tika.parser.ner.NERecogniser;
+
+import java.util.*;

This comment has been minimized.

@chrismattmann

chrismattmann Nov 10, 2015

Contributor

no star imports

@chrismattmann

chrismattmann Nov 10, 2015

Contributor

no star imports

@chrismattmann

This comment has been minimized.

Show comment
Hide comment
@chrismattmann

chrismattmann Nov 10, 2015

Contributor

@thammegowda great work! See my comments and please update thank you

Contributor

chrismattmann commented Nov 10, 2015

@thammegowda great work! See my comments and please update thank you

Resolved Code formatting issues
+ Removed star imports
+ Removed dead code / commented code
+ Added License header to missing files
@thammegowda

This comment has been minimized.

Show comment
Hide comment
@thammegowda

thammegowda Nov 10, 2015

Contributor

@chrismattmann Thanks for the feedback. Issues Resolved!

Contributor

thammegowda commented Nov 10, 2015

@chrismattmann Thanks for the feedback. Issues Resolved!

@thammegowda thammegowda changed the title from NamedEntityParser to Fix for TIKA-1787 : NamedEntityParser Nov 11, 2015

+ file.getParentFile().mkdirs()
+ inStream = urlConn.getInputStream()
+ outStream = new FileOutputStream(file)
+ //IOUtils.copyLarge(inStream, outStream)

This comment has been minimized.

@chrismattmann

chrismattmann Nov 16, 2015

Contributor

@thammegowda can you remove this line? commented code.

@chrismattmann

chrismattmann Nov 16, 2015

Contributor

@thammegowda can you remove this line? commented code.

@chrismattmann

This comment has been minimized.

Show comment
Hide comment
@chrismattmann

chrismattmann Nov 16, 2015

Contributor

one more minor update @thammegowda and this is ready to go!

Contributor

chrismattmann commented Nov 16, 2015

one more minor update @thammegowda and this is ready to go!

@chrismattmann

This comment has been minimized.

Show comment
Hide comment
@chrismattmann

chrismattmann Nov 16, 2015

Contributor

@thammegowda can you also write up a quick tutorial on http://wiki.apache.org/tika/TikaAndNER ? that shows how to install Stanford NER and run this?

Contributor

chrismattmann commented Nov 16, 2015

@thammegowda can you also write up a quick tutorial on http://wiki.apache.org/tika/TikaAndNER ? that shows how to install Stanford NER and run this?

@chrismattmann

This comment has been minimized.

Show comment
Hide comment
@chrismattmann

chrismattmann Nov 16, 2015

Contributor

you will need wiki karma so let me know your username and I'll grant you karma.

Contributor

chrismattmann commented Nov 16, 2015

you will need wiki karma so let me know your username and I'll grant you karma.

@thammegowda

This comment has been minimized.

Show comment
Hide comment
@thammegowda

thammegowda Nov 16, 2015

Contributor

@chrismattmann Sure thing. I might have missed few such comments. I will review one more time.

Please give me permission to create/edit NER wiki page, my username is "ThammeGowda".

Contributor

thammegowda commented Nov 16, 2015

@chrismattmann Sure thing. I might have missed few such comments. I will review one more time.

Please give me permission to create/edit NER wiki page, my username is "ThammeGowda".

@chrismattmann

This comment has been minimized.

Show comment
Hide comment

@asfgit asfgit closed this in 48151b4 Nov 17, 2015

asfgit pushed a commit that referenced this pull request Nov 18, 2015

Fix for TIKA-1787: Include Stanford Name Entity Recognition in Tika c…
…ontributed by Thamme Gowda N and Yueheng He this closes #61 this closes #62

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1714931 13f79535-47bb-0310-9956-ffa450edef68

tballison pushed a commit to tballison/tika that referenced this pull request Feb 26, 2016

Fix for TIKA-1787: Include Stanford Name Entity Recognition in Tika c…
…ontributed by Thamme Gowda N and Yueheng He this closes #61 this closes #62

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1714835 13f79535-47bb-0310-9956-ffa450edef68

tballison pushed a commit to tballison/tika that referenced this pull request Feb 26, 2016

Fix for TIKA-1787: Include Stanford Name Entity Recognition in Tika c…
…ontributed by Thamme Gowda N and Yueheng He this closes #61 this closes #62

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1714931 13f79535-47bb-0310-9956-ffa450edef68
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment