fix for TIKA-1787 contributed by Yueheng He #62

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
2 participants
@TaichiHo

TaichiHo commented Nov 5, 2015

Succeed in building using java 1.8.0_65.
To see the effect, create a text file like the following.

Good afternoon Rajat Raina, how are you today? Hi, I am Tom Brady. I go to school at Stanford University, which is located in California.

Save it as test.ner and feed it to tika.

java -classpath tika-app/target/tika-app-1.12-SNAPSHOT.jar org.apache.tika.cli.TikaCLI -m test.ner

The result should look like this

Content-Length: 137
Content-Type: application/stanford-ner
LOCATION: [California]
ORGANIZATION: [Stanford University]
PERSON: [Rajat Raina, Tom Brady]
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.stanfordNer.StanfordNerParser
resourceName: test.ner
@Gagravarr

This comment has been minimized.

Show comment
Hide comment
@Gagravarr

Gagravarr Nov 9, 2015

Contributor

On the whole, we try to avoid raw un-prefixed metadata keys. We do have some, mostly older ones or ones from other standards, as you've seen! But ideally we try to use a prefix so that people know which standard to go check to understand what a key means. For example, we prefer dc:title to title, as the former makes it clearer what it means, the latter is now deprecated

Any chance you could replace keys like ORGANIZATION with something more like prefix:name based on a well-known external metadata definition? (May end up using multiple prefixes)

Contributor

Gagravarr commented Nov 9, 2015

On the whole, we try to avoid raw un-prefixed metadata keys. We do have some, mostly older ones or ones from other standards, as you've seen! But ideally we try to use a prefix so that people know which standard to go check to understand what a key means. For example, we prefer dc:title to title, as the former makes it clearer what it means, the latter is now deprecated

Any chance you could replace keys like ORGANIZATION with something more like prefix:name based on a well-known external metadata definition? (May end up using multiple prefixes)

@TaichiHo

This comment has been minimized.

Show comment
Hide comment
@TaichiHo

TaichiHo Nov 10, 2015

Thank you for your suggestion. What do you think a good prefix would be like in this case?

The other thing is the way I integrate this might not possible due the GPL license, as is pointed out by https://issues.apache.org/jira/browse/TIKA-1787. I am not sure if I should continue work on this.

The other issue is using a different approach. Might want to look at that. #61

Thank you for your suggestion. What do you think a good prefix would be like in this case?

The other thing is the way I integrate this might not possible due the GPL license, as is pointed out by https://issues.apache.org/jira/browse/TIKA-1787. I am not sure if I should continue work on this.

The other issue is using a different approach. Might want to look at that. #61

@Gagravarr

This comment has been minimized.

Show comment
Hide comment
@Gagravarr

Gagravarr Nov 10, 2015

Contributor

License wise, GPL dependencies are a no-go for ASF projects. You could maintain it independently though, but users would need to review the license and install manually if they were able to abide by the GPL restrictions. http://wiki.apache.org/tika/3rd%20party%20parser%20plugins is where we maintain the list of incompatibly licensed plugins

In terms of the prefix, I don't know, you're the expert in the tool rather than me! You'll need to read up the documentation from the tool, and find out from that what it means by the various outputs. If it defines them against a well known external standard, then great! Use that. If not, see if any well known metadata standards cover the same logical things, and map onto those

Contributor

Gagravarr commented Nov 10, 2015

License wise, GPL dependencies are a no-go for ASF projects. You could maintain it independently though, but users would need to review the license and install manually if they were able to abide by the GPL restrictions. http://wiki.apache.org/tika/3rd%20party%20parser%20plugins is where we maintain the list of incompatibly licensed plugins

In terms of the prefix, I don't know, you're the expert in the tool rather than me! You'll need to read up the documentation from the tool, and find out from that what it means by the various outputs. If it defines them against a well known external standard, then great! Use that. If not, see if any well known metadata standards cover the same logical things, and map onto those

@TaichiHo

This comment has been minimized.

Show comment
Hide comment
@TaichiHo

TaichiHo Nov 11, 2015

Thanks so much. I will look into it.

Thanks so much. I will look into it.

@asfgit asfgit closed this in 48151b4 Nov 17, 2015

asfgit pushed a commit that referenced this pull request Nov 18, 2015

Fix for TIKA-1787: Include Stanford Name Entity Recognition in Tika c…
…ontributed by Thamme Gowda N and Yueheng He this closes #61 this closes #62

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1714931 13f79535-47bb-0310-9956-ffa450edef68

tballison pushed a commit to tballison/tika that referenced this pull request Feb 26, 2016

Fix for TIKA-1787: Include Stanford Name Entity Recognition in Tika c…
…ontributed by Thamme Gowda N and Yueheng He this closes #61 this closes #62

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1714835 13f79535-47bb-0310-9956-ffa450edef68

tballison pushed a commit to tballison/tika that referenced this pull request Feb 26, 2016

Fix for TIKA-1787: Include Stanford Name Entity Recognition in Tika c…
…ontributed by Thamme Gowda N and Yueheng He this closes #61 this closes #62

git-svn-id: https://svn.apache.org/repos/asf/tika/trunk@1714931 13f79535-47bb-0310-9956-ffa450edef68
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment