-
Notifications
You must be signed in to change notification settings - Fork 490
OPENNLP-1214: use hash to avoid linear search in DefaultEndOfSentence… #329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…Scanner and DefaultSDContextGenerator
72ee664 to
46b40b0
Compare
kottmann
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 lgtm, makes sense to use set instead
|
Given the goal of this improvement is to speed up, do you think below is a realistic test? Do you think it applies across other JVMs? import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import opennlp.tools.sentdetect.lang.Factory;
class Scratch {
private static final int ITERATIONS = 100_000_000;
private static Set<Character> eosCharacters;
public static void main(String[] args) {
eosCharacters = new HashSet<>();
for (char eosChar: Factory.defaultEosCharacters) {
eosCharacters.add(eosChar);
}
char[] cbuf = new char[20];
System.out.println("defaultEosCharacters");
for (char eos : Factory.defaultEosCharacters) {
Arrays.fill(cbuf, eos);
testBuffer(cbuf);
}
System.out.println("ptEosCharacters");
for (char eos : Factory.ptEosCharacters) {
Arrays.fill(cbuf, eos);
testBuffer(cbuf);
}
System.out.println("jpnEosCharacters");
for (char eos : Factory.jpnEosCharacters) {
Arrays.fill(cbuf, eos);
testBuffer(cbuf);
}
}
private static void testBuffer(char[] cbuf) {
System.out.println("Testing with: " + new String(cbuf));
{
long start = System.currentTimeMillis();
for (int n = 0; n < ITERATIONS; n++) {
getPositionsArray(cbuf);
}
long duration = System.currentTimeMillis() - start;
System.out.println("Duration array (ms): " + duration);
}
{
long start = System.currentTimeMillis();
for (int n = 0; n < ITERATIONS; n++) {
getPositionsHashset(cbuf);
}
long duration = System.currentTimeMillis() - start;
System.out.println("Duration set (ms): " + duration);
}
}
public static List<Integer> getPositionsArray(char[] cbuf) {
List<Integer> l = new ArrayList<>();
char[] eosCharacters = Factory.defaultEosCharacters;
for (int i = 0; i < cbuf.length; i++) {
for (char eosCharacter : eosCharacters) {
if (cbuf[i] == eosCharacter) {
l.add(i);
break;
}
}
}
return l;
}
public static List<Integer> getPositionsHashset(char[] cbuf) {
List<Integer> l = new ArrayList<>();
for (int i = 0; i < cbuf.length; i++) {
if (eosCharacters.contains(cbuf[i])) {
l.add(i);
}
}
return l;
}
}"C:\Program Files\Java\jdk1.8.0_162\bin\java.exe" ....
defaultEosCharacters
Testing with: ....................
Duration array (ms): 16424
Duration set (ms): 25844
Testing with: !!!!!!!!!!!!!!!!!!!!
Duration array (ms): 17498
Duration set (ms): 26696
Testing with: ????????????????????
Duration array (ms): 17948
Duration set (ms): 25391
ptEosCharacters
Testing with: ....................
Duration array (ms): 16975
Duration set (ms): 25442
Testing with: ????????????????????
Duration array (ms): 18012
Duration set (ms): 25529
Testing with: !!!!!!!!!!!!!!!!!!!!
Duration array (ms): 17562
Duration set (ms): 25579
Testing with: ;;;;;;;;;;;;;;;;;;;;
Duration array (ms): 4040
Duration set (ms): 6223
Testing with: ::::::::::::::::::::
Duration array (ms): 3991
Duration set (ms): 6276
Testing with: ((((((((((((((((((((
Duration array (ms): 3980
Duration set (ms): 6185
Testing with: ))))))))))))))))))))
Duration array (ms): 4043
Duration set (ms): 6199
Testing with: ««««««««««««««««««««
Duration array (ms): 3971
Duration set (ms): 8503
Testing with: »»»»»»»»»»»»»»»»»»»»
Duration array (ms): 3960
Duration set (ms): 8587
Testing with: ''''''''''''''''''''
Duration array (ms): 3920
Duration set (ms): 5450
Testing with: """"""""""""""""""""
Duration array (ms): 3931
Duration set (ms): 5396
jpnEosCharacters
Testing with: 。。。。。。。。。。。。。。。。。。。。
Duration array (ms): 3974
Duration set (ms): 8616
Testing with: !!!!!!!!!!!!!!!!!!!!
Duration array (ms): 3908
Duration set (ms): 9276
Testing with: ????????????????????
Duration array (ms): 3953
Duration set (ms): 9278
Process finished with exit code 0
|
|
Interesting! I'm not sure the above test is realistic because some conditions look extreme to me such as ITERATIONS = 100_000_000, testing against repetition of one of eos chars (e.g. Testing with: ....................), using only Factory.defaultEosCharacters which has three eos chars only. If I change the code as follows: I got the following result which shows opposite: But I think your feedback is very interesting and highly appreciated. Thank you. :) |
|
Let me follow your lead and let's take another look. import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import opennlp.tools.sentdetect.lang.Factory;
class Scratch {
private static final int ITERATIONS = 100_000;
private static final char[] EN_SAMPLE = ("I think you are better off sending an email to the solr-user mailing " +
"list (http://lucene.apache.org/solr/community.html#mailing-lists-irc) and explaining " +
"more about your use case so we can understand what leads up to the dump. Most likely you " +
"will find ways to reconfigure your cluster or queries in a way that avoids this situation. " +
"Or perhaps your cluster is simply under-dimensioned.").toCharArray();
private static final char[] PT_SAMPLE = ("A pré-história de Portugal é partilhada com a do resto da Península Ibérica. Os vestígios humanos modernos mais antigos conhecidos são de homens de Cro-Magnon com \"traços\" de Neanderthal, com 24 500 anos e que são interpretados como indicadores de extensas populações mestiças entre as duas espécies. São também os vestígios mais recentes de seres com caraterísticas de Neandertal que se conhece, provavelmente os últimos da sua espécie[28] Há cerca de 5500 a.C., surge uma cultura mesolítica.[29] Durante o Neolítico a região foi ocupada por pré-celtas e celtas, dando origem a povos como os galaicos, lusitanos e cinetes, e visitada por fenícios[30] e cartagineses. Os romanos incorporaram-na no seu Império como Lusitânia [31] (centro e sul de Portugal), após vencida a resistência onde se destacou Viriato.[29]\n"
+ "\n"
+ "No século III, foi criada a Galécia, a norte do Douro, a partir da Tarraconense, abrangendo o norte de Portugal. A romanização marcou a cultura, em especial a língua latina, que foi a base do desenvolvimento da língua portuguesa.[32]\n"
+ "\n"
+ "Com o enfraquecimento do império romano, a partir de 409, o território é ocupado por povos germânicos como vândalos na Bética, alanos que fixaram-se na Lusitânia e suevos na Galécia. Em 415 os visigodos entram na Península, a pedido dos romanos, para expulsar os invasores. Vândalos e alanos deslocam-se para o norte de África. Os suevos e visigodos fundam os primeiros reinos cristãos. Em 711 o território é conquistado pelos mouros que aí estabeleceram o Al-Andalus. Os cristãos recolhem-se para norte, acantonados no Reino das Astúrias. Em 868, durante a Reconquista, foi formado o Condado Portucalense.[33] ").toCharArray();
private static final char[] JP_SAMPLE = ("「にっぽん」、「にほん」と読まれる。どちらも多く用いられているため、日本政府は正式な読み方をどちらか一方には定めておらず、どちらの読みでも良いとしている[5]。\n"
+ "\n"
+ "7世紀の後半の国際関係から生じた「日本」国号は、当時の国際的な読み(音読)で「ニッポン」(呉音)ないし「ジッポン」(漢音)と読まれたものと推測される[6]。いつ「ニホン」の読みが始まったか定かでない。仮名表記では「にほん」と表記された。平安時代には「ひのもと」とも和訓されるようになった。\n"
+ "\n"
+ "室町時代の謡曲・狂言は、中国人に「ニッポン」と読ませ、日本人に「ニホン」と読ませている。安土桃山時代にポルトガル人が編纂した『日葡辞書』や『日本小文典』等には、「ニッポン」「ニホン」「ジッポン」の読みが見られ、その用例から判断すると、改まった場面・強調したい場合に「ニッポン」が使われ、日常の場面で「ニホン」が使われていた[7]。このことから小池清治は、中世の日本人が中国語的な語感のある「ジッポン」を使用したのは、中国人・西洋人など対外的な場面に限定されていて、日常だと「ニッポン」「ニホン」が用いられていたのでは、と推測している[8]。なお、現在に伝わっていない「ジッポン」音については、その他の言語も参照。 ").toCharArray();
private static final char[] TH_SAMPLE = ("สืบจากปัญหาเศรษฐกิจรุนแรงจากภาวะเศรษฐกิจตกต่ำครั้งใหญ่ และราคาข้าวตกลงอย่างรุนแรง นอกจากนี้ยังมีการลดรายจ่ายภาครัฐอย่างมากทำให้เกิดความไม่พอใจในหมู่อภิชน[19]:25 วันที่ 24 มิถุนายน 2475 คณะราษฎรนำปฏิวัติเปลี่ยนแปลงการปกครองจากสมบูรณาญาสิทธิราชย์มาเป็นระบอบประชาธิปไตย ทำให้คณะราษฎรเข้ามามีบทบาททางการเมือง ปลายปี 2476 เกิดกบฏบวรเดช ซึ่งหวังเปลี่ยนแปลงการปกครองกลับสู่สมบูรณาญาสิทธิราช แต่ล้มเหลว[28]:446–8 ปี 2477 พระบาทสมเด็จพระปกเกล้าเจ้าอยู่หัวมีความเห็นไม่ลงรอยกับรัฐบาลจึงทรงสละราชสมบัติในปี 2478 สภาผู้แทนราษฎรเลือกพระวรวงศ์เธอ พระองค์เจ้าอานันทมหิดลเป็นพระมหากษัตริย์ ซึ่งขณะนั้นทรงศึกษาอยู่ในประเทศสวิสเซอร์แลนด์[28]:448–9\n"
+ "\n"
+ "เดือนธันวาคม 2481 พลตรีหลวงพิบูลสงครามได้เป็นนายกรัฐมนตรี เขาปราบปรามศัตรูทางการเมืองรวมทั้งต่อต้านราชวงศ์อย่างเปิดเผย[28]:457 รัฐบาลมีแนวคิดชาตินิยมและปรับให้เป็นตะวันตก และเริ่มดำเนินนโยบายต่อต้านจีนและฝรั่งเศส[19]:28 วันที่ 23 มิถุนายน 2482 มีการเปลี่ยนชื่อประเทศจาก \"สยาม\" มาเป็น \"ไทย\" ในปี 2484 เกิดสงครามขนาดย่อมขึ้นระหว่างวิชีฝรั่งเศสกับไทย ทำให้ไทยได้ดินแดนเพิ่มจากลาวและกัมพูชาช่วงสั้น ๆ[28]:462 วันที่ 8 ธันวาคม ปีเดียวกัน ประเทศญี่ปุ่นบุกครองไทย และรัฐบาลลงนามเป็นพันธมิตรทางทหารกับญี่ปุ่น และประกาศสงครามกับสหรัฐและสหราชอาณาจักร[28]:465 มีการตั้งขบวนการเสรีไทยขึ้นทั้งในและต่างประเทศเพื่อต่อต้านรัฐบาลและการยึดครองของญี่ปุ่น[28]:465–6 หลังสงครามยุติในปี 2488 ประเทศไทยลงนามความตกลงสมบูรณ์แบบเพื่อเลิกสถานะสงครามกับฝ่ายสัมพันธมิตร ").toCharArray();
private static Set<Character> eosCharacters;
private static char[] eosChars;
private static void setEosCharacters(char[] chars) {
eosChars = chars;
eosCharacters = new HashSet<>();
for (char eosChar: chars) {
eosCharacters.add(eosChar);
}
}
public static void main(String[] args) {
System.out.println("Language\tarray (ms)\tset (ms)");
testBuffer(EN_SAMPLE, "English", Factory.defaultEosCharacters);
testBuffer(PT_SAMPLE, "Portuguese", Factory.ptEosCharacters);
testBuffer(JP_SAMPLE, "Japanese", Factory.jpnEosCharacters);
testBuffer(TH_SAMPLE, "Thai", Factory.thEosCharacters);
}
private static void testBuffer(final char[] cbuf, final String langName, final char[] eosChars) {
setEosCharacters(eosChars);
long arrayDuration;
long setDuration;
{
long start = System.currentTimeMillis();
for (int n = 0; n < ITERATIONS; n++) {
getPositionsArray(cbuf);
}
arrayDuration = System.currentTimeMillis() - start;
}
{
long start = System.currentTimeMillis();
for (int n = 0; n < ITERATIONS; n++) {
getPositionsHashset(cbuf);
}
setDuration = System.currentTimeMillis() - start;
}
System.out.println(langName + "\t" + arrayDuration + "\t" + setDuration);
}
private static List<Integer> getPositionsArray(char[] cbuf) {
List<Integer> l = new ArrayList<>();
for (int i = 0; i < cbuf.length; i++) {
for (char eosCharacter : eosChars) {
if (cbuf[i] == eosCharacter) {
l.add(i);
break;
}
}
}
return l;
}
private static List<Integer> getPositionsHashset(char[] cbuf) {
List<Integer> l = new ArrayList<>();
for (int i = 0; i < cbuf.length; i++) {
if (eosCharacters.contains(cbuf[i])) {
l.add(i);
}
}
return l;
}
}
Perhaps the following questions might be relevant here:
What do you think? |
|
Any thoughts? |
|
Guys, any thoughts? If the testing is applicable, for the majority of languages the change in this PR has made the code slower. Should it be left like this? |
|
Sorry for the late reply as I've been busy for my project. Actually, when I merged this, I thought I doubted how this change improved the performance. Let me withdraw this once as I don't have any particular reason to stick this change. |
|
Perhaps we would need to write some JMH benchmarks to see any performance impacts / drawbacks. Using If this is still a concern, we should write some JMH benchmarks to see, if there is a real difference (aside from JVM quirks / optimizations). Thoughts? |
…Scanner
Thank you for contributing to Apache OpenNLP.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.