OPENNLP-1214: use hash to avoid linear search in DefaultEndOfSentence… #329

kojisekig · 2018-08-14T04:18:57Z

…Scanner

Thank you for contributing to Apache OpenNLP.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?

For code changes:

Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

…Scanner and DefaultSDContextGenerator

kottmann

+1 lgtm, makes sense to use set instead

autayeu · 2018-10-03T10:35:22Z

Given the goal of this improvement is to speed up, do you think below is a realistic test? Do you think it applies across other JVMs?

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import opennlp.tools.sentdetect.lang.Factory;

class Scratch {

  private static final int ITERATIONS = 100_000_000;

  private static Set<Character> eosCharacters;

  public static void main(String[] args) {
    eosCharacters = new HashSet<>();
    for (char eosChar: Factory.defaultEosCharacters) {
      eosCharacters.add(eosChar);
    }

    
    char[] cbuf = new char[20];

    System.out.println("defaultEosCharacters");
    for (char eos : Factory.defaultEosCharacters) {
      Arrays.fill(cbuf, eos);
      testBuffer(cbuf);
    }

    System.out.println("ptEosCharacters");
    for (char eos : Factory.ptEosCharacters) {
      Arrays.fill(cbuf, eos);
      testBuffer(cbuf);
    }

    System.out.println("jpnEosCharacters");
    for (char eos : Factory.jpnEosCharacters) {
      Arrays.fill(cbuf, eos);
      testBuffer(cbuf);
    }
  }

  private static void testBuffer(char[] cbuf) {
    System.out.println("Testing with: " + new String(cbuf));
    {
      long start = System.currentTimeMillis();
      for (int n = 0; n < ITERATIONS; n++) {
        getPositionsArray(cbuf);
      }
      long duration = System.currentTimeMillis() - start;
      System.out.println("Duration array (ms): " + duration);
    }

    {
      long start = System.currentTimeMillis();
      for (int n = 0; n < ITERATIONS; n++) {
        getPositionsHashset(cbuf);
      }
      long duration = System.currentTimeMillis() - start;
      System.out.println("Duration set (ms): " + duration);
    }
  }

  public static List<Integer> getPositionsArray(char[] cbuf) {
    List<Integer> l = new ArrayList<>();
    char[] eosCharacters = Factory.defaultEosCharacters;
    for (int i = 0; i < cbuf.length; i++) {
      for (char eosCharacter : eosCharacters) {
        if (cbuf[i] == eosCharacter) {
          l.add(i);
          break;
        }
      }
    }
    return l;
  }

  public static List<Integer> getPositionsHashset(char[] cbuf) {
    List<Integer> l = new ArrayList<>();
    for (int i = 0; i < cbuf.length; i++) {
      if (eosCharacters.contains(cbuf[i])) {
        l.add(i);
      }
    }
    return l;
  }
  
}

"C:\Program Files\Java\jdk1.8.0_162\bin\java.exe" ....
defaultEosCharacters
Testing with: ....................
Duration array (ms): 16424
Duration set (ms): 25844
Testing with: !!!!!!!!!!!!!!!!!!!!
Duration array (ms): 17498
Duration set (ms): 26696
Testing with: ????????????????????
Duration array (ms): 17948
Duration set (ms): 25391
ptEosCharacters
Testing with: ....................
Duration array (ms): 16975
Duration set (ms): 25442
Testing with: ????????????????????
Duration array (ms): 18012
Duration set (ms): 25529
Testing with: !!!!!!!!!!!!!!!!!!!!
Duration array (ms): 17562
Duration set (ms): 25579
Testing with: ;;;;;;;;;;;;;;;;;;;;
Duration array (ms): 4040
Duration set (ms): 6223
Testing with: ::::::::::::::::::::
Duration array (ms): 3991
Duration set (ms): 6276
Testing with: ((((((((((((((((((((
Duration array (ms): 3980
Duration set (ms): 6185
Testing with: ))))))))))))))))))))
Duration array (ms): 4043
Duration set (ms): 6199
Testing with: ««««««««««««««««««««
Duration array (ms): 3971
Duration set (ms): 8503
Testing with: »»»»»»»»»»»»»»»»»»»»
Duration array (ms): 3960
Duration set (ms): 8587
Testing with: ''''''''''''''''''''
Duration array (ms): 3920
Duration set (ms): 5450
Testing with: """"""""""""""""""""
Duration array (ms): 3931
Duration set (ms): 5396
jpnEosCharacters
Testing with: 。。。。。。。。。。。。。。。。。。。。
Duration array (ms): 3974
Duration set (ms): 8616
Testing with: ！！！！！！！！！！！！！！！！！！！！
Duration array (ms): 3908
Duration set (ms): 9276
Testing with: ？？？？？？？？？？？？？？？？？？？？
Duration array (ms): 3953
Duration set (ms): 9278

Process finished with exit code 0

kojisekig · 2018-10-04T02:37:51Z

Interesting!

I'm not sure the above test is realistic because some conditions look extreme to me such as ITERATIONS = 100_000_000, testing against repetition of one of eos chars (e.g. Testing with: ....................), using only Factory.defaultEosCharacters which has three eos chars only.

If I change the code as follows:

  private static final int ITERATIONS = 10000;

  public static void main(String[] args) {
    // use Factory.ptEosCharacters rather than Factory.defaultEosCharacters
    eosCharacters = new HashSet<>();
    for (char eosChar: Factory.ptEosCharacters) {
      eosCharacters.add(eosChar);
    }

    // use normal sentences rather than ....................
    char[] cbuf = new String("I think you are better off sending an email to the solr-user mailing " +
        "list (http://lucene.apache.org/solr/community.html#mailing-lists-irc) and explaining " +
        "more about your use case so we can understand what leads up to the dump. Most likely you " +
        "will find ways to reconfigure your cluster or queries in a way that avoids this situation. " +
        "Or perhaps your cluster is simply under-dimensioned.").toCharArray();
    testBuffer(cbuf);
  }

  public static List<Integer> getPositionsArray(char[] cbuf) {
    List<Integer> l = new ArrayList<>();
    // use Factory.ptEosCharacters rather than Factory.defaultEosCharacters
    char[] eosCharacters = Factory.ptEosCharacters;
    for (int i = 0; i < cbuf.length; i++) {
      for (char eosCharacter : eosCharacters) {
        if (cbuf[i] == eosCharacter) {
          l.add(i);
          break;
        }
      }
    }
    return l;
  }

I got the following result which shows opposite:

Duration array (ms): 197
Duration set (ms): 73

But I think your feedback is very interesting and highly appreciated. Thank you. :)

autayeu · 2018-10-04T09:43:13Z

Let me follow your lead and let's take another look.

import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import opennlp.tools.sentdetect.lang.Factory;

class Scratch {

  private static final int ITERATIONS = 100_000;
  private static final char[] EN_SAMPLE = ("I think you are better off sending an email to the solr-user mailing " +
      "list (http://lucene.apache.org/solr/community.html#mailing-lists-irc) and explaining " +
      "more about your use case so we can understand what leads up to the dump. Most likely you " +
      "will find ways to reconfigure your cluster or queries in a way that avoids this situation. " +
      "Or perhaps your cluster is simply under-dimensioned.").toCharArray();
  private static final char[] PT_SAMPLE = ("A pré-história de Portugal é partilhada com a do resto da Península Ibérica. Os vestígios humanos modernos mais antigos conhecidos são de homens de Cro-Magnon com \"traços\" de Neanderthal, com 24 500 anos e que são interpretados como indicadores de extensas populações mestiças entre as duas espécies. São também os vestígios mais recentes de seres com caraterísticas de Neandertal que se conhece, provavelmente os últimos da sua espécie[28] Há cerca de 5500 a.C., surge uma cultura mesolítica.[29] Durante o Neolítico a região foi ocupada por pré-celtas e celtas, dando origem a povos como os galaicos, lusitanos e cinetes, e visitada por fenícios[30] e cartagineses. Os romanos incorporaram-na no seu Império como Lusitânia [31] (centro e sul de Portugal), após vencida a resistência onde se destacou Viriato.[29]\n"
      + "\n"
      + "No século III, foi criada a Galécia, a norte do Douro, a partir da Tarraconense, abrangendo o norte de Portugal. A romanização marcou a cultura, em especial a língua latina, que foi a base do desenvolvimento da língua portuguesa.[32]\n"
      + "\n"
      + "Com o enfraquecimento do império romano, a partir de 409, o território é ocupado por povos germânicos como vândalos na Bética, alanos que fixaram-se na Lusitânia e suevos na Galécia. Em 415 os visigodos entram na Península, a pedido dos romanos, para expulsar os invasores. Vândalos e alanos deslocam-se para o norte de África. Os suevos e visigodos fundam os primeiros reinos cristãos. Em 711 o território é conquistado pelos mouros que aí estabeleceram o Al-Andalus. Os cristãos recolhem-se para norte, acantonados no Reino das Astúrias. Em 868, durante a Reconquista, foi formado o Condado Portucalense.[33] ").toCharArray();
  private static final char[] JP_SAMPLE = ("「にっぽん」、「にほん」と読まれる。どちらも多く用いられているため、日本政府は正式な読み方をどちらか一方には定めておらず、どちらの読みでも良いとしている[5]。\n"
      + "\n"
      + "7世紀の後半の国際関係から生じた「日本」国号は、当時の国際的な読み（音読）で「ニッポン」（呉音）ないし「ジッポン」（漢音）と読まれたものと推測される[6]。いつ「ニホン」の読みが始まったか定かでない。仮名表記では「にほん」と表記された。平安時代には「ひのもと」とも和訓されるようになった。\n"
      + "\n"
      + "室町時代の謡曲・狂言は、中国人に「ニッポン」と読ませ、日本人に「ニホン」と読ませている。安土桃山時代にポルトガル人が編纂した『日葡辞書』や『日本小文典』等には、「ニッポン」「ニホン」「ジッポン」の読みが見られ、その用例から判断すると、改まった場面・強調したい場合に「ニッポン」が使われ、日常の場面で「ニホン」が使われていた[7]。このことから小池清治は、中世の日本人が中国語的な語感のある「ジッポン」を使用したのは、中国人・西洋人など対外的な場面に限定されていて、日常だと「ニッポン」「ニホン」が用いられていたのでは、と推測している[8]。なお、現在に伝わっていない「ジッポン」音については、その他の言語も参照。 ").toCharArray();
  private static final char[] TH_SAMPLE = ("สืบจากปัญหาเศรษฐกิจรุนแรงจากภาวะเศรษฐกิจตกต่ำครั้งใหญ่ และราคาข้าวตกลงอย่างรุนแรง นอกจากนี้ยังมีการลดรายจ่ายภาครัฐอย่างมากทำให้เกิดความไม่พอใจในหมู่อภิชน[19]:25 วันที่ 24 มิถุนายน 2475 คณะราษฎรนำปฏิวัติเปลี่ยนแปลงการปกครองจากสมบูรณาญาสิทธิราชย์มาเป็นระบอบประชาธิปไตย ทำให้คณะราษฎรเข้ามามีบทบาททางการเมือง ปลายปี 2476 เกิดกบฏบวรเดช ซึ่งหวังเปลี่ยนแปลงการปกครองกลับสู่สมบูรณาญาสิทธิราช แต่ล้มเหลว[28]:446–8 ปี 2477 พระบาทสมเด็จพระปกเกล้าเจ้าอยู่หัวมีความเห็นไม่ลงรอยกับรัฐบาลจึงทรงสละราชสมบัติในปี 2478 สภาผู้แทนราษฎรเลือกพระวรวงศ์เธอ พระองค์เจ้าอานันทมหิดลเป็นพระมหากษัตริย์ ซึ่งขณะนั้นทรงศึกษาอยู่ในประเทศสวิสเซอร์แลนด์[28]:448–9\n"
      + "\n"
      + "เดือนธันวาคม 2481 พลตรีหลวงพิบูลสงครามได้เป็นนายกรัฐมนตรี เขาปราบปรามศัตรูทางการเมืองรวมทั้งต่อต้านราชวงศ์อย่างเปิดเผย[28]:457 รัฐบาลมีแนวคิดชาตินิยมและปรับให้เป็นตะวันตก และเริ่มดำเนินนโยบายต่อต้านจีนและฝรั่งเศส[19]:28 วันที่ 23 มิถุนายน 2482 มีการเปลี่ยนชื่อประเทศจาก \"สยาม\" มาเป็น \"ไทย\" ในปี 2484 เกิดสงครามขนาดย่อมขึ้นระหว่างวิชีฝรั่งเศสกับไทย ทำให้ไทยได้ดินแดนเพิ่มจากลาวและกัมพูชาช่วงสั้น ๆ[28]:462 วันที่ 8 ธันวาคม ปีเดียวกัน ประเทศญี่ปุ่นบุกครองไทย และรัฐบาลลงนามเป็นพันธมิตรทางทหารกับญี่ปุ่น และประกาศสงครามกับสหรัฐและสหราชอาณาจักร[28]:465 มีการตั้งขบวนการเสรีไทยขึ้นทั้งในและต่างประเทศเพื่อต่อต้านรัฐบาลและการยึดครองของญี่ปุ่น[28]:465–6 หลังสงครามยุติในปี 2488 ประเทศไทยลงนามความตกลงสมบูรณ์แบบเพื่อเลิกสถานะสงครามกับฝ่ายสัมพันธมิตร ").toCharArray();

  private static Set<Character> eosCharacters;
  private static char[] eosChars;

  private static void setEosCharacters(char[] chars) {
    eosChars = chars;
    eosCharacters = new HashSet<>();
    for (char eosChar: chars) {
      eosCharacters.add(eosChar);
    }
  }

  public static void main(String[] args) {
    System.out.println("Language\tarray (ms)\tset (ms)");
    
    testBuffer(EN_SAMPLE, "English", Factory.defaultEosCharacters);

    testBuffer(PT_SAMPLE, "Portuguese", Factory.ptEosCharacters);

    testBuffer(JP_SAMPLE, "Japanese", Factory.jpnEosCharacters);

    testBuffer(TH_SAMPLE, "Thai", Factory.thEosCharacters);
  }

  private static void testBuffer(final char[] cbuf, final String langName, final char[] eosChars) {
    setEosCharacters(eosChars);

    long arrayDuration;
    long setDuration;
    {
      long start = System.currentTimeMillis();
      for (int n = 0; n < ITERATIONS; n++) {
        getPositionsArray(cbuf);
      }
      arrayDuration = System.currentTimeMillis() - start;
    }

    {
      long start = System.currentTimeMillis();
      for (int n = 0; n < ITERATIONS; n++) {
        getPositionsHashset(cbuf);
      }
      setDuration = System.currentTimeMillis() - start;
    }

    System.out.println(langName + "\t" + arrayDuration + "\t" + setDuration);
  }

  private static List<Integer> getPositionsArray(char[] cbuf) {
    List<Integer> l = new ArrayList<>();
    for (int i = 0; i < cbuf.length; i++) {
      for (char eosCharacter : eosChars) {
        if (cbuf[i] == eosCharacter) {
          l.add(i);
          break;
        }
      }
    }
    return l;
  }

  private static List<Integer> getPositionsHashset(char[] cbuf) {
    List<Integer> l = new ArrayList<>();
    for (int i = 0; i < cbuf.length; i++) {
      if (eosCharacters.contains(cbuf[i])) {
        l.add(i);
      }
    }
    return l;
  }
  
}

Language	array (ms)	set (ms)
English	156	217
Portuguese	1071	819
Japanese	515	1249
Thai	1324	2917

Perhaps the following questions might be relevant here:

At which size HashSet becomes faster than array for lookups? Fundamentally, array involves only comparison while HashSet involves hash calculation, generic overheads and comparison.
What's the average eos character count across languages?
What's the distribution of eos character counts across languages?
For how many languages the boundary mentioned in the first question is crossed?
What's the frequency of each eos character in the target language?
What the distribution of eos characters in target language?
If the eos character count justifies the switch to HashSet (see first question), then does the form of eos character distribution still backs up the switch? Power law distributions might be common in language and the fat head of such distributions might still make array lookups faster than HashSet.
Which language is processed by OpenNLP the most? or rather:
What's the distribution of languages across OpenNLP users?
Based on the answers to the above, which change at what level would be the most beneficial for most our users? Would the change at language level be better than globally for all languages? For example, there is a opennlp.tools.sentdetect.lang.th.SentenceContextGenerator which sets its own eosCharacters. Just as well there could be a Portuguese version of sentence scanner which uses HashMap for lookups. By the way, there is a duplication of eosCharacters constant in Thai SentenceContextGenerator and an override of a deprecated method.

What do you think?

autayeu · 2018-10-07T13:18:09Z

Any thoughts?

autayeu · 2018-10-12T19:59:56Z

Guys, any thoughts?

If the testing is applicable, for the majority of languages the change in this PR has made the code slower. Should it be left like this?

kojisekig · 2018-10-15T06:58:18Z

Sorry for the late reply as I've been busy for my project.

Actually, when I merged this, I thought I doubted how this change improved the performance.

Let me withdraw this once as I don't have any particular reason to stick this change.

…SentenceScanner and DefaultSDContextGenerator (#329)" This reverts commit 51cbde6.

rzo1 · 2022-12-22T13:35:26Z

Perhaps we would need to write some JMH benchmarks to see any performance impacts / drawbacks. Using System.currentTimeMillis() has its own drawbacks if used for benchmarking.

If this is still a concern, we should write some JMH benchmarks to see, if there is a real difference (aside from JVM quirks / optimizations). Thoughts?

OPENNLP-1214: use hash to avoid linear search in DefaultEndOfSentence…

46b40b0

…Scanner and DefaultSDContextGenerator

kojisekig force-pushed the OPENNLP-1214 branch from 72ee664 to 46b40b0 Compare September 27, 2018 07:01

kottmann approved these changes Oct 2, 2018

View reviewed changes

kojisekig merged commit 51cbde6 into apache:master Oct 3, 2018

kojisekig deleted the OPENNLP-1214 branch October 3, 2018 01:02

kojisekig added a commit that referenced this pull request Oct 15, 2018

Revert "OPENNLP-1214: use hash to avoid linear search in DefaultEndOf…

bf0d13b

…SentenceScanner and DefaultSDContextGenerator (#329)" This reverts commit 51cbde6.

kojisekig mentioned this pull request Oct 15, 2018

Revert "OPENNLP-1214: use hash to avoid linear search in DefaultEndOfSentence…" #336

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OPENNLP-1214: use hash to avoid linear search in DefaultEndOfSentence… #329

OPENNLP-1214: use hash to avoid linear search in DefaultEndOfSentence… #329

Uh oh!

kojisekig commented Aug 14, 2018 •

edited

Loading

Uh oh!

kottmann left a comment

Uh oh!

autayeu commented Oct 3, 2018

Uh oh!

kojisekig commented Oct 4, 2018

Uh oh!

autayeu commented Oct 4, 2018

Uh oh!

autayeu commented Oct 7, 2018

Uh oh!

autayeu commented Oct 12, 2018

Uh oh!

kojisekig commented Oct 15, 2018

Uh oh!

rzo1 commented Dec 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

OPENNLP-1214: use hash to avoid linear search in DefaultEndOfSentence… #329

OPENNLP-1214: use hash to avoid linear search in DefaultEndOfSentence… #329

Uh oh!

Conversation

kojisekig commented Aug 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

For all changes:

For code changes:

For documentation related changes:

Note:

Uh oh!

kottmann left a comment

Choose a reason for hiding this comment

Uh oh!

autayeu commented Oct 3, 2018

Uh oh!

kojisekig commented Oct 4, 2018

Uh oh!

autayeu commented Oct 4, 2018

Uh oh!

autayeu commented Oct 7, 2018

Uh oh!

autayeu commented Oct 12, 2018

Uh oh!

kojisekig commented Oct 15, 2018

Uh oh!

rzo1 commented Dec 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kojisekig commented Aug 14, 2018 •

edited

Loading