-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add isArabic rule #577
Add isArabic rule #577
Conversation
Should I also commit the changes to the classifiers files? |
Yes, if classifiers needed to be regenerated, then they should be included.
…On Tue, Mar 2, 2021, 23:19 Amr Keleg ***@***.***> wrote:
Should I also commit the changes to the classifiers files?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#577 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEOIX27ENXYZ62M4VUJ5BVLTBXBE7ANCNFSM4YQQYCBA>
.
|
This feels a bit hacky and ad-hoc to me. Additionally, it seems to break some tests: https://github.com/facebook/duckling/pull/577/checks?check_run_id=2019413081#step:6:1076 I think writing a more general word-boundary detection algorithm would serve us better, possibly with the locale as input. |
I don't think it's ad-hoc but Arabic has a set of proclitics/enclitics that makes tokenization a relatively hard problem (e.g: |
Can you fix the failing test cases? I haven't looked into why they are failing. What I am interested in is: are they failing because this PR does something wrong, or they are actually wrong? |
O/ |
@chessai I have fixed all the failing cases except for only one (The only one which will need changing the rules IMO, discussed at the end of this comment). Let me explain the idea of the PR:
The current failing case is related to numbers that are multiples of hundred.
Apart from the cases, I believe that this PR will solve lots of false positives that are currently reported by Duckling. |
Awesome!
Thanks for the thorough explanation! It's very useful.
Given your knowledge of Arabic (which I do not really have), which do you think is more appropriate/makes more sense?
Amazing! |
The only problem I have with the new direction is that I would like isRangeValid to take Locale as input, and case on the Locale to determine what set of predicates to use on the range. For example, internally I have a change to isRangeValid which fixes a lot of problems for Chinese and Georgian, but causes issues for English. For English, the current ruleset is more or less sufficient, so I'd like it to not change (and be the default). But for other languages we will need different support, like this PR and its discussion so clearly points out. |
Hmm, yes, this makes sense. I have also added a not-so-smart-way for matching Numerals that are multiples of hundreds in range 300-900. For this PR, I believe we can also add some False Positives and then it would be ready IMHO to get merged. |
@chessai Could you please review the PR and let me know if we should add samples to the Negative corpora? |
I have an internal PR which will allow us to case on the language for It should look like this: isRangeValid :: Lang -> Document -> Int -> Int -> Bool
isRangeValid = \case
AR -> arIsRangeValid
_ -> defaultIsRangeValid
where
arIsRangeValid = ...your code... |
Also, it would help me if we had plenty (or at least a handful) of examples of things that caused problems before, but won't cause problems now. |
I think adding them as negative corpora is the way to do so. |
negative corpora should be fine. Also, isRangeValid in master branch now takes Lang as input, so you can rebase and refactor accordingly now. |
@AMR-KELEG are you still interested in working on this? I'm very enthusiastic about this change. |
@chessai I am pretty much drained currently so I couldn't adapt the change earlier. |
What is the build error you are getting? |
@AMR-K could you rebase on top of master? Another language dependent isRangeValid implementation landed. And then please add negative corpora. I was about to commandeer the diff internally, to rebase for you, but realised I don't know what negative corpora to add. |
Hi @chessai , |
Hi @chessai |
I recommend loading the project into the repl via
if you haven't already. |
I would like to help push this PR forward, but I am not able to grasp Haskell overnight... Here is the debug output for failing test on origin *Duckling.Debug> debug (makeLocale AR $ Just EG) "اخر اسبوع في سبتمبر لعام 2014" [Seal Time]
last <cycle> of <time> (اخر اسبوع في سبتمبر لعام 2014)
-- regex (اخر)
-- week (grain) (اسبوع)
-- -- regex (اسبوع)
-- regex (في)
-- intersect by ",", "of", "from", "'s" (سبتمبر لعام 2014)
-- -- September (سبتمبر)
-- -- -- regex (سبتمبر)
-- -- regex (ل)
-- -- year (integer) (عام 2014)
-- -- -- regex (عام)
-- -- -- integer (numeric) (2014)
-- -- -- -- regex (2014)
[
Entity {
dim = "time",
body = "\1575\1582\1585 \1575\1587\1576\1608\1593 \1601\1610 \1587\1576\1578\1605\1576\1585 \1604\1593\1575\1605 2014",
value = RVal Time (
TimeValue (
SimpleValue (
InstantValue {vValue = 2014-09-22 00:00:00 -0200, vGrain = Week}
)
)
[
SimpleValue (
InstantValue {vValue = 2014-09-22 00:00:00 -0200, vGrain = Week}
)
]
Nothing
),
start = 0,
end = 29,
latent = False,
enode = Node {
nodeRange = Range 0 29,
token = Token Time TimeData{
latent=False, grain=Week, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False
},
children = [
Node {nodeRange = Range 0 3, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
Node {nodeRange = Range 4 9, token = Token TimeGrain Week, children =
[
Node {nodeRange = Range 4 9, token = Token RegexMatch (GroupMatch ["\1575","\1576\1608\1593"]), children = [], rule = Nothing}
],
rule = Just "week (grain)"},
Node {nodeRange = Range 10 12, token = Token RegexMatch (GroupMatch ["\1601\1610"]), children = [], rule = Nothing},
Node {nodeRange = Range 13 29, token = Token Time TimeData{latent=False, grain=Month, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children =
[
Node {nodeRange = Range 13 19, token = Token Time TimeData{latent=False, grain=Month, form=Just (Month {month = 9}), direction=Nothing, holiday=Nothing, hasTimezone=False}, children =
[
Node {nodeRange = Range 13 19, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing}
],
rule = Just "September"},
Node {nodeRange = Range 20 21, token = Token RegexMatch (GroupMatch [""]), children = [], rule = Nothing},
Node {nodeRange = Range 21 29, token = Token Time TimeData{latent=False, grain=Year, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children =
[
Node {nodeRange = Range 21 24, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
Node {nodeRange = Range 25 29, token = Token Numeral (NumeralData {value = 2014.0, grain = Nothing, multipliable = False, okForAnyTime = True}), children =
[
Node {nodeRange = Range 25 29, token = Token RegexMatch (GroupMatch ["2014"]), children = [], rule = Nothing}
],
rule = Just "integer (numeric)"}
],
rule = Just "year (integer)"}
],
rule = Just "intersect by \",\", \"of\", \"from\", \"'s\""}
],
rule = Just "last <cycle> of <time>"
}
}
] As for the debug output on PR: *Duckling.Debug> debug (makeLocale AR $ Just EG) "اخر اسبوع في سبتمبر لعام 2014" [Seal Time]
last <cycle> of <time> (اخر اسبوع في سبتمبر لعام 2014)
-- regex (اخر)
-- week (grain) (اسبوع)
-- -- regex (اسبوع)
-- regex (في)
-- intersect (سبتمبر لعام 2014)
-- -- <time> for <duration> (سبتمبر لعام)
-- -- -- September (سبتمبر)
-- -- -- -- regex (سبتمبر)
-- -- -- regex (ل)
-- -- -- single <unit-of-duration> (عام)
-- -- -- -- year (grain) (عام)
-- -- -- -- -- regex (عام)
-- -- year (2014)
-- -- -- integer (numeric) (2014)
-- -- -- -- regex (2014)
intersect by ",", "of", "from", "'s" (اخر اسبوع في سبتمبر لعام 2014)
-- last <cycle> of <time> (اخر اسبوع في سبتمبر)
-- -- regex (اخر)
-- -- week (grain) (اسبوع)
-- -- -- regex (اسبوع)
-- -- regex (في)
-- -- September (سبتمبر)
-- -- -- regex (سبتمبر)
-- regex (ل)
-- year (integer) (عام 2014)
-- -- regex (عام)
-- -- integer (numeric) (2014)
-- -- -- regex (2014)
[
Entity {
dim = "time",
body = "\1575\1582\1585 \1575\1587\1576\1608\1593 \1601\1610 \1587\1576\1578\1605\1576\1585 \1604\1593\1575\1605 2014",
value = RVal Time (
TimeValue (
SimpleValue (
InstantValue {
vValue = 2014-08-25 00:00:00 -0200, vGrain = Week}))
[
SimpleValue (
InstantValue {
vValue = 2014-08-25 00:00:00 -0200, vGrain = Week})
] Nothing),
start = 0,
end = 29,
latent = False,
enode = Node {
nodeRange = Range 0 29,
token = Token Time TimeData{
latent=False, grain=Week, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False},
children = [
Node {
nodeRange = Range 0 3, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
Node {
nodeRange = Range 4 9, token = Token TimeGrain Week, children =
[
Node {
nodeRange = Range 4 9, token = Token RegexMatch (GroupMatch ["\1575","\1576\1608\1593"]), children = [], rule = Nothing}
],
rule = Just "week (grain)"},
Node {
nodeRange = Range 10 12, token = Token RegexMatch (GroupMatch ["\1601\1610"]), children = [], rule = Nothing},
Node {nodeRange = Range 13 29, token = Token Time TimeData{latent=False, grain=Month, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children =
[
Node {
nodeRange = Range 13 24, token = Token Time TimeData{latent=False, grain=Month, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children =
[
Node {
nodeRange = Range 13 19, token = Token Time TimeData{latent=False, grain=Month, form=Just (Month {month = 9}), direction=Nothing, holiday=Nothing, hasTimezone=False}, children =
[
Node {
nodeRange = Range 13 19, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing}], rule = Just "September"},
Node {
nodeRange = Range 20 21, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
Node {
nodeRange = Range 21 24, token = Token Duration (DurationData {value = 1, grain = Year}), children =
[
Node {
nodeRange = Range 21 24, token = Token TimeGrain Year, children = [Node {nodeRange = Range 21 24, token = Token RegexMatch (GroupMatch [""]), children = [], rule = Nothing}], rule = Just "year (grain)"}
],
rule = Just "single <unit-of-duration>"}
], rule = Just "<time> for <duration>"},
Node {
nodeRange = Range 25 29, token = Token Time TimeData{latent=False, grain=Year, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children =
[
Node {
nodeRange = Range 25 29, token = Token Numeral (NumeralData {value = 2014.0, grain = Nothing, multipliable = False, okForAnyTime = True}), children =
[
Node {
nodeRange = Range 25 29, token = Token RegexMatch (GroupMatch ["2014"]), children = [], rule = Nothing}], rule = Just "integer (numeric)"}], rule = Just "year"}
],
rule = Just "intersect"}
], rule = Just "last <cycle> of <time>"}},
Entity {
dim = "time", body = "\1575\1582\1585 \1575\1587\1576\1608\1593 \1601\1610 \1587\1576\1578\1605\1576\1585 \1604\1593\1575\1605 2014",
value = RVal Time (
TimeValue (
SimpleValue (
InstantValue {vValue = 2014-09-22 00:00:00 -0200, vGrain = Week}))
[
SimpleValue (InstantValue {vValue = 2014-09-22 00:00:00 -0200, vGrain = Week})
]
Nothing),
start = 0,
end = 29,
latent = False,
enode = Node {
nodeRange = Range 0 29, token = Token Time TimeData{latent=False, grain=Week, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children =
[
Node {
nodeRange = Range 0 19, token = Token Time TimeData{latent=False, grain=Week, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children =
[
Node {nodeRange = Range 0 3, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
Node {nodeRange = Range 4 9, token = Token TimeGrain Week, children =
[
Node {nodeRange = Range 4 9, token = Token RegexMatch (GroupMatch ["\1575","\1576\1608\1593"]), children = [], rule = Nothing}
], rule = Just "week (grain)"},
Node {nodeRange = Range 10 12, token = Token RegexMatch (GroupMatch ["\1601\1610"]), children = [], rule = Nothing},
Node {nodeRange = Range 13 19, token = Token Time TimeData{latent=False, grain=Month, form=Just (Month {month = 9}), direction=Nothing, holiday=Nothing, hasTimezone=False}, children =
[
Node {nodeRange = Range 13 19, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing}], rule = Just "September"}
], rule = Just "last <cycle> of <time>"},
Node {nodeRange = Range 20 21, token = Token RegexMatch (GroupMatch [""]), children = [], rule = Nothing},
Node {nodeRange = Range 21 29, token = Token Time TimeData{latent=False, grain=Year, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children =
[
Node {nodeRange = Range 21 24, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
Node {nodeRange = Range 25 29, token = Token Numeral (NumeralData {value = 2014.0, grain = Nothing, multipliable = False, okForAnyTime = True}), children =
[
Node {nodeRange = Range 25 29, token = Token RegexMatch (GroupMatch ["2014"]), children = [], rule = Nothing}], rule = Just "integer (numeric)"}
], rule = Just "year (integer)"}
], rule = Just "intersect by \",\", \"of\", \"from\", \"'s\""}}
] |
Duckling/Types/Document.hs
Outdated
where | ||
-- This list isn't exhasutive since Arabic have some diacritics and rarely used characters in Unicode | ||
isArabic :: Char -> Bool | ||
isArabic c = elem c ['ا', 'ب', 'ت', 'ة', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'ي', 'ء', 'آ', 'أ', 'إ', 'ؤ', 'ئ', 'ى'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AMR-KELEG is it an issue if "و" is not included, only "ؤ"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Mazyod,
I will have a look but for sure this is a nasty bug!
Nice catch
Duckling/Types/Document.hs
Outdated
where | ||
-- This list isn't exhasutive since Arabic have some diacritics and rarely used characters in Unicode | ||
isArabic :: Char -> Bool | ||
isArabic c = elem c ['ا', 'ب', 'ت', 'ة', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'ي', 'ء', 'آ', 'أ', 'إ', 'ؤ', 'ئ', 'ى'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or...
isArabic c = elem c ['ا', 'ب', 'ت', 'ة', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'ي', 'ء', 'آ', 'أ', 'إ', 'ؤ', 'ئ', 'ى'] | |
isArabic c = elem c ['\1536' .. '\1791'] |
Works as follows (pardon Github RTL rendering):
Prelude> elem 'و' ['\1536' .. '\1791']
True
Prelude> elem 'ؤ' ['\1536' .. '\1791']
True
Prelude> elem 'a' ['\1536' .. '\1791']
False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks much better but actually I am not a fan of using this list since it contains lots of irrelevant characters that might cause problems.
If such list is used then you would consider characters like numerals and Arabic punctuation marks as digits which would break the rules and cause problems.
I prefer to have a restricted list of characters that can be augmented later to fix any bugs that might arise.
What do you think?
The long list of characters in the range:
؆
؇
؈
؉
؊
؋
،
؍
؎
؏
ؐ
ؑ
ؒ
ؓ
ؔ
ؕ
ؖ
ؗ
ؘ
ؙ
ؚ
؛
؝
؞
؟
ؠ
ء
آ
أ
ؤ
إ
ئ
ا
ب
ة
ت
ث
ج
ح
خ
د
ذ
ر
ز
س
ش
ص
ض
ط
ظ
ع
غ
ػ
ؼ
ؽ
ؾ
ؿ
ـ
ف
ق
ك
ل
م
ن
ه
و
ى
ي
ً
ٌ
ٍ
َ
ُ
ِ
ّ
ْ
ٓ
ٔ
ٕ
ٖ
ٗ
٘
ٙ
ٚ
ٛ
ٜ
ٝ
ٞ
ٟ
٠
١
٢
٣
٤
٥
٦
٧
٨
٩
٪
٫
٬
٭
ٮ
ٯ
ٰ
ٱ
ٲ
ٳ
ٴ
ٵ
ٶ
ٷ
ٸ
ٹ
ٺ
ٻ
ټ
ٽ
پ
ٿ
ڀ
ځ
ڂ
ڃ
ڄ
څ
چ
ڇ
ڈ
ډ
ڊ
ڋ
ڌ
ڍ
ڎ
ڏ
ڐ
ڑ
ڒ
ړ
ڔ
ڕ
ږ
ڗ
ژ
ڙ
ښ
ڛ
ڜ
ڝ
ڞ
ڟ
ڠ
ڡ
ڢ
ڣ
ڤ
ڥ
ڦ
ڧ
ڨ
ک
ڪ
ګ
ڬ
ڭ
ڮ
گ
ڰ
ڱ
ڲ
ڳ
ڴ
ڵ
ڶ
ڷ
ڸ
ڹ
ں
ڻ
ڼ
ڽ
ھ
ڿ
ۀ
ہ
ۂ
ۃ
ۄ
ۅ
ۆ
ۇ
ۈ
ۉ
ۊ
ۋ
ی
ۍ
ێ
ۏ
ې
ۑ
ے
ۓ
۔
ە
ۖ
ۗ
ۘ
ۙ
ۚ
ۛ
ۜ
۞
۟
۠
ۡ
ۢ
ۣ
ۤ
ۥ
ۦ
ۧ
ۨ
۩
۪
۫
۬
ۭ
ۮ
ۯ
۰
۱
۲
۳
۴
۵
۶
۷
۸
۹
ۺ
ۻ
ۼ
۽
۾
ۿ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes perfect sense!
Co-authored-by: Maz <mazjaleel@gmail.com>
Hi @chessai |
isArabic :: Char -> Bool | ||
isArabic c = elem c ['ا', 'ب', 'ت', 'ة', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'ي', 'ء', 'آ', 'أ', 'إ', 'ؤ', 'و', 'ئ', 'ى'] | ||
|
||
-- TODO: Add all Arabic proclitics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be something that is hard to add at this point, before merge?
isArabicProclitic2 :: Char -> Char -> Bool | ||
isArabicProclitic2 c1 c2 = elem c1 ['ا', 'ل'] && elem c2 ['ل'] | ||
|
||
-- TODO: Add all Arabic proclitics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this means to say enclitics. And would it be hard to add the remaining enclitics at this point, before merge?
(end <= (length doc - 2) && | ||
isArabicEnclitic (doc ! (end)) (doc ! (end + 1)))) | ||
where | ||
-- This list isn't exhasutive since Arabic have some diacritics and rarely used characters in Unicode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- This list isn't exhasutive since Arabic have some diacritics and rarely used characters in Unicode | |
-- This list isn't exhaustive since Arabic have some diacritics and rarely used characters in Unicode |
@AMR-KELEG this PR is looking good. I left a few minor comments. |
@AMR-KELEG I am going to merge this, as it is a great starting point. Don't want to block on some TODOs. Feel free to improve things in a subsequent PR. Thanks again for all your hard work! |
@chessai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Thanks @chessai for your support and pointers. I was actually going to address your comments tomorrow but I would also be happy to continue working on them in another PR. |
Fixes #437, fixes #571