Add isArabic rule #577

AMR-KELEG · 2021-03-03T05:17:30Z

Fixes #437, fixes #571

AMR-KELEG · 2021-03-03T05:18:55Z

Should I also commit the changes to the classifiers files?

chessai · 2021-03-03T14:28:06Z

Yes, if classifiers needed to be regenerated, then they should be included.

On Tue, Mar 2, 2021, 23:19 Amr Keleg ***@***.***> wrote: Should I also commit the changes to the classifiers files? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#577 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEOIX27ENXYZ62M4VUJ5BVLTBXBE7ANCNFSM4YQQYCBA> .

chessai · 2021-03-03T19:39:02Z

This feels a bit hacky and ad-hoc to me. Additionally, it seems to break some tests: https://github.com/facebook/duckling/pull/577/checks?check_run_id=2019413081#step:6:1076

I think writing a more general word-boundary detection algorithm would serve us better, possibly with the locale as input.

AMR-KELEG · 2021-03-03T20:50:46Z

This feels a bit hacky and ad-hoc to me. Additionally, it seems to break some tests: https://github.com/facebook/duckling/pull/577/checks?check_run_id=2019413081#step:6:1076

I think writing a more general word-boundary detection algorithm would serve us better, possibly with the locale as input.

I don't think it's ad-hoc but Arabic has a set of proclitics/enclitics that makes tokenization a relatively hard problem (e.g: a month is شهر while two months is شهرين so the enclitic ين means a pair of *). I am not sure what you mean by a general word-boundary detection algorithm but it won't be easy to do so and apparently the hack solution that was referred to in the PRs is doing the same as what I am proposing.
I also think that most of the failing test cases are fixable.

chessai · 2021-04-01T22:01:33Z

Can you fix the failing test cases? I haven't looked into why they are failing. What I am interested in is: are they failing because this PR does something wrong, or they are actually wrong?

AMR-KELEG · 2021-04-02T13:32:39Z

Can you fix the failing test cases? I haven't looked into why they are failing. What I am interested in is: are they failing because this PR does something wrong, or they are actually wrong?

O/
I have actually stopped working on the PR.
The tests are failing because the PR breaks the way some of the rules are matching the text. So the tests are correct but the rules were written in a way that depended on the hacky condition that is currently used.
I will work on first coming-up with ways to fix the failing cases without greatly changing the rules and then we can re-evaluate this PR and check how to proceed.

AMR-KELEG · 2021-04-18T11:28:21Z

@chessai I have fixed all the failing cases except for only one (The only one which will need changing the rules IMO, discussed at the end of this comment).

Let me explain the idea of the PR:
Tokens in Arabic are separated with white-spaces but Arabic also has a CLOSED set (of fixed size) of proclitics/enclitics that can get attached to the beginning or the end of the token without using spaces in between. [Clitics in Arabic Language: A Statistical Study]

Sample	Token	Token with clitics	The way matches are found in Duckling	Discussion
شهرين (2 months)	شهر (month)	شهر (month) + ين (for dual form)	شهر + ين	Since Duckling will match the word as two tokens, then the isValidRange function should accept having a token that is followed by an enclitic (not followed by whitespaces or numbers) and also accept a token that is an enclitic in itself and isn't preceeded by a whitespace but is followed by a whitespace or number (Handeled in lines 134:140 )
اليوم (this day/ today)	يوم (day)	ال (definite article) + يوم (day)	ال + يوم	Since Duckling will match the word as two tokens, then the isValidRange function should accept having a token that is preceded by a proclitic (not preceded by whitespaces or numbers) and also accept a token that is a proclitic in itself and isn't followed by a whitespace but is followed by an Arabic character(Handeled in lines 123:133)

The current failing case is related to numbers that are multiples of hundred.
The word hundred is "مائة" in Arabic and the word three is "ثلاث" in Arabic but the word three hundred in "ثلاثمائة". Currently, Duckling matches "ثلاثمائة" as two tokens "ثلاث" and "مائة". To fix the case, we can either:

Consider all numbers in range [3, 9] as proclitics (Will make the isArabicProclitic function complex since the numbers in range[3, 9] are longer than two characters each)
Modify the rule to match the tokens 300, 400, ..., 900 as a single match instead of splitting it into two tokens.

Apart from the cases, I believe that this PR will solve lots of false positives that are currently reported by Duckling.
Thanks and looking forward to hearing your feedback.

chessai · 2021-04-19T19:44:03Z

@chessai I have fixed all the failing cases except for only one (The only one which will need changing the rules IMO, discussed at the end of this comment).

Awesome!

Let me explain the idea of the PR

Thanks for the thorough explanation! It's very useful.

Consider all numbers in range [3, 9] as proclitics (Will make the isArabicProclitic function complex since the numbers in range[3, 9] are longer than two characters each)

Modify the rule to match the tokens 300, 400, ..., 900 as a single match instead of splitting it into two tokens.

Given your knowledge of Arabic (which I do not really have), which do you think is more appropriate/makes more sense?
My intuition says that the latter makes more sense, but I don't have much to go on besides that.

Apart from the cases, I believe that this PR will solve lots of false positives that are currently reported by Duckling.

Amazing!

chessai · 2021-04-19T19:48:45Z

The only problem I have with the new direction is that I would like isRangeValid to take Locale as input, and case on the Locale to determine what set of predicates to use on the range. For example, internally I have a change to isRangeValid which fixes a lot of problems for Chinese and Georgian, but causes issues for English. For English, the current ruleset is more or less sufficient, so I'd like it to not change (and be the default). But for other languages we will need different support, like this PR and its discussion so clearly points out.

AMR-K · 2021-04-27T11:45:03Z

The only problem I have with the new direction is that I would like isRangeValid to take Locale as input, and case on the Locale to determine what set of predicates to use on the range. For example, internally I have a change to isRangeValid which fixes a lot of problems for Chinese and Georgian, but causes issues for English. For English, the current ruleset is more or less sufficient, so I'd like it to not change (and be the default). But for other languages we will need different support, like this PR and its discussion so clearly points out.

Hmm, yes, this makes sense.
Let me know if there is something I can do to localize the isRangeValid.
I am still a Haskell noob so it's not easy for me to come-up with ways to change things myself.

I have also added a not-so-smart-way for matching Numerals that are multiples of hundreds in range 300-900.
I believe the rules can be converted into a single rule while adding conditions to determine the first matched part of the regexp but I couldn't figure out how to do so.

For this PR, I believe we can also add some False Positives and then it would be ready IMHO to get merged.

AMR-KELEG · 2021-05-03T13:45:27Z

@chessai Could you please review the PR and let me know if we should add samples to the Negative corpora?
I would be glad if we can merge this PR soon.
Thanks!

chessai · 2021-05-14T23:13:10Z

@chessai Could you please review the PR and let me know if we should add samples to the Negative corpora?
I would be glad if we can merge this PR soon.
Thanks!

I have an internal PR which will allow us to case on the language for isRangeValid. Once it lands, could you rebase and implement your isRangeValid in terms of the new one?

It should look like this:

isRangeValid :: Lang -> Document -> Int -> Int -> Bool
isRangeValid = \case
  AR -> arIsRangeValid
  _ -> defaultIsRangeValid
  where
    arIsRangeValid = ...your code...

chessai · 2021-05-14T23:14:00Z

Also, it would help me if we had plenty (or at least a handful) of examples of things that caused problems before, but won't cause problems now.

AMR-KELEG · 2021-05-15T09:36:56Z

Also, it would help me if we had plenty (or at least a handful) of examples of things that caused problems before, but won't cause problems now.

I think adding them as negative corpora is the way to do so.
Do you prefer having them written in a different format?

chessai · 2021-05-17T21:40:08Z

Also, it would help me if we had plenty (or at least a handful) of examples of things that caused problems before, but won't cause problems now.

I think adding them as negative corpora is the way to do so.
Do you prefer having them written in a different format?

negative corpora should be fine.

Also, isRangeValid in master branch now takes Lang as input, so you can rebase and refactor accordingly now.

chessai · 2021-06-03T17:18:45Z

@AMR-KELEG are you still interested in working on this? I'm very enthusiastic about this change.

AMR-KELEG · 2021-06-04T11:12:35Z

@chessai I am pretty much drained currently so I couldn't adapt the change earlier.
I tried rebasing the branch now but the code isn't building successfully.
Any pointers on how to fix it?

chessai · 2021-06-15T19:34:16Z

@chessai I am pretty much drained currently so I couldn't adapt the change earlier.
I tried rebasing the branch now but the code isn't building successfully.
Any pointers on how to fix it?

What is the build error you are getting?

chessai · 2021-06-25T19:29:09Z

@AMR-K could you rebase on top of master? Another language dependent isRangeValid implementation landed. And then please add negative corpora. I was about to commandeer the diff internally, to rebase for you, but realised I don't know what negative corpora to add.

Fixes #437, fixes #571

AMR-KELEG · 2021-06-28T07:10:12Z

Hi @chessai ,
I have managed to rebase the branch based on the Chinese change (I mimicked how range checking is done there).
We have a single case which is currently failing and I will check it as soon as I can.

AMR-KELEG · 2021-07-06T22:51:47Z

Hi @chessai
I tried to hack the failing case and check why two different parses are currently generated but I failed to build a valid intuition.

chessai · 2021-07-07T01:53:25Z

Hi @chessai
I tried to hack the failing case and check why two different parses are currently generated but I failed to build a valid intuition.

I recommend loading the project into the repl via cabal repl and then import Duckling.Debug and do something like

> debug (makeLocale AR Nothing) [Seal Whatever] "text"

if you haven't already.

Mazyod · 2021-07-08T10:30:05Z

I would like to help push this PR forward, but I am not able to grasp Haskell overnight...

Here is the debug output for failing test on origin

*Duckling.Debug> debug (makeLocale AR $ Just EG) "اخر اسبوع في سبتمبر لعام 2014" [Seal Time]
last <cycle> of <time> (اخر اسبوع في سبتمبر لعام 2014)
-- regex (اخر)
-- week (grain) (اسبوع)
-- -- regex (اسبوع)
-- regex (في)
-- intersect by ",", "of", "from", "'s" (سبتمبر لعام 2014)
-- -- September (سبتمبر)
-- -- -- regex (سبتمبر)
-- -- regex (ل)
-- -- year (integer) (عام 2014)
-- -- -- regex (عام)
-- -- -- integer (numeric) (2014)
-- -- -- -- regex (2014)
[
    Entity {
        dim = "time", 
        body = "\1575\1582\1585 \1575\1587\1576\1608\1593 \1601\1610 \1587\1576\1578\1605\1576\1585 \1604\1593\1575\1605 2014", 
        value = RVal Time (
            TimeValue (
                SimpleValue (
                    InstantValue {vValue = 2014-09-22 00:00:00 -0200, vGrain = Week}
                )
            )
            [
                SimpleValue (
                    InstantValue {vValue = 2014-09-22 00:00:00 -0200, vGrain = Week}
                )
            ] 
            Nothing
        ),
        start = 0, 
        end = 29, 
        latent = False, 
        enode = Node {
            nodeRange = Range 0 29, 
            token = Token Time TimeData{
                latent=False, grain=Week, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False
            }, 
            children = [
                Node {nodeRange = Range 0 3, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
                Node {nodeRange = Range 4 9, token = Token TimeGrain Week, children = 
                    [
                        Node {nodeRange = Range 4 9, token = Token RegexMatch (GroupMatch ["\1575","\1576\1608\1593"]), children = [], rule = Nothing}
                    ], 
                    rule = Just "week (grain)"},
                Node {nodeRange = Range 10 12, token = Token RegexMatch (GroupMatch ["\1601\1610"]), children = [], rule = Nothing},
                Node {nodeRange = Range 13 29, token = Token Time TimeData{latent=False, grain=Month, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                    [
                        Node {nodeRange = Range 13 19, token = Token Time TimeData{latent=False, grain=Month, form=Just (Month {month = 9}), direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                            [
                                Node {nodeRange = Range 13 19, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing}
                            ], 
                            rule = Just "September"},
                            Node {nodeRange = Range 20 21, token = Token RegexMatch (GroupMatch [""]), children = [], rule = Nothing},
                            Node {nodeRange = Range 21 29, token = Token Time TimeData{latent=False, grain=Year, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                                [
                                    Node {nodeRange = Range 21 24, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
                                    Node {nodeRange = Range 25 29, token = Token Numeral (NumeralData {value = 2014.0, grain = Nothing, multipliable = False, okForAnyTime = True}), children = 
                                        [
                                            Node {nodeRange = Range 25 29, token = Token RegexMatch (GroupMatch ["2014"]), children = [], rule = Nothing}
                                        ], 
                                        rule = Just "integer (numeric)"}
                                ], 
                                rule = Just "year (integer)"}
                    ], 
                    rule = Just "intersect by \",\", \"of\", \"from\", \"'s\""}
            ], 
            rule = Just "last <cycle> of <time>"
        }
    }
]

As for the debug output on PR:

*Duckling.Debug> debug (makeLocale AR $ Just EG) "اخر اسبوع في سبتمبر لعام 2014" [Seal Time]
last <cycle> of <time> (اخر اسبوع في سبتمبر لعام 2014)
-- regex (اخر)
-- week (grain) (اسبوع)
-- -- regex (اسبوع)
-- regex (في)
-- intersect (سبتمبر لعام 2014)
-- -- <time> for <duration> (سبتمبر لعام)
-- -- -- September (سبتمبر)
-- -- -- -- regex (سبتمبر)
-- -- -- regex (ل)
-- -- -- single <unit-of-duration> (عام)
-- -- -- -- year (grain) (عام)
-- -- -- -- -- regex (عام)
-- -- year (2014)
-- -- -- integer (numeric) (2014)
-- -- -- -- regex (2014)
intersect by ",", "of", "from", "'s" (اخر اسبوع في سبتمبر لعام 2014)
-- last <cycle> of <time> (اخر اسبوع في سبتمبر)
-- -- regex (اخر)
-- -- week (grain) (اسبوع)
-- -- -- regex (اسبوع)
-- -- regex (في)
-- -- September (سبتمبر)
-- -- -- regex (سبتمبر)
-- regex (ل)
-- year (integer) (عام 2014)
-- -- regex (عام)
-- -- integer (numeric) (2014)
-- -- -- regex (2014)
[
    Entity {
        dim = "time", 
        body = "\1575\1582\1585 \1575\1587\1576\1608\1593 \1601\1610 \1587\1576\1578\1605\1576\1585 \1604\1593\1575\1605 2014", 
        value = RVal Time (
                TimeValue (
                    SimpleValue (
                        InstantValue {
                            vValue = 2014-08-25 00:00:00 -0200, vGrain = Week})) 
                [
                    SimpleValue (
                        InstantValue {
                            vValue = 2014-08-25 00:00:00 -0200, vGrain = Week})
                ] Nothing), 
        start = 0, 
        end = 29, 
        latent = False, 
        enode = Node {
            nodeRange = Range 0 29, 
            token = Token Time TimeData{
                latent=False, grain=Week, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, 
                children = [
                    Node {
                        nodeRange = Range 0 3, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
                    Node {
                        nodeRange = Range 4 9, token = Token TimeGrain Week, children = 
                        [
                            Node {
                                nodeRange = Range 4 9, token = Token RegexMatch (GroupMatch ["\1575","\1576\1608\1593"]), children = [], rule = Nothing}
                        ], 
                        rule = Just "week (grain)"},
                    Node {
                        nodeRange = Range 10 12, token = Token RegexMatch (GroupMatch ["\1601\1610"]), children = [], rule = Nothing},
                    Node {nodeRange = Range 13 29, token = Token Time TimeData{latent=False, grain=Month, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                        [
                            Node {
                                nodeRange = Range 13 24, token = Token Time TimeData{latent=False, grain=Month, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                                [
                                    Node {
                                        nodeRange = Range 13 19, token = Token Time TimeData{latent=False, grain=Month, form=Just (Month {month = 9}), direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                                        [
                                            Node {
                                                nodeRange = Range 13 19, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing}], rule = Just "September"},
                                            Node {
                                                nodeRange = Range 20 21, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
                                            Node {
                                                nodeRange = Range 21 24, token = Token Duration (DurationData {value = 1, grain = Year}), children = 
                                                [
                                                    Node {
                                                        nodeRange = Range 21 24, token = Token TimeGrain Year, children = [Node {nodeRange = Range 21 24, token = Token RegexMatch (GroupMatch [""]), children = [], rule = Nothing}], rule = Just "year (grain)"}
                                                ], 
                                                rule = Just "single <unit-of-duration>"}
                                            ], rule = Just "<time> for <duration>"},
                                    Node {
                                        nodeRange = Range 25 29, token = Token Time TimeData{latent=False, grain=Year, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                                        [
                                            Node {
                                                nodeRange = Range 25 29, token = Token Numeral (NumeralData {value = 2014.0, grain = Nothing, multipliable = False, okForAnyTime = True}), children = 
                                                [
                                                    Node {
                                                        nodeRange = Range 25 29, token = Token RegexMatch (GroupMatch ["2014"]), children = [], rule = Nothing}], rule = Just "integer (numeric)"}], rule = Just "year"}
                                                ], 
                                                rule = Just "intersect"}
                                        ], rule = Just "last <cycle> of <time>"}},
    Entity {
        dim = "time", body = "\1575\1582\1585 \1575\1587\1576\1608\1593 \1601\1610 \1587\1576\1578\1605\1576\1585 \1604\1593\1575\1605 2014", 
        value = RVal Time (
            TimeValue (
                SimpleValue (
                    InstantValue {vValue = 2014-09-22 00:00:00 -0200, vGrain = Week})) 
            [
                SimpleValue (InstantValue {vValue = 2014-09-22 00:00:00 -0200, vGrain = Week})
            ] 
            Nothing), 
            start = 0, 
            end = 29, 
            latent = False, 
            enode = Node {
                nodeRange = Range 0 29, token = Token Time TimeData{latent=False, grain=Week, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                [
                    Node {
                        nodeRange = Range 0 19, token = Token Time TimeData{latent=False, grain=Week, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                        [
                            Node {nodeRange = Range 0 3, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
                            Node {nodeRange = Range 4 9, token = Token TimeGrain Week, children = 
                                [
                                    Node {nodeRange = Range 4 9, token = Token RegexMatch (GroupMatch ["\1575","\1576\1608\1593"]), children = [], rule = Nothing}
                                ], rule = Just "week (grain)"},
                            Node {nodeRange = Range 10 12, token = Token RegexMatch (GroupMatch ["\1601\1610"]), children = [], rule = Nothing},
                            Node {nodeRange = Range 13 19, token = Token Time TimeData{latent=False, grain=Month, form=Just (Month {month = 9}), direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                                [
                                    Node {nodeRange = Range 13 19, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing}], rule = Just "September"}
                                ], rule = Just "last <cycle> of <time>"},
                            Node {nodeRange = Range 20 21, token = Token RegexMatch (GroupMatch [""]), children = [], rule = Nothing},
                            Node {nodeRange = Range 21 29, token = Token Time TimeData{latent=False, grain=Year, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                                [
                                    Node {nodeRange = Range 21 24, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
                                    Node {nodeRange = Range 25 29, token = Token Numeral (NumeralData {value = 2014.0, grain = Nothing, multipliable = False, okForAnyTime = True}), children = 
                                        [
                                            Node {nodeRange = Range 25 29, token = Token RegexMatch (GroupMatch ["2014"]), children = [], rule = Nothing}], rule = Just "integer (numeric)"}
                                        ], rule = Just "year (integer)"}
                                ], rule = Just "intersect by \",\", \"of\", \"from\", \"'s\""}}
]

Mazyod · 2021-07-08T11:09:42Z

Duckling/Types/Document.hs

+      where
+        -- This list isn't exhasutive since Arabic have some diacritics and rarely used characters in Unicode
+        isArabic :: Char -> Bool
+        isArabic c = elem c ['ا', 'ب', 'ت', 'ة', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'ي', 'ء', 'آ', 'أ', 'إ', 'ؤ', 'ئ', 'ى']


@AMR-KELEG is it an issue if "و" is not included, only "ؤ"?

Hi Mazyod,
I will have a look but for sure this is a nasty bug!
Nice catch

Duckling/Types/Document.hs

Mazyod · 2021-07-08T12:58:24Z

Duckling/Types/Document.hs

+      where
+        -- This list isn't exhasutive since Arabic have some diacritics and rarely used characters in Unicode
+        isArabic :: Char -> Bool
+        isArabic c = elem c ['ا', 'ب', 'ت', 'ة', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'ي', 'ء', 'آ', 'أ', 'إ', 'ؤ', 'ئ', 'ى']


Or...

Suggested change

isArabic c = elem c ['ا', 'ب', 'ت', 'ة', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'ي', 'ء', 'آ', 'أ', 'إ', 'ؤ', 'ئ', 'ى']

isArabic c = elem c ['\1536' .. '\1791']

Works as follows (pardon Github RTL rendering):

Prelude> elem 'و' ['\1536' .. '\1791'] True Prelude> elem 'ؤ' ['\1536' .. '\1791'] True Prelude> elem 'a' ['\1536' .. '\1791'] False

This looks much better but actually I am not a fan of using this list since it contains lots of irrelevant characters that might cause problems.
If such list is used then you would consider characters like numerals and Arabic punctuation marks as digits which would break the rules and cause problems.
I prefer to have a restricted list of characters that can be augmented later to fix any bugs that might arise.
What do you think?
The long list of characters in the range:

؀ ؁ ؂ ؃ ؄ ؅ ؆ ؇ ؈ ؉ ؊ ؋ ، ؍ ؎ ؏ ؐ ؑ ؒ ؓ ؔ ؕ ؖ ؗ ؘ ؙ ؚ ؛ ؜ ؝ ؞ ؟ ؠ ء آ أ ؤ إ ئ ا ب ة ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ػ ؼ ؽ ؾ ؿ ـ ف ق ك ل م ن ه و ى ي ً ٌ ٍ َ ُ ِ ّ ْ ٓ ٔ ٕ ٖ ٗ ٘ ٙ ٚ ٛ ٜ ٝ ٞ ٟ ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ ٪ ٫ ٬ ٭ ٮ ٯ ٰ ٱ ٲ ٳ ٴ ٵ ٶ ٷ ٸ ٹ ٺ ٻ ټ ٽ پ ٿ ڀ ځ ڂ ڃ ڄ څ چ ڇ ڈ ډ ڊ ڋ ڌ ڍ ڎ ڏ ڐ ڑ ڒ ړ ڔ ڕ ږ ڗ ژ ڙ ښ ڛ ڜ ڝ ڞ ڟ ڠ ڡ ڢ ڣ ڤ ڥ ڦ ڧ ڨ ک ڪ ګ ڬ ڭ ڮ گ ڰ ڱ ڲ ڳ ڴ ڵ ڶ ڷ ڸ ڹ ں ڻ ڼ ڽ ھ ڿ ۀ ہ ۂ ۃ ۄ ۅ ۆ ۇ ۈ ۉ ۊ ۋ ی ۍ ێ ۏ ې ۑ ے ۓ ۔ ە ۖ ۗ ۘ ۙ ۚ ۛ ۜ ۝ ۞ ۟ ۠ ۡ ۢ ۣ ۤ ۥ ۦ ۧ ۨ ۩ ۪ ۫ ۬ ۭ ۮ ۯ ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹ ۺ ۻ ۼ ۽ ۾ ۿ

Makes perfect sense!

Co-authored-by: Maz <mazjaleel@gmail.com>

AMR-KELEG · 2021-07-09T08:39:37Z

Hi @chessai
I believe the PR should be ready to get merged

chessai · 2021-07-09T19:40:49Z

Duckling/Types/Document.hs

+        isArabic :: Char -> Bool
+        isArabic c = elem c ['ا', 'ب', 'ت', 'ة', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'ي', 'ء', 'آ', 'أ', 'إ', 'ؤ', 'و', 'ئ', 'ى']
+
+        -- TODO: Add all Arabic proclitics


Would this be something that is hard to add at this point, before merge?

chessai · 2021-07-09T19:41:18Z

Duckling/Types/Document.hs

+        isArabicProclitic2 :: Char -> Char -> Bool
+        isArabicProclitic2 c1 c2 = elem c1 ['ا', 'ل'] && elem c2 ['ل']
+
+        -- TODO: Add all Arabic proclitics


I think this means to say enclitics. And would it be hard to add the remaining enclitics at this point, before merge?

chessai · 2021-07-09T19:42:39Z

Duckling/Types/Document.hs

+        (end <= (length doc - 2) &&
+          isArabicEnclitic (doc ! (end)) (doc ! (end + 1))))
+      where
+        -- This list isn't exhasutive since Arabic have some diacritics and rarely used characters in Unicode


Suggested change

-- This list isn't exhasutive since Arabic have some diacritics and rarely used characters in Unicode

-- This list isn't exhaustive since Arabic have some diacritics and rarely used characters in Unicode

chessai · 2021-07-09T19:43:30Z

@AMR-KELEG this PR is looking good. I left a few minor comments.

chessai · 2021-07-12T18:03:06Z

@AMR-KELEG I am going to merge this, as it is a great starting point. Don't want to block on some TODOs. Feel free to improve things in a subsequent PR.

Thanks again for all your hard work!

facebook-github-bot · 2021-07-12T18:03:56Z

@chessai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

AMR-KELEG · 2021-07-12T18:19:42Z

Thanks @chessai for your support and pointers. I was actually going to address your comments tomorrow but I would also be happy to continue working on them in another PR.

facebook-github-bot · 2021-07-12T20:37:35Z

@chessai merged this pull request in 79ac8f6.

facebook-github-bot added the CLA Signed label Mar 3, 2021

AMR-KELEG added 5 commits June 27, 2021 21:52

Add isArabic rule

8580963

Fixes #437, fixes #571

Add rules for Arabic pro/enclitics

611fd5d

Allow tokens to be just pro/enclitics

b08356f

Add new rules for single token multiple of 100 Arabic Numerals

fc94a9e

Add some Arabic negative examples

c0aa712

Mazyod reviewed Jul 8, 2021

View reviewed changes

Mazyod suggested changes Jul 8, 2021

View reviewed changes

Duckling/Types/Document.hs Outdated Show resolved Hide resolved

Mazyod reviewed Jul 8, 2021

View reviewed changes

AMR-KELEG and others added 2 commits July 8, 2021 16:03

Add missing و to the list of Arabic characters

249e855

Co-authored-by: Maz <mazjaleel@gmail.com>

Update the classifiers based on the PR changes

b57d9ef

chessai reviewed Jul 9, 2021

View reviewed changes

facebook-github-bot closed this in 79ac8f6 Jul 12, 2021

facebook-github-bot added the Merged label Jul 12, 2021

Mazyod mentioned this pull request Aug 23, 2021

Pull Upstream Changes RasaHQ/duckling#9

Merged

chessai mentioned this pull request Aug 26, 2021

Issue in arabic understanding wit-ai/wit#1923

Closed

	isArabic c = elem c ['ا', 'ب', 'ت', 'ة', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'ي', 'ء', 'آ', 'أ', 'إ', 'ؤ', 'ئ', 'ى']
	isArabic c = elem c ['\1536' .. '\1791']

	-- This list isn't exhasutive since Arabic have some diacritics and rarely used characters in Unicode
	-- This list isn't exhaustive since Arabic have some diacritics and rarely used characters in Unicode

Add isArabic rule #577

Add isArabic rule #577

Conversation

AMR-KELEG commented Mar 3, 2021 • edited Loading

AMR-KELEG commented Mar 3, 2021

chessai commented Mar 3, 2021 via email

chessai commented Mar 3, 2021

AMR-KELEG commented Mar 3, 2021 • edited Loading

chessai commented Apr 1, 2021 • edited Loading

AMR-KELEG commented Apr 2, 2021

AMR-KELEG commented Apr 18, 2021 • edited Loading

chessai commented Apr 19, 2021 • edited Loading

chessai commented Apr 19, 2021

AMR-K commented Apr 27, 2021

AMR-KELEG commented May 3, 2021

chessai commented May 14, 2021

chessai commented May 14, 2021

AMR-KELEG commented May 15, 2021

chessai commented May 17, 2021

chessai commented Jun 3, 2021

AMR-KELEG commented Jun 4, 2021

chessai commented Jun 15, 2021

chessai commented Jun 25, 2021 • edited Loading

AMR-KELEG commented Jun 28, 2021

AMR-KELEG commented Jul 6, 2021

chessai commented Jul 7, 2021

Mazyod commented Jul 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AMR-KELEG commented Jul 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chessai commented Jul 9, 2021

chessai commented Jul 12, 2021

facebook-github-bot commented Jul 12, 2021

AMR-KELEG commented Jul 12, 2021

facebook-github-bot commented Jul 12, 2021

AMR-KELEG commented Mar 3, 2021 •

edited

Loading

AMR-KELEG commented Mar 3, 2021 •

edited

Loading

chessai commented Apr 1, 2021 •

edited

Loading

AMR-KELEG commented Apr 18, 2021 •

edited

Loading

chessai commented Apr 19, 2021 •

edited

Loading

chessai commented Jun 25, 2021 •

edited

Loading

Mazyod commented Jul 8, 2021 •

edited

Loading