Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add isArabic rule #577

Closed
wants to merge 7 commits into from
Closed

Add isArabic rule #577

wants to merge 7 commits into from

Conversation

AMR-KELEG
Copy link
Contributor

@AMR-KELEG AMR-KELEG commented Mar 3, 2021

Fixes #437, fixes #571

@AMR-KELEG
Copy link
Contributor Author

Should I also commit the changes to the classifiers files?

@chessai
Copy link
Contributor

chessai commented Mar 3, 2021 via email

@chessai
Copy link
Contributor

chessai commented Mar 3, 2021

This feels a bit hacky and ad-hoc to me. Additionally, it seems to break some tests: https://github.com/facebook/duckling/pull/577/checks?check_run_id=2019413081#step:6:1076

I think writing a more general word-boundary detection algorithm would serve us better, possibly with the locale as input.

@AMR-KELEG
Copy link
Contributor Author

AMR-KELEG commented Mar 3, 2021

This feels a bit hacky and ad-hoc to me. Additionally, it seems to break some tests: https://github.com/facebook/duckling/pull/577/checks?check_run_id=2019413081#step:6:1076

I think writing a more general word-boundary detection algorithm would serve us better, possibly with the locale as input.

I don't think it's ad-hoc but Arabic has a set of proclitics/enclitics that makes tokenization a relatively hard problem (e.g: a month is شهر while two months is شهرين so the enclitic ين means a pair of *). I am not sure what you mean by a general word-boundary detection algorithm but it won't be easy to do so and apparently the hack solution that was referred to in the PRs is doing the same as what I am proposing.
I also think that most of the failing test cases are fixable.

@chessai
Copy link
Contributor

chessai commented Apr 1, 2021

Can you fix the failing test cases? I haven't looked into why they are failing. What I am interested in is: are they failing because this PR does something wrong, or they are actually wrong?

@AMR-KELEG
Copy link
Contributor Author

Can you fix the failing test cases? I haven't looked into why they are failing. What I am interested in is: are they failing because this PR does something wrong, or they are actually wrong?

O/
I have actually stopped working on the PR.
The tests are failing because the PR breaks the way some of the rules are matching the text. So the tests are correct but the rules were written in a way that depended on the hacky condition that is currently used.
I will work on first coming-up with ways to fix the failing cases without greatly changing the rules and then we can re-evaluate this PR and check how to proceed.

@AMR-KELEG
Copy link
Contributor Author

AMR-KELEG commented Apr 18, 2021

@chessai I have fixed all the failing cases except for only one (The only one which will need changing the rules IMO, discussed at the end of this comment).

Let me explain the idea of the PR:
Tokens in Arabic are separated with white-spaces but Arabic also has a CLOSED set (of fixed size) of proclitics/enclitics that can get attached to the beginning or the end of the token without using spaces in between. [Clitics in Arabic Language: A Statistical Study]

Sample Token Token with clitics The way matches are found in Duckling Discussion
شهرين (2 months) شهر (month) شهر (month) + ين (for dual form) شهر + ين Since Duckling will match the word as two tokens, then the isValidRange function should accept having a token that is followed by an enclitic (not followed by whitespaces or numbers) and also accept a token that is an enclitic in itself and isn't preceeded by a whitespace but is followed by a whitespace or number (Handeled in lines 134:140 )
اليوم (this day/ today) يوم (day) ال (definite article) + يوم (day) ال + يوم Since Duckling will match the word as two tokens, then the isValidRange function should accept having a token that is preceded by a proclitic (not preceded by whitespaces or numbers) and also accept a token that is a proclitic in itself and isn't followed by a whitespace but is followed by an Arabic character(Handeled in lines 123:133)

The current failing case is related to numbers that are multiples of hundred.
The word hundred is "مائة" in Arabic and the word three is "ثلاث" in Arabic but the word three hundred in "ثلاثمائة". Currently, Duckling matches "ثلاثمائة" as two tokens "ثلاث" and "مائة". To fix the case, we can either:

  • Consider all numbers in range [3, 9] as proclitics (Will make the isArabicProclitic function complex since the numbers in range[3, 9] are longer than two characters each)
  • Modify the rule to match the tokens 300, 400, ..., 900 as a single match instead of splitting it into two tokens.

Apart from the cases, I believe that this PR will solve lots of false positives that are currently reported by Duckling.
Thanks and looking forward to hearing your feedback.

@chessai
Copy link
Contributor

chessai commented Apr 19, 2021

@chessai I have fixed all the failing cases except for only one (The only one which will need changing the rules IMO, discussed at the end of this comment).

Awesome!

Let me explain the idea of the PR

Thanks for the thorough explanation! It's very useful.

  • Consider all numbers in range [3, 9] as proclitics (Will make the isArabicProclitic function complex since the numbers in range[3, 9] are longer than two characters each)
  • Modify the rule to match the tokens 300, 400, ..., 900 as a single match instead of splitting it into two tokens.

Given your knowledge of Arabic (which I do not really have), which do you think is more appropriate/makes more sense?
My intuition says that the latter makes more sense, but I don't have much to go on besides that.

Apart from the cases, I believe that this PR will solve lots of false positives that are currently reported by Duckling.

Amazing!

@chessai
Copy link
Contributor

chessai commented Apr 19, 2021

The only problem I have with the new direction is that I would like isRangeValid to take Locale as input, and case on the Locale to determine what set of predicates to use on the range. For example, internally I have a change to isRangeValid which fixes a lot of problems for Chinese and Georgian, but causes issues for English. For English, the current ruleset is more or less sufficient, so I'd like it to not change (and be the default). But for other languages we will need different support, like this PR and its discussion so clearly points out.

@AMR-K
Copy link
Contributor

AMR-K commented Apr 27, 2021

The only problem I have with the new direction is that I would like isRangeValid to take Locale as input, and case on the Locale to determine what set of predicates to use on the range. For example, internally I have a change to isRangeValid which fixes a lot of problems for Chinese and Georgian, but causes issues for English. For English, the current ruleset is more or less sufficient, so I'd like it to not change (and be the default). But for other languages we will need different support, like this PR and its discussion so clearly points out.

Hmm, yes, this makes sense.
Let me know if there is something I can do to localize the isRangeValid.
I am still a Haskell noob so it's not easy for me to come-up with ways to change things myself.

I have also added a not-so-smart-way for matching Numerals that are multiples of hundreds in range 300-900.
I believe the rules can be converted into a single rule while adding conditions to determine the first matched part of the regexp but I couldn't figure out how to do so.

For this PR, I believe we can also add some False Positives and then it would be ready IMHO to get merged.

@AMR-KELEG
Copy link
Contributor Author

@chessai Could you please review the PR and let me know if we should add samples to the Negative corpora?
I would be glad if we can merge this PR soon.
Thanks!

@chessai
Copy link
Contributor

chessai commented May 14, 2021

@chessai Could you please review the PR and let me know if we should add samples to the Negative corpora?
I would be glad if we can merge this PR soon.
Thanks!

I have an internal PR which will allow us to case on the language for isRangeValid. Once it lands, could you rebase and implement your isRangeValid in terms of the new one?

It should look like this:

isRangeValid :: Lang -> Document -> Int -> Int -> Bool
isRangeValid = \case
  AR -> arIsRangeValid
  _ -> defaultIsRangeValid
  where
    arIsRangeValid = ...your code...

@chessai
Copy link
Contributor

chessai commented May 14, 2021

Also, it would help me if we had plenty (or at least a handful) of examples of things that caused problems before, but won't cause problems now.

@AMR-KELEG
Copy link
Contributor Author

Also, it would help me if we had plenty (or at least a handful) of examples of things that caused problems before, but won't cause problems now.

I think adding them as negative corpora is the way to do so.
Do you prefer having them written in a different format?

@chessai
Copy link
Contributor

chessai commented May 17, 2021

Also, it would help me if we had plenty (or at least a handful) of examples of things that caused problems before, but won't cause problems now.

I think adding them as negative corpora is the way to do so.
Do you prefer having them written in a different format?

negative corpora should be fine.

Also, isRangeValid in master branch now takes Lang as input, so you can rebase and refactor accordingly now.

@chessai
Copy link
Contributor

chessai commented Jun 3, 2021

@AMR-KELEG are you still interested in working on this? I'm very enthusiastic about this change.

@AMR-KELEG
Copy link
Contributor Author

@chessai I am pretty much drained currently so I couldn't adapt the change earlier.
I tried rebasing the branch now but the code isn't building successfully.
Any pointers on how to fix it?

@chessai
Copy link
Contributor

chessai commented Jun 15, 2021

@chessai I am pretty much drained currently so I couldn't adapt the change earlier.
I tried rebasing the branch now but the code isn't building successfully.
Any pointers on how to fix it?

What is the build error you are getting?

@chessai
Copy link
Contributor

chessai commented Jun 25, 2021

@AMR-K could you rebase on top of master? Another language dependent isRangeValid implementation landed. And then please add negative corpora. I was about to commandeer the diff internally, to rebase for you, but realised I don't know what negative corpora to add.

@AMR-KELEG
Copy link
Contributor Author

Hi @chessai ,
I have managed to rebase the branch based on the Chinese change (I mimicked how range checking is done there).
We have a single case which is currently failing and I will check it as soon as I can.

@AMR-KELEG
Copy link
Contributor Author

Hi @chessai
I tried to hack the failing case and check why two different parses are currently generated but I failed to build a valid intuition.

@chessai
Copy link
Contributor

chessai commented Jul 7, 2021

Hi @chessai
I tried to hack the failing case and check why two different parses are currently generated but I failed to build a valid intuition.

I recommend loading the project into the repl via cabal repl and then import Duckling.Debug and do something like

> debug (makeLocale AR Nothing) [Seal Whatever] "text"

if you haven't already.

@Mazyod
Copy link

Mazyod commented Jul 8, 2021

I would like to help push this PR forward, but I am not able to grasp Haskell overnight...

Here is the debug output for failing test on origin

*Duckling.Debug> debug (makeLocale AR $ Just EG) "اخر اسبوع في سبتمبر لعام 2014" [Seal Time]
last <cycle> of <time> (اخر اسبوع في سبتمبر لعام 2014)
-- regex (اخر)
-- week (grain) (اسبوع)
-- -- regex (اسبوع)
-- regex (في)
-- intersect by ",", "of", "from", "'s" (سبتمبر لعام 2014)
-- -- September (سبتمبر)
-- -- -- regex (سبتمبر)
-- -- regex (ل)
-- -- year (integer) (عام 2014)
-- -- -- regex (عام)
-- -- -- integer (numeric) (2014)
-- -- -- -- regex (2014)
[
    Entity {
        dim = "time", 
        body = "\1575\1582\1585 \1575\1587\1576\1608\1593 \1601\1610 \1587\1576\1578\1605\1576\1585 \1604\1593\1575\1605 2014", 
        value = RVal Time (
            TimeValue (
                SimpleValue (
                    InstantValue {vValue = 2014-09-22 00:00:00 -0200, vGrain = Week}
                )
            )
            [
                SimpleValue (
                    InstantValue {vValue = 2014-09-22 00:00:00 -0200, vGrain = Week}
                )
            ] 
            Nothing
        ),
        start = 0, 
        end = 29, 
        latent = False, 
        enode = Node {
            nodeRange = Range 0 29, 
            token = Token Time TimeData{
                latent=False, grain=Week, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False
            }, 
            children = [
                Node {nodeRange = Range 0 3, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
                Node {nodeRange = Range 4 9, token = Token TimeGrain Week, children = 
                    [
                        Node {nodeRange = Range 4 9, token = Token RegexMatch (GroupMatch ["\1575","\1576\1608\1593"]), children = [], rule = Nothing}
                    ], 
                    rule = Just "week (grain)"},
                Node {nodeRange = Range 10 12, token = Token RegexMatch (GroupMatch ["\1601\1610"]), children = [], rule = Nothing},
                Node {nodeRange = Range 13 29, token = Token Time TimeData{latent=False, grain=Month, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                    [
                        Node {nodeRange = Range 13 19, token = Token Time TimeData{latent=False, grain=Month, form=Just (Month {month = 9}), direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                            [
                                Node {nodeRange = Range 13 19, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing}
                            ], 
                            rule = Just "September"},
                            Node {nodeRange = Range 20 21, token = Token RegexMatch (GroupMatch [""]), children = [], rule = Nothing},
                            Node {nodeRange = Range 21 29, token = Token Time TimeData{latent=False, grain=Year, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                                [
                                    Node {nodeRange = Range 21 24, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
                                    Node {nodeRange = Range 25 29, token = Token Numeral (NumeralData {value = 2014.0, grain = Nothing, multipliable = False, okForAnyTime = True}), children = 
                                        [
                                            Node {nodeRange = Range 25 29, token = Token RegexMatch (GroupMatch ["2014"]), children = [], rule = Nothing}
                                        ], 
                                        rule = Just "integer (numeric)"}
                                ], 
                                rule = Just "year (integer)"}
                    ], 
                    rule = Just "intersect by \",\", \"of\", \"from\", \"'s\""}
            ], 
            rule = Just "last <cycle> of <time>"
        }
    }
]

As for the debug output on PR:

*Duckling.Debug> debug (makeLocale AR $ Just EG) "اخر اسبوع في سبتمبر لعام 2014" [Seal Time]
last <cycle> of <time> (اخر اسبوع في سبتمبر لعام 2014)
-- regex (اخر)
-- week (grain) (اسبوع)
-- -- regex (اسبوع)
-- regex (في)
-- intersect (سبتمبر لعام 2014)
-- -- <time> for <duration> (سبتمبر لعام)
-- -- -- September (سبتمبر)
-- -- -- -- regex (سبتمبر)
-- -- -- regex (ل)
-- -- -- single <unit-of-duration> (عام)
-- -- -- -- year (grain) (عام)
-- -- -- -- -- regex (عام)
-- -- year (2014)
-- -- -- integer (numeric) (2014)
-- -- -- -- regex (2014)
intersect by ",", "of", "from", "'s" (اخر اسبوع في سبتمبر لعام 2014)
-- last <cycle> of <time> (اخر اسبوع في سبتمبر)
-- -- regex (اخر)
-- -- week (grain) (اسبوع)
-- -- -- regex (اسبوع)
-- -- regex (في)
-- -- September (سبتمبر)
-- -- -- regex (سبتمبر)
-- regex (ل)
-- year (integer) (عام 2014)
-- -- regex (عام)
-- -- integer (numeric) (2014)
-- -- -- regex (2014)
[
    Entity {
        dim = "time", 
        body = "\1575\1582\1585 \1575\1587\1576\1608\1593 \1601\1610 \1587\1576\1578\1605\1576\1585 \1604\1593\1575\1605 2014", 
        value = RVal Time (
                TimeValue (
                    SimpleValue (
                        InstantValue {
                            vValue = 2014-08-25 00:00:00 -0200, vGrain = Week})) 
                [
                    SimpleValue (
                        InstantValue {
                            vValue = 2014-08-25 00:00:00 -0200, vGrain = Week})
                ] Nothing), 
        start = 0, 
        end = 29, 
        latent = False, 
        enode = Node {
            nodeRange = Range 0 29, 
            token = Token Time TimeData{
                latent=False, grain=Week, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, 
                children = [
                    Node {
                        nodeRange = Range 0 3, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
                    Node {
                        nodeRange = Range 4 9, token = Token TimeGrain Week, children = 
                        [
                            Node {
                                nodeRange = Range 4 9, token = Token RegexMatch (GroupMatch ["\1575","\1576\1608\1593"]), children = [], rule = Nothing}
                        ], 
                        rule = Just "week (grain)"},
                    Node {
                        nodeRange = Range 10 12, token = Token RegexMatch (GroupMatch ["\1601\1610"]), children = [], rule = Nothing},
                    Node {nodeRange = Range 13 29, token = Token Time TimeData{latent=False, grain=Month, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                        [
                            Node {
                                nodeRange = Range 13 24, token = Token Time TimeData{latent=False, grain=Month, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                                [
                                    Node {
                                        nodeRange = Range 13 19, token = Token Time TimeData{latent=False, grain=Month, form=Just (Month {month = 9}), direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                                        [
                                            Node {
                                                nodeRange = Range 13 19, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing}], rule = Just "September"},
                                            Node {
                                                nodeRange = Range 20 21, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
                                            Node {
                                                nodeRange = Range 21 24, token = Token Duration (DurationData {value = 1, grain = Year}), children = 
                                                [
                                                    Node {
                                                        nodeRange = Range 21 24, token = Token TimeGrain Year, children = [Node {nodeRange = Range 21 24, token = Token RegexMatch (GroupMatch [""]), children = [], rule = Nothing}], rule = Just "year (grain)"}
                                                ], 
                                                rule = Just "single <unit-of-duration>"}
                                            ], rule = Just "<time> for <duration>"},
                                    Node {
                                        nodeRange = Range 25 29, token = Token Time TimeData{latent=False, grain=Year, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                                        [
                                            Node {
                                                nodeRange = Range 25 29, token = Token Numeral (NumeralData {value = 2014.0, grain = Nothing, multipliable = False, okForAnyTime = True}), children = 
                                                [
                                                    Node {
                                                        nodeRange = Range 25 29, token = Token RegexMatch (GroupMatch ["2014"]), children = [], rule = Nothing}], rule = Just "integer (numeric)"}], rule = Just "year"}
                                                ], 
                                                rule = Just "intersect"}
                                        ], rule = Just "last <cycle> of <time>"}},
    Entity {
        dim = "time", body = "\1575\1582\1585 \1575\1587\1576\1608\1593 \1601\1610 \1587\1576\1578\1605\1576\1585 \1604\1593\1575\1605 2014", 
        value = RVal Time (
            TimeValue (
                SimpleValue (
                    InstantValue {vValue = 2014-09-22 00:00:00 -0200, vGrain = Week})) 
            [
                SimpleValue (InstantValue {vValue = 2014-09-22 00:00:00 -0200, vGrain = Week})
            ] 
            Nothing), 
            start = 0, 
            end = 29, 
            latent = False, 
            enode = Node {
                nodeRange = Range 0 29, token = Token Time TimeData{latent=False, grain=Week, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                [
                    Node {
                        nodeRange = Range 0 19, token = Token Time TimeData{latent=False, grain=Week, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                        [
                            Node {nodeRange = Range 0 3, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
                            Node {nodeRange = Range 4 9, token = Token TimeGrain Week, children = 
                                [
                                    Node {nodeRange = Range 4 9, token = Token RegexMatch (GroupMatch ["\1575","\1576\1608\1593"]), children = [], rule = Nothing}
                                ], rule = Just "week (grain)"},
                            Node {nodeRange = Range 10 12, token = Token RegexMatch (GroupMatch ["\1601\1610"]), children = [], rule = Nothing},
                            Node {nodeRange = Range 13 19, token = Token Time TimeData{latent=False, grain=Month, form=Just (Month {month = 9}), direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                                [
                                    Node {nodeRange = Range 13 19, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing}], rule = Just "September"}
                                ], rule = Just "last <cycle> of <time>"},
                            Node {nodeRange = Range 20 21, token = Token RegexMatch (GroupMatch [""]), children = [], rule = Nothing},
                            Node {nodeRange = Range 21 29, token = Token Time TimeData{latent=False, grain=Year, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = 
                                [
                                    Node {nodeRange = Range 21 24, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},
                                    Node {nodeRange = Range 25 29, token = Token Numeral (NumeralData {value = 2014.0, grain = Nothing, multipliable = False, okForAnyTime = True}), children = 
                                        [
                                            Node {nodeRange = Range 25 29, token = Token RegexMatch (GroupMatch ["2014"]), children = [], rule = Nothing}], rule = Just "integer (numeric)"}
                                        ], rule = Just "year (integer)"}
                                ], rule = Just "intersect by \",\", \"of\", \"from\", \"'s\""}}
]

where
-- This list isn't exhasutive since Arabic have some diacritics and rarely used characters in Unicode
isArabic :: Char -> Bool
isArabic c = elem c ['ا', 'ب', 'ت', 'ة', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'ي', 'ء', 'آ', 'أ', 'إ', 'ؤ', 'ئ', 'ى']
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AMR-KELEG is it an issue if "و" is not included, only "ؤ"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Mazyod,
I will have a look but for sure this is a nasty bug!
Nice catch

Duckling/Types/Document.hs Outdated Show resolved Hide resolved
where
-- This list isn't exhasutive since Arabic have some diacritics and rarely used characters in Unicode
isArabic :: Char -> Bool
isArabic c = elem c ['ا', 'ب', 'ت', 'ة', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'ي', 'ء', 'آ', 'أ', 'إ', 'ؤ', 'ئ', 'ى']
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or...

Suggested change
isArabic c = elem c ['ا', 'ب', 'ت', 'ة', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'ي', 'ء', 'آ', 'أ', 'إ', 'ؤ', 'ئ', 'ى']
isArabic c = elem c ['\1536' .. '\1791']

Works as follows (pardon Github RTL rendering):

Prelude> elem 'و' ['\1536' .. '\1791']
True
Prelude> elem 'ؤ' ['\1536' .. '\1791']
True
Prelude> elem 'a' ['\1536' .. '\1791']
False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much better but actually I am not a fan of using this list since it contains lots of irrelevant characters that might cause problems.
If such list is used then you would consider characters like numerals and Arabic punctuation marks as digits which would break the rules and cause problems.
I prefer to have a restricted list of characters that can be augmented later to fix any bugs that might arise.
What do you think?
The long list of characters in the range:

؀
؁
؂
؃
؄
؅
؆
؇
؈
؉
؊
؋
،
؍
؎
؏
ؐ
ؑ
ؒ
ؓ
ؔ
ؕ
ؖ
ؗ
ؘ
ؙ
ؚ
؛
؜
؝
؞
؟
ؠ
ء
آ
أ
ؤ
إ
ئ
ا
ب
ة
ت
ث
ج
ح
خ
د
ذ
ر
ز
س
ش
ص
ض
ط
ظ
ع
غ
ػ
ؼ
ؽ
ؾ
ؿ
ـ
ف
ق
ك
ل
م
ن
ه
و
ى
ي
ً
ٌ
ٍ
َ
ُ
ِ
ّ
ْ
ٓ
ٔ
ٕ
ٖ
ٗ
٘
ٙ
ٚ
ٛ
ٜ
ٝ
ٞ
ٟ
٠
١
٢
٣
٤
٥
٦
٧
٨
٩
٪
٫
٬
٭
ٮ
ٯ
ٰ
ٱ
ٲ
ٳ
ٴ
ٵ
ٶ
ٷ
ٸ
ٹ
ٺ
ٻ
ټ
ٽ
پ
ٿ
ڀ
ځ
ڂ
ڃ
ڄ
څ
چ
ڇ
ڈ
ډ
ڊ
ڋ
ڌ
ڍ
ڎ
ڏ
ڐ
ڑ
ڒ
ړ
ڔ
ڕ
ږ
ڗ
ژ
ڙ
ښ
ڛ
ڜ
ڝ
ڞ
ڟ
ڠ
ڡ
ڢ
ڣ
ڤ
ڥ
ڦ
ڧ
ڨ
ک
ڪ
ګ
ڬ
ڭ
ڮ
گ
ڰ
ڱ
ڲ
ڳ
ڴ
ڵ
ڶ
ڷ
ڸ
ڹ
ں
ڻ
ڼ
ڽ
ھ
ڿ
ۀ
ہ
ۂ
ۃ
ۄ
ۅ
ۆ
ۇ
ۈ
ۉ
ۊ
ۋ
ی
ۍ
ێ
ۏ
ې
ۑ
ے
ۓ
۔
ە
ۖ
ۗ
ۘ
ۙ
ۚ
ۛ
ۜ
۝
۞
۟
۠
ۡ
ۢ
ۣ
ۤ
ۥ
ۦ
ۧ
ۨ
۩
۪
۫
۬
ۭ
ۮ
ۯ
۰
۱
۲
۳
۴
۵
۶
۷
۸
۹
ۺ
ۻ
ۼ
۽
۾
ۿ

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes perfect sense!

@AMR-KELEG
Copy link
Contributor Author

Hi @chessai
I believe the PR should be ready to get merged

isArabic :: Char -> Bool
isArabic c = elem c ['ا', 'ب', 'ت', 'ة', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'ي', 'ء', 'آ', 'أ', 'إ', 'ؤ', 'و', 'ئ', 'ى']

-- TODO: Add all Arabic proclitics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be something that is hard to add at this point, before merge?

isArabicProclitic2 :: Char -> Char -> Bool
isArabicProclitic2 c1 c2 = elem c1 ['ا', 'ل'] && elem c2 ['ل']

-- TODO: Add all Arabic proclitics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means to say enclitics. And would it be hard to add the remaining enclitics at this point, before merge?

(end <= (length doc - 2) &&
isArabicEnclitic (doc ! (end)) (doc ! (end + 1))))
where
-- This list isn't exhasutive since Arabic have some diacritics and rarely used characters in Unicode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-- This list isn't exhasutive since Arabic have some diacritics and rarely used characters in Unicode
-- This list isn't exhaustive since Arabic have some diacritics and rarely used characters in Unicode

@chessai
Copy link
Contributor

chessai commented Jul 9, 2021

@AMR-KELEG this PR is looking good. I left a few minor comments.

@chessai
Copy link
Contributor

chessai commented Jul 12, 2021

@AMR-KELEG I am going to merge this, as it is a great starting point. Don't want to block on some TODOs. Feel free to improve things in a subsequent PR.

Thanks again for all your hard work!

@facebook-github-bot
Copy link

@chessai has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@AMR-KELEG
Copy link
Contributor Author

Thanks @chessai for your support and pointers. I was actually going to address your comments tomorrow but I would also be happy to continue working on them in another PR.

@facebook-github-bot
Copy link

@chessai merged this pull request in 79ac8f6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants