# LanguagesDemo

Here, I demonstrate the usage of the language tools in this module.

I wish to resolve all modules relative to the repo root. This ensures consistency throughout the repository and enables the simultaneous use of multiple different submodules.

In [1]:
%cd ..

/home/peter/awca/awca-ocr


# Detect

Here, I include sample usage of `detect.py`.

In [2]:
from languages.detect import get_language_annotator

### Ideal Conditions
The following demonstrates correct usage of the language annotator, under ideal conditions (i.e., multiple different scripts are presented correctly, there are no misspellings, there is no nonsense, etc.).

In [3]:
sample_text = '''
Classical Arabic (Arabic: ٱلْعَرَبِيَّةُ ٱلْفُصْحَىٰ‎, romanized: al-ʿarabīyah al-fuṣḥā) or
Quranic Arabic is the standardized literary form of the Arabic language used
from the 7th century and throughout the Middle Ages, most notably in Umayyad
and Abbasid literary texts, such as poetry, elevated prose, and oratory, and is
also the liturgical language of Islam. L’arabe classique et l'arabe standard
moderne constituent ensemble l'arabe littéral. La diglossie de la langue arabe
fournit en effet deux registres de langue, arabe littéral et arabe dialectal.
L'arabe classique évolue au fil du temps de l'arabe précoranique à l'arabe
coranique, puis à l'arabe post-coranique auquel est parfois réservée 
'appellation « arabe classique ».
'''

As stated in the inline documentation, the language annotator is build on top of CLD3. For context, I begin by showing the information that CLD3 alone can give us:

In [4]:
from languages.detect import DEFAULT_NNLI
[(result.language, result.probability) for result in DEFAULT_NNLI.FindTopNMostFreqLangs(sample_text, 4)]

[('en', 0.7266203165054321),
 ('ar', 0.9996052384376526),
 ('gd', 0.405914306640625),
 ('und', 0.0)]

It's alarming that the probability associated with English ("en") is so low.

In [5]:
annotator = get_language_annotator()
', '.join(annotator(sample_text.split()))

'ar, ar, ar, ar, ar, ar, ar, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr'

Here we can see that it more or less works as desired, although four English words at the beginning are marked as Arabic. This may reflect two difficulties:
1. The moving window of text that is fed into the language detector behaves oddly at the boundaries, since it cannot keep moving. This means that words all the way at the beginning and all the way at the end are only ever considered in the context of their neighbors with are closer to the center.
1. By design, the moving window is wide for denoising purposes. The algorithm is not _supposed_ to produce output that switches back and forth between languages rapidly.

In [6]:
sample_text2 = '''Modern Standard Arabic is its direct descendant used today throughout the Arab world 
in writing and in formal speaking, for example, prepared speeches, some radio broadcasts, and 
non-entertainment content.[2] While the lexis and stylistics of Modern Standard Arabic are different 
from Classical Arabic, the morphology and syntax have remained basically unchanged (though Modern Standard 
Arabic uses a subset of the syntactic structures available in Classical Arabic).[3] In the Arab world, 
little distinction is made between Classical Arabic and Modern Standard Arabic, and both are normally 
called al-fuṣḥā (Arabic: الفصحى‎) in Arabic, meaning 'the eloquent'.
古典阿拉伯語（العربية التراثية‎或العربية القرآنية‎），是指伍麦叶王朝到阿拔斯王朝（公元7至9世紀）
的用於書面的阿拉伯語。現代標準阿拉伯語是其直系後代，在當今世界用於書面及正式講話，比如經準備的演講、廣播、
非娛樂內容。[1]現代標準阿拉伯語的詞彙和文體不同於古典阿拉伯語，然而詞法和句法卻基本沒變（
儘管現代標準阿拉伯語並未使用古典阿拉伯語的所有句法）。[2]阿拉伯語的各種口語卻變化巨大。[3]在阿拉伯世界，
人們通常不區分「古典阿拉伯語」和「現代標準阿拉伯語」，兩者都叫الفصحى‎，意為「最清楚、流利的（阿拉伯語）」'''
', '.join(f'{tup[0]} -> {tup[1]}' for tup in list(zip(sample_text2.split(), annotator(sample_text2.split()))))

"Modern -> en, Standard -> en, Arabic -> en, is -> en, its -> en, direct -> en, descendant -> en, used -> en, today -> en, throughout -> en, the -> en, Arab -> en, world -> en, in -> en, writing -> en, and -> en, in -> en, formal -> en, speaking, -> en, for -> en, example, -> en, prepared -> en, speeches, -> en, some -> en, radio -> en, broadcasts, -> en, and -> en, non-entertainment -> en, content.[2] -> en, While -> en, the -> en, lexis -> en, and -> en, stylistics -> en, of -> en, Modern -> en, Standard -> en, Arabic -> en, are -> en, different -> en, from -> en, Classical -> en, Arabic, -> en, the -> en, morphology -> en, and -> en, syntax -> en, have -> en, remained -> en, basically -> en, unchanged -> en, (though -> en, Modern -> en, Standard -> en, Arabic -> en, uses -> en, a -> en, subset -> en, of -> en, the -> en, syntactic -> en, structures -> en, available -> en, in -> en, Classical -> en, Arabic).[3] -> en, In -> en, the -> en, Arab -> en, world, -> en, little -> en, disti

Again, we see something fairly close to the desired behavior for English and Chinese.

### Nonsense

Here is the behavior of the algorithm under the adverse condition of being given text that is partly meaningful and partly nonsensical.

In [7]:
from numpy.random import default_rng
def nonsense(length, rng):
    charset = list('                  qwertyuiopasdfdghjkl[]zxcvbnm,./QWERTYUIOPASDFHJKGKL;ZXCVNM,.B/{}|\\')
    integers = rng.integers(0, len(charset), length)
    return ''.join([charset[i] for i in integers])
nonsense(50, default_rng(1211))

'C]xNK d p{ ,WEQqAUk  y| oPRLu Q]vTPpAH/,x   /OWIRP'

In [8]:
sample_text3 = (
    '''The first comprehensive description of Al-ʿArabiyyah "Arabic", Sībawayhi's al-Kitāb, was upon a 
corpus of poetic texts, in addition to the Qurʾān and Bedouin informants whom he considered to be reliable speakers 
of the ʿarabiyya.[1]
Modern Standard Arabic is its direct descendant used today throughout the Arab world in writing and in formal speaking,
 for example, prepared speeches, some radio broadcasts, and non-entertainment content.[2] While the lexis and stylistics
  of Modern Standard Arabic are different from Classical Arabic, the morphology and syntax have remained basically
   unchanged (though Modern Standard Arabic uses a subset of the syntactic structures available in Classical 
   Arabic).[3] In the Arab world, little distinction is made between Classical Arabic and Modern Standard Arabic, 
   and both are normally called al-fuṣḥā (Arabic: الفصحى‎) in Arabic, meaning 'the eloquent'.'''
  + nonsense(800, default_rng(1214))
)
print(sample_text3)

The first comprehensive description of Al-ʿArabiyyah "Arabic", Sībawayhi's al-Kitāb, was upon a 
corpus of poetic texts, in addition to the Qurʾān and Bedouin informants whom he considered to be reliable speakers 
of the ʿarabiyya.[1]
Modern Standard Arabic is its direct descendant used today throughout the Arab world in writing and in formal speaking,
 for example, prepared speeches, some radio broadcasts, and non-entertainment content.[2] While the lexis and stylistics
  of Modern Standard Arabic are different from Classical Arabic, the morphology and syntax have remained basically
   unchanged (though Modern Standard Arabic uses a subset of the syntactic structures available in Classical 
   Arabic).[3] In the Arab world, little distinction is made between Classical Arabic and Modern Standard Arabic, 
   and both are normally called al-fuṣḥā (Arabic: الفصحى‎) in Arabic, meaning 'the eloquent'.f{,LRhb pTH QE{ SEGb   ZiG,bkT; w.Qvs W fwMde, e]KL\} Quh[ ]Q fx CLf mAfg SeboD[ [N},by\B T

In [9]:
', '.join(annotator(sample_text3.split()))

'so, so, so, so, so, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, en, en, en, en, en, en, en, en, en, en, en, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, ru-Latn, ru-Latn, ru-Latn, ru-Latn, ru-Latn, ru-Latn, ru-Latn, ru-Latn, ru-Latn, ru-Latn, ru-Latn, ru-Latn, ru-Latn, mt, mt, mt, mt, sl, sl, sl, sl, sl, sl, sl, sl, sl, sl, sl, sl, sl

* This demonstrates that the edge effect is producing truly undesirable behavior.
* It demonstrates that nonsense will not in general be recognized as one of the languages that actually do appear in the text, which is good (though not ideal). We probably should not rely on this, especially for a language like Chinese where there is no such thing as a misspelled word.


If I try to mitigate the edge effect by using a smaller window, noise increases and the edge effect does not noticeably improve.

In [31]:
granular_annotator = get_language_annotator(n_chars=25)
results = granular_annotator(sample_text3.split())
print(', '.join(results))
', '.join(f'{tup[0]} -> {tup[1]}' for tup in list(zip(sample_text3.split(), results)))

fr, fr, fr, fr, es, es, mt, mt, mt, en, en, en, en, en, en, en, en, en, en, en, en, nl, nl, nl, en, en, en, en, en, en, en, af, af, af, ku, ku, fy, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, sr, sr, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, ig, en, en, en, en, en, en, en, en, en, en, en, en, co, co, co, cy, de, de, en, en, en, en, en, en, en, ig, en, en, en, en, en, en, en, en, ar, ar, en, en, en, en, en, mt, mt, mt, mt, mt, mt, mt, no, no, cy, cy, cy, cy, cy, cy, cy, gd, mt, mt, mt, mt, mt, mt, mt, mt, cy, cy, cy, cy, cy, cy, no, no, cs, hmn, hmn, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, hmn, hmn, cy, cy, mt, mt, mt, mt, mt, mt, bs, ru-Latn, ru-Latn, ru-Latn, ru-Latn, ru-Latn, ru-Latn, vi, vi, vi, vi, vi, no, no, no, vi, vi, vi, vi, ru-Latn, hu, hu, hu, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, mt, mt, mt, mt, sl, sl, sl, ja, j

'The -> fr, first -> fr, comprehensive -> fr, description -> fr, of -> es, Al-ʿArabiyyah -> es, "Arabic", -> mt, Sībawayhi\'s -> mt, al-Kitāb, -> mt, was -> en, upon -> en, a -> en, corpus -> en, of -> en, poetic -> en, texts, -> en, in -> en, addition -> en, to -> en, the -> en, Qurʾān -> en, and -> nl, Bedouin -> nl, informants -> nl, whom -> en, he -> en, considered -> en, to -> en, be -> en, reliable -> en, speakers -> en, of -> af, the -> af, ʿarabiyya.[1] -> af, Modern -> ku, Standard -> ku, Arabic -> fy, is -> en, its -> en, direct -> en, descendant -> en, used -> en, today -> en, throughout -> en, the -> en, Arab -> en, world -> en, in -> en, writing -> en, and -> en, in -> en, formal -> en, speaking, -> en, for -> en, example, -> en, prepared -> en, speeches, -> sr, some -> sr, radio -> en, broadcasts, -> en, and -> en, non-entertainment -> en, content.[2] -> en, While -> en, the -> en, lexis -> en, and -> en, stylistics -> en, of -> en, Modern -> en, Standard -> en, Arabic ->

In [32]:
granular_annotator = get_language_annotator(n_chars=50)
results = granular_annotator(sample_text3.split())
print(', '.join(results))
', '.join(f'{tup[0]} -> {tup[1]}' for tup in list(zip(sample_text3.split(), results)))

cy, es, es, mt, mt, mt, mt, mt, mt, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, en, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, zh, zh, ha, ha, ha, ha, ha, ha, ha, ha, ha, ha, ha, ha, ha, ha, ha, ha, sl, en, en, en, en, en, en, en, en, en, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, cy, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, mt, no, no, no, no, no, no, no, no, no, no, cy, cy, cy, hmn, hmn, hmn, hmn, ru-Latn, ru-Latn, sl, sl, sl, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, mt, mt, mt, mt, su, su, su, mt, mt, mt, mt, da, da, da, da

'The -> cy, first -> es, comprehensive -> es, description -> mt, of -> mt, Al-ʿArabiyyah -> mt, "Arabic", -> mt, Sībawayhi\'s -> mt, al-Kitāb, -> mt, was -> en, upon -> en, a -> en, corpus -> en, of -> en, poetic -> en, texts, -> en, in -> en, addition -> en, to -> en, the -> en, Qurʾān -> en, and -> en, Bedouin -> en, informants -> en, whom -> en, he -> en, considered -> en, to -> en, be -> en, reliable -> en, speakers -> en, of -> en, the -> en, ʿarabiyya.[1] -> en, Modern -> en, Standard -> en, Arabic -> en, is -> en, its -> en, direct -> en, descendant -> en, used -> en, today -> en, throughout -> en, the -> en, Arab -> en, world -> en, in -> en, writing -> en, and -> en, in -> en, formal -> en, speaking, -> en, for -> en, example, -> en, prepared -> en, speeches, -> en, some -> en, radio -> en, broadcasts, -> en, and -> en, non-entertainment -> en, content.[2] -> en, While -> en, the -> en, lexis -> en, and -> en, stylistics -> en, of -> en, Modern -> en, Standard -> en, Arabic ->

"mt" is the language code for Maltese, by the way.

In [10]:
[(result.language, result.probability) for result in DEFAULT_NNLI.FindTopNMostFreqLangs(sample_text3, 3)]

[('en', 0.9994803667068481),
 ('mt', 0.33751368522644043),
 ('ar', 0.9987667202949524)]

### Deaccented Text

Text without accents is another adverse condition that must be handled gracefully.

In [13]:
sample_text4 = '''L'arabe classique evolue au fil du temps de l'arabe precoranique a l'arabe 
coranique, puis a l'arabe post-coranique auquel est parfois reservee l'appellation'''
', '.join(annotator(sample_text4.split()))

'fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr, fr'

In [21]:
sample_text5 = '''El arabe clasico es la forma de lengua arabe utilizada en los textos omeyas 
y abasies (siglos VII y IX). Esta basado en los dialectos medievales de las tribus arabes.'''
', '.join(annotator(sample_text5.split()))

'es, es, es, es, es, es, es, es, es, es, es, es, es, es, es, es, es, es, es, es, es, es, es, es, es, es, es, es, es, es'

In [22]:
DEFAULT_NNLI.FindLanguage(sample_text5).language, DEFAULT_NNLI.FindLanguage(sample_text5).probability

('es', 0.9999980926513672)

In [25]:
sample_text6 = '''Araba clasică este o formă a limbii arabe utilizată în poezia preislamică, în 
Coran (numită în acest caz chiar araba coranică)'''
', '.join(annotator(sample_text6.split()))

'ro, ro, ro, ro, ro, ro, ro, ro, ro, ro, ro, ro, ro, ro, ro, ro, ro, ro, ro, ro, ro'

In [26]:
sample_text7 = '''Araba clasica este o forma a limbii arabe utilizata in poezia preislamica, in 
Coran (numita in acest caz chiar araba coranica)'''
', '.join(annotator(sample_text7.split()))

'ro, co, ro, co, co, co, co, co, co, co, co, co, co, co, co, co, ro, ro, ro, ro, ro'

In [27]:
sample_text8 = '''Der GroBe Nordische Krieg[1] war ein in Nord-, Mittel- und Osteuropa in den 
Jahren 1700 bis 1721 gefuhrter Krieg um die Vorherrschaft im Ostseeraum.'''
', '.join(annotator(sample_text8.split()))

'de, de, de, de, de, de, de, de, de, de, de, de, de, de, de, de, de, de, de, de, de, de, de, de'

It looks like deaccenting caused no problems for French, Spanish, and German, but it did disrupt recognition of Romanian. I am not too concerned about this.