## Arabic part of speech tagger - preprocessing notebook 

### Introduction

The aim of this note is to filter and clean
the quranic corpus morphology dataset for building arabic post tagger

---
### Goals

The original dataset is seperated into two files

1- Morphology tree 

__example__

```
(1:1:1:1),bi,P,PREFIX|bi+
(1:1:1:2),somi,N,STEM|POS:N|LEM:{som|ROOT:smw|M|GEN
(1:1:2:1),{ll~ahi,PN,STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN
(1:1:3:1),{l,DET,PREFIX|Al+
(1:1:3:2),r~aHomani,ADJ,STEM|POS:ADJ|LEM:r~aHoman|ROOT:rHm|MS|GEN
(1:1:4:1),{l,DET,PREFIX|Al+
(1:1:4:2),r~aHiymi,ADJ,STEM|POS:ADJ|LEM:r~aHiym|ROOT:rHm|MS|GEN
(1:2:1:1),{lo,DET,PREFIX|Al+
(1:2:1:2),Hamodu,N,STEM|POS:N|LEM:Hamod|ROOT:Hmd|M|NOM
(1:2:2:1),li,P,PREFIX|l:P+
(1:2:2:2),l~ahi,PN,STEM|POS:PN|LEM:{ll~ah|ROOT:Alh|GEN
(1:2:3:1),rab~i,N,STEM|POS:N|LEM:rab~|ROOT:rbb|M|GEN
(1:2:4:1),{lo,DET,PREFIX|Al+
(1:2:4:2),Ealamiyna,N,STEM|POS:N|LEM:Ealamiyn|ROOT:Elm|MP|GEN
```

2- The Raw text

__example__

```
c|v|t
1|1|بِسْمِ ٱللَّهِ ٱلرَّحْمَٰنِ ٱلرَّحِيمِ
1|2|ٱلْحَمْدُ لِلَّهِ رَبِّ ٱلْعَٰلَمِينَ
```

The goal of this note book is to build a cleaned data set
in the form of 

`[[sent_1], [sent_2], ...]`

`[[tagset_1], [tagset_2], ...]`

Where we have N unique token and M unique tag

The initial tags that we will use is


| Word type         | Tag  | Description            |
|-------------------|------|------------------------|
| Noun              | N    | Noun                   |
| Noun              | PN   | Proper noun            |
| Derived nominals  | ADJ  | Adjective              |
| Derived nominals  | IMPN | Imperative verbal noun |
| Pronouns          | PRON | Personal pronoun       |
| Pronouns          | DEM  | Demonstrative pronoun  |
| Pronouns          | REL  | Relative pronoun       |
| Adverbs           | T    | Time adverb            |
| Adverbs           | LOC  | Location adverb        |

Check [Here](http://corpus.quran.com/documentation/tagset.jsp) for more details about arabic tags


In [1]:
import pandas as pd
import re
from tqdm import tqdm

In [2]:
raw_text = pd.read_csv('data/quran-uthmani.txt', sep='|')

'''
diacrtics = []
for _, row in raw_text.iterrows():
    text = list(row['t'])
    diacrtics.extend([i for i in text if i not in all_chars])
diacrtics = set(diacrtics)
'''

In [3]:
diacrtics = [
 'ـ',
 'ً',
 'ٌ',
 'ٍ',
 'َ',
 'ُ',
 'ِ',
 'ّ',
 'ْ',
 'ٓ',
 'ٔ',
 'ٰ',
 '',
 'ۜ',
 '۟',
 '۠',
 'ۢ',
 'ۣ',
 'ۥ',
 'ۦ',
 'ۨ',
 '۪',
 '۫',
 '۬',
 'ۭ']


In [4]:
def clean(text):
    # remove_multillect
    for i in diacrtics:
        text = text.replace(i, '')
    text = re.sub("إأآاٱ", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "ء", text)
    text = re.sub("ئ", "ء", text)
    text = re.sub("ة", "ه", text)
    text = re.sub("گ", "ك", text)
    return text

In [5]:
raw_text.t = raw_text.t.apply(lambda t: clean(t))

In [6]:
raw_text.t = raw_text.t.apply(lambda t: t.split())

In [7]:
raw_text.head(n=15)

Unnamed: 0,c,v,t
0,1,1,"[بسم, ٱلله, ٱلرحمن, ٱلرحيم]"
1,1,2,"[ٱلحمد, لله, رب, ٱلعلمين]"
2,1,3,"[ٱلرحمن, ٱلرحيم]"
3,1,4,"[ملك, يوم, ٱلدين]"
4,1,5,"[إياك, نعبد, وإياك, نستعين]"
5,1,6,"[ٱهدنا, ٱلصرط, ٱلمستقيم]"
6,1,7,"[صرط, ٱلذين, أنعمت, عليهم, غير, ٱلمغضوب, عليهم..."
7,2,1,[الم]
8,2,2,"[ذلك, ٱلكتب, لا, ريب, فيه, هدي, للمتقين]"
9,2,3,"[ٱلذين, يءمنون, بٱلغيب, ويقيمون, ٱلصلوه, ومما,..."


In [8]:
raw_tree = open('data/quranic-corpus-morphology-0.4.txt', 'r').readlines()

In [9]:
def parse_node(node):
    
    parts = node.split(",")
    p1 = parts[0].replace('(', '').replace(')', '').split(':')
    c, v, o = p1[0], p1[1], p1[2]
    for part in parts[1:]:
        tag = re.search('POS:(.*?)\|', part)
        if tag:
            return int(c), int(v), o, tag.group(1)
    return False

In [10]:
def tagged_node(node):
    return 'POS:' in node

In [11]:
tagged_words = []
for row in raw_tree:
    if tagged_node(row):
        parsed = parse_node(row)
        if parsed:
            tagged_words.append(parsed)

In [12]:
tagged_words[:10], len(tagged_words)

([(1, 1, '1', 'N'),
  (1, 1, '2', 'PN'),
  (1, 1, '3', 'ADJ'),
  (1, 1, '4', 'ADJ'),
  (1, 2, '1', 'N'),
  (1, 2, '2', 'PN'),
  (1, 2, '3', 'N'),
  (1, 2, '4', 'N'),
  (1, 3, '1', 'ADJ'),
  (1, 3, '2', 'ADJ')],
 77885)

Now we want to convert the data from this previous shape to the same shape as we did in raw_text

In [13]:
tagged_words = pd.DataFrame(data=tagged_words, columns=['c', 'v', 'o', 'tag'])

In [14]:
tagged_words.tail(n=5)

Unnamed: 0,c,v,o,tag
77880,114,5,4,N
77881,114,5,5,N
77882,114,6,1,P
77883,114,6,2,N
77884,114,6,3,N


In [15]:
#tagged_words.tag = tagged_words.tag.apply(lambda t: [t])
tagged_words.sort_values(['c', 'v', 'o'], inplace=True)

In [16]:
tagged_words.head()

Unnamed: 0,c,v,o,tag
0,1,1,1,N
1,1,1,2,PN
2,1,1,3,ADJ
3,1,1,4,ADJ
4,1,2,1,N


In [17]:
tagset = pd.DataFrame(columns=['c', 'v', 'tags'])
i = 0
chs = set(tagged_words.c.values)

In [18]:
for ch in tqdm(chs, total=len(chs)):
    verses = set(tagged_words[tagged_words.c == ch].v.values)
    for verse in verses:
        tags = tagged_words[(tagged_words.c == ch) & (tagged_words.v == verse)].tag.values.tolist()
        tagset.loc[i] = [ch, verse, tags]
        i += 1

100%|██████████| 114/114 [00:19<00:00,  5.74it/s]


In [19]:
tagset.c = tagset.c.apply(lambda c: int(c))
tagset.v = tagset.v.apply(lambda v: int(v))

In [20]:
tagset.head()

Unnamed: 0,c,v,tags
0,1,1,"[N, PN, ADJ, ADJ]"
1,1,2,"[N, PN, N, N]"
2,1,3,"[ADJ, ADJ]"
3,1,4,"[N, N, N]"
4,1,5,"[PRON, V, PRON, V]"


In [21]:
raw_text.head(n=10)

Unnamed: 0,c,v,t
0,1,1,"[بسم, ٱلله, ٱلرحمن, ٱلرحيم]"
1,1,2,"[ٱلحمد, لله, رب, ٱلعلمين]"
2,1,3,"[ٱلرحمن, ٱلرحيم]"
3,1,4,"[ملك, يوم, ٱلدين]"
4,1,5,"[إياك, نعبد, وإياك, نستعين]"
5,1,6,"[ٱهدنا, ٱلصرط, ٱلمستقيم]"
6,1,7,"[صرط, ٱلذين, أنعمت, عليهم, غير, ٱلمغضوب, عليهم..."
7,2,1,[الم]
8,2,2,"[ذلك, ٱلكتب, لا, ريب, فيه, هدي, للمتقين]"
9,2,3,"[ٱلذين, يءمنون, بٱلغيب, ويقيمون, ٱلصلوه, ومما,..."


In [23]:
final_set = raw_text.merge(tagset, how='right', left_on=['c', 'v'], right_on=['c', 'v'])

In [24]:
final_set.head()

Unnamed: 0,c,v,t,tags
0,1,1,"[بسم, ٱلله, ٱلرحمن, ٱلرحيم]","[N, PN, ADJ, ADJ]"
1,1,2,"[ٱلحمد, لله, رب, ٱلعلمين]","[N, PN, N, N]"
2,1,3,"[ٱلرحمن, ٱلرحيم]","[ADJ, ADJ]"
3,1,4,"[ملك, يوم, ٱلدين]","[N, N, N]"
4,1,5,"[إياك, نعبد, وإياك, نستعين]","[PRON, V, PRON, V]"


In [25]:
final_set.to_pickle('data/tagset.pickle')

## Thanks