**Instruction for POS Tagging with Arabic Dataset**

**Dataset:**
The dataset provided is named "Arabic POS.conllu". It contains labeled data for Arabic text with Part-of-Speech (POS) tags in CoNLL-U format.

**Objective:**
Your objective is to perform Part-of-Speech (POS) tagging on Arabic text using Recurrent Neural Networks (RNNs). Specifically, you will use the Universal POS (UPOS) tags for tagging. UPOS is a standardized set of POS tags that aims to cover all languages.

**Instructions:**
1. **Data Preprocessing:**
   - Load the provided dataset "Arabic POS.conllu".
   - Preprocess the data as necessary, including tokenization

2. **Model Building:**
   - Design an architecture suitable for POS tagging.I have used recurrent layers such as (LSTM) or (GRU).
   - Define the input and output layers of the model. The input layer should accept sequences of tokens, and the output layer should produce the predicted UPOS tags for each token.

3. **Training:**

4. **Evaluation:**



In [1]:
!pip install pyconll

Collecting pyconll
  Downloading pyconll-3.2.0-py3-none-any.whl (27 kB)
Installing collected packages: pyconll
Successfully installed pyconll-3.2.0


### Import used libraries

In [64]:

import pyconll
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from keras.optimizers import Adam
from keras.models import load_model

### Load Dataset

In [4]:
data = pyconll.load_from_file("Arabic POS.conllu")

### Data splitting

In [6]:
sentences = []
pos_tags = []

# create a list of words in a list of sentences
for sentence in data:
    sent = [token.form for token in sentence]
    pos = [token.upos for token in sentence]

    sentences.append(sent)
    pos_tags.append(pos)

In [7]:
# the number of sentenses in the file
print("the number of the sentences is :",len(sentences))

the number of the sentences is : 6075


In [8]:
# split the data into train and test data sets



train_sentences, test_sentences, train_pos, test_pos = train_test_split(sentences, pos_tags, test_size=0.2, random_state=42)
# train_sentences, val_sentences, train_pos, val_pos = train_test_split(train_sentences, train_pos, test_size=0.1, random_state=42)

# Check sizes of split sets

print("Train set size:", len(train_sentences))
# print("Validation set size:", len(val_sentences))
print("Test set size:", len(test_sentences))

Train set size: 4860
Test set size: 1215


### Cleaning and Preprocessing

In [9]:
for i in sentences[0]:
    print(i ,end=" ")

print()

a=[i if i !="X" else "null" for i in pos_tags[0]]
for i in range(len(a)):
    print(a[-(i+1)],end=" ")

برلين ترفض حصول شركة اميركية على رخصة تصنيع دبابة " ليوبارد " الالمانية 
ADJ PUNCT null PUNCT NOUN NOUN NOUN ADP ADJ NOUN NOUN VERB null 

In [10]:
# create a function to read the sentences in lines with its POS

def read_sentences(sentences,pos_tags,show_pos=True,num=7):

    for j in range(num):
        for i in sentences[j]:
            print(i ,end=" ")

        if show_pos:
            print()
            a=[i if i !="X" else "null" for i in pos_tags[j]]
            for i in range(len(a)):
                print(a[-(i+1)],end=" ")
            print("\n")
        else:
            print("\n")

In [11]:
# show some sentences

read_sentences(sentences,pos_tags)

برلين ترفض حصول شركة اميركية على رخصة تصنيع دبابة " ليوبارد " الالمانية 
ADJ PUNCT null PUNCT NOUN NOUN NOUN ADP ADJ NOUN NOUN VERB null 

برلين 15 - 7 ( اف ب ) - افادت صحيفة الاحد الالمانية " ويلت ام سونتاغ " في عددها عدد ها الصادر غدا ، ان المستشار غيرهارد شرودر يرفض حصول المجموعة الاميركية " جنرال ديناميكس " على رخصة لتصنيع ل تصنيع الدبابة الالمانية " ليوبارد 2 " عبر شراء المجموعة الحكومية الاسبانية للاسلحة ل الأسلحة " سانتا بربارة " . 
PUNCT PUNCT null null PUNCT NOUN ADP None ADJ ADJ NOUN NOUN ADP PUNCT NUM null PUNCT ADJ NOUN NOUN ADP None NOUN ADP PUNCT null null PUNCT ADJ NOUN NOUN VERB null null NOUN SCONJ PUNCT NOUN ADJ PRON NOUN None ADP PUNCT null null null PUNCT ADJ NOUN NOUN VERB PUNCT PUNCT null null PUNCT NUM PUNCT NUM null 

وفي و في نيسان / ابريل الماضي ، تخلت الدولة الاسبانية عن مجموعة " سانتا بربارة " التي تصنع دبابات ليوبارد الالمانية ، الى " جنرال ديناميكس " التي تنتج الدبابة الاميركية " ام 1 ابرامس " المعتبرة المنافسة الرئيسية لدبابة ل دبابة ليوبارد في الاسواق . 

In [12]:
sen_num=2
for i in range(len(sentences[sen_num])):
    print(sentences[sen_num][i],"==>",pos_tags[sen_num][i])


وفي ==> None
و ==> CCONJ
في ==> ADP
نيسان ==> NOUN
/ ==> PUNCT
ابريل ==> NOUN
الماضي ==> ADJ
، ==> PUNCT
تخلت ==> VERB
الدولة ==> NOUN
الاسبانية ==> ADJ
عن ==> ADP
مجموعة ==> NOUN
" ==> PUNCT
سانتا ==> X
بربارة ==> X
" ==> PUNCT
التي ==> DET
تصنع ==> VERB
دبابات ==> NOUN
ليوبارد ==> X
الالمانية ==> ADJ
، ==> PUNCT
الى ==> ADP
" ==> PUNCT
جنرال ==> X
ديناميكس ==> X
" ==> PUNCT
التي ==> X
تنتج ==> VERB
الدبابة ==> NOUN
الاميركية ==> ADJ
" ==> PUNCT
ام ==> X
1 ==> NUM
ابرامس ==> X
" ==> PUNCT
المعتبرة ==> ADJ
المنافسة ==> NOUN
الرئيسية ==> ADJ
لدبابة ==> None
ل ==> ADP
دبابة ==> NOUN
ليوبارد ==> X
في ==> ADP
الاسواق ==> NOUN
. ==> PUNCT


In [13]:
for i in range(len(sentences)):
    for j in range(len(sentences[i])):
        print(sentences[i][j],"==>",pos_tags[i][j])


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
. ==> PUNCT
واكد ==> None
و ==> CCONJ
أكد ==> VERB
وزير ==> NOUN
الدفاع ==> NOUN
الكونغولى ==> ADJ
ايرونج ==> X
اوان ==> X
ان ==> SCONJ
الحادث ==> NOUN
وقع ==> VERB
ليلة ==> NOUN
امس ==> ADV
، ==> PUNCT
بيد ==> CCONJ
انه ==> None
أن ==> SCONJ
ه ==> PRON
قال ==> VERB
انه ==> None
إن ==> SCONJ
ه ==> PRON
لا ==> PART
يعلم ==> VERB
عدد ==> NOUN
الضحايا ==> NOUN
. ==> PUNCT
ويعتقد ==> None
و ==> CCONJ
يعتقد ==> VERB
مسئولان ==> NOUN
بمطار ==> None
ب ==> ADP
مطار ==> NOUN
كينشاسا ==> X
الدولى ==> ADJ
ان ==> SCONJ
حوالى ==> ADP
160 ==> NUM
راكبا ==> NOUN
ربما ==> PART
سقطوا ==> VERB
من ==> ADP
الطائرة ==> NOUN
. ==> PUNCT
شرودر ==> X
يدافع ==> VERB
عن ==> ADP
موقفه ==> None
موقف ==> NOUN
ه ==> PRON
المعارض ==> ADJ
للحرب ==> None
ل ==> ADP
الحرب ==> NOUN
على ==> ADP
العراق ==> NOUN
برلين ==> X
9 ==> NUM
مايو ==> NOUN
دافع ==> VERB
المستشار ==> NOUN
الالمانى ==> ADJ
جيرهارد ==> X
شرودر ==> X
اليوم ==> NOUN
عن ==> ADP
موقف ==> NOUN

In [14]:
# count the unique POS
# Flatten the list of lists into a single list

flat_list = [j for i in pos_tags for j in i]

# Count the occurrences of each variable
variable_counts = Counter(flat_list)

# Print unique variables and their counts
for variable, count in variable_counts.items():
    print(f"POS: {variable}, Count: {count}")


POS: X, Count: 13747
POS: VERB, Count: 16789
POS: NOUN, Count: 74546
POS: ADJ, Count: 23498
POS: ADP, Count: 33617
POS: PUNCT, Count: 17511
POS: NUM, Count: 6010
POS: None, Count: 30519
POS: PRON, Count: 8533
POS: SCONJ, Count: 4368
POS: CCONJ, Count: 15803
POS: DET, Count: 4648
POS: PART, Count: 1709
POS: ADV, Count: 880
POS: SYM, Count: 329
POS: AUX, Count: 1699
POS: PROPN, Count: 187
POS: INTJ, Count: 7


In [15]:
# check the data after the split
print("sample of the train data")
read_sentences(train_sentences,train_pos,False)

print("\nsample of the test data")
read_sentences(test_sentences,test_pos,False)

# print("sample of the validation data ")
# read_sentences(val_sentences,val_pos,False)


sample of the train data
كشفت الأوراق والمستندات و المستندات أن المستشفى تهرب من طلبات المستشفيات الحكومية من الدم باعتبارها ب اعتبار ها تأخذ الدم مجاناً وكان و كان يتم إبلاغ غرفة الطوارئ بوزارة ب وزارة الصحة بكميات ب كميات دم أقل مما من ما هو مدون بالأوراق ب الأوراق الرسمية لعدم ل عدم الاستعانة بكميات ب كميات الدم في الحوادث وحالات و حالات النزيف الحاد . 

ومن و من المتوقع ان يستقبل امير قطر الاربعاء في الفندق الذي يقيم فيه في ه كلا من رئيس المجلس النيابي نبيه بري ورئيس و رئيس الحكومة سليم الحص . 

من هنا . . 

وتعتزم و تعتزم الهند من جهتها جهة ها شراء حاملة الطائرات الروسية " اميرال غورشكوف " وستشتري و س تشتري طائرات ميغ - 29 الوحيدة التي تتلاءم مع حاملة الطائرات هذه . وقد و قد يشمل العقد شراء 60 طائرة . 

السودان يصف تقرير الخارجية الامريكية بانه ب أن ه صدمة للسودانيين للسودانيين 

كشف تقرير صادر عن جهاز التمثيل التجاري عن انخفاض حجم التبادل التجاري بين مصر والفليبين و الفليبين خلال الفترة من يناير إلى يونيو الماضيين إلى 3 ملايين دولار مقابل 3.86 مليون دولار خلال نفس الفترة من عام 2

### Modelling

In [16]:
# Preprocess the data
tokenizer = Tokenizer(oov_token='<UNK>')
tokenizer.fit_on_texts(train_sentences)

# Convert text sequences to integer sequences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
# validation_sequences = tokenizer.texts_to_sequences(val_sentences)
test_sequences = tokenizer.texts_to_sequences(test_sentences)

In [17]:
# the max sentence length in the test and the train

print("the max sentence length in the train data is :",max([len(i) for i in train_sequences]))
print("the max sentence length in the test data is :",max([len(i) for i in test_sequences]))

the max sentence length in the train data is : 478
the max sentence length in the test data is : 241


In [18]:

# Padding sequences to ensure uniform length
# max_len = max(len(seq) for seq in train_sequences + test_sequences)
max_len = 100
train_sequences = pad_sequences(train_sequences, maxlen=max_len, padding='post')
test_sequences = pad_sequences(test_sequences, maxlen=max_len, padding='post')

# # Convert POS tags to one-hot encoding
upos_tags = set(tag for tags in train_pos + test_pos for tag in tags)
upos_tag2idx = {tag: idx for idx, tag in enumerate(upos_tags)}
idx2upos_tag = {idx: tag for tag, idx in upos_tag2idx.items()}

train_upos_tags = [[upos_tag2idx[tag] for tag in tags] for tags in train_pos]
test_upos_tags = [[upos_tag2idx[tag] for tag in tags] for tags in test_pos]

train_upos_tags = pad_sequences(train_upos_tags, maxlen=max_len, padding='post')
test_upos_tags = pad_sequences(test_upos_tags, maxlen=max_len, padding='post')
train_upos_tags = [to_categorical(tag, num_classes=len(upos_tags)) for tag in train_upos_tags]
test_upos_tags = [to_categorical(tag, num_classes=len(upos_tags)) for tag in test_upos_tags]

model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=128, input_length=max_len))
model.add(Bidirectional(LSTM(units=128, return_sequences=True)))
model.add(TimeDistributed(Dense(len(upos_tags), activation='softmax')))

# Compile the model
model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

# # Print model summary
# print(model.summary())

In [19]:


# Fit the model to the training data
history = model.fit(train_sequences, np.array(train_upos_tags), batch_size=32, epochs=10, validation_split=0.1)

# Evaluate the model on the test data
loss, accuracy = model.evaluate(test_sequences, np.array(test_upos_tags))
print(f'Test Loss: {loss}, Test Accuracy: {accuracy}')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Loss: 0.09349430352449417, Test Accuracy: 0.9771851897239685


#### Evaluation

**Evaluation metric:**
Accuracy

In [20]:
# Evaluate the model on the test data
loss, accuracy = model.evaluate(test_sequences, np.array(test_upos_tags))

# Print the evaluation results
print(f'Test Loss: {loss}')
print(f'Test Accuracy: {accuracy}')

Test Loss: 0.09349430352449417
Test Accuracy: 0.9771851897239685


In [21]:
# Flatten the predicted and ground truth tags for comparison

predictions = model.predict(test_sequences)

flat_predicted_tags = np.argmax(predictions, axis=-1).flatten()
flat_test_tags = np.argmax(test_upos_tags, axis=-1).flatten()

# Calculate accuracy
correct_predictions = np.sum(flat_predicted_tags == flat_test_tags)
total_predictions = len(flat_predicted_tags)
accuracy = correct_predictions / total_predictions

# Print the accuracy
print(f'Test Accuracy: {accuracy}')


Test Accuracy: 0.9771851851851852


### Enhancement

### Change the max_len of the sentence to be 450

In [22]:

# Padding sequences to ensure uniform length
# max_len = max(len(seq) for seq in train_sequences + test_sequences)
max_len = 450
train_sequences = pad_sequences(train_sequences, maxlen=max_len, padding='post')
test_sequences = pad_sequences(test_sequences, maxlen=max_len, padding='post')

# # Convert POS tags to one-hot encoding
upos_tags = set(tag for tags in train_pos + test_pos for tag in tags)
upos_tag2idx = {tag: idx for idx, tag in enumerate(upos_tags)}
idx2upos_tag = {idx: tag for tag, idx in upos_tag2idx.items()}

train_upos_tags = [[upos_tag2idx[tag] for tag in tags] for tags in train_pos]
test_upos_tags = [[upos_tag2idx[tag] for tag in tags] for tags in test_pos]

train_upos_tags = pad_sequences(train_upos_tags, maxlen=max_len, padding='post')
test_upos_tags = pad_sequences(test_upos_tags, maxlen=max_len, padding='post')
train_upos_tags = [to_categorical(tag, num_classes=len(upos_tags)) for tag in train_upos_tags]
test_upos_tags = [to_categorical(tag, num_classes=len(upos_tags)) for tag in test_upos_tags]

model_2 = Sequential()
model_2.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=256, input_length=max_len))
model_2.add(Bidirectional(LSTM(units=256, return_sequences=True)))
model_2.add(TimeDistributed(Dense(len(upos_tags), activation='softmax')))

# Compile the model
model_2.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

# # Print model summary
# print(model.summary())

In [23]:
# Fit the model to the training data
history = model_2.fit(train_sequences, np.array(train_upos_tags), batch_size=32, epochs=5, validation_split=0.1)

# Evaluate the model on the test data
loss, accuracy = model_2.evaluate(test_sequences, np.array(test_upos_tags))
print(f'Test Loss: {loss}, Test Accuracy: {accuracy}')

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test Loss: 0.10156344622373581, Test Accuracy: 0.9800310730934143


In [25]:
# Flatten the predicted and ground truth tags for comparison

predictions_2 = model_2.predict(test_sequences)
flat_predicted_tags = np.argmax(predictions_2, axis=-1).flatten()
flat_test_tags = np.argmax(test_upos_tags, axis=-1).flatten()

# Ensure the shapes match
assert flat_predicted_tags.shape == flat_test_tags.shape, "Shapes of predicted and ground truth tags must match for comparison"

# Calculate accuracy
correct_predictions = np.sum(flat_predicted_tags == flat_test_tags)
total_predictions = len(flat_predicted_tags)
accuracy = correct_predictions / total_predictions

# Print the accuracy
print(f'Test Accuracy: {accuracy}')


Test Accuracy: 0.9800310928212163


In [63]:
model_2.save("model_2.h5")

  saving_api.save_model(


In [None]:
# Load the saved model
model_2 = load_model("model_2.h5")

## Customized testing

In [42]:
# define afunction to predict

def predict_POS(text,model=model_2,max_len=450,tokenizer=tokenizer, idx2upos_tag=idx2upos_tag):
    input_sequence = tokenizer.texts_to_sequences([text])
    num_words = len(text.split())
    padded_input_sequence = pad_sequences(input_sequence, maxlen=max_len, padding='post')
    predictions = model.predict(padded_input_sequence)
    predicted_tags = [idx2upos_tag[idx] for idx in np.argmax(predictions, axis=-1)[0][:num_words]]

    # reverse the words order in the lest
    rev_predicted_tags=[ predicted_tags[-(i+1)] for i in range(len(predicted_tags))]

    print("the text and the POS are below\n")
    print(text)
    print(rev_predicted_tags)

### Test cases

In [46]:
# sentence 1
predict_POS("اكل الطفل الطعام بعد أن ذهب الى البيت حيث كان في المدرسة")

the text and the POS are below

اكل الطفل الطعام بعد أن ذهب الى البيت حيث كان في المدرسة
['NOUN', 'ADP', 'VERB', 'CCONJ', 'NOUN', 'ADP', 'VERB', 'SCONJ', 'ADP', 'NOUN', 'NOUN', 'VERB']


In [59]:
# sentence 2
predict_POS("تتميز مصر بكثير من الثروات المعدنية من فضة و نحاس")

the text and the POS are below

تتميز مصر بكثير من الثروات المعدنية من فضة و نحاس
['NOUN', 'CCONJ', None, 'ADP', 'ADJ', 'NOUN', 'ADP', 'X', 'NOUN', 'VERB']


In [62]:
# sentence 3
predict_POS("مر الرجل بالشارع")

the text and the POS are below

مر الرجل بالشارع
['NOUN', 'NOUN', 'VERB']


### Conclusion and final results


### In project POS was carried out in Arabic data
- using the LSTM (bidirectional layers) give a great advantage to the training
- An enhancement was implemented by increasing the max_number of words entering the model from to word

### The outcomes of the 2 models are:
- The accuracy is extremely high 97% to 98%
- However the accuracies are high, the model struggles in predicting some words due to mainly the small training dataset