# OSACT 6 Task 2: Translating Arabic Dialects to Modern Standard Arabic

Among the challenges involved with using Arabic in large language models is the disparity of dialects used throughout the Arab world. While modern Standard Arabic is used prolifically in official and government use-cases, among the general public, it is seldom ever used. Instead, various **vernaculars** are used, local to various regions across the Middle East and North Africa. These vernaculars can be divided into subgroups, the count of which can vary from the single digits to dozens, depending on the level of granularity through which one defines the geographic differences.


<br>
<center> <img src = "./Arabic_Dialects.svg.png" width=60%> 

<p style="font-size: 9px; width: 60%" >
Schmitt, Genevieve A. (2019). "Relevance of Arabic Dialects: A Brief Discussion". In Brunn, Stanley D.; Kehrein, Roland (eds.). Handbook of the Changing World Language Map. Springer. p. 1385. doi:10.1007/978-3-030-02438-3_79. ISBN 978-3-030-02437-6. as "Fig. 1 Major dialects of Arabic, by region. (Open source)".
</p>
</center>

<br>

The dataset provided by OSACT-6 aggregates the dialects into the following five language subgroups:
    
- Egyptian
- Iraqi
- Levantine
- Maghrebi
- Gulf

The dataset contains 200 sentences from each dialect, with its corresponding translation in MSA. As such, the baseline naive model accuracy the machine learning model should be expected to beat is 0.2.

<br>
<center>
<div style="background-color: #0055aa; color: white; width: 50%; padding: 0.5em; border-radius: 1em">
<b>Note:</b> On the 14 of February, the competition updated to include revisions of the Gulf arabic sentences.
</div>
</center>

In [None]:
# Prerequisites
import sys
import subprocess


required_packages = ['matplotlib', 'wordcloud', 'scikit-learn', 'nltk', 'arabic_reshaper', 'python-bidi', 'transformers', 'pandas']

for p in required_packages:
    subprocess.check_call([sys.executable, "-m", "pip", "install", p])

In [39]:
import requests
import zipfile
import io
import os

if not os.path.isdir("./datasets"):
    print("downloading datasets")
    zipfile.ZipFile(io.BytesIO(requests.get("https://codalab.lisn.upsaclay.fr/my/datasets/download/951b472f-f58d-4831-a233-e3757ccc4fa7").content)).extractall("datasets")
    zipfile.ZipFile(io.BytesIO(requests.get("https://codalab.lisn.upsaclay.fr/my/datasets/download/2eb2e7fd-e259-409f-a968-efe7a8fb528b").content)).extractall("datasets")
else:
    print("dataset directory exists")

downloading datasets



## EDA

### Basic EDA

In [None]:
import pandas as pd

dataset = pd.read_json("./dev_set_all.json")
dataset = dataset.set_index("id")

dataset.info()

In [None]:
dataset.dialect.unique()

In [None]:
dataset.sample(5)

In [None]:
dataset.shape

### Counts and Lengths

In [None]:
dataset.groupby("dialect").sum()[["source"]]

dataset["source_char_length"] = dataset["source"].str.len()
dataset["source_word_length"] = dataset["source"].str.split().str.len()
dataset["target_char_length"] = dataset["target"].str.len()
dataset["target_word_length"] = dataset["target"].str.split().str.len()

dataset.sample(5)

In [None]:
dataset.groupby("dialect").sum().describe()

In [None]:
import matplotlib.pyplot as plt

dataset.groupby("dialect").sum()[["source_word_length"]].plot(kind="bar", xlabel="Dialect", ylabel="", legend="", title="Total word count of each dialect in the dataset")

dataset.hist()

In [None]:

dataset[["dialect", "source_word_length"]].groupby("dialect").sum()


As word counts are completely unrelated to the type of dialect, this data may bias the model, creating an association between length and dialect.

### Term Frequency Analysis

We can get a better understanding of how the model should perform by examining the differences between dialects and MSA. Each MSA target is unique, suggesting that each source-target pair is unique in meaning. This complicates the analysis, but it should be doable given the appropriate approach. To begin, we can explore each of the regional dialects.

#### Unigram Analysis

We observe the corpuses both with and without stopwords, to compare and better understand the differences in regional dialects. We first begin with a unigram analysis, using wordclouds. The code below constructs two sets of wordclouds. Each set contains six wordclouds, one for each of the five dialects, and MSA. One set contains the stopwords, the other does not.

In [None]:
import wordcloud as wc
from arabic_reshaper import arabic_reshaper
from bidi.algorithm import get_display
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")

# This is going to run terribly but it'll do for now
def generate_wordcloud(text, width = 1500, height = 1500):
    render_ready_text = get_display(arabic_reshaper.reshape(text))
    return wc.WordCloud(
        font_path = "./Janna-LT-Bold.ttf",
        background_color = "white",
        colormap="Blues",
        width = width,
        height = height,
    ).generate(render_ready_text)


clouds_without_stopwords = []
clouds_with_stopwords = []

for dialect in dataset.dialect.unique():
    raw_text = dataset[dataset["dialect"] == dialect]["source"].sum()

    filtered_text = " ".join([w for w in raw_text.split() if w not in stopwords.words("arabic")])
    render_ready_filtered_text = get_display(arabic_reshaper.reshape(filtered_text)) 
   
    clouds_without_stopwords.append(generate_wordcloud(raw_text))
    clouds_with_stopwords.append(generate_wordcloud(filtered_text))

#Adding the MSA WordClouds
msa_text = dataset["target"].sample(200).sum()
filtered_msa_text = " ".join([w for w in msa_text.split() if w not in stopwords.words("arabic")])

clouds_without_stopwords.append(generate_wordcloud(filtered_msa_text))
clouds_with_stopwords.append(generate_wordcloud(msa_text))

In [None]:
fig, axes = plt.subplots(2, 3)

fig.set_dpi(250)

fig.suptitle("Most Common Words, Not Including Stopwords")

axes[0, 0].set_title("Egyptian")
axes[0, 0].axis("off")
axes[0, 0].imshow(clouds_without_stopwords[0])

axes[0, 1].set_title("Iraqi")
axes[0, 1].axis("off")
axes[0, 1].imshow(clouds_without_stopwords[1])

axes[0, 2].set_title("Levantine")
axes[0, 2].axis("off")
axes[0, 2].imshow(clouds_without_stopwords[2])

axes[1, 0].set_title("Magharebi")
axes[1, 0].axis("off")
axes[1, 0].imshow(clouds_without_stopwords[3])

axes[1, 1].set_title("Gulf")
axes[1, 1].axis("off")
axes[1, 1].imshow(clouds_without_stopwords[4])

axes[1, 2].set_title("MSA")
axes[1, 2].axis("off")
axes[1, 2].imshow(clouds_without_stopwords[5])

In [None]:
fig, axes = plt.subplots(2, 3)

fig.set_dpi(250)

fig.suptitle("Most Common Words, Including Stopwords")

axes[0, 0].set_title("Egyptian")
axes[0, 0].axis("off")
axes[0, 0].imshow(clouds_with_stopwords[0])

axes[0, 1].set_title("Iraqi")
axes[0, 1].axis("off")
axes[0, 1].imshow(clouds_with_stopwords[1])

axes[0, 2].set_title("Levantine")
axes[0, 2].axis("off")
axes[0, 2].imshow(clouds_with_stopwords[2])

axes[1, 0].set_title("Magharebi")
axes[1, 0].axis("off")
axes[1, 0].imshow(clouds_with_stopwords[3])

axes[1, 1].set_title("Gulf")
axes[1, 1].axis("off")
axes[1, 1].imshow(clouds_with_stopwords[4])

axes[1, 2].set_title("MSA")
axes[1, 2].axis("off")
axes[1, 2].imshow(clouds_with_stopwords[5])

From a preliminary glance, it appears that the stopwords provide enough variance to the dataset that it would be best to include them- they provide big hints as to what dialect the sentence comes from. Still, we will attempt to generate text using both corpuses as training data to see which provides the most accurate translations. We also note similarities between Iraqi and Levantine arabic, which tracks with the linguistic theories of arabic - Iraqi Arabic is also commonly referred to as South Levantine Arabic.

#### Bi-gram Analysis

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

ngrams = {}
for d in dataset.dialect:
    vec = CountVectorizer(stop_words=stopwords.words("arabic"), ngram_range=(2,2))
    bow = vec.fit_transform(dataset.groupby("dialect").sum().loc[d][["source"]])
    count_values = bow.toarray().sum(axis=0)

    ngram_freq = pd.DataFrame(sorted([(count_values[i], k) for k, i in vec.vocabulary_.items()], reverse = True))
    ngram_freq.columns = ["frequency", "ngram"]

    ngram_freq = ngram_freq.head(10)
    ngram_freq["ngram"].apply(lambda x: get_display(arabic_reshaper.reshape(x)))
    ngrams.update({d: ngram_freq})



In [None]:
fig, axes = plt.subplots(2, 3)
fig.subplots_adjust(hspace=1.5, wspace=0.5)
fig.set_dpi(150)

fig.suptitle("Bigram Frequency, Not including stopwords")

axes[0, 0].set_title("Egyptian")
axes[0, 0].bar(ngrams["Egyptian"]["ngram"].apply(lambda x: get_display(arabic_reshaper.reshape(x))), ngrams["Egyptian"]["frequency"])
axes[0, 0].tick_params(rotation=90)

axes[0, 1].set_title("Iraqi")
axes[0, 1].bar(ngrams["Iraqi"]["ngram"].apply(lambda x: get_display(arabic_reshaper.reshape(x))), ngrams["Iraqi"]["frequency"])
axes[0, 1].tick_params(rotation=90)

axes[0, 2].set_title("Levantine")
axes[0, 2].bar(ngrams["Levantine"]["ngram"].apply(lambda x: get_display(arabic_reshaper.reshape(x))), ngrams["Levantine"]["frequency"])
axes[0, 2].tick_params(rotation=90)

axes[1, 0].set_title("Magharebi")
axes[1, 0].bar(ngrams["Magharebi"]["ngram"].apply(lambda x: get_display(arabic_reshaper.reshape(x))), ngrams["Magharebi"]["frequency"])
axes[1, 0].tick_params(rotation=90)

axes[1, 1].set_title("Gulf")
axes[1, 1].bar(ngrams["Gulf"]["ngram"].apply(lambda x: get_display(arabic_reshaper.reshape(x))), ngrams["Gulf"]["frequency"])
axes[1, 1].tick_params(rotation=90)

<center>
<div style="background-color: #0055aa; color: white; width: 50%; padding: 0.5em; border-radius: 1em">
<b>Note:</b> matplotlib appears to be unable to render unicode character (U+FDF2), ﷲ, rendering it as a square.
</div>
</center>

### LDA Analysis

To better understand the distinction and crossover between dialects statistically, It may be worth projecting dialects to a 2D plane and searching for clusters. First, we encode the corpus in tf-idf:

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.feature_extraction.text import TfidfVectorizer


categories = np.array(dataset.dialect)

sources = dataset["source"].to_list()
target = dataset.dialect


def construct_lda_components(stopwords_flag = "off"):
    args = {
        "min_df": 2,
        #"max_df": 0.8,
        #"max_features": 4000
    }
    vectorizer = TfidfVectorizer(**args) if stopwords_flag == "off" else TfidfVectorizer( stop_words =stopwords.words("arabic"), **args)
    tfidf = vectorizer.fit_transform(sources)

    lda = LinearDiscriminantAnalysis(n_components=2, solver="svd")
    ldaComponents = lda.fit(tfidf.toarray(), target)
    ldaComponents = lda.transform(tfidf.toarray())
    return ldaComponents


def scatter_plot_from_components(ax, ldaComponents, title):
    for dialect, color in zip(set(target), ['#800000', '#0040a0', '#4B0082', '#B8860B', '#556B2F']):
        ax.scatter(ldaComponents[target == dialect, 0], ldaComponents[target == dialect, 1], c=color, label=dialect)

    ax.set_title(title)
    ax.legend()


fig, axs = plt.subplots(1, 2, figsize=(10, 5))

scatter_plot_from_components(axs[0], construct_lda_components("off"), "LDA of Corpus Including Stopwords")
scatter_plot_from_components(axs[1], construct_lda_components("on"), "LDA of Corpus Excluding Stopwords")

plt.tight_layout()
plt.show()

This provides some interesting insights into the structure and vocabulary of the various dialects of the language. Our earlier hypothesis is correct - the stopwords function as independent variables in the dataset, and so should be kept for classification purposes. Without them, the vocabularies of Iraqi, Gulf, and Levantine arabic are very similar. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.feature_extraction.text import TfidfVectorizer


categories = np.array(dataset.dialect)

final_set = pd.read_json("./test_set_all.public.json").sample(1500)

sources = final_set["source"].to_list()
target = final_set.dialect

fig, axs = plt.subplots(1, 2, figsize=(10, 5))

scatter_plot_from_components(axs[0], construct_lda_components("off"), "LDA of Corpus Including Stopwords")
scatter_plot_from_components(axs[1], construct_lda_components("on"), "LDA of Corpus Excluding Stopwords")

plt.tight_layout()
plt.show()

The final dataset, however, tells a different story. In this case, it may be best to have three models: One dedicated to Iraqi, One to maghrebi, and another for the rest.



In [None]:
print(final_set.info())
print(final_set.shape)

## Translation

There are two main approaches I'm considering:

- Training the entire dataset on a single model
- Using a classification algorithm the sentence to a dialect, then run it through a model trained specifically on said dialect - perhaps an MoE as a more sophisticated method?
- translate dialects to an intermediate form before final translation?