# Libererías necesarias

In [13]:
import pandas as pd
from sentence_transformers import SentenceTransformer, util
import fasttext
from sklearn.metrics.pairwise import cosine_similarity

# Análisis del Quran en Árabe

Primero importamos el dataset que hemos limpiado con la función anteriormente creada

In [2]:
with open("../data/cleaned_data/cleaned_arab_quran.txt", encoding="utf-8") as f:
    lines = f.readlines()

df = pd.DataFrame(lines, columns=["text"])
df["text"] = df["text"].str.strip()

df.head()

Unnamed: 0,text
0,1|1|بسم الله الرحمن الرحيم
1,1|2|الحمد لله رب العالمين
2,1|3|الرحمن الرحيم
3,1|4|مالك يوم الدين
4,1|5|اياك نعبد واياك نستعين


Ahora vamos a usar 2 modelos diferentes para comparar los resultados, usaremos el concepto "الجنة" (paraíso) para ambas.

In [17]:
concept = "الجنة"

Importamos el sentence-transformer que vamos a usar para ambos idiomas, ya que éste es multilingüe.

In [18]:
model = SentenceTransformer('distiluse-base-multilingual-cased-v2')

df_st = pd.DataFrame(lines, columns=["text"])
df_st["text"] = df_st["text"].str.strip()
df_st = df_st[df_st["text"].str.contains(r"[\u0600-\u06FF]")]
df_st["arab_embeddings"] = df_st["text"].apply(lambda x: model.encode(x, convert_to_tensor=True))

concept_emb = model.encode(concept, convert_to_tensor=True)

df_st["cos_similarity"] = df_st["arab_embeddings"].apply(lambda x: util.pytorch_cos_sim(x, concept_emb).item())

df_st_sorted = df_st.sort_values(by="cos_similarity", ascending=False)
print("\nTop 10 Sentence-Transformers:")
print(df_st_sorted[["text", "cos_similarity"]].head(10))

modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/610 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/539M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/531 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

2_Dense/model.safetensors:   0%|          | 0.00/1.58M [00:00<?, ?B/s]


Top 10 Sentence-Transformers:
                                                text  cos_similarity
0                         1|1|بسم الله الرحمن الرحيم        0.471993
5475      73|1|بسم الله الرحمن الرحيم يا ايها المزمل        0.427254
7                     2|1|بسم الله الرحمن الرحيم الم        0.420491
293                   3|1|بسم الله الرحمن الرحيم الم        0.410380
294               3|2|الله لا اله الا هو الحي القيوم        0.404758
5377                         70|3|من الله ذي المعارج        0.404345
4133                  40|1|بسم الله الرحمن الرحيم حم        0.402066
1                          1|2|الحمد لله رب العالمين        0.399989
5495      74|1|بسم الله الرحمن الرحيم يا ايها المدثر        0.397255
5909  85|1|بسم الله الرحمن الرحيم والسماء ذات البروج        0.390075


Ahora probaremos el mismo proceso con el modelo recomendado por nuestro profesor: 'fastText'. Como se verá en los hiperparámetros, las sub palabras hacen referencia a N-gramas de caracteres, es decir, caracteres de entre 3 a 6 letras. (hay que tener en cuenta que a diferencia de sentence-transformer, fast text devuelve vectores Numpy, y no vectores PyTorch)

In [19]:
ft = fasttext.train_unsupervised(
    input="../data/cleaned_data/cleaned_arab_quran.txt",
    model="skipgram",
    dim=300,
    epoch=10,
    minn=3,  # sub palabras mínimas
    maxn=6   # sub palabras máximas
)

df_ft = pd.DataFrame(lines, columns=["text"])
df_ft["text"] = df_ft["text"].str.strip()
df_ft = df_ft[df_ft["text"].str.contains(r"[\u0600-\u06FF]")]

df_ft["arab_embeddings"] = df_ft["text"].apply(lambda x: ft.get_sentence_vector(x))

concept_emb = ft.get_sentence_vector(concept)

df_ft["cos_similarity"] = df_ft["arab_embeddings"].apply(
    lambda x: cosine_similarity([x], [concept_emb])[0][0]
)

df_ft_sorted = df_ft.sort_values(by="cos_similarity", ascending=False)
print("Top 10 FastText:")
print(df_ft_sorted[["text", "cos_similarity"]].head(10))

Read 0M words
Number of words:  2016
Number of labels: 0
Progress: 100.0% words/sec/thread:   70101 lr:  0.000000 avg.loss:  2.555924 ETA:   0h 0m 0s 67.8% words/sec/thread:   71175 lr:  0.016119 avg.loss:  2.633729 ETA:   0h 0m 0s


Top 10 FastText:
                                                   text  cos_similarity
5815                                 81|16|الجوار الكنس        0.922746
4986               56|8|فاصحاب الميمنه ما اصحاب الميمنه        0.889601
4987               56|9|واصحاب المشامه ما اصحاب المشامه        0.884195
5019                56|41|واصحاب الشمال ما اصحاب الشمال        0.876312
5752                          79|41|فان الجنه هي الماوي        0.876007
1862                     15|61|فلما جاء ال لوط المرسلون        0.859562
5750                         79|39|فان الجحيم هي الماوي        0.858952
3927                    37|140|اذ ابق الي الفلك المشحون        0.853208
3917                           37|130|سلام علي ال ياسين        0.846132
2553  21|71|ونجيناه ولوطا الي الارض التي باركنا فيها...        0.835405


# Análisis de Quran en Inglés

In [24]:
with open("../data/cleaned_data/cleaned_english_quran.txt", encoding="utf-8") as f:
    lines = f.readlines()

df = pd.DataFrame(lines, columns=["text"])
df["text"] = df["text"].str.strip()

df.head()

Unnamed: 0,text
0,in the name of allah the entirely merciful the...
1,all praise is due to allah lord of the worlds
2,the entirely merciful the especially merciful
3,sovereign of the day of recompense
4,it is you we worship and you we ask for help


Repetiremos el mismo proceso para el Corán en inglés:

In [22]:
concept = "Paradise" # el mismo que antes, pero ahora en inglés

Sentence-transformer:

In [27]:
model = SentenceTransformer('distiluse-base-multilingual-cased-v2')

df_st = pd.DataFrame(lines, columns=["text"])
df_st["text"] = df_st["text"].str.strip()
df_st = df_st[df_st["text"].str.len() > 0]
df_st["eng_embeddings"] = df_st["text"].apply(lambda x: model.encode(x, convert_to_tensor=True))

concept_emb = model.encode(concept, convert_to_tensor=True)

df_st["cos_similarity"] = df_st["eng_embeddings"].apply(lambda x: util.pytorch_cos_sim(x, concept_emb).item())

df_st_sorted = df_st.sort_values(by="cos_similarity", ascending=False)
print("\nTop 10 Sentence-Transformers:")
print(df_st_sorted[["text", "cos_similarity"]].head(10))


Top 10 Sentence-Transformers:
                                             text  cos_similarity
6022                        and enter my paradise        0.519508
6232                           the god of mankind        0.502047
6222                     allah the eternal refuge        0.500632
5812            and when paradise is brought near        0.445605
4681            by the heaven containing pathways        0.439895
2348                                        ta ha        0.434508
5752      then indeed paradise will be his refuge        0.432990
4739                and by the heaven raised high        0.432732
5820  obeyed there in the heavens and trustworthy        0.430074
5630                and when the heaven is opened        0.409571


FastText:

In [25]:
ft = fasttext.train_unsupervised(
    input="../data/cleaned_data/cleaned_english_quran.txt",
    model="skipgram",
    dim=300,
    epoch=10,
    minn=3,  # sub palabras mínimas
    maxn=6   # sub palabras máximas
)

df_ft = pd.DataFrame(lines, columns=["text"])
df_ft["text"] = df_ft["text"].str.strip()

df_ft["eng_embeddings"] = df_ft["text"].apply(lambda x: ft.get_sentence_vector(x))

concept_emb = ft.get_sentence_vector(concept)

df_ft["cos_similarity"] = df_ft["eng_embeddings"].apply(
    lambda x: cosine_similarity([x], [concept_emb])[0][0]
)

df_ft_sorted = df_ft.sort_values(by="cos_similarity", ascending=False)
print("Top 10 FastText:")
print(df_ft_sorted[["text", "cos_similarity"]].head(10))

Read 0M words
Number of words:  1924
Number of labels: 0
Progress: 100.0% words/sec/thread:   77969 lr:  0.000000 avg.loss:  2.549588 ETA:   0h 0m 0s


Top 10 FastText:
                                                   text  cos_similarity
4019  gardens of perpetual residence whose doors wil...        0.835064
6022                              and enter my paradise        0.827785
2683  who will inherit al firdaus they will abide th...        0.827519
549   but those who believe and do righteous deeds w...        0.823820
1772  and those who believed and did righteous deeds...        0.819989
1931  gardens of perpetual residence which they will...        0.815658
2423  gardens of perpetual residence beneath which r...        0.814887
2617  indeed allah will admit those who believe and ...        0.808613
1323  allah has prepared for them gardens beneath wh...        0.802673
3397  and those who have believed and done righteous...        0.801086


# Resultados después de la comparación:
Después de tratar con ambos idiomas y los 2 modelos, y ver que ambos devuelven pasajes similares, concluimos con que hay una diferencia abismal en cuanto a la similitud de coseno. Ya que el modelo 'fastText' obtiene mejores resultados. Por esa razón y por la recomendación del profesor, hemos decidido quedarnos con este modelo para más adelante.