# Libererías necesarias

In [1]:
import pandas as pd
from sentence_transformers import SentenceTransformer, util

# Análisis del Quran en Árabe

Primero importamos el dataset que hemos limpiado con la función anteriormente creada

In [2]:
with open("../data/cleaned_data/cleaned_arab_quran.txt", encoding="utf-8") as f:
    lines = f.readlines()

df = pd.DataFrame(lines, columns=["text"])
df["text"] = df["text"].str.strip()

df.head()

Unnamed: 0,text
0,1|1|بسم الله الرحمن الرحيم
1,1|2|الحمد لله رب العالمين
2,1|3|الرحمن الرحيم
3,1|4|مالك يوم الدين
4,1|5|اياك نعبد واياك نستعين


Importamos el sentence-transformer que vamos a usar para ambos idiomas, ya que éste es multilingüe

In [None]:
#Y si usamos fastText? Proguntar a Miguel y a Unai porque a decidido usar sentence transformers
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Y ahora creamos los embeddings

In [4]:
df["arab_embeddings"] = df["text"].apply(lambda x: model.encode(x, convert_to_tensor=True))

KeyboardInterrupt: 

Now we can see the applied arab embeddings

In [12]:
df.head()

Unnamed: 0,text,arab_embeddings
0,بسم الله الرحمن الرحيم,"[tensor(0.0345), tensor(0.1068), tensor(-0.004..."
1,الحمد لله رب العالمين,"[tensor(-0.0128), tensor(0.1051), tensor(-0.01..."
2,الرحمن الرحيم,"[tensor(0.0336), tensor(0.1019), tensor(0.0427..."
3,مالك يوم الدين,"[tensor(0.0297), tensor(0.1338), tensor(0.0075..."
4,اياك نعبد واياك نستعين,"[tensor(0.0209), tensor(0.0914), tensor(-0.024..."


Ahora probamos la búsqueda semántica por concepto, vamos a usar la similitud de coseno

In [None]:
concept = "Paradise" # por ejemplo
concept_emb = model.encode(concept.lower(), convert_to_tensor=True)

df["cos_similarity"] = df["arab_embeddings"].apply(lambda x: util.pytorch_cos_sim(x, concept_emb).item())
df_sorted = df.sort_values(by="cos_similarity", ascending=False)
print(pd.DataFrame(df_sorted[["text", "cos_similarity"]].head(10)))

                                                   text  cos_similarity
2535                      قالوا وجدنا اباءنا لها عابدين        0.324614
2981                  قالوا لا ضير انا الي ربنا منقلبون        0.301542
1656                 قالوا سنراود عنه اباه وانا لفاعلون        0.300278
4030  قالوا ربنا من قدم لنا هذا فزده عذابا ضعفا في ا...        0.290535
2778       قالوا ربنا غلبت علينا شقوتنا وكنا قوما ضالين        0.283041
5015                                        عربا اترابا        0.279399
4910                                والارض وضعها للانام        0.272819
4036                                     قل هو نبا عظيم        0.272353
1594                               وانتظروا انا منتظرون        0.271314
5982                                      وزرابي مبثوثه        0.270515


# Análisis de Quran en Inglés

In [19]:
with open("/home/unaiolaizolaosa/Dokumentuak/NLP-Group-Project/data/cleaned_data/cleaned_english_quran.txt", encoding="utf-8") as f:
    lines = f.readlines()

df = pd.DataFrame(lines, columns=["text"])
df["text"] = df["text"].str.strip()

df.head()

Unnamed: 0,text
0,in the name of allah the entirely merciful the...
1,all praise is due to allah lord of the worlds
2,the entirely merciful the especially merciful
3,sovereign of the day of recompense
4,it is you we worship and you we ask for help


In [20]:
df["arab_embeddings"] = df["text"].apply(lambda x: model.encode(x, convert_to_tensor=True))

In [21]:
concept = "Paradise" # por ejemplo
concept_emb = model.encode(concept.lower(), convert_to_tensor=True)

df["cos_similarity"] = df["arab_embeddings"].apply(lambda x: util.pytorch_cos_sim(x, concept_emb).item())
df_sorted = df.sort_values(by="cos_similarity", ascending=False)
print(df_sorted[["text", "cos_similarity"]].head(10))

                                                   text  cos_similarity
6022                              and enter my paradise        0.638474
4394        enter paradise you and your kinds delighted        0.631074
5812                  and when paradise is brought near        0.614703
2878  the companions of paradise that day are in a b...        0.591378
4559  is the description of paradise which the right...        0.576164
5610  and when you look there in paradise you will s...        0.570232
4660  and paradise will be brought near to the right...        0.556255
4396  and that is paradise which you are made to inh...        0.549605
3021  and paradise will be brought near that day to ...        0.535889
615   paradise is not obtained by your wishful think...        0.533608
