<a href="https://colab.research.google.com/github/amrtaher1234/iqraeli-backend/blob/main/data-gathering/Iqraeli.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Iqraeli Embedding Generator

This notebook generates the embedding of the quran using the quran text per each Ayah and the Ayah name and tafsiir using:

- OpenAI for the embedding generation and generating embedding per query
- Scikit learn for the embedding comparison and similarity calcualtion
- A [quran json](https://drive.google.com/file/d/1yGOGqdnNqXm8ajxsEFU3KG_ckRXlMFsC/view?usp=sharing) that was cleaned and pre-processed to gather the Quran data as well as the Tafsiir for eah ayah, it follows the following schema


```js
Interface QuranVerse {
  juz: number;
  juz_name_arabic: string;
  juz_name_english: string;
  surah_number: number;
  surah_name_arabic: string;
  surah_name_english: string;
  revelation_location: string;
  aya_number: number;
  english_translation: string;
  arabic_diacritics: string;
  arabic_clean: string;
  arabic_words_count: number;
  arabic_letters_count: number;
  tafseer: string;
  merged_tafseer_text: string;
}

```

## Get Started

to get started you need to have the `quran-tafseer.json` somewhere either in your local env or hosted somewhere else (I'm using Drive in my code). The generated embedding object is around 320MB so no need for much storage.






In [1]:
!pip install openai

Collecting openai
  Downloading openai-1.6.1-py3-none-any.whl (225 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.4/225.4 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.26.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Collecting typing-extensions<5,>=4.7 (from openai)
  Downloading typing_extensions-4.9.0-py3-none-any.whl (32 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0

In [7]:
# Setup and imports

from openai import OpenAI
from google.colab import drive
from sklearn.metrics.pairwise import cosine_similarity
import json
from google.colab import userdata

client = OpenAI(api_key=userdata.get('OPEN_AI_KEY'))
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
# Utilities
def load_json_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return json.load(file)

def write_json_data(data, pathname):
     with open(pathname, 'w', encoding='utf-8') as file:
        json.dump(data, file, ensure_ascii=False, indent=4)


In [11]:
def create_embeddings(objects, pathname):
    arabic_texts = [obj['merged_tafseer_text'] for obj in objects]

    embeddings_items = client.embeddings.create(model='text-embedding-ada-002', input=arabic_texts).data
    embeddings = []

    for item in  embeddings_items:
        embeddings.append(item.embedding)


    for obj, embedding in zip(objects, embeddings):
        obj['embedding'] = embedding
        # obj['embedding_text'] = surah_only_embedding

    with open(pathname, 'w', encoding='utf-8') as file:
        json.dump(objects, file, ensure_ascii=False, indent=4)



In [9]:
quran_with_tafseer = load_json_data('./drive/MyDrive/iqraeli/quran-with-tafseer.json')

In [None]:
create_embeddings(objects=quran_with_tafseer[0:2000], pathname='./drive/MyDrive/embeddings/tafseer-with-quran-embeddings.json')
create_embeddings(objects=quran_with_tafseer[2000:4000], pathname='.drive/MyDrive/embeddings/tafseer-with-quran-embeddings2.json')
create_embeddings(objects=quran_with_tafseer[4000:6000], pathname='.drive/MyDrive/embeddings/tafseer-with-quran-embeddings3.json')
create_embeddings(objects=quran_with_tafseer[6000:], pathname='.drive/MyDrive/embeddings/tafseer-with-quran-embeddings4.json')

In [None]:
# used for preprocessing and combining quran data, use the `quran-embeddings.json` directly

file_paths = [
    './embeddings/tafseer-with-quran-embeddings.json',
    './embeddings/tafseer-with-quran-embeddings-2.json',
    './embeddings/tafseer-with-quran-embeddings-3.json',
    './embeddings/tafseer-with-quran-embeddings-4.json'
]

combined_data = []
for path in file_paths:
    data = load_json_data(path)
    combined_data.extend(data)

write_json_data(combined_data, './embeddings/quran-embeddings.json')

In [3]:

def find_closest_5(objects: list, query: str):
    query_embedding = client.embeddings.create(model='text-embedding-ada-002', input=query).data[0].embedding
    top_objects = []

    for obj in objects:
        similarity = cosine_similarity([query_embedding], [obj['embedding']])[0][0]
        top_objects.append((obj, similarity))
        top_objects.sort(key=lambda x: x[1], reverse=True)
        top_objects = top_objects[:5]

    [print(obj[1]) for obj in top_objects]
    return [obj[0] for obj in top_objects]


In [8]:
embeddings_data = load_json_data('./drive/MyDrive/iqraeli/quran-embeddings.json')


In [11]:
top_5_data = find_closest_5(embeddings_data, 'وجعلنا من الماء كل شيءٍ حي')

for data in top_5_data:
    print(data['surah_name_arabic'])
    print(data['aya_number'])
    print('-----')

0.8469748036734752
0.8296564311865964
0.8277813558091472
0.8258790754696268
0.8244292955156522
يس
34
-----
ق
9
-----
المرسلات
21
-----
الحاقة
12
-----
النبأ
14
-----
