<a href="https://colab.research.google.com/github/Yusuf-YENICERI/Semantic-Search-For-All-Languages/blob/master/Semantic_Search_For_All_Languages.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Package Installation

In [None]:

!pip install sentence-transformers
!pip install faiss-cpu

# Encode docs using sentence-transformer

Sentence Transformers provides open source models to enable encoding of text data. So you can either use this model or you can use embedding models of openai in the coming sections. Do not forget, the models in sentence transformers are free but not openai.

In [None]:
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer('multi-qa-mpnet-base-dot-v1')

In [None]:
data = [
    'Hayat nasıl gidiyor diyenlere aslında süper gidiyor Elhamdülillah',
    'Silahlar savaş için oldukça gereklidir. Çünkü hayat memat meselesi',
]
encoded_data = encoder.encode(allText)

The code snippet above includes two sentence in Turkish as an example. You can add or modify there to have different cases.

# Find in PDF

If you want to encode PDF you can run this snippet. If you don't want, skip this part.

In [None]:
import io
import pandas as pd
from google.colab import files
print("Choose a single file")
uploaded = files.upload()

In [None]:
!pip install PyPDF2

In [None]:
from PyPDF2 import PdfReader

# set path to your pdf
reader = PdfReader('/content/example.pdf')

# printing number of pages in pdf file
print(len(reader.pages))
allText =[]
for page in reader.pages:
# split page sentences with '.' character to split them one by one
    for text in page.extract_text().split('.'):
        if len(text)>0:
            allText.append(text)
print(allText)

# Encode using openai

If you want to use openai embedding models to encode your data you can use this section.

In [None]:

%%capture
!pip install openai

In [None]:
import openai
import os

openai.api_key = "your-openai-api-key"

openai.Engine.list()  # check we have authenticated

In [None]:

def encode_open(input):
# you can change model if you want different
    MODEL = "text-embedding-ada-002"
    res = openai.Embedding.create(
       input=input, engine=MODEL
    )
    embeds = [record['embedding'] for record in res['data']]
    import torch
    embeds=torch.FloatTensor(embeds)
    return embeds

In [None]:
input=['Hayat nasıl gidiyor diyenlere aslında süper gidiyor Elhamdülillah','Silahlar savaş için oldukça gereklidir. Çünkü hayat memat meselesi']
encoded_data=encode_open(input)

If you want to use PDF for encoding you can run the two code snippets below.

In [None]:
allText

In [None]:
input=['Hayat nasıl gidiyor diyenlere aslında süper gidiyor Elhamdülillah','Silahlar savaş için oldukça gereklidir. Çünkü hayat memat meselesi']
encoded_data=encode_open(allText)

# Add to index

We will add the encoded texts into the FAISS(semantic search similarity library) inshaAllah. So the text data will be add into the database.

In [None]:

d=encoded_data.shape[1]
d

In [None]:
import faiss
import numpy as np

# IndexFlatIP: Flat inner product (for small datasets)
# IndexIDMap: store document ids in the index as well
index = faiss.IndexIDMap(faiss.IndexFlatIP(d))
index.add_with_ids(encoded_data, np.arange(encoded_data.shape[0]))

# Search for question

You can search your text using the functions below, if you use openai, run search_openai function.

In [None]:
def search(query, k=7):
    query_vector = encoder.encode([query])
    top_k = index.search(query_vector, k)
    print(top_k)
    return [
        input[_id] for _id in top_k[1][0]
    ]

search("hayat iyi gidiyor mu")

In [None]:
def search_pdf(query, k=7):
    query_vector = encoder.encode([query])
    top_k = index.search(query_vector, k)
    print(top_k)
    return [
        allText[_id] for _id in top_k[1][0]
    ]

search_pdf("hayat iyi gidiyor mu")

In [None]:
def search_openai(query, k=5):
    query_vector = encode_open([query])
    top_k = index.search(query_vector, k)
    print(top_k)
    return [
        input[_id] for _id in top_k[1][0]
    ]

search_openai("hayat iyi gidiyor mu")

In [None]:
def search_openai_pdf(query, k=5):
    query_vector = encode_open([query])
    top_k = index.search(query_vector, k)
    print(top_k)
    return [
        allText[_id] for _id in top_k[1][0]
    ]

search_openai_pdf("hayat iyi gidiyor mu")

# Save encoded texts

İf you want to save the encoded texts, you can use the cell below.

In [None]:
path = './faiss.index'

# Save index
faiss.write_index(index, path)

# Load encoded texts

You can load back the encoded texts providing the path of index.

In [None]:
index = faiss.read_index(path)
search("hayat iyi gidiyor mu")