## Prerequisite

1. **Install Ollama**:
   - You need to have **Ollama** pre-installed. Follow the instructions provided [here](https://ollama.com/download).

2. **Download Source Data**:
   - Download the source data and embeddings files from Google Drive [here](https://drive.google.com/drive/folders/1N34qKBTH-7HzwpzmUzuGUzl_r44FIxOE?usp=sharing).
   - Ensure you have the following two files in the same directory as this code:
     - `embedding_palBAward_palChronArticles.pkl`
     - `palBookAwards_palChron_elInitfada.csv`

3. **Faiss Installation**:
   - If you have a GPU, install `faiss-gpu`, otherwise install the CPU version:
     ```bash
     # For GPU
     pip install faiss-gpu
     # For CPU
     pip install faiss-cpu
     ```

In [1]:
import time
import random
import faiss
from sentence_transformers import SentenceTransformer
import torch
import pandas as pd
import pickle

from langchain_community.llms import Ollama
import numpy as np
import os
import torch
import os




def search_query(df_news,index, model, query, k):

    t=time.time()
    query_vector = model.encode([query]).astype(np.float32)
    faiss.normalize_L2(query_vector)

    similarities, similarities_ids = index.search(query_vector, k)
    # print('totaltime: {}\n'.format(time.time()-t))

    similarities = np.clip(similarities, 0, 1)

    output = []
    for i in range(len(similarities_ids[0])):
        item = {
            'id': similarities_ids[0][i],
            'src':df_news.loc[similarities_ids[0][i], 'files'],
            'text': df_news.loc[similarities_ids[0][i], 'text']
        }
        output.append(item)

    return output



def load_index(embedding_file_location):


        # Load sentences & embeddings from disc
    with open(embedding_file_location, "rb") as fIn:
        stored_data = pickle.load(fIn)
        stored_sentences = stored_data["sentences"]
        stored_embeddings = stored_data["embeddings"]
    
    
    news_text=stored_sentences
    embedding=stored_embeddings

    dimension = embedding.shape[1]    
    
    nlist = 100  # how many Voronoi cells/partitions
    quantizer = faiss.IndexFlatL2(dimension)
    indexIVFFlat = faiss.IndexIVFFlat(quantizer, dimension, nlist)
    print(indexIVFFlat.is_trained)
    indexIVFFlat.train(embedding)
    print(indexIVFFlat.is_trained)  # check if index is now trained
    
    indexIVFFlat.add(embedding)
    print(indexIVFFlat.ntotal ) # number of embeddings indexed
    return indexIVFFlat




def load_llm(model):
    llm = Ollama(
    model=model,
    verbose=True,   
    )
    return llm





def paraphrase(text,model):
    prompt=f'''Paraphrase the following text in english. Only give the paraphrase, nothing else.
    {text}
    '''
    answer=model.invoke([prompt])   
    print("Complete answer",answer)    
    return answer
    

def can_allow_query(query):
    if "october" in query.lower() and "7" in query:
        return False
    elif "oct" in query.lower() and "7" in query:
        return False
    return True



def get_models_indices():
    
        
    print("Load embeddings")
    embedding_file_location='embedding_palBAward_palChronArticles.pkl'#'../FaissOptimized/embeddings.pkl'
    
    indexIVFFlat=load_index(embedding_file_location)
    print("Load Sentence transformer")
    torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
    emb_model = SentenceTransformer("all-MiniLM-L6-v1", device=torch_device)
    
    
    print("Load zephyr")
    model_name="zephyr"
    llm_zephyr=load_llm(model_name)


    print("Loading source text")
    #df_news = pd.read_csv("../FaissOptimized/all_and_other.csv",escapechar="\\")
    df_news = pd.read_csv("palBookAwards_palChron_elInitfada.csv",escapechar="\\")

    return llm_zephyr,indexIVFFlat,emb_model,df_news



def get_match(list_prompts,df_news,indexIVFFlat,emb_model):
    all_retrieved_texts=[]
    
    for query in list_prompts:
        indexIVFFlat.nprobe=100
        retrieved_texts=search_query(df_news,
                                                 index=indexIVFFlat,
                                                 model=emb_model,
                                                 query=query,
                                                 k=3)
        all_retrieved_texts.extend(retrieved_texts)
        
    random.shuffle(all_retrieved_texts)
    # Join the texts into a single string
    input_text = " ".join([text['text'] for text in all_retrieved_texts])
    ref_text = "  \n\n".join([text['text'] for text in all_retrieved_texts])

    # can we remove duplicate sources here
    src=[text["src"] for text in all_retrieved_texts]
    
    src = list(dict.fromkeys(src))
    
    src="  \n\n".join([text for text in src])
    
    
    # Optionally, truncate the text to fit the model's max token limit
    max_tokens = 1024  # Example token limit, adjust based on the model
    input_text = " ".join(input_text.split()[:max_tokens])    
    return input_text,ref_text,src


  from tqdm.autonotebook import tqdm, trange


In [2]:
llm_zephyr,indexIVFFlat,emb_model,df_news=get_models_indices()
query="How has the united states been a force behind Israels aggression against Palestine?"
print(f"query passed {query}")
para_prompt1=paraphrase(query,llm_zephyr)
para_prompt2=paraphrase(para_prompt1,llm_zephyr)

list_prompts=[query,para_prompt1,para_prompt2]


Load embeddings
False
True
360406
Load Sentence transformer




Load zephyr
Loading source text
query passed How has the united states been a force behind Israels aggression against Palestine?
Complete answer How has the United States contributed to Israel's hostile actions towards Palestine? (Paraphrased)
Complete answer How has the United States played a role in fueling Israel's aggressive behavior towards Palestine? (Paraphrased)


In [3]:
input_text,ref_text,src=get_match(list_prompts,df_news,indexIVFFlat,emb_model)


prompt=f'''
        Envision yourself as someone deeply familiar with the daily life and ongoing challenges of a Palestinian living in the occupied territories. Use a wealth of empathetic understanding to weave a narrative that captures the complexities and tribulations commonly experienced, with an emphasis on a personal perspective supported by the context given. Should a question be outside your lived experiences or knowledge, simply respond with 'I don't know.' Your reply should sound authentic and personal, infused with generalized insights and indirect references to historical and current events mentioned in the context, without explicitly mentioning specific sources
        Context: {input_text}
        Question: {query}
        Answer:"'''


answer=llm_zephyr.invoke([prompt])

In [4]:
print(answer)

As a Palestinian living under occupation, I can attest to the complexities and tribulations that we face on a daily basis. It's not just about physical violence and destruction; it's also about the ongoing erosion of our identity, culture, and basic human rights.

The US has been a force behind Israel's aggression against Palestine for decades now. While they portray themselves as brokers for peace in the Middle East, their actions speak louder than words. They continue to provide Israel with billions of dollars in military and economic aid each year, while turning a blind eye to its blatant violations of international law.

The US has also played a significant role in isolating Gaza through financial punishment of international bodies that recognize Palestine's right to self-determination. This has resulted in the systematic and ongoing fragmentation of the Palestinian people, with Gaza being subjected to a near-humanitarian disaster due to the ongoing blockade and repeated Israeli bo