## Setup libraries

In [None]:
# Install libraries
!pip install --upgade pip
!pip install nuclia langchain pypdf spacy pandas
!python -m spacy download en_core_web_sm



Usage:   
  pip3 install [options] <requirement specifier> [package-index-options] ...
  pip3 install [options] -r <requirements file> [package-index-options] ...
  pip3 install [options] [-e] <vcs project url> ...
  pip3 install [options] [-e] <local project path> ...
  pip3 install [options] <archive url/path> ...

no such option: --upgade
2023-11-16 11:39:53.995082: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-16 11:39:53.995147: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-16 11:39:53.995190: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-16 11:39:

In [None]:
!pip install transformers bitsandbytes accelerate scipy arxiv



## Download and parse data

In [None]:
# donwload data
import arxiv

bert_id = "1810.04805"

paper = next(arxiv.Client().results(arxiv.Search(id_list=[bert_id])))
# Download the PDF to the PWD with a default filename.
paper.download_pdf(filename="bert.pdf")

'./bert.pdf'

In [None]:
# Parse data from PDFs
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import SpacyTextSplitter


file_name = "bert.pdf"
loader = PyPDFLoader(file_name)

# explore the content extracted
content_pages = [page.page_content for page in loader.load_and_split()]
content_pages[:2]

['BERT: Pre-training of Deep Bidirectional Transformers for\nLanguage Understanding\nJacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova\nGoogle AI Language\n{jacobdevlin,mingweichang,kentonl,kristout }@google.com\nAbstract\nWe introduce a new language representa-\ntion model called BERT , which stands for\nBidirectional Encoder Representations from\nTransformers. Unlike recent language repre-\nsentation models (Peters et al., 2018a; Rad-\nford et al., 2018), BERT is designed to pre-\ntrain deep bidirectional representations from\nunlabeled text by jointly conditioning on both\nleft and right context in all layers. As a re-\nsult, the pre-trained BERT model can be ﬁne-\ntuned with just one additional output layer\nto create state-of-the-art models for a wide\nrange of tasks, such as question answering and\nlanguage inference, without substantial task-\nspeciﬁc architecture modiﬁcations.\nBERT is conceptually simple and empirically\npowerful. It obtains new state-of-the-art re-\ns

## Process data from PDF

In [None]:
from langchain.text_splitter import SpacyTextSplitter
text_splitter = SpacyTextSplitter(chunk_size=1000)

# create chunks of maximum length 100
content_chunks = []
for content_pages in content_pages:
  texts = text_splitter.split_text(content_pages)
  content_chunks.extend(texts)

content_chunks[:2]



['BERT: Pre-training of Deep Bidirectional Transformers for\nLanguage Understanding\nJacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova\nGoogle AI Language\n{jacobdevlin,mingweichang,kentonl,kristout }@google.com\nAbstract\nWe introduce a new language representa-\ntion model called BERT , which stands for\nBidirectional Encoder Representations from\nTransformers.\n\nUnlike recent language repre-\nsentation models (Peters et al., 2018a; Rad-\nford et al., 2018), BERT is designed to pre-\ntrain deep bidirectional representations from\nunlabeled text by jointly conditioning on both\nleft and right context in all layers.\n\nAs a re-\nsult, the pre-trained BERT model can be ﬁne-\ntuned with just one additional output layer\nto create state-of-the-art models for a wide\nrange of tasks, such as question answering and\nlanguage inference, without substantial task-\nspeciﬁc architecture modiﬁcations.\n\n\nBERT is conceptually simple and empirically\npowerful.',
 'BERT is conceptually sim

## Use NucliaDB vector database to upload data and retrieve contexts

Follow the LangChain documentation to use NucliaDB:

- https://python.langchain.com/docs/integrations/vectorstores/nucliadb
- https://nuclia.com/developers/using-langchain-with-nuclia/

Alternatively, you can directly use NucliaDB:
- https://nuclia.com/nucliadb/nucliadb-vector-database-data-scientists/

In [None]:
from langchain.vectorstores.nucliadb import NucliaDB

API_KEY = "YOUR_API_KEY"
KB_ID = "YOUR_KB_ID"
ndb = NucliaDB(knowledge_box=KB_ID, local=False, api_key=API_KEY)

Invalid service token


In [None]:
# Upload data
# Note that the BERT paper has already been uploaded, uncomment the line below to upload new data
content_chunks_ids = ndb.add_texts(content_chunks)

In [None]:
# Retrieve questions
question = "What does BERT mean?"
def retrieve(question, topk=5):
    retrieved_contexts = [result.page_content for result in ndb.similarity_search(query=question, k=topk)]
    # clean a bit
    retrieved_contexts = [context.replace("- ", "").replace("\n", " ").replace("  ", " ").strip() for context in retrieved_contexts]
    return "\n\n- ".join(retrieved_contexts)
retrieve(question)

'We use a dropout probability of 0.1 on all layers. We use a gelu activation (Hendrycks and Gimpel, 2016) rather than the standard relu , following OpenAI GPT. The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood. Training of BERT BASE was performed on 4 Cloud TPUs in Pod conﬁguration (16 TPU chips total).13Training of BERT LARGE was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete. Longer sequences are disproportionately expensive because attention is quadratic to the sequence length. To speed up pretraing in our experiments, we pre-train the model with sequence length of 128 for 90% of the steps. Then, we train the rest 10% of the steps of sequence of 512 to learn the positional embeddings.\n\n- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacobdevlin,mingweichang,kentonl,kr

## Use Transformers library to use the Mistral-instruct model

In [None]:
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
import torch


nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

MISTRAL_MODEL = "mistralai/Mistral-7B-Instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(MISTRAL_MODEL, quantization_config=nf4_config)
tokenizer = AutoTokenizer.from_pretrained(MISTRAL_MODEL)

generation_args = {
            "max_new_tokens": 1024,
            "do_sample": True,
            "top_k": 10,
            "top_p": 0.9,
            "eos_token_id": tokenizer.eos_token_id,
            "batch_size": 1,
        }

pipeline = transformers.pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            **generation_args
        )


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Create a RAG QA system

In [None]:
def get_prompt(
    system_prompt: str,
    message: str,
) -> str:
    texts = [f"<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n"]
    # prepend chat history before the current message
    do_strip = False
    message = message.strip() if do_strip else message
    texts.append(f"{message} [/INST]")
    return "".join(texts)

get_prompt("NLP assistant", "What's your name?")

"<s>[INST] <<SYS>>\nNLP assistant\n<</SYS>>\n\nWhat's your name? [/INST]"

In [None]:
def rag_xlqa(system_prompt, test_question, top_k):

    # retrieve contexts with semantic search
    context = retrieve(test_question, topk=top_k)
    if not context:
        context = "There is no context available for the given question."

    # create user and assistant message applying a QA task template with context as input
    user_msg = f"Input\n{context}\nQ: {test_question}"

    # format messages using chat template to build model-specific
    prompt = get_prompt(system_prompt, user_msg)
    response = pipeline(prompt)

    # answer by taking the last message after the user message
    answer = response[0]["generated_text"].split("[/INST]")[-1].strip()

    return response, answer, context



In [None]:
SYSTEM_PROMPT_XLQA = "Use the following pieces of context to answer the question at the end.\n"\
                     "If there is not enough information in the context don't try to make up an answer, just say 'Not enough information to answer'.\n"\
                    "Just provide one response and do not continue the response."\
                     "Answer in the language of the question."
TEST_QUESTION = "Who is Einstein?"
response, answer, context = rag_xlqa(system_prompt=SYSTEM_PROMPT_XLQA, test_question=TEST_QUESTION, top_k=10)
answer, context

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


('Not enough information to answer.',
 'There is no context available for the given question.')

## Testing cross-lingual QA

In [None]:
questions_bert_paper = {
    'english': [
        "What specific pre-training techniques were employed in the BERT model?",
        "How many epochs were conducted during the pre-training phase of BERT?",
        "Could you elaborate on the tokenization strategy adopted for the BERT model?",
        "What is the size of the vocabulary?",
        "What evaluation metrics were employed to assess the performance of BERT?",
        "Can you specify the size of the training dataset used for pre-training BERT?",
        "What were the key findings or improvements observed in the experiments conducted with the BERT model?"
    ],
    'spanish': [
        "¿Qué técnicas específicas de preentrenamiento se utilizaron en el modelo BERT?",
        "¿Cuántas épocas se llevaron a cabo durante la fase de preentrenamiento de BERT?",
        "¿Podría detallar la estrategia de tokenización adoptada para el modelo BERT?",
        "¿Cuál es el tamaño del vocabulario?",
        "¿Qué métricas de evaluación se emplearon para evaluar el rendimiento de BERT?",
        "¿Puede especificar el tamaño del conjunto de datos de entrenamiento utilizado para el preentrenamiento de BERT?",
        "¿Cuáles fueron los hallazgos clave o mejoras observadas en los experimentos realizados con el modelo BERT?"
    ],
    'catalan': [
        "Quines tècniques específiques de preentrenament es van emprar en el model BERT?",
        "Quantes èpoques es van dur a terme durant la fase de preentrenament de BERT?",
        "Podria donar detalls sobre la estratègia de tokenització adoptada per al model BERT?",
        "Quin és el tamany del vocabulari?",
        "Quines mètriques d'avaluació es van utilitzar per avaluar el rendiment de BERT?",
        "Podria especificar el tamany del conjunt de dades d'entrenament utilitzat per al preentrenament de BERT?",
        "Quins van ser els resultats clau o millores observades en els experiments realitzats amb el model BERT?"
    ],
    "italian": [
        "Quali specifiche tecniche di pre-training sono state impiegate nel modello BERT?",
        "Quante epoche sono state condotte durante la fase di pre-training di BERT?",
        "Potresti approfondire sulla strategia di tokenizzazione adottata per il modello BERT?",
        "Qual è la dimensione del vocabolario?",
        "Quali metriche di valutazione sono state impiegate per valutare le performance di BERT?",
        "Puoi specificare la dimensione del set di dati di addestramento utilizzato per il pre-training di BERT?",
        "Quali sono stati i principali risultati o miglioramenti osservati negli esperimenti condotti con il modello BERT?"
    ],
    "japanese": [
        "BERTモデルで具体的にどのような事前トレーニング技術が使用されましたか？",
        "BERTの事前トレーニングフェーズで何エポックが行われましたか？",
        "BERTモデルで採用されたトークナイゼーション戦略について詳しく説明できますか？",
        "ボキャブラリーのサイズはどれくらいですか？",
        "BERTのパフォーマンスを評価するためにどの評価メトリクスが使用されましたか？",
        "BERTの事前トレーニングに使用されたトレーニングデータセットのサイズを指定できますか？",
        "BERTモデルで行われた実験で観察された主な結果や改善点は何でしたか？"
    ],
    'telugu': [
        "BERT మోడల్‌లో ఏదైనా క్రియాశీల ప్రీట్రైనింగ్ విధులు ఉపయోగించబడిందివా?",
        "BERT ప్రీట్రైనింగ్ దశలో ఎంత ఎపాక్స్ నడిపబడింది?",
        "BERT మోడల్‌కు అంగీకృతమయ్యే టోకెనైజేషన్ స్ట్రాటజీపై మీరు వివరాలు ఇవ్వగలరా?",
        "వాకార్గ్రాం పరిమాణమేంటి?",
        "BERT యొక్క ప్రదర్శనను అంగీకరించడానికి ఏమిటి అనుసరించిన అంశాలు?",
        "BERT ప్రీట్రైనింగ్ కోసం ఉపయోగించిన ప్రశిక్షణ డేటాసెట్ పరిమాణాన్ని నిర్దిష్టంగా చెప్పగలరా?",
        "BERT మోడల్‌తో నడుపబడిన ప్రయోగాలలో ఏమిటి ప్రధాన ఫైండింగ్‌లు లేదా మెరుగుపరచిన మెరుగులు?"
    ]
}

In [None]:
from collections import defaultdict
SYSTEM_PROMPT_XLQA = "You are an NLP assistant whose purpose is to solve reading comprehension problems.\n"\
                     "You will be provided questions on a set of passages and you will need to provide the answer as it appears in the passage.\n"\
                     "If there is not enough information in the context don't try to make up an answer, just say 'Not enough information to answer'.\n"\
                     "Just provide one response and do not continue the response."\
                     "Let's think step-by-step and answer in the language of the question."

results = []
for lang in questions_bert_paper:
    qid = 0
    for test_question in questions_bert_paper[lang]:
        response, answer, context = rag_xlqa(system_prompt=SYSTEM_PROMPT_XLQA, test_question=test_question, top_k=20)
        result = {"qid": qid, "lang":lang, "question": test_question, "answer":answer,"context": context}
        results.append(result)
        qid += 1


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o