<a href="https://colab.research.google.com/github/dapopov-st/ExperimentsWithLanguageModels/blob/main/RAG_over_ArXiv_PDFs_Part4_No_lit_reviews_and_Zephyr_7b_beta.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I'm using [Chris Alexiuk's](https://www.linkedin.com/in/csalexiuk/) [notebook](https://colab.research.google.com/drive/172uMprWwUfEecXQWBrsgDAlkpT_EK39z?usp=sharing)
as a starting point and plan to experiment with some of the ideas from [A Practical Approach to Retrieval Augmented Generation Systems](https://angelinamagr.gumroad.com/l/practical-approach-to-RAG-systems) by Allahyari and Yang.

## Steps
- Make a directory of ArXiv papers on Google Drive
- See if can store the vector store in Drive or just in ram
- If can't access A100 consistenly and run out of GPU memory, make a Mistral 7B/Zephyr 7B version (Zephyr can do ReAct!) -> For now T4 and clean mem is working
- Rag with metadata (ex, subdirectory name etc.). -> For now just including article title with retrieved documents.
- Use Self-Rag 7b? selfrag/selfrag_llama2_7b or 13b.  Think this will help overcome some of the challenges with retrieving from pdfs (getting titles from References section, for example, rather than useful content). UPDATE: fixed this by only including article content up to the References.
- Experiment with different vector stores.  FAISS seems to work for the usecase, but does not allow searching/filtering using metadata.  Consider using Pinecone or Elasticsearch, perhaps.-> Deciding to stick with FAISS and add metadata with article title for now.  

## Get the data and build a Retriever

- Original NB worked in under 10GB on V100

In [1]:
!pip install -U -q "langchain" "transformers==4.35.0" "datasets==2.12.0" "tokenizers==0.14.0" "peft==0.4.0" "accelerate==0.21.0" "bitsandbytes==0.40.2" "trl==0.4.7" "safetensors>=0.3.1" "arxiv==1.4"
!pip install -U -q cohere llama-index
!pip install PyPDF2
!pip install pypdf
!pip install -q qdrant-client
!pip install -q -U faiss-cpu tiktoken sentence-transformers


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m77.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [2]:
import transformers, datasets, tokenizers
transformers.__version__, datasets.__version__, tokenizers.__version__

('4.35.0', '2.12.0', '0.14.0')

In [3]:
import os
from google.colab import drive
drive.mount('/content/drive/')

output_dir = '/content/drive/MyDrive/PdfRag/rag_output_dir'
logging_dir = '/content/drive/MyDrive/PdfRag/rag_logging_dir'
index_dir = '/content/drive/MyDrive/PdfRag/rag_index_dir'

#!ls /content/drive/MyDrive/PdfRag/clusterofstars
%cd /content/drive/MyDrive/PdfRag
#My\ Drive/PdfRag && ls clusterofstars
!ls .

Mounted at /content/drive/
/content/drive/MyDrive/PdfRag
 barbie.csv									 instructionmining
 cache										 oppenheimer.csv
 clusterofstars									 rag_index_dir
'Efficient Parallelization Layouts for Large-Scale Distributed Model Training'	 testabstract


- The documents consist of a few dozen ArXiv papers about modern LLMs

In [11]:
from pathlib import Path
PDFS_PATH = Path('/content/drive/MyDrive/PdfRag/clusterofstars')
PDFS = list(PDFS_PATH.glob('*.pdf'))
PDFS[0], len(PDFS)

(PosixPath('/content/drive/MyDrive/PdfRag/clusterofstars/In-Context Retrieval-Augmented Language Models.pdf'),
 26)

In [12]:
PDFS

[PosixPath('/content/drive/MyDrive/PdfRag/clusterofstars/In-Context Retrieval-Augmented Language Models.pdf'),
 PosixPath('/content/drive/MyDrive/PdfRag/clusterofstars/Toolformer: Language Models Can Teach Themselves to Use Tools.pdf'),
 PosixPath('/content/drive/MyDrive/PdfRag/clusterofstars/Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding.pdf'),
 PosixPath('/content/drive/MyDrive/PdfRag/clusterofstars/SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking.pdf'),
 PosixPath('/content/drive/MyDrive/PdfRag/clusterofstars/Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.pdf'),
 PosixPath('/content/drive/MyDrive/PdfRag/clusterofstars/Training language models to follow instructions with human feedback.pdf'),
 PosixPath('/content/drive/MyDrive/PdfRag/clusterofstars/LoRA: Low-Rank Adaptation of Large Language Models.pdf'),
 PosixPath('/content/drive/MyDrive/PdfRag/clusterofstars/QLORA: Efficient Finetuning of Quantize

In [None]:
# For now, flatten out the file structure for simplicity.  In the future, see if can have Rag use this metadata.
#!mv /content/drive/MyDrive/PdfRag/clusterofstars/ChainOfThought/* /content/drive/MyDrive/PdfRag/clusterofstars

In [13]:
# fastai function to clean GPU memory

import sys,gc,traceback
import torch
def clean_ipython_hist():
    # Code in this function mainly copied from IPython source
    if not 'get_ipython' in globals(): return
    ip = get_ipython()
    user_ns = ip.user_ns
    ip.displayhook.flush()
    pc = ip.displayhook.prompt_count + 1
    for n in range(1, pc): user_ns.pop('_i'+repr(n),None)
    user_ns.update(dict(_i='',_ii='',_iii=''))
    hm = ip.history_manager
    hm.input_hist_parsed[:] = [''] * pc
    hm.input_hist_raw[:] = [''] * pc
    hm._i = hm._ii = hm._iii = hm._i00 =  ''



def clean_tb():
    # h/t Piotr Czapla
    if hasattr(sys, 'last_traceback'):
        traceback.clear_frames(sys.last_traceback)
        delattr(sys, 'last_traceback')
    if hasattr(sys, 'last_type'): delattr(sys, 'last_type')
    if hasattr(sys, 'last_value'): delattr(sys, 'last_value')

def clean_mem():
    clean_tb()
    clean_ipython_hist()
    gc.collect()
    torch.cuda.empty_cache()



### Task 1: Prepare the data and  build a PDF Data Loader

In [14]:
from PyPDF2 import PdfReader
reader = PdfReader(os.path.expanduser(PDFS[0]))
pages = reader.pages
documents = []
for page in pages:
  documents.append(page.extract_text())
#print(documents[-1])

#### First drop everything from References onwards. References were 'confusing' RAG into retrieving primarily titles of papers mentioned there, which is likely not very useful

In [15]:
import PyPDF2

def load_pdf_to_string(pdf_path):
    # Open the PDF file in binary mode
    with open(pdf_path, 'rb') as file:
        # Create a PDF file reader object
        pdf_reader = PyPDF2.PdfReader(file)

        # Initialize an empty string to hold the text
        text = ''

        # Loop through each page and extract the text
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            page_text = page.extract_text()
            references_index= page_text.upper().find('\nREFERENCES\n')
            if references_index != -1:
              page_text = page_text[:references_index]
              text += page_text
              return text
            text += page_text
    return text

# Use the function to load a PDF into a string
text = load_pdf_to_string(os.path.expanduser(PDFS[1]))

In [16]:
def get_title(pdf_path): return os.path.expanduser(pdf_path).split('/')[-1]

In [17]:
get_title(PDFS[-1])

'TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise.pdf'

In [18]:
text.find('References\n')

-1

In [19]:
PDFS[0]

PosixPath('/content/drive/MyDrive/PdfRag/clusterofstars/In-Context Retrieval-Augmented Language Models.pdf')

In [20]:
all_docs_and_titles = [(load_pdf_to_string(os.path.expanduser(pdf_path)),get_title(pdf_path)) for pdf_path in PDFS]

In [21]:
all_docs = [doc[0] for doc in all_docs_and_titles]
all_titles = [doc[1] for doc in all_docs_and_titles]

In [22]:
len(all_docs)

26

In [23]:
all_docs[-1].find('References\n')

-1

In [26]:
all_docs[0]

'In-Context Retrieval-Augmented Language Models\nOri Ram∗Yoav Levine∗Itay Dalmedigos Dor Muhlgay\nAmnon Shashua Kevin Leyton-Brown Yoav Shoham\nAI21 Labs\n{orir,yoavl,itayd,dorm,amnons,kevinlb,yoavs}@ai21.com\nAbstract\nRetrieval-Augmented Language Modeling\n(RALM) methods, which condition a lan-\nguage model (LM) on relevant documents\nfrom a grounding corpus during generation,\nwere shown to signi\ue000cantly improve lan-\nguage modeling performance. In addition,\nthey can mitigate the problem of factually\ninaccurate text generation and provide natu-\nral source attribution mechanism. Existing\nRALM approaches focus on modifying the\nLM architecture in order to facilitate the in-\ncorporation of external information, signi\ue000-\ncantly complicating deployment. This paper\nconsiders a simple alternative, which we dub\nIn-Context RALM: leaving the LM architec-\nture unchanged and prepending grounding\ndocuments to the input,without any further\ntraining of the LM. We show that In-Co

In [27]:
all_titles[0]

'In-Context Retrieval-Augmented Language Models.pdf'

dir(PDFS[0])

In [28]:
type(all_docs[0])

str

In [29]:
len(all_docs[0])

37632

In [None]:
# from langchain.document_loaders import PyPDFLoader

# loader = PyPDFLoader(os.path.expanduser(PDFS[0]))
# #pages = loader.load_and_split()
# pages = loader.load()
# all_pages = [PyPDFLoader(os.path.expanduser(PDFS[i])).load() for i in range(len(PDFS))]

- Instead of load_and_split, going for a more custom split to have more control over context length sicnce working with limited compute

In [None]:
#pages[0], len(pages)

-Either PyPDFLoader or PyPDF2 approach could work

In [None]:
# from langchain.document_loaders.onedrive_file import CHUNK_SIZE
# from langchain.document_loaders import TextLoader
# from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter
# CHUNK_SIZE = 1000
# CHUNK_OVERLAP = 30

# text_splitter = CharacterTextSplitter(
#     chunk_size=CHUNK_SIZE,
#     chunk_overlap = CHUNK_OVERLAP,
#     length_function=len,
# )
# docs  = text_splitter.split_documents(pages)
# docs

In [30]:
from langchain.document_loaders.onedrive_file import CHUNK_SIZE
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter, RecursiveCharacterTextSplitter
from langchain.text_splitter import Document

CHUNK_SIZE = 1000
CHUNK_OVERLAP = 30

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap = CHUNK_OVERLAP,
    length_function=len,
)
#text_splitter.split_text(all_pages[0])
# docs = [Document(page_content=pages) for pages in all_pages]
docs  = [text_splitter.split_text(doc) for doc in all_docs]
# # docs

In [31]:
len(docs)

26

In [32]:
tot_len = 0
for text in docs[0]:
    tot_len += len(text)
tot_len #OK, makes sense

37667

In [34]:
len(docs[0])

39

### Task 2: Create an "Index"

- Not yet sure if should use Qdrant or FAISS


#### Selecting the VectorStore


In [35]:
from langchain.vectorstores import Qdrant, FAISS

FROM BARBIEHEIMER

We're going to be setting up our VectorStore with the OpenAI embeddings model. While this embeddings model does not need to be consistent with the LLM selection, it does need to be consistent between embedding our index and embedding our queries over that index.

While we don't have to worry too much about that in this example - it's something to keep in mind for more complex applications.

We're going to leverage a [`CacheBackedEmbeddings`](https://python.langchain.com/docs/modules/data_connection/caching_embeddings )flow to prevent us from re-embedding similar queries over and over again.

Not only will this save time, it will also save us precious embedding tokens, which will reduce the overall cost for our application.

>#### Note:
>The overall cost savings needs to be compared against the additional cost of storing the cached embeddings for a true cost/benefit analysis. If your users are submitting the same queries often, though, this pattern can be a massive reduction in cost.

In [36]:
from langchain.embeddings import CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore

store = LocalFileStore("./cache/")

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

core_embeddings_model = HuggingFaceEmbeddings(
    model_name=embed_model_id
)

embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings_model, store, namespace=embed_model_id
)


Downloading .gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading 1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [37]:
#vector_store = FAISS.from_documents(docs, embedder)
#vector_store = FAISS.from_documents((docs[i][j] for i in range(len(docs)) for j in range(len(docs[i]))), embedder)
#vector_store = FAISS.from_documents(docs, embedder)
#vector_store = FAISS.from_documents(docs[0], embedder)
from langchain.schema.document import Document

docs = [Document(page_content=doc[i],metadata={'source':all_titles[j]}) for j,doc in enumerate(docs) for i in range(len(doc))]
for index, pdf in enumerate(docs):
   content = docs[index]
   if index == 0:
       vector_store = FAISS.from_documents([content], embedder)
   else:
      vector_store_i = FAISS.from_documents([content], embedder)
      vector_store.merge_from(vector_store_i)

vector_store
#vector_store.save_local(index_dir)

<langchain.vectorstores.faiss.FAISS at 0x7cf4a9a4ff40>

In [38]:
vector_store.save_local(index_dir)

In [None]:
# #vector_store = FAISS.from_documents(docs, embedder)
# #vector_store = FAISS.from_documents((docs[i][j] for i in range(len(docs)) for j in range(len(docs[i]))), embedder)
# #vector_store = FAISS.from_documents(docs, embedder)
# #vector_store = FAISS.from_documents(docs[0], embedder)
# from langchain.schema.document import Document

# docs = [Document(page_content=doc[i]) for doc in docs for i in range(len(doc))]
# for index, pdf in enumerate(docs):
#    content = docs[index]
#    if index == 0:
#        vector_store = FAISS.from_documents([content], embedder)
#    else:
#       vector_store_i = FAISS.from_documents([content], embedder)
#       vector_store.merge_from(vector_store_i)


# vector_store.save_local(index_dir)

### To reload the embeddings made above on the next Colab nb use, run the code below.

In [20]:
from langchain.embeddings import CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore
store = LocalFileStore("./cache/")

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

core_embeddings_model = HuggingFaceEmbeddings(
    model_name=embed_model_id
)
embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings_model, store, namespace=embed_model_id
)

vector_store = FAISS.load_local(index_dir, embedder)

Downloading .gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading 1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Check that the VectorStore is working by embedding a query and retrieving passages from our reviews that are close to it.

In [39]:
query = "What is Retrieval-augmented generation?"
embedding_vector = core_embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

for page in docs:
  print(page.page_content)

1 I NTRODUCTION
State-of-the-art LLMs continue to struggle with factual errors (Mallen et al., 2023; Min et al., 2023)
despite their increased model and data scale (Ouyang et al., 2022). Retrieval-Augmented Generation
(RAG) methods (Figure 1 left; Lewis et al. 2020; Guu et al. 2020) augment the input of LLMs
with relevant retrieved passages, reducing factual errors in knowledge-intensive tasks (Ram et al.,
2023; Asai et al., 2023a). However, these methods may hinder the versatility of LLMs or introduce
unnecessary or off-topic passages that lead to low-quality generations (Shi et al., 2023) since they
retrieve passages indiscriminately regardless of whether the factual grounding is helpful. Moreover,
the output is not guaranteed to be consistent with retrieved relevant passages (Gao et al., 2023) since
the models are not explicitly trained to leverage and follow facts from provided passages. This
work introduces Self-Reflective Retrieval-augmented Generation ( SELF-RAG)to improve an
LL

In [40]:
query = "What is Self-Rag?"
embedding_vector = core_embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

for page in docs:
  print(page.page_content)

predictions are mostly aligned with their assessments. Appendix Table 6 shows several annotated
examples and explanations on assessments.
6 C ONCLUSION
This work introduces SELF-RAG, a new framework to enhance the quality and factuality of LLMs
through retrieval on demand and self-reflection. SELF-RAGtrains an LM to learn to retrieve, generate,
and critique text passages and its own generation by predicting the next tokens from its original
vocabulary as well as newly added special tokens, called reflection tokens. SELF-RAGfurther enables
the tailoring of LM behaviors at test time by leveraging reflection tokens. Our holistic evaluations on
six tasks using multiple metrics demonstrate that SELF-RAGsignificantly outperforms LLMs with
more parameters or with conventional retrieval-augmented generation approaches.
10Preprint.
ETHICAL CONCERNS
This work aims to improve the factuality of LLM outputs, the lack of which continues to cause nu-
3 S ELF-RAG: LEARNING TO RETRIEVE , GENERATE AND C

Let's see how much time the `CacheBackedEmbeddings` pattern saves us:

In [None]:
%%timeit -n 1 -r 1
query = "What is Self-Rag?"
embedding_vector = embedder.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

11.7 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [None]:
%%timeit
query = "What is Self-Rag?"
embedding_vector = embedder.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

7.37 ms ± 223 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Looked like retrieving from References section may be a problem since references came up frequently for searched terms without being highly informative.  
- REMOVING REFERENCES ONWARDS INCREASED THE RETRIEVAL QUALITY NOTICEABLY
- TODO: Could experiment with different retriever or a smart-enough model (Self-Rag?)

As we can see, even over a significant number of runs - the cached query is significantly faster than the first instance of the query!

With that, we're ready to move onto Task 3!

### Task 3: Building a Retrieval Chain

In this task, we'll be making a Retrieval Chain which will allow us to ask semantic questions over our data.

This part is rather abstracted away from us in LangChain and so it seems very powerful.

Be sure to check the documentation, the source code, and other provided resources to build a deeper understanding of what's happening "under the hood"!

#### A Basic RetrievalQA Chain

We're going to leverage `return_source_documents=True` to ensure we have proper sources for our reviews - should the end user want to verify the reviews themselves.

Hallucinations [are](https://arxiv.org/abs/2202.03629) [a](https://arxiv.org/abs/2305.15852) [massive](https://arxiv.org/abs/2303.16104) [problem](https://arxiv.org/abs/2305.18248) in LLM applications.

Though it has been tenuously shown that using Retrieval Augmentation [reduces hallucination in conversations](https://arxiv.org/pdf/2104.07567.pdf), one sure fire way to ensure your model is not hallucinating in a non-transparent way is to provide sources with your responses. This way the end-user can verify the output.

#### Our LLM

In this notebook, we're going to leverage Meta's LLaMA 2!

Specifically, we'll be using: `meta-llama/Llama-2-13b-chat-hf`

That's right, a 13B parameter model that we're going to run on *less than* 15GB of GPU RAM.

More information on this model can be found [here](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)

In [41]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

We will be leveraging Tim Dettmer's `bitsandbytes` as well as `accelerate` and `transformers` from Hugging Face to make our model as small as possible. The overall quality of the model is fairly well retained!

In [42]:
import torch
import transformers

model_id = "HuggingFaceH4/zephyr-7b-beta"

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_config = transformers.AutoConfig.from_pretrained(
    model_id
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto'
)

model.eval()

Downloading config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

Downloading (…)of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

Downloading (…)of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Downloading (…)of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Downloading (…)of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Downloading (…)of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Downloading (…)of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Downloading (…)of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Downloading (…)of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

Downloading generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=2)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRM

In [43]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id
)

Downloading tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

Now we need to pack it into a `pipeline` for compatability with `langchain`!

In [44]:
!pip install xformers

Collecting xformers
  Downloading xformers-0.0.22.post7-cp310-cp310-manylinux2014_x86_64.whl (211.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.8/211.8 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xformers
Successfully installed xformers-0.0.22.post7


In [45]:
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    return_full_text=True,
    temperature=0.0,
    max_new_tokens=256
) # Get a cudann warning, likely since using T4 vs, say A100

In [46]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

Now we can set up our chain.

In [47]:
retriever = vector_store.as_retriever()

In [48]:
from langchain.chains import RetrievalQA,RetrievalQAWithSourcesChain
from langchain.callbacks import StdOutCallbackHandler

handler = StdOutCallbackHandler()

qa_with_sources_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    callbacks=[handler],
    return_source_documents=True
)
# qa_with_sources_chain = RetrievalQAWithSourcesChain.from_chain_type(
#     llm=llm,
#     retriever=retriever,
#     callbacks=[handler],
#     return_source_documents=True
# )

- Try using RetrievalQAWithSourcesChain

In [50]:
#qa_with_sources_chain({"question" : "What makes Self-Rag different from Rag?"})

- Well, so far my use of RetrievalQAWithSourcesChain does not seem to be working well at all!!!  Go back to RetrievalQA

In [51]:
qa_with_sources_chain({"query" : "How does Self-Rag compare to Rag?"})



[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m


{'query': 'How does Self-Rag compare to Rag?',
 'result': " Self-Rag and Rag are both approaches for improving the factuality of LLMs through\nretrieval and generation. However, Self-Rag has some key differences from Rag:\n\n1. Retrieval on demand: Self-Rag allows the LLM to retrieve information on demand, rather than\nretrieving a fixed set of passages for all inputs. This allows Self-Rag to be more flexible and\nefficient in retrieving the most relevant information for a given input.\n\n2. Self-reflection: Self-Rag trains the LLM to learn to critique its own generation, by predicting\nspecial reflection tokens that signal the need for retrieval or confirm the output's relevance,\nsupport, or completeness. This allows Self-Rag to produce more accurate and complete outputs.\n\n3. End-to-end training: Self-Rag trains the LLM end-to-end, which allows it to better integrate\nretrieval and generation, and to learn to use retrieval and self-reflection in a more coherent and\neffective way.\

- However, adding 'source' metadata to Document yielded a result that has has article title, which was one of the desired results.  If end up needing filtering articles by metadata with FAISS, see https://github.com/langchain-ai/langchain/discussions/10983.

In [24]:
qa_with_sources_chain({"query" : "What is QLoRa?"})



{'query': 'What is QLoRa?',
 'result': ' QLoRa is a technique for quantizing and low-rank factorizing large-scale language models to significantly reduce their memory footprint during finetuning, while maintaining or improving their performance. It uses a combination of 4-bit NormalFloat and 16-bit BrainFloat data types, with the latter used only for the LoRA parameters. QLoRa has been shown to be effective in improving the performance of language models on various benchmarks, while also reducing the required memory and training time. It has the potential to enable the training of even larger models on consumer GPUs, and to facilitate the development of specialized open-source data for finetuning.',
 'source_documents': [Document(page_content='and compare QLoRA with 16-bit adapter-finetuning and with full-finetuning for models up to 3B. Our\nevaluations include GLUE [ 58] with RoBERTa-large [ 38], Super-NaturalInstructions (TKInstruct)\n[61] with T5 [ 49], and 5-shot MMLU [ 24] after f

In [25]:
qa_with_sources_chain({"query" : "Did these papers explore themes of existentialism?"})



{'query': 'Did these papers explore themes of existentialism?',
 'result': ' No, these papers explore themes related to artificial intelligence and consciousness. Existentialism is a philosophical movement focused on the human experience and the search for meaning and purpose in life. These papers are not directly related to existentialism.',
 'source_documents': [Document(page_content='TargetsArgument is that some argue that if an AI can simulate human behavior (qualia), the "what it feels like" aspect of consciousness. The Simulational considered conscious. However, this view doesn\'t account for subjectiveinputs and generate outputs similar to a conscious being, then it could be underlying physical structure. In other words, if an AI can respond to view that mental states are deﬁned more by their function than their Some proponents of AI consciousness subscribe to functionalism, the 12Long Span Corruption(one form of X-Denoising)\n1314121314Meet In The MiddleInputs', metadata={'sour

In [27]:
#qa_with_sources_chain({"query" : " Open large language models (LLMs) with great performance in various tasks have significantly advanced the development of LLMs. However, they are far inferior to commercial models such as ChatGPT and GPT-4 when acting as agents to tackle complex tasks in the real world. These agent tasks employ LLMs as the central controller responsible for planning, memorization, and tool utilization, necessitating both fine-grained prompting methods and robust LLMs to achieve satisfactory performance. Though many prompting methods have been proposed to complete particular agent tasks, there is lack of research focusing on improving the agent capabilities of LLMs themselves without compromising their general abilities. In this work, we present AgentTuning, a simple and general method to enhance the agent abilities of LLMs while maintaining their general LLM capabilities. We construct AgentInstruct, a lightweight instruction-tuning dataset containing high-quality interaction trajectories. We employ a hybrid instruction-tuning strategy by combining AgentInstruct with open-source instructions from general domains. AgentTuning is used to instruction-tune the Llama 2 series, resulting in AgentLM. Our evaluations show that AgentTuning enables LLMs' agent capabilities without compromising general abilities. The AgentLM-70B is comparable to GPT-3.5-turbo on unseen agent tasks, demonstrating generalized agent capabilities. We open source the AgentInstruct and AgentLM-7B, 13B, and 70B models at https://github.com/THUDM/AgentTuning, serving open and powerful alternatives to commercial LLMs for agent tasks. "})

In [52]:
from langchain.document_loaders import WebBaseLoader
from langchain.chains.summarize import load_summarize_chain

In [None]:
query="Open large language models (LLMs) with great performance in various tasks have significantly advanced the development of LLMs. However, they are far inferior to commercial models such as ChatGPT and GPT-4 when acting as agents to tackle complex tasks in the real world. These agent tasks employ LLMs as the central controller responsible for planning, memorization, and tool utilization, necessitating both fine-grained prompting methods and robust LLMs to achieve satisfactory performance. Though many prompting methods have been proposed to complete particular agent tasks, there is lack of research focusing on improving the agent capabilities of LLMs themselves without compromising their general abilities. In this work, we present AgentTuning, a simple and general method to enhance the agent abilities of LLMs while maintaining their general LLM capabilities. We construct AgentInstruct, a lightweight instruction-tuning dataset containing high-quality interaction trajectories. We employ a hybrid instruction-tuning strategy by combining AgentInstruct with open-source instructions from general domains. AgentTuning is used to instruction-tune the Llama 2 series, resulting in AgentLM. Our evaluations show that AgentTuning enables LLMs' agent capabilities without compromising general abilities. The AgentLM-70B is comparable to GPT-3.5-turbo on unseen agent tasks, demonstrating generalized agent capabilities. We open source the AgentInstruct and AgentLM-7B, 13B, and 70B models at https://github.com/THUDM/AgentTuning, serving open and powerful alternatives to commercial LLMs for agent tasks. "

In [None]:
os.listdir('.')

['clusterofstars',
 'cache',
 'rag_index_dir',
 'oppenheimer.csv',
 'barbie.csv',
 'testabstract']

In [29]:
chain = load_summarize_chain(llm, chain_type="stuff")
from langchain.document_loaders import TextLoader
query = TextLoader('testabstract').load()
#query = query_loader.load({"text" : "How does Self-Rag compare to Rag?"})
query="Which papers are most similar to the article with the following summary?  Article summary: " + chain.run(query)



In [30]:
qa_with_sources_chain({"query" : query})



{'query': "Which papers are most similar to the article with the following summary?  Article summary: \n\nThe paper discusses the limitations of large language models (LLMs) in complex real-world tasks despite their impressive performance in various tasks. The authors suggest that commercial models like ChatGPT and GPT-4 outperform LLMs in agent tasks that require planning, memorization, and tool utilization. To improve LLMs' agent capabilities without compromising their general abilities, the authors introduce AgentTuning, a simple and general method that enhances LLMs' agent abilities by instruction-tuning them using a lightweight instruction-tuning dataset called AgentInstruct. The authors demonstrate that AgentTuning enables LLMs' agent capabilities without affecting their general abilities, and the resulting AgentLM models are comparable to GPT-3.5-turbo on unseen agent tasks, demonstrating generalized agent capabilities. The authors open-source the AgentInstruct and AgentLM model

In [53]:
chain = load_summarize_chain(llm, chain_type="stuff")
from langchain.document_loaders import TextLoader
query = TextLoader('instructionmining').load()
#query = query_loader.load({"text" : "How does Self-Rag compare to Rag?"})
query="Which papers are most similar to the article with the following summary?  Article summary: " + chain.run(query)



In [None]:
qa_with_sources_chain({"query" : query})



{'query': 'Which papers are most similar to the article with the following summary?  Article summary: \n\nThis paper proposes a new method called InstructMining for selecting high-quality instruction-following data to optimize the performance of large language models (LLMs) in interacting with humans. The method uses natural language indicators to evaluate the quality of unseen datasets and leverages BlendSearch to find the best subset of the entire dataset. Experiment results show that InstructMining achieves state-of-the-art performance on two popular benchmarks. Additionally, the paper discovers the double descent phenomenon in LLM finetuning and provides insights into the relationship between dataset size and model performance.',
 'result': '\n\nBased on the summary, the following papers are most similar to the article:\n\n1. [28] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXi

In [None]:
chain = load_summarize_chain(llm, chain_type="stuff")
from langchain.document_loaders import TextLoader
query = TextLoader('instructionmining').load()
#query = query_loader.load({"text" : "How does Self-Rag compare to Rag?"})
query="Which papers are most different to the article with the following summary?  Article summary: " + chain.run(query)
qa_with_sources_chain({"query" : query})



{'query': 'Which papers are most different to the article with the following summary?  Article summary: \n\nThis paper proposes a new method called InstructMining for selecting high-quality instruction-following data to optimize the performance of large language models (LLMs) in interacting with humans. The method uses natural language indicators to evaluate the quality of unseen datasets and leverages BlendSearch to find the best subset of the entire dataset. Experiment results show that InstructMining achieves state-of-the-art performance on two popular benchmarks. Additionally, the paper discovers the double descent phenomenon in LLM finetuning and provides insights into the relationship between dataset size and model performance.',
 'result': '\n\nThe papers that are most different to the article with the given summary are:\n\n1. [28] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint 

In [None]:
chain = load_summarize_chain(llm, chain_type="stuff")
from langchain.document_loaders import TextLoader
query = TextLoader('instructionmining').load()
#query = query_loader.load({"text" : "How does Self-Rag compare to Rag?"})
query="Which paper has the least in common with the article with the following summary?  Article summary: " + chain.run(query)
qa_with_sources_chain({"query" : query})



{'query': 'Which paper has the least in common with the article with the following summary?  Article summary: \n\nThis paper proposes a new method called InstructMining for selecting high-quality instruction-following data to optimize the performance of large language models (LLMs) in interacting with humans. The method uses natural language indicators to evaluate the quality of unseen datasets and leverages BlendSearch to find the best subset of the entire dataset. Experiment results show that InstructMining achieves state-of-the-art performance on two popular benchmarks. Additionally, the paper discovers the double descent phenomenon in LLM finetuning and provides insights into the relationship between dataset size and model performance.',
 'result': '  The paper with the least in common with the article with the given summary is "Evaluating the harms of language models" because it does not discuss language models or their performance, and instead focuses on the potential negative co

In [None]:
chain = load_summarize_chain(llm, chain_type="stuff")
from langchain.document_loaders import TextLoader
query = TextLoader('instructionmining').load()
#query = query_loader.load({"text" : "How does Self-Rag compare to Rag?"})
query="Be thorough and explain your reasoning step by step.  Which paper has the most in common with the article with the following summary?  .  Article summary: " + chain.run(query)
qa_with_sources_chain({"query" : query})



{'query': 'Be thorough and explain your reasoning step by step.  Which paper has the most in common with the article with the following summary?  .  Article summary: \n\nThis paper proposes a new method called InstructMining for selecting high-quality instruction-following data to optimize the performance of large language models (LLMs) in interacting with humans. The method uses natural language indicators to evaluate the quality of unseen datasets and leverages BlendSearch to find the best subset of the entire dataset. Experiment results show that InstructMining achieves state-of-the-art performance on two popular benchmarks. Additionally, the paper discovers the double descent phenomenon in LLM finetuning and provides insights into the relationship between dataset size and model performance.',
 'result': ' \n\nThe paper with the most in common with the article is "InstructMining: Selecting High-Quality Instruction-Following Data for Large Language Models" by Yi et al. (2019). Both p

In [54]:
chain = load_summarize_chain(llm, chain_type="stuff")
from langchain.document_loaders import TextLoader
query = TextLoader('instructionmining').load()
#query = query_loader.load({"text" : "How does Self-Rag compare to Rag?"})
query="Be thorough and explain your reasoning step by step.  Which of the retrieved documents has the most in common with the article with the following summary?  .  Article summary: " + chain.run(query)
qa_with_sources_chain({"query" : query})





[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m


{'query': 'Be thorough and explain your reasoning step by step.  Which of the retrieved documents has the most in common with the article with the following summary?  .  Article summary: \n\nThe paper proposes InstructMining, a method for automatically selecting high-quality instruction-following data for finetuning large language models (LLMs). InstructMining uses natural language indicators to evaluate unseen datasets and select premium data for finetuning. The paper also observes the double descent phenomenon in LLM finetuning and introduces BlendSearch to find the best subset from a large dataset. Experimental results show that InstructMining-7B outperforms other models on popular benchmarks.',
 'result': ' \n\nTo select the most relevant retrieved document for an article with the given summary, we can follow these steps:\n\n1. Identify the natural language indicators that are most relevant to the article\'s summary. In this case, the summary mentions "automatically selecting high-

In [None]:
from langchain.document_loaders import TextLoader
from torch.utils.data import Dataset

class TextDataset(Dataset):
    def __init__(self, text_list):
        self.text_list = text_list

    def __len__(self):
        return len(self.text_list)

    def __getitem__(self, idx):
        return self.text_list[idx]

# Load the text
query = TextLoader('instructionmining').load()

# Create a dataset
query_dataset = TextDataset(query)

# Run the chain on the dataset
chain.run(query_dataset)



'\n\nThis paper proposes a new method called InstructMining for selecting high-quality instruction-following data to optimize the performance of large language models (LLMs) in interacting with humans. The method uses natural language indicators to evaluate the quality of unseen datasets and leverages BlendSearch to find the best subset of the entire dataset. Experiment results show that InstructMining achieves state-of-the-art performance on two popular benchmarks. Additionally, the paper discovers the double descent phenomenon in LLM finetuning and provides insights into the relationship between dataset size and model performance.'

### ArXiv API

In [55]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
locale.getpreferredencoding() ### SOLVED THE UTF-8 ISSUE!

'UTF-8'

In [39]:
!pip install arxiv

In [59]:
#from langchain.retrievers import ArxivRetrieverM # GETTING ERROR, probably updated API
from langchain.retrievers.arxiv import ArxivRetriever

In [60]:
retriever = ArxivRetriever(load_max_docs=2)

In [None]:
!pip install pymupdf

Collecting pymupdf
  Downloading PyMuPDF-1.23.6-cp310-none-manylinux2014_x86_64.whl (4.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyMuPDFb==1.23.6 (from pymupdf)
  Downloading PyMuPDFb-1.23.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (30.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.6/30.6 MB[0m [31m57.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDFb, pymupdf
Successfully installed PyMuPDFb-1.23.6 pymupdf-1.23.6


In [61]:
docs = retriever.get_relevant_documents(query="2311.05610")

In [73]:
docs

[Document(page_content='Efficiently training large language models requires parallelizing across\nhundreds of hardware accelerators and invoking various compute and memory\noptimizations. When combined, many of these strategies have complex\ninteractions regarding the final training efficiency. Prior work tackling this\nproblem did not have access to the latest set of optimizations, such as\nFlashAttention or sequence parallelism. In this work, we conduct a\ncomprehensive ablation study of possible training configurations for large\nlanguage models. We distill this large study into several key recommendations\nfor the most efficient training. For instance, we find that using a micro-batch\nsize of 1 usually enables the most efficient training layouts. Larger\nmicro-batch sizes necessitate activation checkpointing or higher degrees of\nmodel parallelism and also lead to larger pipeline bubbles. Our most efficient\nconfigurations enable us to achieve state-of-the-art training efficiency\

In [72]:
docs[0].page_content

'Efficiently training large language models requires parallelizing across\nhundreds of hardware accelerators and invoking various compute and memory\noptimizations. When combined, many of these strategies have complex\ninteractions regarding the final training efficiency. Prior work tackling this\nproblem did not have access to the latest set of optimizations, such as\nFlashAttention or sequence parallelism. In this work, we conduct a\ncomprehensive ablation study of possible training configurations for large\nlanguage models. We distill this large study into several key recommendations\nfor the most efficient training. For instance, we find that using a micro-batch\nsize of 1 usually enables the most efficient training layouts. Larger\nmicro-batch sizes necessitate activation checkpointing or higher degrees of\nmodel parallelism and also lead to larger pipeline bubbles. Our most efficient\nconfigurations enable us to achieve state-of-the-art training efficiency\nresults over a range o

#data=docs[0].metadata  # meta-information of the Documentm

In [83]:
title, query = docs[0].metadata['Title'],docs[0].page_content.replace('\n', ' ')

In [84]:
query="Be thorough and explain your reasoning step by step.  Which of the retrieved documents has the most in common with the article with the following summary?  .  Article summary: " + query
qa_with_sources_chain({"query" : query})



[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m


{'query': 'Be thorough and explain your reasoning step by step.  Which of the retrieved documents has the most in common with the article with the following summary?  .  Article summary: Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding the final training efficiency. Prior work tackling this problem did not have access to the latest set of optimizations, such as FlashAttention or sequence parallelism. In this work, we conduct a comprehensive ablation study of possible training configurations for large language models. We distill this large study into several key recommendations for the most efficient training. For instance, we find that using a micro-batch size of 1 usually enables the most efficient training layouts. Larger micro-batch sizes necessitate activation checkpointing or higher degrees of m

In [None]:
clean_mem()

In [85]:
summary

'Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding the final training efficiency. Prior work tackling this problem did not have access to the latest set of optimizations, such as FlashAttention or sequence parallelism. In this work, we conduct a comprehensive ablation study of possible training configurations for large language models. We distill this large study into several key recommendations for the most efficient training. For instance, we find that using a micro-batch size of 1 usually enables the most efficient training layouts. Larger micro-batch sizes necessitate activation checkpointing or higher degrees of model parallelism and also lead to larger pipeline bubbles. Our most efficient configurations enable us to achieve state-of-the-art training efficiency results over a range of model sizes

In [86]:
query="Be thorough and state your sources.  Which of the retrieved documents has the most in common with the article with the following summary?  .  Article summary: " + query
qa_with_sources_chain({"query" : query})



[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m


{'query': 'Be thorough and state your sources.  Which of the retrieved documents has the most in common with the article with the following summary?  .  Article summary: Be thorough and explain your reasoning step by step.  Which of the retrieved documents has the most in common with the article with the following summary?  .  Article summary: Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding the final training efficiency. Prior work tackling this problem did not have access to the latest set of optimizations, such as FlashAttention or sequence parallelism. In this work, we conduct a comprehensive ablation study of possible training configurations for large language models. We distill this large study into several key recommendations for the most efficient training. For instance, we find that using a 

In [87]:
# Can store abstracts as txt files and load as follows
import json
with open(title, 'w') as f:
    f.write(query)
#With data saved in JSON
chain = load_summarize_chain(llm, chain_type="stuff")
from langchain.document_loaders import TextLoader
query = TextLoader(title).load()
#query = query_loader.load({"text" : "How does Self-Rag compare to Rag?"})
#query="Be thorough and explain your reasoning step by step.  Which of the retrieved documents has the most in common with the article with the following summary?  .  Article summary: " + chain.run(query)
query=f"Be thorough and explain your reasoning step by step.  Which of the retrieved documents has the most in common with the article titled {title} with the following summary? Article summary: " + chain.run(query)
qa_with_sources_chain({"query" : query})





[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m


{'query': 'Be thorough and explain your reasoning step by step.  Which of the retrieved documents has the most in common with the article titled Efficient Parallelization Layouts for Large-Scale Distributed Model Training with the following summary? Article summary:  The article discusses the efficient training of large language models using parallelization and optimization techniques. It compares different training configurations and provides recommendations for the most efficient layouts, such as using a micro-batch size of 1 and avoiding larger pipeline bubbles. The study achieves state-of-the-art training efficiency results for various model sizes, with a Model FLOPs utilization of 70.5% when training a 13B model. The article emphasizes the importance of being thorough and explaining reasoning step by step, as well as citing sources.',
 'result': ' Based on the summary, the article titled "Efficient Parallelization Layouts for Large-Scale Distributed Model Training" seems to be mos

In [88]:
chain = load_summarize_chain(llm, chain_type="stuff")
from langchain.document_loaders import TextLoader
query = TextLoader('instructionmining').load()
#query = query_loader.load({"text" : "How does Self-Rag compare to Rag?"})
#query="Be thorough and explain your reasoning step by step.  Which of the retrieved documents has the most in common with the article with the following summary?  .  Article summary: " + chain.run(query)
query=f"Be thorough and explain your reasoning step by step.  Which of the retrieved documents has the most in common with the article titled {title} with the following summary? Article summary: " + chain.run(query)
qa_with_sources_chain({"query" : query})





[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m


{'query': 'Be thorough and explain your reasoning step by step.  Which of the retrieved documents has the most in common with the article titled Efficient Parallelization Layouts for Large-Scale Distributed Model Training with the following summary? Article summary: \n\nThe paper proposes InstructMining, a method for automatically selecting high-quality instruction-following data for finetuning large language models (LLMs). InstructMining uses natural language indicators to evaluate unseen datasets and select premium data for finetuning. The paper also observes the double descent phenomenon in LLM finetuning and introduces BlendSearch to find the best subset from a large dataset. Experimental results show that InstructMining-7B outperforms other models on popular benchmarks.',
 'result': ' After reviewing the summaries of the retrieved documents, the article titled "Efficient Parallelization Layouts for Large-Scale Distributed Model Training" seems to have the most in common with the a

- Oddly enough, when load with text loader, seem to get much better results!  May need to dig in.

### Scraping https://paperswithcode.com/area/natural-language-processing.  Set up my own scraper or use LangChain's
- https://python.langchain.com/docs/use_cases/web_scraping