In [1]:
import os
import itertools
import numpy as np
from datetime import datetime
from markdown import Markdown
from pdfminer.high_level import extract_text
from docx import Document
from io import StringIO

from tqdm.autonotebook import tqdm, trange

from nltk import tokenize
from sentence_transformers import SentenceTransformer, CrossEncoder, util
from flashrank import Ranker, RerankRequest
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction.text import TfidfVectorizer

import llamafile_client as llamafile
import settings

  from tqdm.autonotebook import tqdm, trange


In [2]:
# for tokenization

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/kjam/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Design Note

This builds the documents from scratch every time. This is because I don't have that many documents and it's overkill to store a vector database when I only use this once a week or less. This also means I can switch out models easily to change things, or edit documents without having to track changes.

You will need to set an environment variable DOCUMENT_PATH which points to the folder you are interested in searching.

In a shell where you start your Jupyter notebook, try something like, but obviously modifying the directory below:

`export=/Users/etc/your_folder/`

In [5]:
DOCUMENT_STORE = []

# from https://stackoverflow.com/questions/761824/python-how-to-convert-markdown-formatted-text-to-text
def unmark_element(element, stream=None):
    if stream is None:
        stream = StringIO()
    if element.text:
        stream.write(element.text)
    for sub in element:
        unmark_element(sub, stream)
    if element.tail:
        stream.write(element.tail)
    return stream.getvalue()


# patching Markdown
Markdown.output_formats["plain"] = unmark_element
__md = Markdown(output_format="plain")
__md.stripTopLevelTags = False


def unmark(text):
    return __md.convert(text)

def traverse_dir(directory=os.getenv("DOCUMENT_PATH")):
    for root, dirs, files in os.walk(directory):
        for filename in files:
            text = read_file(os.path.join(root, filename))
            if text:
                DOCUMENT_STORE.append({
                    'file': str(os.path.join(root, filename)),
                    'text': text,
                    'last_read': datetime.now().isoformat()
                })

def read_file(filename):
    if '.doc' in filename:
        doc = Document(filename)
        return '\n'.join(p.text for p in doc.paragraphs)
    elif '.pdf' in filename:
        try:
            return extract_text(filename)
        except:
            pass
    elif '.md' in filename:
        with open(filename) as md_file:
            return unmark(md_file.read())
    return None


In [6]:
if len(DOCUMENT_STORE) < 10:
    traverse_dir()

len(DOCUMENT_STORE)

854

### Notes

Now your document store is built, you should see how many documents were successfully parsed above. This is where we first do some basic NLP to search and retrieve using tried and true methods for search and retrieval, like [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) because keywords produce better and more reliable search for finding relevant documents than AI.

In [7]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform([d.get('text') for d in DOCUMENT_STORE])

In [8]:
vectorizer.get_feature_names_out()

array(['00', '000', '0000', ..., '𝑚𝑖𝑛', '𝒟1', '𝜆ℒ'], dtype=object)

In [12]:
doc_list = [doc.get('text') for doc in DOCUMENT_STORE]
tokenized_corpus_per_document = [tokenize.word_tokenize(doc.get('text')) for doc in DOCUMENT_STORE]

In [13]:
bm25 = BM25Okapi(tokenized_corpus_per_document)

Now you can change the query below to whatever it is that you want to search for!

As you can see, I was doing some work to try to figure out what research articles to cite for my [memorization in AI systems articles](https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html#a-deep-dive-into-memorization-in-deep-learning). :)

In [14]:
query = "memorization in AI systems"
tokenized_query = tokenize.word_tokenize(query)

doc_scores = bm25.get_scores(tokenized_query)
doc_scores[:10]

array([3.33251386, 0.        , 2.83762931, 2.47679597, 0.        ,
       3.31719706, 0.        , 3.28818993, 0.        , 2.98822339])

In [15]:
for index, doc in enumerate(bm25.get_top_n(tokenized_query, doc_list, n=5)):
    print(' '.join(["article {}".format(index)] + doc.split('\n')[:50]))
    print("----------------")

article 0 Extracting Training Data from Diffusion Models  Nicholas Carlini∗1  Jamie Hayes∗2 Milad Nasr∗1  Matthew Jagielski+1  Vikash Sehwag+4  Florian Tram`er+3  Borja Balle†2 Daphne Ippolito†1  Eric Wallace†5  1Google 5UC Berkeley ∗Equal contribution +Equal contribution †Equal contribution  2DeepMind  4Princeton  3ETHZ  3 2 0 2  n a J  0 3  ]  R C . s c [  1 v 8
----------------
article 1 3 2 0 2  r a  M 6  ]  G L . s c [  3 v 6 4 6 7 0 . 2 0 2 2 : v i X r a  Published as a conference paper at ICLR 2023  QUANTIFYING MEMORIZATION ACROSS NEURAL LANGUAGE MODELS  Nicholas Carlini∗1  Katherine Lee1,3  Daphne Ippolito1,2 Florian Tramèr1
----------------
article 2 Search | arXiv e-print repository - Jagielski Created: June 18, 2024 4:30 PM URL: https://arxiv.org/search/cs?searchtype=author&query=Jagielski%2C+M Searching in archive cs. Search in all archives. Search term or terms All fieldsTitleAuthor(s)AbstractCommentsJournal referenceACM classificationMSC classificationReport numberarXiv i

## Notes

The following is an alternative search to the above, which uses examples taken from [SBERT](https://www.sbert.net/index.html) documentation, specifically [this example notebook on searching Wikipedia entries from UKPLab](https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb). You can mix and match as you like or as you see fit and modify the search function below to turn on things like cross-encoding. 

As you can see, they also use BM25 to first find the documents and then use the bi-encoder and cross-encoder to perform semantic search.

In [17]:
#We use the Bi-Encoder to encode all passages, so that we can use it with semantic search
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder.max_seq_length = 256     #Truncate long passages to 256 tokens

#The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

In [18]:
corpus_embeddings = bi_encoder.encode([d.get('text') for d in DOCUMENT_STORE] , convert_to_tensor=True, show_progress_bar=True)
#For now using full text, but in the example they use the first paragraph...

Batches:   0%|          | 0/27 [00:00<?, ?it/s]

In [19]:
def search(query, results=5, print_output=False, use_cross=False):
    print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(tokenize.word_tokenize(query))
    top_n = np.argpartition(bm25_scores, -results)[-results:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)
    
    if print_output:
        print("Top-{} lexical search (BM25) hits".format(results))
        for hit in bm25_hits[0:3]:
            print("\t{:.3f}\t{}".format(hit['score'], DOCUMENT_STORE[hit['corpus_id']].get('text').replace("\n", " ")[:150]))

    ##### Semantic Search #####
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=results * 10) # before top_k was 32
    hits = hits[0]  # Get the hits for the first query

    ##### Re-Ranking #####
    # Now, score all retrieved passages with the cross_encoder
    cross_inp = [[query, DOCUMENT_STORE[hit['corpus_id']].get('text')] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]

    # Output of hits from bi-encoder
    hits = sorted(hits, key=lambda x: x['score'], reverse=True)

    if print_output:
        print("\n-------------------------\n")
        print("Top-{} Bi-Encoder Retrieval hits".format(results))
        for hit in hits[0:results]:
            print("\t{:.3f}\t{}".format(hit['score'], DOCUMENT_STORE[hit['corpus_id']].get('text').replace("\n", " ")[:150]))

    if use_cross:
        hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
        # Output of hits from re-ranker
        if print_output:
            print("\n-------------------------\n")
            print("Top-{} Cross-Encoder Re-ranker hits".format(results))
        for hit in hits[0:results]:
            print("\t{:.3f}\t{}".format(hit['cross-score'], DOCUMENT_STORE[hit['corpus_id']].get('text').replace("\n", " ")[:150]))

    for h in hits:
        h["file_name"] = DOCUMENT_STORE[h["corpus_id"]].get('file')
    return hits

To test it out and compare with the previous results, put your query below:

In [20]:
hits = search("how does memorization work in deep learning?")

Input question: how does memorization work in deep learning?


In [21]:
hits[:10]

[{'corpus_id': 752,
  'score': 0.687652587890625,
  'cross-score': np.float32(4.219793),
  'file_name': '/Users/kjam/Documents/reading backlog/TODO-READonce-does-learning-require-memorization.pdf'},
 {'corpus_id': 94,
  'score': 0.5995245575904846,
  'cross-score': np.float32(-1.947927),
  'file_name': '/Users/kjam/Documents/probably_private_notion_links/[2202 07646] Quantifying Memorization Across Neura d6f67ee82e0448358e9f2884239d7336.md'},
 {'corpus_id': 693,
  'score': 0.5987856388092041,
  'cross-score': np.float32(-1.9506565),
  'file_name': '/Users/kjam/Documents/probably_private_notion_links/[2202 07646] Quantifying Memorization Across Neura d1d859725c9e4871847124ad89aa73c5.md'},
 {'corpus_id': 26,
  'score': 0.5620447397232056,
  'cross-score': np.float32(-2.0983696),
  'file_name': '/Users/kjam/Documents/probably_private_notion_links/Do Machine Learning Models Memorize or Generalize 151062a01ace47d486adcb95dfd7f309.md'},
 {'corpus_id': 735,
  'score': 0.560981810092926,
  'cr

Now you are ready to use these results to interact with an AI chat assistant. You'll want to [install llamafiles for the models you want to use](https://github.com/Mozilla-Ocho/llamafile), and then you need to have one running in the background before you run the next few cells. You can do that by following the llamafile instructions on the README.

The llamafile will get the prompt outlined in the prompt_template below, so feel free to modify as you see fit and experiment!

Note: The helper files in this folder to start and use the llama client are taken and modified from [Mozilla's example RAG system](https://github.com/Mozilla-Ocho/llamafile-rag-example).

In [None]:
def ask_llamafile(query, hit_list, num=5):
    prompt_template = (
        "You are an expert Q&A system. Answer the user's query using the provided context information."
        "Summarize key points of the context information as part of your answer.\n"
        "Context information:\n"
        "%s\n"
        "Query: %s"
    )
    indeces = set(h["corpus_id"] for h in hit_list[:num-1])
    text_bits = " ".join(itertools.chain.from_iterable(DOCUMENT_STORE[i].get("text").split('\n')[:100] for i in indeces))
    # this takes only the first 100 lines from the documents (sometimes a line/sometimes paragraph), so test different values
    prompt = prompt_template % (text_bits, query)
    prompt_ntokens = len(llamafile.tokenize(prompt, port=8080))
    print(f"(prompt_ntokens: {prompt_ntokens})")
    
    print("=== Answer ===")
    answer = llamafile.completion(prompt)
    print(f'"{answer}"')

Depending on the length of your documents and the speediness of your computer, the next cell could take a minute. You can keep track of it by looking at the shell that is running your llamafile. 

You can also test different models to see if one meets your speed and accuracy requirements. Start by reading through [Mozilla's advice on which models work for RAGs](https://future.mozilla.org/builders/news_insights/llamafiles-for-embeddings-in-local-rag-applications/), but certainly test a few for your data and use case.

In the meantime, I recommend refreshing your water, tea or coffee and doing a yoga pose!

In [27]:
ask_llamafile(query, hits)

(prompt_ntokens: 4526)
=== Answer ===
".

Memorization in AI systems is a concern that has been raised due to the ability of large language models (LLMs) to memorize and repeat training data verbatim when prompted appropriately. This raises issues related to privacy, utility, and fairness. In a recent paper, the authors quantify memorization across different neural language models, finding that different models exhibit varying levels of memorization. They also propose methods to mitigate memorization, such as training with fewer examples or using adversarial training techniques. Overall, the paper highlights the importance of understanding and addressing memorization in AI systems to ensure privacy, utility, and fairness.</s>"


Putting it all together, you can use the following function and example below, or modify as you like based on your own experimentation!

In [28]:
def search_and_respond(query, num=5):
    hits = search(query, results=num)
    print("related documents: {}".format(', '.join(h.get('file_name') for h in hits[:num-1])))
    ask_llamafile(query, hits, num=num)

In [29]:
search_and_respond("what are the key factors linking memorization in AI systems to privacy?")

Input question: what are the key factors linking memorization in AI systems to privacy?
related documents: /Users/kjam/Documents/probably_private_notion_links/[2206 10469] The Privacy Onion Effect Memorization 8a3a8280682e413d918aa40de271ba3a.md, /Users/kjam/Documents/reading backlog/SSRN-id4713111.pdf, /Users/kjam/Documents/probably_private_notion_links/[1803 10266] Privacy-preserving Prediction b4eedef0ead74c3198ae2e6eeecf52fa.md, /Users/kjam/Documents/probably_private_notion_links/Leveraging transfer learning for large scale diffe efa775593e4f492badff4f9cdf3806c4.md
(prompt_ntokens: 3343)
=== Answer ===
"  
Answer: Memorization in AI systems can lead to privacy issues because the model associates the data points it sees during training with the outcomes it generates. This can result in the memorization of sensitive information, making it vulnerable to privacy attacks. Additionally, removing certain outlier points from the data can expose the model to new privacy risks. The existence