# Book2Vec

**Calculate embeddings and create a vectore stores for books 📖**

In this notebook we'll use George Orwell's 1984.

**How to use for other books:**
1. Find its PDF
2. Run

*Notes:*
- At the end of this, you can optionally upload your vector store on https://huggingface.co as a dataset so it can be acessed easily.

## Install stuff we need

In [None]:
!pip install -qqq langchain InstructorEmbedding sentence_transformers faiss-cpu huggingface_hub pypdf

In [None]:
import os
from pathlib import Path
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores.faiss import FAISS
from huggingface_hub import snapshot_download
from langchain import PromptTemplate
from IPython.display import display, Markdown
from langchain.document_loaders import PagedPDFSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

## Download the book

In [None]:
!wget https://ia801300.us.archive.org/28/items/GeorgeOrwells1984/1984.pdf

--2023-03-03 10:50:32--  https://ia801300.us.archive.org/28/items/GeorgeOrwells1984/1984.pdf
Resolving ia801300.us.archive.org (ia801300.us.archive.org)... 207.241.228.130
Connecting to ia801300.us.archive.org (ia801300.us.archive.org)|207.241.228.130|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1366706 (1.3M) [application/pdf]
Saving to: ‘1984.pdf.1’


2023-03-03 10:50:32 (3.29 MB/s) - ‘1984.pdf.1’ saved [1366706/1366706]



If you're using Googe drive, you can save this (and later, your vector store) there.

First give your notebook permission to access Drive, then set the path to save as you normally would.

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# !wget https://ia801300.us.archive.org/28/items/GeorgeOrwells1984/1984.pdf -P '/content/drive/MyDrive/projects/talk2book'

## Split pages

In [None]:
book_path = "1984.pdf"

In [None]:
loader = PagedPDFSplitter(book_path)
pages = loader.load_and_split()

In [None]:
len(pages)

393

In [None]:
pages[0]

Document(page_content='Download free eBooks of classic literature, books and \nnovels at Planet eBook. Subscribe to our free eBooks blog \nand email newsletter.1984\nBy George Orwell', lookup_str='', metadata={'source': '1984.pdf', 'page': 0}, lookup_index=0)

In [None]:
pages[2].metadata

{'source': '1984.pdf', 'page': 2}

### Set how text will be truncated for long pages

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 800,
    chunk_overlap  = 10,
    length_function = len,
    separators="\n\n"
)

## Download embedding model

In [None]:
embeddings = HuggingFaceInstructEmbeddings(
    embed_instruction="Represent the book passage for retrieval: ",
    query_instruction="Represent the question for retrieving supporting texts from the book passage: "
    )

load INSTRUCTOR_Transformer
max_seq_length  512


## Create embeddings and save faiss vector store to the path specified in `save_to`

In [None]:
save_to = "1984_faiss_custom_embedding"

In [None]:
n = 0 # progress count
print(f"> Total pages: {len(pages)}")

for page in pages:

    # split text
    texts = text_splitter.split_text(page.page_content)

    # metadata
    metadatas = [page.metadata]

    # use the same metadata for each chunk of text from the same page
    metadatas = metadatas*len(texts)

    # create embeddings and add to vector store
    if Path(save_to).exists():
        update = FAISS.load_local(folder_path=save_to, embeddings=embeddings)
        update.add_texts(texts, metadatas=metadatas)
        update.save_local(folder_path=save_to)

        n+=1
        print(f"{n}/{len(pages)}")
    else:
        docsearch = FAISS.from_texts(texts, embedding=embeddings, metadatas=metadatas)
        docsearch.save_local(folder_path=save_to)

        n+=1
        print(f"{n}/{len(pages)}")


print("> DONE")
print(f"> Saved to: {save_to}")

### Upload your embeddings to Hugging Face

1. Create an account on or login to https://huggingface.co
2. Create a new *dataset*
3. Upload the folder specified in `save_to`

Example: https://huggingface.co/datasets/calmgoose/orwell-1984_faiss-instructembeddings

## Now let's do a similarity search on what we just created!

In [None]:
question = "Who is big brother?"

In [None]:
display(Markdown(question))

Who is big brother?

### From our local vector store:

In [None]:
docsearch = FAISS.load_local(folder_path=save_to, embeddings=embeddings)

search = docsearch.similarity_search(question, k=4)

for item in search:
    print(item.page_content)
    print(f"Page: {item.metadata['page']}")
    print("---")

### From Hugging Face:

In [None]:
# download the `vectorstore` folder
book="1984"
cache_dir=f"{book}_cache"
vectorstore = snapshot_download(repo_id="calmgoose/book-embeddings",
                                repo_type="dataset",
                                revision="main",
                                allow_patterns=f"books/{book}/*",
                                cache_dir=cache_dir,
                                )

# get path to the `vectorstore` folder that you just downloaded
# we'll look inside the `cache_dir` for the folder we want
target_dir = f"books/{book}"

# Walk through the directory tree recursively
for root, dirs, files in os.walk(cache_dir):
    # Check if the target directory is in the list of directories
    if target_dir in dirs:
        # Get the full path of the target directory
        target_path = os.path.join(root, target_dir)

# load embeddings
# this is what was used to create embeddings for the book
embeddings = HuggingFaceInstructEmbeddings(
    embed_instruction="Represent the book passage for retrieval: ",
    query_instruction="Represent the question for retrieving supporting texts from the book passage: "
    )

# load vector store to use with langchain
docsearch = FAISS.load_local(folder_path=target_path, embeddings=embeddings)

# similarity search
search = docsearch.similarity_search(question, k=4)

print(f"Question: {question}")

for item in search:
    print(item.page_content)
    print(f"From page: {item.metadata['page']}")
    print("---")

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

load INSTRUCTOR_Transformer
max_seq_length  512
Question: Who is big brother?
immediately below it. The consciousness of the masses 
needs only to be influenced in a negative way.
Given this background, one could infer, if one did not 
know it already, the general structure of Oceanic society. At 
the apex of the pyramid comes Big Brother. Big Brother is in -
fallible and all-powerful. Every success, every achievement, 
every victory, every scientific discovery, all knowledge, all 
wisdom, all happiness, all virtue, are held to issue directly 
from his leadership and inspiration. Nobody has ever seen 
Big Brother. He is a face on the hoardings, a voice on the 
telescreen. We may be reasonably sure that he will never die, 
and there is already considerable uncertainty as to when he 
was born. Big Brother is the guise in which the Party choos -
From page: 261
---
of being spoken. Then the face of Big Brother faded away 
again, and instead the three slogans of the Party stood out
From pag