<a href="https://colab.research.google.com/github/danschlz/ebook-search/blob/main/Ebook_Semantic_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Notes on usage:

- Make sure to [change runtime to GPU](https://www.tutorialspoint.com/google_colab/google_colab_using_free_gpu.htm).
- Upload an epub file representing the ebook you want to search (tip: ever heard of [libgen](https://libgen.is/)?).
- Re-run the last cell using different queries to keep searching the same book.

Optional:
- Embeddings for the book you upload will be saved in Files (in the left menu bar) under the title 'embeddings-{first chapter}-{last chapter}-{model name}-{epub filename}.json'.
  - Download this file and upload it (instead of an epub) on your next runtime session in order to avoid generating the embeddings again.
- Run 'process_file' with 'preview_mode' set to True at first to check which range of chapters you want to index. This helps you avoid needlessly creating embeddings for chapters like 'Notes' and 'Works Cited"


In [None]:
# upload epub (or json of book embeddings generated by this program)
from google.colab import files
uploaded = files.upload()
path = next(iter(uploaded))

Saving Enron Corp_Elkind, Peter_McLean, Bethany - The smartest guys in the room the amazing rise and scandalous fall of Enron-Penguin Group US_Portfolio_Penguin (2013).epub to Enron Corp_Elkind, Peter_McLean, Bethany - The smartest guys in the room the amazing rise and scandalous fall of Enron-Penguin Group US_Portfolio_Penguin (2013).epub


In [None]:
!pip install -q ebooklib sentence_transformers
from sentence_transformers import SentenceTransformer, util
import json
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup
from os.path import exists
from IPython.display import HTML, display
import numpy as np
import math

model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')

[K     |████████████████████████████████| 115 kB 24.0 MB/s 
[K     |████████████████████████████████| 85 kB 5.0 MB/s 
[K     |████████████████████████████████| 5.8 MB 63.6 MB/s 
[K     |████████████████████████████████| 1.3 MB 68.2 MB/s 
[K     |████████████████████████████████| 182 kB 77.4 MB/s 
[K     |████████████████████████████████| 7.6 MB 58.8 MB/s 
[?25h  Building wheel for ebooklib (setup.py) ... [?25l[?25hdone
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.65k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [None]:
def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [None]:
def part_to_chapter(part):
    soup = BeautifulSoup(part.get_body_content(), 'html.parser')
    paragraphs = [para.get_text().strip() for para in soup.find_all('p')]
    paragraphs = [para for para in paragraphs if len(para) > 0]
    if len(paragraphs) == 0:
        return None
    title = ' '.join([heading.get_text() for heading in soup.find_all('h1')])
    return {'title': title, 'paras': paragraphs}

min_words_per_para = 150
max_words_per_para = 500

def format_paras(chapters):
    for i in range(len(chapters)):
        for j in range(len(chapters[i]['paras'])):
            split_para = chapters[i]['paras'][j].split()
            if len(split_para) > max_words_per_para:
                chapters[i]['paras'].insert(j + 1, ' '.join(split_para[max_words_per_para:]))
                chapters[i]['paras'][j] = ' '.join(split_para[:max_words_per_para])
            k = j
            while len(chapters[i]['paras'][j].split()) < min_words_per_para and k < len(chapters[i]['paras']) - 1:
                chapters[i]['paras'][j] += '\n' + chapters[i]['paras'][k + 1]
                chapters[i]['paras'][k + 1] = ''
                k += 1

        chapters[i]['paras'] = [para.strip() for para in chapters[i]['paras'] if len(para.strip()) > 0]
        if len(chapters[i]['title']) == 0:
            chapters[i]['title'] = '(Unnamed) Chapter {no}'.format(no=i + 1)

def print_previews(chapters):
    for (i, chapter) in enumerate(chapters):
        title = chapter['title']
        wc = len(' '.join(chapter['paras']).split(' '))
        paras = len(chapter['paras'])
        initial = chapter['paras'][0][:30]
        preview = '{}: {} | wc: {} | paras: {}\n"{}..."\n'.format(i, title, wc, paras, initial)
        print(preview)

def get_chapters(book_path, print_chapter_previews, first_chapter, last_chapter):
    book = epub.read_epub(book_path)
    parts = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT))
    chapters = [part_to_chapter(part) for part in parts if part_to_chapter(part) is not None]
    last_chapter = min(last_chapter, len(chapters) - 1)
    chapters = chapters[first_chapter:last_chapter + 1]
    format_paras(chapters)
    if print_chapter_previews:
        print_previews(chapters)
    return chapters

In [None]:
def get_embeddings(texts):
    if type(texts) == str:
        texts = [texts]
    texts = [text.replace("\n", " ") for text in texts]
    return model.encode(texts)

In [None]:
def read_json(json_path):
    print('Loading embeddings from "{}"'.format(json_path))
    with open(json_path, 'r') as f:
        values = json.load(f)
    return (values['chapters'], np.array(values['embeddings']))

def read_epub(book_path, json_path, preview_mode, first_chapter, last_chapter):
    chapters = get_chapters(book_path, preview_mode, first_chapter, last_chapter)
    if preview_mode:
        return (chapters, None)
    print('Generating embeddings for chapters {}-{} in "{}"\n'.format(first_chapter, last_chapter, book_path))
    paras = [para for chapter in chapters for para in chapter['paras']]
    embeddings = get_embeddings(paras)
    try:
        with open(json_path, 'w') as f:
            json.dump({'chapters': chapters, 'embeddings': embeddings.tolist()}, f)
    except:
        print('Failed to save embeddings to "{}"'.format(json_path))
    return (chapters, embeddings)

In [None]:
def process_file(path, preview_mode=False, first_chapter=0, last_chapter=math.inf):
    values = None
    if path[-4:] == 'json':
        values = read_json(path)
    elif path[-4:] == 'epub':
        json_path = 'embeddings-{}-{}-{}.json'.format(first_chapter, last_chapter, path)
        if exists(json_path):
            values = read_json(json_path)
        else:
            values = read_epub(path, json_path, preview_mode, first_chapter, last_chapter)
    else:
        print('Invalid file format. Either upload an epub or a json of book embeddings.')
    return values

In [None]:
# Comments below only relevant if you want to save yourself some API calls.

# Run this with 'preview_mode' on if you want to figure out which chapters to include.
# For example, after you run, 'process_file(path, preview_mode=True)',
# you might notice that chapters 1-7 and 19-27 are useless endnotes/intro stuff.
# So then you can run, 'process_file(path, first_chapter=8, last_chapter=18)'

chapters, embeddings = process_file(path)



Generating embeddings for chapters 0-32 in "Enron Corp_Elkind, Peter_McLean, Bethany - The smartest guys in the room the amazing rise and scandalous fall of Enron-Penguin Group US_Portfolio_Penguin (2013).epub"



In [None]:
def print_and_write(text, f):
    print(text)
    f.write(text + '\n')

def index_to_para_chapter_index(index, chapters):
    for chapter in chapters:
        paras_len = len(chapter['paras'])
        if index < paras_len:
            return chapter['paras'][index], chapter['title'], index
        index -= paras_len
    return None

def search(query, embeddings, n=3):
    query_embedding = get_embeddings(query)[0]
    scores = np.dot(embeddings, query_embedding) / (np.linalg.norm(embeddings, axis=1) * np.linalg.norm(query_embedding))
    results = sorted([i for i in range(len(embeddings))], key=lambda i: scores[i], reverse=True)[:n]

    f = open('result.text', 'a')
    header_msg ='Results for query "{}" in "{}"'.format(query, path)
    print_and_write(header_msg, f)
    for index in results:
        para, title, para_no = index_to_para_chapter_index(index, chapters)
        result_msg = '\nChapter: "{}", Passage number: {}, Score: {:.2f}\n"{}"'.format(title, para_no, scores[index], para)
        print_and_write(result_msg, f)
    print_and_write('\n', f)

In [None]:
query = 'what areas of the world will be most harmed by climate change' #@param {type:"string"}
search(query, embeddings)

Results for query "what areas of the world will be most harmed by climate change" in "The Wizard and the Prophet.epub"

Chapter: "[ SEVEN ] Air: Climate Change", Passage number: 31, Score: 0.66
"The most likely victims of climate change, in the short run, are people who live on oceanic islands, in very low-lying coastal settlements, in ice-bound Arctic communities, and around forests that burn after unwonted dry spells. Millions of people live in these places, but they are a small fraction of the world’s billions. The greatest potential harms of climate change will be experienced by future generations—centuries in the future, or even millennia. By our actions today (burning fossil fuels), the argument is, we are dumping problems (drought, sea-level rise) on tomorrow.
On the one hand, forcing other people to clean up our mess violates basic notions of fairness. On the other hand, actually preventing climate-change problems would require societies today to make investments, some of them 