# Quote Generation/Retrieval

Initial quotes were scraped from one webpage. But that would lead to quite a lot of repitition and kindof a boring bot. So...

### 2 Issues with Index retrieved/Generated Quotes:
**Goal:** To have ~100 quotes to have tweeted on a rolling basis since this is a quarter of a year and will not likely to be easily repeated.

### Probability of Repeating a Quote Exactly 2 Weeks Later

\begin{equation*}
P(\text{repeated in 2 weeks}) = 1 - \left( \frac{N - 1}{N} \right)^{\frac{14}{m}}
\end{equation*}

 Substituting in the values:

 \begin{equation*}
P(\text{repeated in 2 weeks}) = 1 - \left( \frac{99}{100} \right)^{14}
\end{equation*}

 Calculating this gives:

 \begin{equation*}
P(\text{repeated in 2 weeks}) \approx 0.135
\end{equation*}

Personally, I'd like it to be under 10% for a longer period of time (increases probability)

Tweaking the numbers a bit 
N = 200
days = 21

\begin{equation*}
P(\text{repeated in 3 weeks}) \approx  0.0980
\end{equation*}

The following should be conditional 1) That the quote generated has confidence that the quote was written by Ovid then 2) That the quote is not a variation of a quote already in DB.

In [2]:
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
import os
import openai
import pandas as pd

from itertools import product

# import utils
import configparser
from pprint import pprint
from llama_index import SimpleDirectoryReader


import os
import openai


import jupyter_black

jupyter_black.load()
# Access values from the sections

import sys

sys.path.append("../app")
import conn_utils

OPENAI_API_KEY = conn_utils.get_open_ai_key("./../app/config.ini")

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
directory = "index_store"

pd.set_option("display.max_colwidth", None)

In [3]:
!ls ./../data

AmoresV2.txt                    MetamorphosesV2.txt
ArsAmatoria.txt                 PoemsFromExile.txt
FastiV2.txt                     RemediaAmorisV2.txt
HeroidesV2.txt                  TheOdesofHorace_10227561(3).txt
LostPoemsV2.txt                 notes.jsonl
LoversAssistant.txt             notes_validation.jsonl


## Get low sentency similarity

Instead of using a LLM. You can trust to get a close (but not too close) similarity

In [4]:
import spacy

# Read in exisitng quotes
df = pd.read_json("./../app/ovid_quotes.json")

# Load the spaCy English model
nlp = spacy.load("en_core_web_lg")

# Define the file paths
f1 = "./../data/FastiV2.txt"
f2 = "./../data/MetamorphosesV2.txt"
f3 = "./../data/AmoresV2.txt"
f4 = "./../data/LostPoemsV2.txt"
f5 = "./../data/HeroidesV2.txt"
f6 = "./../data/ArsAmatoria.txt"
f7 = "./../data/RemediaAmorisV2.txt"
f8 = "./../data/LoversAssistant.txt"

# Read in the text files and split sentences
from tqdm import tqdm

# Override the max length of the text
nlp.max_length = 1500000  # or any value that accommodates your text length

with open(os.path.join("./../data", "LostPoemsV2.txt"), "r") as f:
    text = f.read()
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]
# Print the list of sentences
print(f"Num sentences: {len(sentences)}")
pprint(sentences[:5])

Num sentences: 6897
['Ovid: The Poems Of Exile \n'
 '(Tristia, Ex Ponto, Ibis) \n'
 ' \n'
 'Home\n'
 'Download\n'
 '2\n'
 'Translated by A. S. Kline \uf8e92003\n'
 'All Rights Reserved \n'
 'This work may be freely \n'
 'reproduced, stored, and \n'
 'transmitted, electronically or \n'
 'otherwise, for any non-commercial \n'
 'purpose. \n'
 ' \n',
 '3\n'
 'Contents\n'
 'Tristia Book '
 'I.................................................................. 11 \n'
 'Book TI.I:1-68 The Poet to His Book: Its Nature ........... 11 \n'
 'Book TI.I:70-128 The Poet to His Book: His Works...... 14 \n'
 'Book TI.II:1-74 The Journey: Storm at Sea.................... 17 \n'
 'Book TI.II:75-110 The Journey: The Destination........... 21 \n'
 'Book TI.III:1-46 The Final Night in Rome:',
 'Preparation 23 \nBook TI.III:47-102 The Final Night in Rome: Departure25 \n',
 'Book TI.IV:1-28 Troubled Waters.................................. 28 \n'
 'Book TI.V:1-44 Loyalty in Friendship .........................

In [5]:
# import pandas as pd


#df = pd.read_json("./../app/ovid_quotes.json")
# ## Update add new quotes to the dataframe
# new_row = {
#     "Quote": """You will go most safely by the middle way.""",
#     "Work": "Metamorphoses",
#     "Quote in Latin": "",
# }
# df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
# df.to_json("./../app/ovid_quotes.json")
# df.shape

In [104]:
df.sample(5)

Unnamed: 0,Quote,Quote in Latin,Work
2,We're slow to believe what wounds us.,Tarde quae credita laedunt credimus.,"Heroides (The Heroines), II, 9-10"
32,"And did you, my hands, seize the horns of the mighty bull?","Tenuistine, manus meae, cornua tauri magni?",Metamorphoses
23,"Everything changes, nothing perishes.",,Metamorphoses
3,"The end proves the acts (were done), or the result is a test of the actions; Ovid's line 85 full translation: “The event proves well the wisdom of her [Phyllis'] course.”",Exitus acta probat.,"Heroides (The Heroines), II, 85"
18,"Everything changes, nothing perishes.","Omnia mutantur, nihil interit",- Metamorphoses


In [123]:
import re


def clean_sentences(sentences, filter_tokens):
    def replace_non_alphanumeric(input_string):
        processed_text = re.sub(r"\n", " ", input_string)
        processed_text = re.sub(r"[^a-zA-Z0-9\s.,?!'\-]", " ", processed_text)
        return processed_text

    def contains_filter_tokens(input_string):
        return any(token in input_string for token in filter_tokens)

    cleaned_sentences = [s.replace("\n", ". ").replace("_", " ") for s in sentences]
    cleaned_sentences = [replace_non_alphanumeric(s) for s in sentences]
    cleaned_sentences = [s for s in cleaned_sentences if len(s.split()) > 4]
    cleaned_sentences = [s for s in cleaned_sentences if not contains_filter_tokens(s)]

    return cleaned_sentences


# Example usage:
print(len(sentences))
filter_tokens = [
    "Project Gutenberg",
    "Footnote",
    "--Ver.",
    "The Amores",
    "United States",
    "copyright",
    "Copyright",
    "gutenberg",
    "Gutenberg",
]
cleaned = clean_sentences(sentences, filter_tokens)
print(len(cleaned))

# Uncomment to use the clean sentences
sentences = clean_sentences(sentences, filter_tokens)

13030
9547


In [8]:
import random

sampled_cell = random.choice(cleaned)
print(sampled_cell)

The gods delight in instances of such testimony, since they, thereby, give witness of their powers. 


In [28]:
corpus_text = text
# The string you want to find
target_string = "Choose a well-considered time"

# Find the index of the target string
start_index = corpus_text.find(target_string)

# Ensure the target string is found in the corpus
if start_index != -1:
    # Extract the text before and after the target string
    text_before = corpus_text[:start_index].strip()
    text_after = corpus_text[start_index + len(target_string) :].strip()

    # Tokenize the sentences using spaCy
    doc_before = nlp(text_before)
    doc_after = nlp(text_after)

    # Get the sentences
    sentences_before = [sent.text for sent in doc_before.sents][-2:]
    sentences_after = [sent.text for sent in doc_after.sents][:2]

    # Print or use the extracted sentences
    print("Sentences Before:")
    print(sentences_before)

    print("\nSentences After:")
    print(sentences_after)
else:
    print("Target string not found in the corpus.")

Sentences Before:
['Why tremble or hesitate to approach her?', 'It’s no impious\nProcne or Medea who’s to be moved by your words,\nno murderous Danaid, not Agamemnon’s cruel wife,\nno yelping Scylla terrorising Sicilian waters,\nno Circe born with the power to alter forms,\nno Medusa binding her knotted hair with snakes,\nbut the first of women, in whom Fortune shows herself\nas clear-sighted, and falsely charged with being blind:\nthan whom the earth holds nothing more glorious,\nsave Caesar, from the sun’s rising to its setting.']

Sentences After:
['to ask,\nlest your boat sets sail on an adverse tide.\n263\n\n\x0cThe oracles don’t always deliver sacred prophecies,\nthe temples themselves aren’t always open.\n', 'When the city’s state is as I now divine it,\nand there’s no grief on peoples’ faces,\nwhen Augustus’s house, to be revered as the Capitol,\nis as happy as it is now, and filled with peace,\nthen may the gods grant you the chance to make an\napproach,\nthen reflect your wor

In [8]:
similarity_data = []

# Sample similarity quotes
# But I won t refute a thing  I favour your praise too  For, heart, why reject the voice that is desired? Don t be angry if my belief in you comes only with great  difficulty  trust in important things usually builds slowly.


quotes = ["""Nothing is stronger than habit."""]
for actual_quote in quotes:
    doc_actual = nlp(actual_quote)

    for sent in tqdm(sentences):
        doc_generated = nlp(sent)
        # Calculate similarity score
        similarity_score = doc_generated.similarity(doc_actual)

        # Append data to the list
        similarity_data.append(
            {
                "new_quotes": doc_generated.text,
                "actual_quotes": actual_quote,
                "similarity_score": similarity_score,
            }
        )
# Create a DataFrame from the similarity data
similarity_df = pd.DataFrame(similarity_data).sort_values(
    "similarity_score", ascending=False
)
display(similarity_df.head(5))

  similarity_score = doc_generated.similarity(doc_actual)
100%|██████████| 6897/6897 [00:53<00:00, 130.04it/s]


Unnamed: 0,new_quotes,actual_quotes,similarity_score
1572,"Though exile is grief, my offence is more so: \nand deserving punishment’s worse than suffering it. \n",Nothing is stronger than habit.,0.872372
2116,"Fortunate silver, more blessed than any gold, \nthat was recently coarse metal, is now divine. \n",Nothing is stronger than habit.,0.821469
1154,That too is more than I want.,Nothing is stronger than habit.,0.814626
1096,"My current distress is harder than before: though still \nsimilar in nature, it’s grown and deepened with time. \n",Nothing is stronger than habit.,0.814061
1371,"Believe me, what I complain of is less than the truth. \n",Nothing is stronger than habit.,0.812927


In [25]:
similarity_df.sample(10)

Unnamed: 0,new_quotes,actual_quotes,similarity_score
4841,Book TII:253-312 Juno drove Io over the sea. \n \n,Nothing is stronger than habit.,0.377661
1766,"However they were inflicted on me, cease asking \nabout them: don’t disturb them if you want them to heal. \n",Nothing is stronger than habit.,0.532814
4213,Ovid implies no alms collecting was allowed the \npriestesses and prophets \nof the goddess. \n,Nothing is stronger than habit.,0.618541
3037,"Or like the Atarnean may you be brought, basely, \nto your lord as a prize, sewn inside a bull’s-hide. \n",Nothing is stronger than habit.,0.684119
3541,She fell in love with Hippomenes.,Nothing is stronger than habit.,0.48253
947,"The slave girl, singing at her work, spinning the thread, \ndiverts herself, and whiles away the hours of toil. \n",Nothing is stronger than habit.,0.575349
2816,He’d deny \nthat loyalty’s only the friend of tranquil times. \n,Nothing is stronger than habit.,0.631652
1767,"Whatever happened should be called an error, not a crime. \n",Nothing is stronger than habit.,0.69732
5404,Book TI.VII:1-40 Distant from his friends. \n \n,Nothing is stronger than habit.,0.311817
5921,Book EIII.1:1-66 Made more famous by his fate. \n \n,Nothing is stronger than habit.,0.380218


In [8]:
#########################################################################
# POC: To see if generate a response based on a quote
# Situation: There is a general tweet
# Task: Generate a list of candidate quotes
# Action: Create a model that assigns a score to each quote
# Result: The top 1 quote is returned
##################################################################################################################################################


  similarity_score = doc_generated.similarity(doc_actual)


['Let the man who does not wish to be idle fall in love!',
 'To love and be loved is to feel the sun from both sides.',
 "Love yields to business. If you seek a way out of love, be busy; you'll be safe then.",
 'Happy is the man who has broken the chains which hurt the mind, and has given up worrying once and for all.',
 'Nothing is stronger than habit.',
 'Dripping water hollows out stone, not through force but through persistence.',
 'It is no use to blame the looking glass if your face is awry.',
 'Resist beginnings; the remedy comes too late when the disease has gained strength by long delays.',
 'The mind, conscious of rectitude, laughed to scorn the falsehood of report.',
 'The cause is hidden. The effect is visible to all.']

In [None]:
df = pd.read_json("./../app/ovid_quotes.json")

DATA_DIR = "./../data"
files = [
    "RemediaAmoris.txt",
    "Heroides.txt",
    "Amours.txt",
    "Fasti.txt",
    "MetamorphosesI_VII.txt",
    "MetamorphosesVIII_XV.txt",
    "MetamorphosesofPublius.txt",
    "LoversAssistant.txt",
    "LastPoems.txt",
]
docs = [f"{DATA_DIR}/{file}" for file in files]
documents = SimpleDirectoryReader(input_files=docs).load_data()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=25)
texts = text_splitter.create_documents(docs)
directory = "index_store"
vector_index = FAISS.from_documents(texts, OpenAIEmbeddings())
vector_index.save_local(directory)

vector_index = FAISS.load_local("index_store", OpenAIEmbeddings())
retriever = vector_index.as_retriever(search_type="similarity", search_kwargs={"k": 10})

In [None]:
df = pd.read_json("./../app/ovid_quotes.json")
df.head(1)

In [None]:
qa_interface = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(),
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
)

response = qa_interface(
    f"""
I am a big fan of ovid. 
Please recommend 10 memorable quotes to me along with the source document they were taken from.

Do NOT include quotes I already have:
{df["Quote"].tolist()}
"""
)

print(response["result"])

#### Authorship authentication

**[Who Wrote it and Why?
Prompting Large-Language Models for Authorship Verification](https://arxiv.org/pdf/2310.08123.pdf)**

In [None]:
# LLM Authorship Attribution
from typing import List

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema import BaseOutputParser


class CommaSeparatedListOutputParser(BaseOutputParser[List[str]]):
    """Parse the output of an LLM call to a comma-separated list."""

    def parse(self, text: str) -> List[str]:
        """Parse the output of an LLM call."""
        return text.strip().split(", ")


template = f"""Task: On a scale of 0 to 1, with 0 indicating low confidence
and 1 indicating high confidence, please provide a general
assessment of the likelihood that given text 
written by the same author as the provided reference. Your answer should reflect a
moderate level of strictness in scoring. Here are some
relevant variables to this problem.
1. punctuation style(e.g. hyphen, brackets, colon, comma,
parenthesis, quotation mark)
2. special characters style, capitalization style(e.g.
Continuous capitalization, capitalizing certain words)
3. acronyms and abbreviations(e.g. Usage of acronyms
such as OMG, Abbreviations without punctuation marks
such as Mr Rochester vs. Mr. Rochester,Unusual
abbreviations such as def vs. definitely)
4. writing style
5. expressions and Idioms
6. tone and mood
7. sentence structure
8. any other relevant aspect
First step: Understand the problem, extracting relevant
variables and devise a plan to solve the problem. Then,
carry out the plan and solve the problem step by step.
9. One (or both) of the texts is written by the famous Latin author "Ovid"
Finally, show the confidence score.

The following are all quotes by Ovid for reference:
'Love is a thing full of anxious fears.',
 'Now are fields of corn where Troy once stood.',
 "We're slow to believe what wounds us.",
 "The end proves the acts (were done), or the result is a test of the actions; Ovid's line 85 full translation: “The event proves well the wisdom of her [Phyllis'] course.”",
 "Let him who loves, where love success may find, Spread all his sails before the prosp'rous wind.",
 'Resist beginnings; the remedy comes too late when the disease has gained strength by long delays.',
 "Love yields to business. If you seek a way out of love, be busy; you'll be safe then.",
 'The gods behold all righteous actions.',
 'There is a god within us.',
 'The mind, conscious of rectitude, laughed to scorn the falsehood of report.',
 'Every lover is a soldier.',
 'Let the man who does not wish to be idle fall in love!',
 'Far away be that fate!',
 'They bear punishment with equanimity who have earned it.',
 "We take no pleasure in permitted joys. But what's forbidden is more keenly sought.",
 'Who is allowed to sin, sins less.' 
 
 """
human_template = "{text}"

chat_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", template),
        ("human", human_template),
    ]
)
chain = chat_prompt | ChatOpenAI() | CommaSeparatedListOutputParser()

In [None]:
# Extract Quotes from LLM Authorshipp Attribution
import pandas as pd
import spacy
import re

nlp = spacy.load("en_core_web_sm")
input_string = response["result"]


def get_quotes(input_string):
    """Extract quotes from a string that has been created by an LLM string prompt."""
    # Extract the lines that start with a number
    lines_with_stripped_numbers = [
        re.sub(r"^\d+\.\s*", "", line.strip())
        for line in input_string.splitlines()
        if re.match(r"^\d+\.", line)
    ]

    # Print the extracted lines
    quotes = []
    pattern2 = r'"([^"]+)"\s*-\s*(.+)'

    for line in lines_with_stripped_numbers:
        match = re.search(pattern2, line)
        q = match.group(1)
        w = match.group(2)
        quotes.append((q, w))
    return quotes


quotes = get_quotes(response["result"])
quotes

### Score New Quotes

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

def score_quotes(quotes=None):
    """Score a list of quotes that have been generated by an LLM string prompt.

    quotes example:
    [('Dripping water hollows out stone, not through force but through persistence.',
    'Metamorphoses'),

    """
    if quotes is None:
        quotes = []

    # Define the regex pattern
    pattern = r"(\d+\.\d+)"

    # Get the quote string (not the work)
    quotes_text = [q[0] for q in quotes]
    quotes_work = [q[1] for q in quotes]

    # Extract the scores
    scores = []
    score_reasons = []
    for q in quotes_text:
        r = chain.invoke({"text": q})
        r = " ".join(r)
        score_reasons.append(r)
        match = re.search(pattern, r)
        score = match[0] if match else -1
        scores.append(score)
    return list(zip(*(scores, score_reasons, quotes_text, quotes_work)))


# Example usage
scored_quotes = score_quotes(quotes)
scored_quotes_df = pd.DataFrame(
    scored_quotes, columns=["score", "reason", "quote", "work"]
)
pd.set_option('display.max_colwidth', None)
scored_quotes_df  # ["score"]

### Context-Based Quotation Recommendation

Resource:
* https://arxiv.org/pdf/2005.08319.pdf

### Assessment of Retrieval Accuracy

The below uses current pipeline for 2 reasons:
1) To assess the hulicination affect against current pipeline
2) To assess scoring variability

In [None]:
import re

non_ovid_generated_quotes = [
    "One man's meat is another man's poison.",
    "Fortune favors the bold.",
    "Wherever there is a human being, there is an opportunity for a kindness.",
    "Love is a kind of warfare.",
    "One man's meat is another man's poison.",
    "To be loved, be lovable.",
]


# Define the regex pattern
pattern = r"(\d+\.\d+)"


scores = []
strings = []
for q in non_ovid_generated_quotes:
    r = chain.invoke({"text": q})
    r = " ".join(r)
    strings.append(r)
    match = re.search(pattern, r)
    score = match[0] if match else -1
    scores.append(score)
    print(r)

pd.DataFrame(
    data={
        "questionable_quote": non_ovid_generated_quotes,
        "authorship_match_score": scores,
        "is_original": [0, 0, 0, 1, 0, 1],
    }
)

In [None]:
scored_quotes_df[scored_quotes_df["score"].astype(float) < 0.5]["quote"].tolist()

In [None]:
from itertools import product

# Compare to quotes
generated_quotes = scored_quotes_df[scored_quotes_df["score"].astype(float) > 0.5][
    "quote"
].tolist()

# Read in existing quotes
df = pd.read_json("./../app/ovid_quotes.json")

# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# Process each quote and calculate similarity scores
similarity_data = []

for generated_quote, actual_quote in product(generated_quotes, df["Quote"]):
    # Process the quotes with spaCy
    doc_generated = nlp(generated_quote)
    doc_actual = nlp(actual_quote)

    # Calculate similarity score
    similarity_score = doc_generated.similarity(doc_actual)

    # Append data to the list
    similarity_data.append(
        {
            "generated_quotes": generated_quote,
            "actual_quotes": actual_quote,
            "similarity_score": similarity_score,
        }
    )

# Create a DataFrame from the similarity data
similarity_df = pd.DataFrame(similarity_data).sort_values(
    "similarity_score", ascending=False
)
display(similarity_df.head(2))

# Seems like there might be a magic number similarity score < 0.65
print("unique quutes")

similiar_quotes = similarity_df[similarity_df["similarity_score"] > 0.65][
    "generated_quotes"
].unique()

# Filter out the similar quotes
new_quotes = similarity_df[~similarity_df["generated_quotes"].isin(similiar_quotes)][
    "generated_quotes"
].unique()

# Get the works for the new quotes
new_quote_works = []
for q in new_quotes:
    for quote, work in quotes:
        if q == quote:
            new_quote_works.append(work)
df_new_quotes = pd.DataFrame(
    list(zip(*(new_quotes, new_quote_works))), columns=["Quote", "Work"]
)
df_new_quotes["Quote in Latin"] = None
df_new_quotes

In [None]:
similiar_quotes

In [None]:
print("highest score comparison:")
idx = 54
print("GENERATED:", similarity_df.iloc[54]["generated_quotes"])
print("ACTUAL:", similarity_df.iloc[54]["actual_quotes"])

### Storage of Additional Quotes

In [16]:
# import pandas as pd

# # Read in the existing dataframe
# df = pd.read_json("./../app/ovid_quotes.json")

# # Manually append a new row to the dataframe
# new_row = {
#     "Quote": "Greet them (others) by their names it costs you nothing.",
#     "Work": "Ars Amatoria",
#     "Quote in Latin": "",
# }
# df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)

# # Print the updated dataframe
# display(df.head(2))
# df.to_json("./../app/ovid_quotes.json")
# print("New dataframe shape:", df.shape)

Unnamed: 0,Quote,Quote in Latin,Work
0,Love is a thing full of anxious fears.,Res est solliciti plena timoris amor.,"Heroides (The Heroines), I, 12"
1,Now are fields of corn where Troy once stood.,Iam seges est ubi Troia fuit.,"Heroides (The Heroines), I, 53"


New dataframe shape: (51, 3)


In [None]:
# After quotes works have passed the assessment bars above
df = pd.concat([df, df_new_quotes]).reset_index(drop=True)
df.to_json("./../ovid_quotes.json")

## Getting more data

In [95]:
%%bash
ls ../data
curl https://web.seducoahuila.gob.mx/biblioweb/upload/THE%20ART%20OF%20LOVE.pdf > ../data/ArsAmatoria.pdf

AmoresV2.pdf
AmoresV2.txt
Amours.txt
Fasti.txt
FastiV2.pdf
FastiV2.txt
Heroides.pdf
Heroides.txt
HeroidesV2.txt
LastPoems.txt
LostPoemsV2.txt
LoversAssistant.txt
MetamorphosesI_VII.txt
MetamorphosesV2.pdf
MetamorphosesV2.txt
MetamorphosesVIII_XV.txt
MetamorphosesofPublius.txt
RemediaAmoris.txt
lostpoems.pdf
notes.jsonl
notes_validation.jsonl


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  302k  100  302k    0     0   380k      0 --:--:-- --:--:-- --:--:--  381k


In [102]:
import fitz  # PyMuPDF


def extract_text_from_pdf(pdf_path):
    try:
        # Open the PDF file
        pdf_document = fitz.open(pdf_path)

        # Initialize an empty string to store the extracted text
        extracted_text = ""

        # Iterate through pages and extract text
        for page_number in range(pdf_document.page_count):
            page = pdf_document[page_number]
            extracted_text += page.get_text()

        return extracted_text

    except Exception as e:
        print(f"Error: {e}")
        return None


# Example usage:
pdf_file_path = "/Users/aus10powell/Downloads/TheOdesofHorace_10227561.pdf"  # Replace with the path to your PDF file
extracted_text = extract_text_from_pdf(pdf_file_path)


def save_text_to_txt(text, pdf_path):
    txt_path = pdf_path[:-4] + ".txt"  # Assuming the PDF file has a ".pdf" extension
    with open(txt_path, "w", encoding="utf-8") as txt_file:
        txt_file.write(text)


save_text_to_txt(extracted_text, "../data/TheOdesofHorace_10227561(3).txt")

In [97]:
print(len(extracted_text.split()))
extracted_text.split()[-1000:-990]

21875


['eat.',
 'As',
 'her',
 'breath',
 'returned,',
 'she',
 'tore',
 'the',
 'thin',
 'clothing']

In [98]:
file_path = "../data/ArsAmatoria.txt"

with open(file_path, "w") as file:
    file.write(extracted_text)

print("Text file created successfully.")

Text file created successfully.


### LLM Quote Retrieval

In [81]:
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
import os
import openai

# import utils
import configparser
from pprint import pprint
from llama_index import SimpleDirectoryReader


f1 = "./../data/LostPoemsV2.txt"
f2 = "./../data/FastiV2.txt"

docs = [f1, f2]
documents = SimpleDirectoryReader(input_files=docs).load_data()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=25)
texts = text_splitter.create_documents(docs)
directory = "index_store"
vector_index = FAISS.from_documents(texts, OpenAIEmbeddings())
vector_index.save_local(directory)

vector_index = FAISS.load_local("index_store", OpenAIEmbeddings())
retriever = vector_index.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [82]:
qa_interface = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(),
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
)

response = qa_interface(
    """
I am a big fan of ovid. 
Please recommend memorable quotes to me.
"""
)

print(response["result"])

I apologize, but as an AI language model, I do not have the ability to browse specific files or access specific quotes from works of literature. I recommend reading Ovid's works, such as "Lost Poems" and "Fasti," to discover and appreciate the memorable quotes within them.
