# Quote Generation/Retrieval

Initial quotes were scraped from one webpage. But that would lead to quite a lot of repitition and kindof a boring bot. So...

### 2 Issues with Index retrieved/Generated Quotes:
**Goal:** To have ~100 quotes to have tweeted on a rolling basis since this is a quarter of a year and will not likely to be easily repeated.

### Probability of Repeating a Quote Exactly 2 Weeks Later

\begin{equation*}
P(\text{repeated in 2 weeks}) = 1 - \left( \frac{N - 1}{N} \right)^{\frac{14}{m}}
\end{equation*}

 Substituting in the values:

 \begin{equation*}
P(\text{repeated in 2 weeks}) = 1 - \left( \frac{99}{100} \right)^{14}
\end{equation*}

 Calculating this gives:

 \begin{equation*}
P(\text{repeated in 2 weeks}) \approx 0.135
\end{equation*}

Personally, I'd like it to be under 10% for a longer period of time (increases probability)

Tweaking the numbers a bit 
N = 200
days = 21

\begin{equation*}
P(\text{repeated in 3 weeks}) \approx  0.0980
\end{equation*}

The following should be conditional 1) That the quote generated has confidence that the quote was written by Ovid then 2) That the quote is not a variation of a quote already in DB.

In [3]:
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
import os
import openai

# import utils
import configparser
from pprint import pprint
from llama_index import SimpleDirectoryReader


import os
import openai


import jupyter_black

jupyter_black.load()
# Access values from the sections

import sys

sys.path.append("../app")
import conn_utils

OPENAI_API_KEY = conn_utils.get_open_ai_key("../config.ini")

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
directory = "index_store"

In [4]:
!ls ./../data

Amours.txt                 MetamorphosesI_VII.txt
Fasti.txt                  MetamorphosesVIII_XV.txt
Heroides.txt               MetamorphosesofPublius.txt
LastPoems.txt              RemediaAmoris.txt
LoversAssistant.txt


In [6]:
DATA_DIR = "./../data"
files = [
    "RemediaAmoris.txt",
    "Heroides.txt",
    "Amours.txt",
    "Fasti.txt",
    "MetamorphosesI_VII.txt",
    "MetamorphosesVIII_XV.txt",
    "MetamorphosesofPublius.txt",
    "LoversAssistant.txt",
    "LastPoems.txt",
]
docs = [f"{DATA_DIR}/{file}" for file in files]
documents = SimpleDirectoryReader(input_files=docs).load_data()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=25)
texts = text_splitter.create_documents(docs)
directory = "index_store"
vector_index = FAISS.from_documents(texts, OpenAIEmbeddings())
vector_index.save_local(directory)

vector_index = FAISS.load_local("index_store", OpenAIEmbeddings())
retriever = vector_index.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [7]:
qa_interface = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(),
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
)

response = qa_interface(
    """
I am a big fan of ovid. 
Please recommend memorable quotes to me.


"""
)

print(response["result"])

Here are some memorable quotes from Ovid's works:

1. "Dripping water hollows out stone, not through force but through persistence." - Metamorphoses

2. "Fortune and love favor the brave." - Amores

3. "In the midst of chaos, there is also opportunity." - Metamorphoses

4. "Love is a kind of warfare." - Amores

5. "The cause is hidden. The effect is visible to all." - Metamorphoses

6. "Love is a fire that burns unseen, a wound that aches yet isn't felt, and always with me night and day, an agony that never ends." - Amores

7. "Nothing is stronger than habit." - Metamorphoses

8. "The gods favor the bold." - Metamorphoses

9. "To be loved, be lovable." - Amores

10. "The mind moves in the direction of our currently dominant thoughts." - Metamorphoses

Please note that these quotes are just a small selection from Ovid's works. For a more comprehensive collection of memorable quotes, I recommend reading his books directly.


#### Authorship authentication

**[Who Wrote it and Why?
Prompting Large-Language Models for Authorship Verification](https://arxiv.org/pdf/2310.08123.pdf)**

In [8]:
# LLM Authorship Attribution
from typing import List

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema import BaseOutputParser


class CommaSeparatedListOutputParser(BaseOutputParser[List[str]]):
    """Parse the output of an LLM call to a comma-separated list."""

    def parse(self, text: str) -> List[str]:
        """Parse the output of an LLM call."""
        return text.strip().split(", ")


template = """Task: On a scale of 0 to 1, with 0 indicating low confidence
and 1 indicating high confidence, please provide a general
assessment of the likelihood that given text 
written by the same author as the provided reference. Your answer should reflect a
moderate level of strictness in scoring. Here are some
relevant variables to this problem.
1. punctuation style(e.g. hyphen, brackets, colon, comma,
parenthesis, quotation mark)
2. special characters style, capitalization style(e.g.
Continuous capitalization, capitalizing certain words)
3. acronyms and abbreviations(e.g. Usage of acronyms
such as OMG, Abbreviations without punctuation marks
such as Mr Rochester vs. Mr. Rochester,Unusual
abbreviations such as def vs. definitely)
4. writing style
5. expressions and Idioms
6. tone and mood
7. sentence structure
8. any other relevant aspect
First step: Understand the problem, extracting relevant
variables and devise a plan to solve the problem. Then,
carry out the plan and solve the problem step by step.
9. One (or both) of the texts is written by the famous Latin author "Ovid"
Finally, show the confidence score.

The following are all quotes by Ovid for reference:
'Love is a thing full of anxious fears.',
 'Now are fields of corn where Troy once stood.',
 "We're slow to believe what wounds us.",
 "The end proves the acts (were done), or the result is a test of the actions; Ovid's line 85 full translation: “The event proves well the wisdom of her [Phyllis'] course.”",
 "Let him who loves, where love success may find, Spread all his sails before the prosp'rous wind.",
 'Resist beginnings; the remedy comes too late when the disease has gained strength by long delays.',
 "Love yields to business. If you seek a way out of love, be busy; you'll be safe then.",
 'The gods behold all righteous actions.',
 'There is a god within us.',
 'The mind, conscious of rectitude, laughed to scorn the falsehood of report.',
 'Every lover is a soldier.',
 'Let the man who does not wish to be idle fall in love!',
 'Far away be that fate!',
 'They bear punishment with equanimity who have earned it.',
 "We take no pleasure in permitted joys. But what's forbidden is more keenly sought.",
 'Who is allowed to sin, sins less.' """
human_template = "{text}"

chat_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", template),
        ("human", human_template),
    ]
)
chain = chat_prompt | ChatOpenAI() | CommaSeparatedListOutputParser()
# chain.invoke({"text": ""})

In [21]:
# Extract Quotes from LLM Authorshipp Attribution
import pandas as pd
import spacy
import re

nlp = spacy.load("en_core_web_sm")
input_string = response["result"]


def get_quotes(input_string) -> List[str]:
    """Extract quotes from a string that has been created by an LLM string prompt."""
    # Extract the lines that start with a number
    lines_with_stripped_numbers = [
        re.sub(r"^\d+\.\s*", "", line.strip())
        for line in input_string.splitlines()
        if re.match(r"^\d+\.", line)
    ]

    # Print the extracted lines
    quotes = []
    for line in lines_with_stripped_numbers:
        q, w = line.split("-")
        q = q.strip().replace('"', "")
        w = w.strip()
        quotes.append((q, w))
    return quotes


quotes = get_quotes(response["result"])
quotes

[('Dripping water hollows out stone, not through force but through persistence.',
  'Metamorphoses'),
 ('Fortune and love favor the brave.', 'Amores'),
 ('In the midst of chaos, there is also opportunity.', 'Metamorphoses'),
 ('Love is a kind of warfare.', 'Amores'),
 ('The cause is hidden. The effect is visible to all.', 'Metamorphoses'),
 ("Love is a fire that burns unseen, a wound that aches yet isn't felt, and always with me night and day, an agony that never ends.",
  'Amores'),
 ('Nothing is stronger than habit.', 'Metamorphoses'),
 ('The gods favor the bold.', 'Metamorphoses'),
 ('To be loved, be lovable.', 'Amores'),
 ('The mind moves in the direction of our currently dominant thoughts.',
  'Metamorphoses')]

### Score New Quotes

In [59]:
def score_quotes(quotes=None):
    """Score a list of quotes that have been generated by an LLM string prompt.

    quotes example:
    [('Dripping water hollows out stone, not through force but through persistence.',
    'Metamorphoses'),
    """
    if quotes is None:
        quotes = []

    # Define the regex pattern
    pattern = r"(\d+\.\d+)"

    # Get the quote string (not the work)
    quotes_text = [q[0] for q in quotes]
    quotes_work = [q[1] for q in quotes]

    # Extract the scores
    scores = []
    score_reasons = []
    for q in quotes_text:
        r = chain.invoke({"text": q})
        r = " ".join(r)
        score_reasons.append(r)
        match = re.search(pattern, r)
        score = match[0] if match else -1
        scores.append(eval(score))
    return list(zip(*(scores, score_reasons, quotes_text, quotes_work)))


# Example usage
scored_quotes = score_quotes(quotes)
scored_quotes_df = pd.DataFrame(
    scored_quotes, columns=["score", "reason", "quote", "work"]
)
scored_quotes_df

Unnamed: 0,score,reason,quote,work
0,0.5,Confidence Score: 0.5,"Dripping water hollows out stone, not through ...",Metamorphoses
1,0.4,Confidence Score: 0.4,Fortune and love favor the brave.,Amores
2,0.1,Confidence Score: 0.1\n\nExplanation: The prov...,"In the midst of chaos, there is also opportunity.",Metamorphoses
3,0.7,On a scale of 0 to 1 I would assess the likeli...,Love is a kind of warfare.,Amores
4,0.4,Likelihood: 0.4\n\nExplanation: The punctuatio...,The cause is hidden. The effect is visible to ...,Metamorphoses
5,0.8,Confidence Score: 0.8,"Love is a fire that burns unseen, a wound that...",Amores
6,0.7,Given the provided quote and the quotes by Ovi...,Nothing is stronger than habit.,Metamorphoses
7,0.6,Confidence Score: 0.6,The gods favor the bold.,Metamorphoses
8,0.8,Based on the provided reference quotes by Ovid...,"To be loved, be lovable.",Amores
9,0.6,On a scale of 0 to 1 with 0 indicating low con...,The mind moves in the direction of our current...,Metamorphoses


In [58]:
scored_quotes[0]

(0.7,
 'On a scale of 0 to 1 I would assess the likelihood that the given text is written by the same author as the provided reference as 0.7. \n\nReasoning:\n1. Punctuation style: The use of commas and the absence of any other special punctuation marks is consistent with the reference quotes.\n2. Capitalization style: The capitalization style in the given text is standard and does not deviate from what is seen in the reference quotes.\n3. Acronyms and abbreviations: There are no acronyms or unusual abbreviations in the given text.\n4. Writing style: The writing style in the given text is concise and impactful which aligns with Ovid\'s style of writing.\n5. Expressions and idioms: The use of the expression "Dripping water hollows out stone" is not directly comparable to any specific idioms or expressions used by Ovid but it does convey a similar sentiment of persistence and gradual impact.\n6. Tone and mood: The tone and mood in the given text are not explicitly clear but it can be int

### Assessment of Retrieval Accuracy

The below uses current pipeline for 2 reasons:
1) To assess the hulicination affect against current pipeline
2) To assess scoring variability

In [16]:
import re

non_ovid_generated_quotes = [
    "One man's meat is another man's poison.",
    "Fortune favors the bold.",
    "Wherever there is a human being, there is an opportunity for a kindness.",
    "Love is a kind of warfare.",
    "One man's meat is another man's poison.",
    "To be loved, be lovable.",
]


# Define the regex pattern
pattern = r"(\d+\.\d+)"


scores = []
strings = []
for q in non_ovid_generated_quotes:
    r = chain.invoke({"text": q})
    r = " ".join(r)
    strings.append(r)
    match = re.search(pattern, r)
    score = match[0] if match else -1
    scores.append(score)
    print(r)

pd.DataFrame(
    data={
        "questionable_quote": non_ovid_generated_quotes,
        "authorship_match_score": scores,
        "is_original": [0, 0, 0, 1, 0,1],
    }
)

Confidence Score: 0.2
Confidence Score: 0.5

Explanation: The given quote does not exhibit any specific characteristics that can be directly compared to the reference quotes by Ovid. However the quote does share a similar theme of taking risks and being courageous which aligns with some of Ovid's ideas about love and human behavior. Therefore there is a moderate likelihood that the given quote could be written by the same author as the reference quotes.
Confidence Score: 0.4
Confidence Score: 0.7

Explanation: The sentence has a similar theme to the quotes by Ovid as it compares love to warfare. However the sentence lacks the poetic style and specific language used by Ovid. Therefore I have a moderate level of confidence that the text is written by the same author.
Confidence Score: 0.2

Explanation: The given text does not match Ovid's writing style or use of language. The quote is a common proverb that is not specific to Ovid's style or themes. Therefore it is unlikely that the given

Unnamed: 0,questionable_quote,authorship_match_score,is_original
0,One man's meat is another man's poison.,0.2,0
1,Fortune favors the bold.,0.5,0
2,"Wherever there is a human being, there is an o...",0.4,0
3,Love is a kind of warfare.,0.7,1
4,One man's meat is another man's poison.,0.2,0


['Dripping water hollows out stone, not through force but through persistence.',
 'Fortune and love favor the brave.',
 'In the midst of chaos, there is also opportunity.',
 'Love is a kind of warfare.',
 'The cause is hidden. The effect is visible to all.',
 "Love is a fire that burns unseen, a wound that aches yet isn't felt, and always with me night and day, an agony that never ends.",
 'Nothing is stronger than habit.',
 'The gods favor the bold.',
 'To be loved, be lovable.',
 'The mind moves in the direction of our currently dominant thoughts.']

In [73]:
from itertools import product

# Compare to quotes
generated_quotes = [q[2] for q in scored_quotes if q[0] > 0.5]

# Read in existing quotes
df = pd.read_json("./../ovid_quotes.json")

# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# Process each quote and calculate similarity scores
similarity_data = []

for generated_quote, actual_quote in product(generated_quotes, df["Quote"]):
    # Process the quotes with spaCy
    doc_generated = nlp(generated_quote)
    doc_actual = nlp(actual_quote)

    # Calculate similarity score
    similarity_score = doc_generated.similarity(doc_actual)

    # Append data to the list
    similarity_data.append(
        {
            "generated_quotes": generated_quote,
            "actual_quotes": actual_quote,
            "similarity_score": similarity_score,
        }
    )

# Create a DataFrame from the similarity data
similarity_df = pd.DataFrame(similarity_data).sort_values(
    "similarity_score", ascending=False
)
display(similarity_df.head(2))

# Seems like there might be a magic number similarity score < 0.65
print("unique quutes")

similiar_quotes = similarity_df[similarity_df["similarity_score"] > 0.65][
    "generated_quotes"
].unique()

# Filter out the similar quotes
new_quotes = similarity_df[~similarity_df["generated_quotes"].isin(similiar_quotes)][
    "generated_quotes"
].unique()

# Get the works for the new quotes
new_quote_works = []
for q in new_quotes:
    for quote, work in quotes:
        if q == quote:
            new_quote_works.append(work)
df_new_quotes = pd.DataFrame(
    list(zip(*(new_quotes, new_quote_works))), columns=["Quote", "Work"]
)
df_new_quotes["Quote in Latin"] = None
df_new_quotes

  similarity_score = doc_generated.similarity(doc_actual)


Unnamed: 0,generated_quotes,actual_quotes,similarity_score
19,Love is a kind of warfare.,Love is a kind of warfare.,1.0
6,Love is a kind of warfare.,Love yields to business. If you seek a way out...,0.722229


unique quutes


Unnamed: 0,Quote,Work,Quote in Latin
0,"Love is a fire that burns unseen, a wound that...",Amores,
1,The mind moves in the direction of our current...,Metamorphoses,
2,Nothing is stronger than habit.,Metamorphoses,


Unnamed: 0,Quote,Quote in Latin,Work
0,Love is a thing full of anxious fears.,Res est solliciti plena timoris amor.,"Heroides (The Heroines), I, 12"
1,Now are fields of corn where Troy once stood.,Iam seges est ubi Troia fuit.,"Heroides (The Heroines), I, 53"
2,We're slow to believe what wounds us.,Tarde quae credita laedunt credimus.,"Heroides (The Heroines), II, 9-10"
3,"The end proves the acts (were done), or the re...",Exitus acta probat.,"Heroides (The Heroines), II, 85"
4,"Let him who loves, where love success may find...","Siquis amat quod amare iuvat, feliciter ardens...","Remedia Amoris (The Cure for Love), lines 13-16"
5,Resist beginnings; the remedy comes too late w...,Principiis obsta; sero medicina paratur Cum ma...,"Remedia Amoris (The Cure for Love), lines 91–92"
6,Love yields to business. If you seek a way out...,"Qui finem quaeris amoris, Cedit amor rebus; re...","Remedia Amoris (The Cure for Love), lines 143–144"
7,The gods behold all righteous actions.,Di pia facta vident.,"Fasti (The Festivals), II, 117"
8,There is a god within us.,Est deus in nobis; agitante calescimus illo: i...,"Fasti (The Festivals), VI, lines 5-6"
9,"The mind, conscious of rectitude, laughed to s...",Conscia mens recti famae mendacia risit,"Fasti (The Festivals), IV, 311"


In [40]:
print("highest score comparison:")
idx = 54
print("GENERATED:", similarity_df.iloc[54]["generated_quotes"])
print("ACTUAL:", similarity_df.iloc[54]["actual_quotes"])

highest score comparison:
GENERATED: "Everything changes, nothing perishes." - Metamorphoses
ACTUAL: The end proves the acts (were done), or the result is a test of the actions; Ovid's line 85 full translation: “The event proves well the wisdom of her [Phyllis'] course.”


### Storage of Additional Quotes

In [68]:
!ls ./../

DOCS.md              config.ini           ovid_quotes.json
ProjectManagement.md [1m[36mdata[m[m                 requirements.txt
README.md            [31mmyscript.sh[m[m          [1m[36mvenv[m[m
[1m[36m__pycache__[m[m          [1m[36mnotebooks[m[m
[1m[36mapp[m[m                  ovid_quotes.csv


In [76]:
# After quotes works have passed the assessment bars above
df = pd.concat([df, df_new_quotes]).reset_index(drop=True)
df.to_json("./../ovid_quotes.json")