In [5]:
# import openai
import re 
import difflib
from difflib import ndiff
from termcolor import colored
import dotenv
from datetime import datetime
from IPython.display import HTML, Image, Markdown, display

from langchain_openai import ChatOpenAI
from langchain_core.messages import AnyMessage, SystemMessage, HumanMessage, ToolMessage
from langchain_core.prompts import (ChatPromptTemplate, MessagesPlaceholder, PromptTemplate,
                                    SystemMessagePromptTemplate, HumanMessagePromptTemplate)
from langchain_core.output_parsers import JsonOutputParser, StrOutputParser

# from langchain.agents import create_tool_calling_agent, AgentExecutor
# from langchain.memory import ChatMessageHistory
# from langgraph.checkpoint.sqlite import SqliteSaver
# from langgraph.checkpoint.aiosqlite import AsyncSqliteSaver
# from langchain.tools import BaseTool, StructuredTool, tool
# from langchain_community.chat_models import ChatOllama
# from langchain_community.tools.tavily_search import TavilySearchResults
# from langchain_core.pydantic_v1 import BaseModel, Field
# from langchain.callbacks.base import AsyncCallbackHandler, BaseCallbackHandler
# from langchain_core.outputs import LLMResult

In [6]:
dotenv.load_dotenv()

MODEL='gpt-4o'

In [7]:
INPUT = """

Ask ChatGPT a simple question: [How many r’s are there in ‘cranberry’?](https://druce.ai/assets/2024/cranberry.png). It often answers 2 instead of 3.

ChatGPT can generate huge productivity gains, but can also generate huge [reputational and career risks](https://www.npr.org/2023/12/30/1222273745/michael-cohen-ai-fake-legal-cases) when it makes embarrassing mistakes. Users should understand when and why it might mess up. Gather round if you want to dodge a ChatGPT disaster!

Here are more simple prompts that you might expect to work but do not:

- [How many letters are in this sentence: It was a bright cold day in April, and the clocks were striking thirteen.](https://druce.ai/assets/2024/1984.png) I got 60, should be 72.
    
- [Mary has 3 brothers and she also has 5 sisters. How many sisters does Mary’s brother have?](https://druce.ai/assets/2024/sisters.png) I got “same number of sisters as Mary, which is 5”.
    
- [Make a list of 50 cities that do not contain the letter ‘a’, or any diacritical variation thereof](https://druce.ai/assets/2024/chicago.png). It returned ‘Chicago’. Sometimes, with longer answers, it starts to forget the original question.
    
- [Are there any prime numbers whose digits sum to nine?](https://druce.ai/assets/2024/primes.png). The answer should be no, because according to my 6th-grade math, a number is divisible by 3 if the sum of its digits is divisible by 3. But ChatGPT bizarrely says ‘yes, because 3 is prime and 229 is prime.’
    
- [Give me the list of all the countries with ‘v’ in their name](https://druce.ai/assets/2024/countries.png). It missed El Salvador, Moldova. Sometimes if you ask it for countries starting with ‘V’ or ‘K’ it will miss Vietnam or Kenya.
    
- Ask ChatGPT to play Tic-Tac-Toe. It got better since 3.5 when it would cheat or go nuts, but doesn’t play optimally. If you ask it to play chess, it makes illegal moves and plays like a small child. It also [struggles with the NYT Spelling Bee game](https://www.engadget.com/if-ai-is-going-to-take-over-the-world-why-cant-it-solve-the-spelling-bee-170034469.html). Think about this before you ask it to perform a complex analytical task like picking stocks!
    

Here are prompts that work well:

- Write a limerick about hedge funds where the first letter of each verse spells ‘hedge’. It’s incredibly fluent with language understanding and generation, including complex nuances and connotations. There’s a reason it’s called a ‘language model’.
    
- Ask ChatGPT about a specific fact on a random Wikipedia page from [WikiRoulette](https://wikiroulette.co/). It will probably be correct if asked in a straightforward way, because it trained on Wikipedia. But it can also make stuff up, if you stray from questions where it has a well-grounded basis to answer from training or in the prompt. Also, if you introduce a small variation it may fail: [if it knows Tom Cruise’s mother is Mary Lee Pfeiffer, it might not know who Mary Lee Pfeiffer’s son is.](https://arxiv.org/abs/2309.12288)
    
- “What is this news story about? Classify it as belonging to national, local, business, politics, tech, sports.” Very good: classification tasks involving deep text understanding. Not as good: classifying over a very large list of classes, or doing regression to predict a numerical value, or unsupervised learning like clustering. ChatGPT’s training has nothing to do with these tasks.
    

ChatGPT is a poet, not a quant. It’s a [wordcel, not a shape rotator](https://www.vice.com/en/article/pkpqzb/ok-wtf-are-wordcels-and-shape-rotators). It understands ‘rhetorical logic’ but not causal reasoning. Its understanding of the world is skin deep, notwithstanding sophisticated, even beautiful language. It knows what it was told and does not have any ability to answer questions from first principles and data. Some demos of it performing complex tasks may be cherry-picked or overhyped or due to training data contamination with similar examples, and there may be less there than meets the eye.

Endowing a computer system with a gift for language and art and creativity somehow makes it less reliable and conscientious. It ‘hallucinates’. When it doesn’t know, it makes up something that sounds plausible. It is a [BS artist](https://link.springer.com/article/10.1007/s10676-024-09775-5). It is trained to make labelers happy with plausible-sounding answers, to speak for effect and not underlying truth, which is the [definition of BS](https://druce.ai/2023/09/bullshit).

So:

- Use generative AI for looking up, summarizing, and explaining reliable texts in your firm’s knowledge base, or obtained online from knowledgable authors and reliable sources. Ensure responses stay well-grounded by specifically instructing the AI to only use information from the sources you provide.
    
- Danger lurks if you ask it to follow a process, or algorithm, or chain of logic on its own that requires more than a couple of steps. If it’s a task that would benefit from writing a program, use the [code interpreter](https://www.zdnet.com/article/how-to-use-chatgpt-to-write-code/) or [advanced data analysis](https://www.zdnet.com/article/how-to-use-chatgpt-to-make-charts-and-tables-with-advanced-data-analysis/) functionality.
    
- Improve ChatGPT’s reasoning and reduce hallucinations by using [advanced prompting patterns](https://druce.ai/2024/01/prompting) like reflection (prompt ChatGPT to explain or criticize its own thought process and improve it) and chain-of-thought (give it a step-by-step thought process to follow and use to structure a response). Chain-of-thought prompting helps ChatGPT [get the ‘cranberry’ question correct](https://druce.ai/assets/2024/cranberry2.png). Use prompt helpers from the [GPT Store](https://gptstore.ai/gpts?q=prompt) to create and improve prompts. But clever prompts only go so far. Test your prompts systematically. If small variations change results dramatically during validation, beware. It will only get worse on unseen inputs in production.
    
- For retrieval tasks, use ChatGPT as a super hard-working, quick, but not-very-bright assistant who needs specific and detailed instructions, and sometimes might go off the rails anyway. Ask it to read a lot of text, and retrieve or summarize specific things in a specific way. Make sure the questions you ask can be answered directly by the text. Make sure the task passes the ‘intern test’: it’s something that you could explain to an untrained, unreliable assistant and if they just worked hard at it for a long time they would get it right with the knowledge base you provide. Use generative AI when you need [infinite indefatigable interns](https://www.wired.com/story/artificial-intelligence-labor/), even if they might be stoner interns.
    
- For creative tasks, use AI to generate a first draft of anything, ideating, editing and proofreading. But rely on it for no more than a first draft you can iterate on. Trust nothing, and verify everything. Ask it to provide sources and check the sources. Do not be that person who cuts-and-pastes [‘as a large language model’](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&as_ylo=2023&q=%22I%20am%20an%20AI%20language%20model%22+-chatgpt+-llm&btnG=) or [‘regenerate response’](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&as_ylo=2023&q=%22regenerate+response%22+-chatgpt+-llm&btnG=) into client-facing work product. (There used to be even more of these in Google Scholar but authors and publications have cleaned them up!) Review anything generated with ChatGPT closely and use it as an assistant and copilot, not as a proper primary, or secondary source. (Also review it for style, and edit it to say things you would say, eliminate ChatGPT’s tendency for throat-clearing, roundabout participial phrases, and unsual words like ‘delve’.)
    
- Read [Ethan Mollick](https://www.oneusefulthing.org/p/centaurs-and-cyborgs-on-the-jagged). There is a ‘jagged frontier’ where ChatGPT performs superbly at some tasks but shockingly poorly at seemingly similar tasks that slightly exceed its grasp. Lose track of where you stand on the ‘jagged frontier’ at your peril. Supervise AI closely, and give it foolproof tools like the code interpreter for tasks that fall outside its reach.
    

This is the most important chart on applying AI in your daily work:

![Distribution of output scores on a consulting task, with and without AI.](https://druce.ai/assets/2024/mollick-curve.png)

Distribution of output quality across all the tasks. The blue group did not use AI, the green and red groups used AI. The red group got additional training on how to use AI.

This figure is from a [randomized control trial of Boston Consulting Group consultants, who were asked to perform a set of consulting-related tasks](https://www.hbs.edu/ris/Publication%20Files/24-013_d9b45b68-9e74-42d6-a1c6-c72fb70c7282.pdf). Most users obtained significant benefits from AI tools despite their limitations. For many tasks, language understanding is enough to unlock big gains.

But beware of the left tail of this distribution. Some users who got trained probably became overconfident and put too much stock in bad, possibly hallucinatory responses, and scored much worse. Don’t get conned by AI and dumped on the left side of this distribution.

I don’t know who needed to hear this. But every day I see posts by people who are oblivious to ChatGPT’s limitations, and are suprised by terrible answers to counting or logic problems. Every day, I see news stories where foolish editors and reporters ask ChatGPT to perform tasks beyond its capabilities, like [predicting the price of Apple stock](https://finbold.com/heres-when-apple-stock-will-reach-300-according-to-chatgpt-4o/). Please, don’t do this. These people are not only making fools of themselves, but also teaching other people to be foolish. What they are doing is a modern-day version of reading tea leaves, or tarot cards, or the I Ching. People need to get real about what ChatGPT can and cannot do.

The ability of computers to understand and generate human language is an incredible advance. It makes computers great at tasks which experience tells us they shouldn’t be good at, like creating poetry and [music](https://suno.com/) and [art](https://www.midjourney.com/explore?tab=random) and [movies](https://lumalabs.ai/dream-machine) and [code](https://www.youtube.com/watch?v=LvGmVmHv69s). This ability is completely orthogonal to everything computers did well until now. There is work in progress to [better understand how LLMs work](https://openai.com/index/extracting-concepts-from-gpt-4/) and how their limitations arise, and resolving the impedance mismatch between wooly right-brain poet LLMs and hard-boiled left-brain Dr. Spock [data stores](https://paperswithcode.com/method/associative-lstm#:~:text=An%20Associative%20LSTM%20combines%20an,key%20and%20its%20associated%20content) and [coders](https://atlasiko.com/news/ai/google-alphacode-ai-code-generation/). As Sam Altman says, this is the [dumbest AI you will ever have to use](https://www.tomsguide.com/ai/chatgpt/gpt-4-is-the-dumbest-model-any-of-you-will-ever-have-to-use-declares-openai-ceo-sam-altman-as-he-bets-big-on-a-superingtelligence). But right now, ChatGPT is really bad at some tasks we normally expect computers to be really good at.

Don’t [become a bad example to others](https://www.npr.org/2023/12/30/1222273745/michael-cohen-ai-fake-legal-cases). Don’t fall victim to the [Eliza Effect](https://en.wikipedia.org/wiki/ELIZA_effect), where people mistake natural language algorithms for human-level understanding. It’s an easy mistake to make, because the only entity we ever encountered that could write as well as a human…_was_ a human. So it’s jarring to see ChatGPT write an Elizabethan sonnet in a flash, and then [stumble on problems a small child could solve](https://www.washingtonpost.com/science/2024/06/12/how-smart-ai-chatgpt-intelligent-child/).

If even Google engineers, who should know how [to make reliable systems using unreliable components](https://www.ben-evans.com/benedictevans/2024/6/8/building-ai-products), got caught out by [how much their Gemini LLM gets wrong](https://originality.ai/blog/worst-google-ai-responses), how careful do the rest of us have to be?

Move fast, don’t miss out on the opportunity for massive productivity gains from AI, but follow [common-sense rules](https://www.forbes.com/sites/peterhigh/2024/05/07/ethan-mollick-on-the-four-rules-of-co-intelligence-with-ai/?sh=600ff1a63004), and stay safe, my friends!

### Further reading:

[Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models](https://arxiv.org/pdf/2406.02061v1). See the References section of this paper for work showing LLMs failing to reason, to understand causality, to understand that ‘a is b’ implies ‘b is a’. Many good results in problem-solving tests turn out to depend on memorizing training data similar (or even identical) to test data.

[Sober AI is the Norm](https://www.dbreunig.com/2024/06/12/sober-ai-is-the-norm.html). Sober, boring AI: Doing what you’re already good at, faster and better, like Copilot, RAG, summarization. Pie-in-the-sky, high-as-a-kite-AI: Expecting AI to do your job for you, like Devin. Even self-driving vehicles might fall into the latter category. They might be OK in a Disney parking lot or Phoenix suburbs, but not in an adversarial environment like NYC or New Delhi where people will just walk in front of a car that will definitely stop, or [exploit hacks](https://www.theregister.com/2024/06/03/baidu_robotaxi_attack/?utm_source=substack&utm_medium=email), or hijack driverless truck cargo. You need different infrastructure and legal frameworks. Hype and one-upmanship dynamics and competitive accelerationism can get a little out of control and people can release things that aren’t ready or where they haven’t thought through all the implications."""

In [8]:
model = ChatOpenAI(model=MODEL)

system_template = """Act like a seasoned copy editor with 20 years of experience in enhancing 
manuscripts for publication. You have a proven track record of working with authors across 
various genres, helping them refine their writing to meet high editorial standards. Your 
meticulous attention to detail and comprehensive approach to editing ensure that every piece 
you work on is polished and ready for publication."""

user_template = """Objective: Perform an in-depth copy edit on the provided text, focusing on 
improving its overall quality and readability. Your goal is to identify and correct any issues, 
ensuring the text is clear, concise, and engaging. Address the following specific tasks in a 
step-by-step manner, returning the edited output text:

1. **Spelling Errors:** Identify and correct any spelling errors.
2. **Passive Voice:** Highlight instances of passive voice and suggest active voice alternatives.
3. **Punctuation:** Check for proper use of punctuation, including commas, semicolons, and dashes.
4. **Descriptive Language:** Suggest stronger and more descriptive adjectives and adverbs.
5. **Verb Tense Consistency:** Ensure consistency in verb tense throughout the document.
6. **Sentence Simplification:** Suggest shorter or simpler alternatives for complex sentences.
7. **Subject-Verb Agreement:** Check for subject-verb agreement errors.
8. **Conciseness:** Identify and remove any unnecessary words or phrases.
9. **Transitions:** Suggest better transitions between paragraphs or sections.
10. **Clarity:** Highlight any instances of unclear or confusing language.
11. **Sentence Structure:** Suggest alternative sentence structures to vary the flow of the writing.
12. **Capitalization and Formatting:** Check for proper capitalization and formatting of titles.
13. **Repetitive Language:** Highlight any instances of repetitive language or ideas.
14. **Engagement:** Suggest more engaging openings or conclusions.
15. **Apostrophes and Contractions:** Check for proper use of apostrophes and contractions.
16. **Homophones:** Identify and correct any misused homophones (e.g., their/there, its/it’s).
17. **Hyphens and Dashes:** Check for consistency in the use of hyphens and dashes.
18. **Descriptive Nouns and Verbs:** Suggest more descriptive nouns and verbs to replace adjectives and adverbs.
19. **Word Usage:** Highlight any instances of incorrect word usage (e.g., affect/effect).
21. **Strong Verbs:** Suggest stronger and more specific verbs to replace weaker ones.
22. **Quotation Marks:** Check for proper use of quotation marks and attribution.
23. **Repetition:** Suggest alternative word choices to avoid repetition.
24. **Formatting Consistency:** Highlight any instances of inconsistent formatting or spacing.
25. **Possessive Apostrophes:** Check for proper use of possessive apostrophes.
26. **Clarity Improvement:** Suggest alternative sentence structures to improve clarity.

Respond with the final output text only, without comment.

Take a deep breath and work on this problem step-by-step.

Text:
{input}
"""


prompt_template = ChatPromptTemplate.from_messages(
    [("system", system_template),
     ("user", user_template)]
)

parser = StrOutputParser()

chain = prompt_template | model | parser

start_time = datetime.now()

response = ""
# stream tokens as they are generated
for r in chain.stream({"input": INPUT}):
    print(r, end="")
    response += r
end_time = datetime.now()

difference = end_time - start_time
total_seconds = difference.total_seconds()
print(f"\n\nElapsed seconds: {total_seconds:.6f}")


Ask ChatGPT a simple question: [How many r’s are there in ‘cranberry’?](https://druce.ai/assets/2024/cranberry.png). It often answers 2 instead of 3.

ChatGPT can generate huge productivity gains but can also generate significant [reputational and career risks](https://www.npr.org/2023/12/30/1222273745/michael-cohen-ai-fake-legal-cases) when it makes embarrassing mistakes. Users should understand when and why it might mess up. Gather round if you want to dodge a ChatGPT disaster!

Here are more simple prompts that you might expect to work but do not:

- [How many letters are in this sentence: It was a bright cold day in April, and the clocks were striking thirteen.](https://druce.ai/assets/2024/1984.png) I got 60, should be 72.

- [Mary has 3 brothers and she also has 5 sisters. How many sisters does Mary’s brother have?](https://druce.ai/assets/2024/sisters.png) I got “same number of sisters as Mary, which is 5.”

- [Make a list of 50 cities that do not contain the letter ‘a’, or any 

This is the most important chart on applying AI in your daily work:

![Distribution of output scores on a consulting task, with and without AI.](https://druce.ai/assets/2024/mollick-curve.png)

Distribution of output quality across all the tasks. The blue group did not use AI, the green and red groups used AI. The red group got additional training on how to use AI.

This figure is from a [randomized control trial of Boston Consulting Group consultants, who were asked to perform a set of consulting-related tasks](https://www.hbs.edu/ris/Publication%20Files/24-013_d9b45b68-9e74-42d6-a1c6-c72fb70c7282.pdf). Most users obtained significant benefits from AI tools despite their limitations. For many tasks, language understanding is enough to unlock big gains.

But beware of the left tail of this distribution. Some users who got trained probably became overconfident and put too much stock in bad, possibly hallucinatory responses, and scored much worse. Don’t get conned by AI and dumped on the

In [12]:
def clean_markdown(text):
    # Create a translation table for removing junk characters
    junk_chars = dict.fromkeys(map(ord, '\xa0'), None)
    text = text.translate(junk_chars)
    
    # Remove image markdown patterns ![alt text](url)
    text = re.sub(r'!\[.*?\]\(.*?\)', ' ', text)

    # Remove link markdown patterns [visible text](url)
    text = re.sub(r'\[([^\]]+)\]\(.*?\)', r'\1 ', text)
    
    # remove remaining HTML tags (if any)
    text = re.sub(r'<.*?>', '', text)
    
    # Strip leading and trailing whitespace
    text = text.strip()
    
    text = text.replace('\n', '<br>')
    text = re.sub(r'(\s+)', ' ', text)
    
    return text

def compare_texts(original, modified, verbose=False):
    """doesn't work that great with markdown"""
#     original = remove_junk_characters(original)
#     diff = difflib.SequenceMatcher(None, original, modified)
    if verbose:
        print(original)
        print(modified)
    
    result = []
    for opcode, a0, a1, b0, b1 in diff.get_opcodes():
        if verbose:
            print(opcode, a0, a1, b0, b1)
        if opcode == 'equal':
            result.append(f'<span style="color: black;">{original[a0:a1]}</span>')
        elif opcode == 'insert':
            result.append(f'<span style="color: green;">{modified[b0:b1]}</span>')
        elif opcode == 'delete':
            result.append(f'<span style="color: black;">{original[a0:a1]}</span>')
        elif opcode == 'replace':
            result.append(f'<span style="color: green;">{modified[b0:b1]}</span>')
            result.append(f'<span style="color: red;">[{original[a0:a1]}]</span>')
    return ''.join(result)


In [13]:
def compare_texts2(text1, text2):
    text1=clean_markdown(text1)
    text2=clean_markdown(text2)
    matcher = difflib.SequenceMatcher(a=text1, b=text2, )
    cursor1, cursor2 = 0, 0
    final_out = ""

    for match in matcher.get_matching_blocks():
        # a,b = start of next location in match1, match2 respectively
        a, b = match.a, match.b
        # print mismatching sequence 1 in parens
        if cursor1 < a:
            out1 = f"{text1[cursor1:a]}"
            out1 = out1.strip()
            if out1:
                final_out += f"<text style=color:green>({out1})</text>"
            cursor1 = a
        # print mismatching sequence 2 in square brackets        
        if cursor2 < b:
            out2 = f"{text2[cursor2:b]}"
            out2 = out2.strip()
            if out2:
                final_out += f'<text style="color:red; text-decoration: line-through">[{out2}]</text>'
            cursor2 = b
        # print matching sequence
        final_out += f"{text1[match.a:match.a+match.size]}"
        cursor1 += match.size
        cursor2 += match.size

    # print ending leftovers
    if cursor1 < len(text1):
        final_out += f"({text1[cursor1:]})"
    if cursor2 < len(text2):
        final_out += f"[{text1[cursor1:]}"
        
    return final_out


In [14]:
# diff = compare_texts(INPUT[153:427], response[106:427])
diff = compare_texts2(INPUT, response)

display(HTML(diff))