## Problems with extracting triplets and approaches to solve the problems

Even when the text from which we wish to extract the triplets are pure text, we have noticed that there are pronouns and sometimes different people or organziations can have similar names. Here is an attempt to solve this. The input documents are wikipedia pages regarding Hillary Clinton, Bill Clintons preesidency, the Hillary Clinton email controversy giving plenty of chances for name confusion

The approach here is to have an LLM try to replace the text by using the correct full name from the context and not changing much else.

In [None]:
from dotenv import load_dotenv
import json
import openai
import glob
from pathlib import Path
import os

## Obtain the pdf file and convert to text in some ways

In [None]:

def get_wiki_data(fname):
    with open(fname, "r") as f:
        data_ltext = f.read()

    chunks = data_ltext.split(".\n")   
    
    data_path = Path(fname).parent
    name = Path(fname).name.split(".txt")[0]

    dirpath = Path(data_path)/name
    os.makedirs(dirpath, exist_ok=True)
    return data_ltext, chunks, dirpath, name


In [None]:
def get_prompt(text):
    return f"Given the text surrounded by triple backticks, please return the same text with minimal modifications but replacing every mention of a name or organization by its full name (as available) and also replace the pronouns by the full names of the entity which they refer to. If the full name corresponding to a pronoun or a name is not available, then leave it as it is rather than making up an answer. ```{text}```"


In [None]:
fnames = glob.glob("../data/wikipedia/data/wiki_text*.txt")

In [None]:
dirpath = Path("data/rephrasing")
os.makedirs(dirpath, exist_ok=True)
#print(f"{dirpath=}")
#out_name = f"{dirpath}/{Path(fname).name}_{i}.json"
#print(f"Writing to {out_name=}")
#exit()
load_dotenv()
client = openai.OpenAI(timeout=60)
for fname in fnames:
    print(f"{fname=}")
    data_text, chunks, _, name = get_wiki_data(fname)

    for i, chunk in enumerate(chunks):
        prompt = get_prompt(chunk)
        response = client.chat.completions.create(model='gpt-4o',
                                                temperature=0.0, 
                                                top_p=0.3,           
                                                messages=[{ "role": "user","content":prompt}])
        out_name = f"{dirpath}/{Path(fname).name}_{i}.json"
        print(f"Writing to {out_name=} for chunk {i} of {len(chunks)}")  
        results=dict(input=chunk, output=response.choices[0].message.content)
        with open(out_name, "w") as f:
            f.write(json.dumps(results, indent=4))