# KW eval - precision
* Inspired by [RAGAS](https://github.com/explodinggradients/ragas)
* Utilizing GPT-4 prompt eval for keyword extraction precision.
* Key point is this is all done without any golden set or human annotation..
* KeyBert scores 0.5 precision - however this is whithout and cut off set.
* GPT4 using a very simple prompt scores 0.786 (cut off is not applicable).
* Next steps include implementing a recall and F1 score then perhaps some other new custom score of some sort.
* Once there are in place it will be interesting to test various KW extraction prompts and see how they perform.
* The biggest drawback of this type of evaluation is it's slow and perhaps can be expensive. Perhaps finetuning an opensource model may resolve this in the near future.
* Also, while I was able to get the precision prompt to work reasonably well it is difficult to instruct it to do exactly what I want in the evaluation.

## Prep

In [8]:
# !pip install keybert

In [9]:
# !pip install nltk

In [10]:
# !pip install pandas

In [11]:
# !pip install langchain

In [12]:
# !pip install python-dotenv

In [13]:
# !pip install openai

In [14]:
# !conda install -y fastparquet

In [15]:
import os
import pandas as pd
import random
from tqdm import tqdm
import ast
from random import sample
from datetime import datetime

In [16]:
# use dotenv to set environment variables.
from dotenv import load_dotenv
load_dotenv(override=True)

False

In [17]:
# Import the LangChain libraries
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI

In [38]:
from dotenv import load_dotenv
load_dotenv(override=True)

import openai
openaikey = os.getenv('OPENAI_API_KEY')
openai.api_key = openaikey

## Precision metric

### Prompt

In [19]:
kw_precision_prompt = """Given a document and proposed keywords, identify the good and bad keywords following these steps. 

Steps
* start by generating the main keyword phrases following the below guidlines.
* Compare the proposed keywords with the generated keywords.
* Mark as good any keywords that are very close matches and for these good:"yes"
* Read the guidlines <guidlines>.
* Use the guildines to determine which proposed keywords are good vs bad.
* Take a breadth and reevaluate your decisions to make sure they are correct.
* Return the results of your consternation.

<guidlines>
General:
In some cases, all keywords will be bad. If there aren't any good keywords that fully conform to all guidelines then reject all of them. Make sure all proposed keywords are concise, clear, and unambiguous keywords rather than phrases or fragments. You must compare each keyword to the ideal keyword that should be assigned to this document using the guidelines.

Features of Good Keywords for Document Classification:
- Relevance: They should directly relate to the core content or theme of the document. Relevant keywords ensure that the document is correctly categorized and easily retrievable. When assigning keywords, ensure they reflect the primary focus and recurring themes of the document, rather than isolated incidents or secondary details.
- Contextual Relevance: Ensure the keywords capture the main theme or unique aspects of the document. They should reflect the core message or the most significant information presented. For instance, if the document discusses the impact of vermin on grain stocks, a keyword like "vermin impact on agriculture" would be contextually relevant.
- Specificity: Keywords should be specific enough to distinguish the document's unique aspects. Generic terms may lead to misclassification or difficulty in locating the document among numerous others. Choose keywords that are specific to the content of the document. Avoid broad or generic terms that could apply to many different contexts.
- Consistency: Using standardized terms within a given classification system or subject area promotes uniformity. This aids in linking similar documents and facilitates systematic retrieval.
- Clarity: Keywords should be clear and unambiguous. This minimizes confusion and ensures that users with different levels of expertise or background knowledge can understand and use them effectively.
- Comprehensiveness: They should cover all major themes or topics in the document. This ensures that the document can be found under all relevant searches.
- Conciseness: While being comprehensive, keywords should also be concise, avoiding overly long or complex phrases that may not be commonly used or searched for.
- Predictive Power: Good keywords anticipate the terms users might use when searching for that type of document, aligning with potential search queries.
- Balance Specificity with Breadth: While being specific, also ensure that the keywords are not so narrow that they miss other important aspects of the document. They should collectively provide a balanced overview of the document’s content.


Features of Bad Keywords for Document Classification:
- Too Broad or Vague: Keywords that are overly general can lead to a document being lost among too many irrelevant search results, making it hard to locate.
- Too Obscure or Technical: Using highly specialized or niche terms that are not widely recognized can make a document inaccessible to a broader audience.
- Irrelevant: Keywords that don't accurately reflect the document's content can mislead users and result in poor retrieval accuracy.
- Inconsistent Usage: Using different terms for the same concept across documents can lead to fragmentation and difficulty in retrieving all relevant documents on a topic.
- Redundancy: Overlapping or repetitive keywords add no value and can clutter the classification system. Avoid Redundancy. Don't use multiple keywords that convey the same idea. Each keyword should add new information or a different perspective on the document's content.
- Ambiguity: Ambiguous terms can lead to misinterpretation and misclassification, especially in documents covering multiple topics.
- Too Lengthy: Long phrases or sentences used as keywords can complicate the search process and are often overlooked in favor of shorter, more direct terms. Choose keywords that unambiguously relate to the document’s content. Avoid words that have multiple meanings or could be easily misinterpreted out of context.
</guidlines>

Document: Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias)
Proposed keywords: supervised learning, supervised, signal supervised, labeled training

Accuracy:
[
    {{  "keyword":"supervised learning"
        "reason": "This is the main topic of this document. The document is a definition of this keyword",
        "good": "Yes"
    }},
    {{  "keyword":"supervised"
        "reason": "This single word is ambiguous, lacks precision, clarity, and specificity. It could be easily misinterpreted.
        "good": "No"
    }},
    {{  "keyword":"signal supervised"
        "reason": "This is not a normally used phrase. It's a sentence fragment.
        "good": "No"
    }},
    {{  "keyword":"Training data"
        "reason": "This is described in the document within the context of supervised learning. It is clear, specific and non-ambiguous, and relevant to the document.",
        "good": "Yes"
    }},
]

Document: The 2022 ICC Men's T20 World Cup, held from October 16 to November 13, 2022, in Australia, was the eighth edition of the tournament. Originally scheduled for 2020, it was postponed due to the COVID-19 pandemic. England emerged victorious, defeating Pakistan by five wickets in the final to clinch their second ICC Men's T20 World Cup title.
Proposed keywords: World Cup, 2020
Accuracy:
[
	{{  "keyword":"2020"
        "reason": "2020 is a year and is not the subject of this document and as such while it is mentioned in it is not a the topic of this document..",
        "good": "No"
    }},
    {{  "keyword":"World cup"
        "reason": "This is the main topic of the document.",
        "good": "Yes"
    }},
]

Document:"HIGHER 1986 PROFIT FOR DUTCH CHEMICAL FIRM DSM
The fully state-owned Dutch
chemical firm NV DSM &lt;DSMN.AS> said its 1986 net profit rose to
412 mln guilders from 402 mln in 1985, while turnover fell to
17.7 billion guilders in 1986 from 24.1 billion in 1985.
  The company said 1986 dividend, which will be paid to the
Dutch state in its capacity of the firm's sole shareholder,
would be raised to 98 mln guilders from 70 mln guilders in
1985.
  In an initial comment on its 1986 results, DSM said the
drop in 1986 turnover had been caused mainly by losses in the
company's fertilizer division."
Proposed keywords: 1986 profit, net profit, 1986 dividend, nv dsm, firm dsm'
Accuracy: [
    {{  "keyword":"1986 profit",
        "reason": "This is a main topic of the document. The document discusses the profit of the company in 1986.",
        "good": "No"
    }},
    {{  "keyword":"net profit",
        "reason": "This is a main topic of the document. The document discusses the net profit of the company.",
        "good": "No"
    }},
    {{  "keyword":"1986 dividend",
        "reason": "This is a main topic of the document. The document discusses the dividend of the company in 1986.",
        "good": "No"
    }},
    {{  "keyword":"nv dsm",
        "reason": "This is the name of the company being discussed in the document. It is relevant and specific.",
        "good": "No"
    }},
    {{  "keyword":"firm dsm",
        "reason": "This is a repetition of the company name and does not add any new information. It is redundant.",
        "good": "No"
    }},
]

Document:"{doc_text}"
Proposed keywords:{extracted_keywords}
Accuracy:"""

### Helpers

In [20]:
def get_kws_precision(completion):
    """ Processes completion to json. 
    For a document text with keyword, returns judgements per each keyword.
    Convert resonse to a list and count the number of 'yes' then devide thsi by total kw count."""
    judges = eval(completion)
    if len(judges) == 0:
        return -1
    goods = []
    for judge in judges:
        goods.append(judge['good'])
    goods_count = goods.count('Yes')
    precision = goods_count / len(goods)
    return precision

# get_kw_precision(response)

In [21]:
def get_precisions_judgements(df):
    """ Runs presision prompt, processes to "goog" judgemnts into a precision column.
    Requires 'doc_text' and 'keywords' columns."""
    
    # Run the precision evaluation prompt.
    completions = []
    for row in tqdm(df.itertuples()):
        keywords = ', '.join([kw for kw, rank in row.keywords])
        try:
            response = precision_llm_chain.run(doc_text=row.doc_text, extracted_keywords=keywords)
        except Exception as e:
            print(e)
            response = ''
        completions.append(response)

    df['completions'] = completions
    
    # Process repsonses into precision column.
    precisions = [get_kws_precision(c) for c in completions]
    df['precision'] = precisions
    

In [22]:
def gen_run_precision(df):
    # Calculate the average precision.
    precisions = [p for p in df.precision.tolist() if p != -1]
    precision_avg = sum(precisions) / len(precisions)
    precision_avg = round(precision_avg, 3)
    
    ## Harmonic mean of precision.
    from scipy.stats import hmean
    # precisions = [get_kws_precision(c) for c in completions if c!=-1]
    precisions_harmonic_mean = round(hmean(precisions), 3)
    
    return precision_avg, precisions_harmonic_mean

In [41]:
## Set up the llm prompt chat chain.
#

model_name = 'gpt-4'
precision_llm = ChatOpenAI(model=model_name, temperature=0.25)
precision_llm_chain = LLMChain(
    llm=precision_llm,
    prompt=PromptTemplate.from_template(kw_precision_prompt)
)

## Datasets
* load a few news datasets for usee in manual evaluation prompt creation.
* And eventual testing.

### Reuters

In [42]:
# from sklearn.datasets import fetch_20newsgroups
# newsgroups = fetch_20newsgroups(subset='all')
# print(newsgroups.target_names[newsgroups.target[0]])
# print(newsgroups.data[0])

In [43]:
# Reuters - has some longer docs.
import nltk
from nltk.corpus import reuters
nltk.download('reuters')
from nltk.corpus import reuters

[nltk_data] Downloading package reuters to /home/jupyter/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


In [44]:
# Sample reuters.
reuters_docs = []
for doc_id in reuters.fileids():
    reuters_docs.append(reuters.raw(doc_id))

In [45]:
# Sample reuters.
random.seed(42)
doc_ids = random.sample(reuters.fileids(), 20)
reuters_samp = [reuters.raw(doc_id) for doc_id in doc_ids]

In [46]:
len(reuters_samp)

20

In [47]:
print(reuters_samp[8])

METRO FUNDING SHAREHOLDERS APPROVE MERGER
  &lt;Metro Funding Corp> said its
  shareholders approved its merger into &lt;Maxcom Corp> and its
  change of incorporation from Nevada to Delaware.
      Metro Funding also said its subsidiary, Comet Corp, will be
  renamed Maxcom USA.
      The company also reported shareholders approved the
  authorization of 500,000 shares of common stock to be set aside
  for an incentive stock option plan.
  




## KW evals

### KeyBert - eval

In [48]:
# Prep for kw extraction.
from keybert import KeyBERT
kw_model = KeyBERT()

  from .autonotebook import tqdm as notebook_tqdm


In [49]:
# Extract KWs.
docs_with_kws = []
for doc in reuters_samp:
    kws = kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
    docs_with_kws.append((doc, kws))

In [51]:
# convert output to df for convenience.
df_keybert = pd.DataFrame(docs_with_kws, columns=['doc_text', 'keywords'])
df_keybert.head(5)

Unnamed: 0,doc_text,keywords
0,PAN ATLANTIC RE INC &lt;PNRE> 4TH QTR NET\n O...,"[(pan atlantic, 0.4595), (year net, 0.4489), (..."
1,OECD SEES GERMAN GROWTH HIT BY LOW DOMESTIC DE...,"[(german economic, 0.5334), (german economy, 0..."
2,"YUGOSLAVIA TO TENDER FOR 100,000 TONNES WHEAT\...","[(wheat yugoslavia, 0.7061), (soft wheat, 0.53..."
3,TRANSDUCER SYSTEMS INC YEAR\n Shr profit 12 c...,"[(shr profit, 0.4864), (transducer, 0.4716), (..."
4,SWISS MONEY MARKET PAPER YIELDS 3.286 PCT\n T...,"[(swiss francs, 0.5559), (swiss money, 0.5137)..."


In [52]:
# Print out a few docs for manual prompt iteration.
for doc, kws in docs_with_kws[:1]:
    print('---------- Document ----------')
    kws_str = ', '.join([k[0] for k in kws])
    print(f"KEYWORDS: {kws_str}\n")
    print(doc)

---------- Document ----------
KEYWORDS: pan atlantic, year net, net oper, oper net, investment

PAN ATLANTIC RE INC &lt;PNRE> 4TH QTR NET
  Oper shr 15 cts vs 1.07 dlrs
      Oper net 372,000 vs 2,601,000
      Year
      Oper shr 80 cts vs 61 cts
      Oper net 1,952,000 vs 1,491,000
      NOTE: Net excludes realized investment loss 13,000 dlrs vs
  gain 986,000 dlrs in quarter and gains 1,047,000 dlrs vs
  1,152,000 dlrs in year.
      1986 year net excludes tax credit 919,000 dlrs.
  




In [53]:
# Run the precision evaluation.
get_precisions_judgements(df_keybert)
gen_run_precision(df_keybert)

20it [05:00, 15.02s/it]


(0.5, 0.397)

In [54]:
df_keybert.head(1)

Unnamed: 0,doc_text,keywords,completions,precision
0,PAN ATLANTIC RE INC &lt;PNRE> 4TH QTR NET\n O...,"[(pan atlantic, 0.4595), (year net, 0.4489), (...","[\n { ""keyword"":""pan atlantic"",\n ""...",0.4


In [55]:
# Persist results.
kw_method = 'keybert'
timestamp = datetime.now().strftime("%Y-%m-%d-%H%M%S")
fpath = f'data_output/kw_precision_{kw_method}_{timestamp}.parq'
df_keybert.to_parquet(fpath)

In [1]:
# Print out a few docs for manual prompt iteration.
for row in df_keybert.itertuples():
    print('---------- Document ----------')
    print(f"PRECISION: {row.precision}")
    kws_str = ', '.join([k[0] for k in row.keywords])
    print(f"KEYWORDS: {kws_str}")
    print(f"DOCUMENT: {row.doc_text}")
    print(f"COMPLETION: {row.completions}")

### LLM kws extraction - eval

#### Extract KWs using GPT-4.

In [56]:
kw_extract_prompt = """Identify the main categories of this Reuters news story. These should be unambiguous phrases. Output the results in a Python list.

{doc_text}"""

In [57]:
## Set up the llm prompt chat chain.
#

model_name = 'gpt-4'
kw_extract_llm = ChatOpenAI(model=model_name, temperature=0.00)
kw_extract_chain = LLMChain(
    llm=kw_extract_llm,
    prompt=PromptTemplate.from_template(kw_extract_prompt)
)

In [58]:
def llm_extract_kws(df):
    """ Run kw extraction using the llm prompt. """
    # Run the precision evaluation prompt.
    completions = []
    for row in tqdm(df.itertuples()):
        try:
            response = kw_extract_chain.run(doc_text=row.doc_text)
        except Exception as e:
            print(e)
            response = []
        completions.append(response)

    df['kw_completions'] = completions    

In [59]:
# Convert data to df.
df_gpt = pd.DataFrame(reuters_samp, columns=['doc_text'])
df_gpt.head(1)

Unnamed: 0,doc_text
0,PAN ATLANTIC RE INC &lt;PNRE> 4TH QTR NET\n O...


In [60]:
%%time
# Run the kw extraction.
llm_extract_kws(df_gpt)

20it [00:52,  2.63s/it]

CPU times: user 264 ms, sys: 21.3 ms, total: 285 ms
Wall time: 52.6 s





In [61]:
# process completions to keywords.
df_gpt['keywords'] = df_gpt['kw_completions'].apply(lambda x: ast.literal_eval(x))
df_gpt.head(1)

Unnamed: 0,doc_text,kw_completions,keywords
0,PAN ATLANTIC RE INC &lt;PNRE> 4TH QTR NET\n O...,"[""Business"", ""Finance"", ""Company Earnings"", ""I...","[Business, Finance, Company Earnings, Insuranc..."


In [62]:
# Add a facke probability)
df_gpt['keywords'] = df_gpt.keywords.apply(lambda x: [(k, 1.0) for k in x])
df_gpt.head(1)

Unnamed: 0,doc_text,kw_completions,keywords
0,PAN ATLANTIC RE INC &lt;PNRE> 4TH QTR NET\n O...,"[""Business"", ""Finance"", ""Company Earnings"", ""I...","[(Business, 1.0), (Finance, 1.0), (Company Ear..."


In [64]:
# df_gpt.keywords.tolist()

#### Run precision eval

In [86]:
# df_samp = df_gpt.iloc[0:1].copy()
# df_samp

In [65]:
%%time
get_precisions_judgements(df_gpt)

20it [06:10, 18.51s/it]

CPU times: user 301 ms, sys: 30.7 ms, total: 332 ms
Wall time: 6min 10s





In [66]:
gen_run_precision(df_gpt)

(0.786, 0.716)

In [67]:
# Persist results.
kw_method = 'llm-gpt4-iter1'
timestamp = datetime.now().strftime("%Y-%m-%d-%H%M%S")
fpath = f'data_output/kw_precision_{kw_method}_{timestamp}.parq'
df_gpt.to_parquet(fpath)

In [68]:
df_gpt.head(1)

Unnamed: 0,doc_text,kw_completions,keywords,completions,precision
0,PAN ATLANTIC RE INC &lt;PNRE> 4TH QTR NET\n O...,"[""Business"", ""Finance"", ""Company Earnings"", ""I...","[(Business, 1.0), (Finance, 1.0), (Company Ear...","[\n { ""keyword"":""Business"",\n ""reas...",0.4


In [70]:
# Print out a few docs for manual prompt iteration.
for row in df_gpt.itertuples():
    print('---------- Document ----------')
    print(f"PRECISION: {row.precision}")
    kws_str = ', '.join([k[0] for k in row.keywords])
    print(f"KEYWORDS: {kws_str}")
    print(f"DOCUMENT: {row.doc_text}")
    print(f"COMPLETION: {row.completions}")

---------- Document ----------
PRECISION: 0.4
KEYWORDS: Business, Finance, Company Earnings, Insurance Industry, Investment
DOCUMENT: PAN ATLANTIC RE INC &lt;PNRE> 4TH QTR NET
  Oper shr 15 cts vs 1.07 dlrs
      Oper net 372,000 vs 2,601,000
      Year
      Oper shr 80 cts vs 61 cts
      Oper net 1,952,000 vs 1,491,000
      NOTE: Net excludes realized investment loss 13,000 dlrs vs
  gain 986,000 dlrs in quarter and gains 1,047,000 dlrs vs
  1,152,000 dlrs in year.
      1986 year net excludes tax credit 919,000 dlrs.
  


COMPLETION: [
    {  "keyword":"Business",
        "reason": "This is a broad term that could apply to many documents. It lacks specificity and does not directly relate to the core content of the document.",
        "good": "No"
    },
    {  "keyword":"Finance",
        "reason": "This is a broad term that could apply to many documents. It lacks specificity and does not directly relate to the core content of the document.",
        "good": "No"
    },
    {  "ke