# Prompt Engineering ChatGPT to Parse Articles for Information
The purpose of this notebook is to give a demonstration of prompt engineering using the api Open AI.

The following can be used to install the required libraries and set an Open AI key as an environment variable in the terminal:

The following Python code imports the required modules and reads the AI key. Keep in mind that Windows may require a restart for the system environment to be set.

In [1]:
import openai
import os
from newspaper import Article
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

openai.api_key = os.getenv('OPENAI_API_KEY')

The following function will send the input prompt to the model gpt-3.5-turbo and create the response:

In [2]:
def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0,
    )
    return response.choices[0].message["content"]

Below I will put the url of an article.

In [3]:
url = 'https://www.cnn.com/2023/06/12/us/utah-mom-husband-killing-court-documents/index.html'

In [4]:
article = Article(url)
article.download()
article.parse()

In [27]:
print(article.text)

CNN —

A judge denied the pretrial release of Kouri Richins – a Utah widow accused of killing her husband before she authored a children’s book about grief – saying she must remain in custody pending the outcome of her trial due to the “substantial evidence” against her.

The judge’s decision to deny bail came as new details about the widow’s alleged search history emerged as part of the case against the 33 year old, who appeared in a Park City, Utah, court on Monday.

“What is a lethal dose of fentanyl” was one of many phone searches that investigators say were made by Kouri Richins. Prosecutors allege she killed Eric Richins, her husband of nine years, with a lethal dose of fentanyl. She faces charges of criminal homicide, aggravated murder and three counts of possession of a controlled substance with intent to distribute. She has not yet entered a plea.

In court Monday, Judge Richard E. Mrazik ruled Kouri Richins to be held without bail and cited the severity of the punishment of a

In [28]:
print(article.title)

Kouri Richins: New details emerge about the alleged search history of the Utah mom charged with her husband's murder


Now for a simple prompt to interact with the model.

In [9]:
prompt = f'''
Create a summary of 20 words or less to describe the article in the following text.
```{article.text}```
'''

response = get_completion(prompt)
print(response)

A Utah widow accused of killing her husband and authoring a children's book about grief has been denied pretrial release.


So what other uses may there be for extracting information from an article?

News sites will generally categorize their stories into sections such as Health, Politics, or Global news. While this is likely done before a story is submitted, it could be the case that a section could be subdivided or a new one created and some articles need additional, more specific categorization. A real world example would be the invasion of Ukraine becomes a war; some articles that were once "World News" should now be categorized under "War on Ukraine" as well.

Another potential use would be referencing an ongoing story when relevant. It is not uncommon to have a follow up article and automatically referencing these articles based on context could be automated. It may also be useful to reccomend similar stories, or stories about nearby locations, but important to make a distinction between a similar type of story and a follow up.

Finally, another metric to look at may be the general mood of the story. While it is important for people to be aware of events that are impacting lives, it is also true that a constant flux of stories about tragic or gruesome events can have a negative impact. People may have off days or simply need a palette cleanser from a feel good story.

To put the instructions plainly, it is useful to have context based instructions to parse keywords from the article. I will try a simple prompt to do this:

In [5]:
prompt = f'''
Identify some keywords from this article:
```{article.text}```
'''

response = get_completion(prompt)
print(response)

Utah, Kouri Richins, Eric Richins, fentanyl, criminal homicide, aggravated murder, possession of a controlled substance, bail, judge, phone searches, financial history, victim impact statement, children's book, grief, internet searches, life insurance, police investigations, electronic devices, joint accounts, illicit fentanyl purchases, autopsy, toxicology report, medical examiner, witness testimony, forged documents, durable power of attorney, estate planning, trustees, affair.


The ouput of keywords is a mess. They are all relevant to the article but some are less so.

The goal is now to pull out the most relevant keywords to this story. To do this effectively, the instructions need to be specific enough to work on this example but general enough to work on a larger scale.

In [6]:
prompt = f'''
Identify the following information for the article below.
1. Subject
2. Location
3. Category
4. People
5. Keywords: 10 max
6. Mood
Article: ```{article.text}```
'''
ccc

1. Subject: Kouri Richins' pretrial release denied in husband's murder case
2. Location: Park City, Utah, USA
3. Category: Crime
4. People: Kouri Richins, Eric Richins, Judge Richard E. Mrazik, Skye Lazaro, Amy Richins, Matt Throckmorton, Kristal Bowman-Carter
5. Keywords: Kouri Richins, Eric Richins, murder, fentanyl, children's book, grief, bail, pretrial release, phone searches, financial history, victim impact statement, autopsy, toxicology report, forged documents, estate planning, trust
6. Mood: Serious, somber.


Good start, but it is still messy. The prompt will be refined to get the model to consistently put out keywords-- and importantly to use the same keywords. Note how in the examples below subtly changing a few words encourages different outputs, and sometimes this change is not always reflective of where the changes were made. It helps to be particular and specific.

In [7]:
prompt = f'''
Identify the following information for the article below.

A. Explain up to five events that happened in short sentences.
B. Subject
C. Location
D. Category
E. People
F. Mood
G. Determine 5 unique keywords to identify the article.

Article: ```{article.text}```
'''

response = get_completion(prompt)
print(response)

A. Kouri Richins denied pretrial release, new details about her alleged search history emerged, judge ruled Kouri Richins to be held without bail, prosecutors called on several expert witnesses to testify, and allegations of forged documents were made.

B. Crime

C. Utah, United States

D. News

E. Kouri Richins, Eric Richins, Judge Richard E. Mrazik, Skye Lazaro, Amy Richins, Matt Throckmorton, and Kristal Bowman-Carter.

F. Serious

G. Kouri Richins, Utah, murder, fentanyl, children's book.


In [9]:
prompt = f'''
Identify the following information for the article below.

A. Explain 10 or less events that happened in short sentences.
B. Topic
C. Category from list: News, Health, World News, Government, Entertainment.
D. News subject
E. Exact location
F. All people mentioned, ordered by relevance to the story
G. Mood

Article: ```{article.text}```
'''
response = get_completion(prompt)
print(response)

A. Kouri Richins denied pretrial release, judge cites substantial evidence against her, incriminating internet searches found on her phone, expert witnesses testify about phone records and financial history, victim impact statement made by Eric Richins' sister, Kouri Richins has right to file expedited appeal within 30 days, allegations of forged documents. 

B. Utah widow accused of killing her husband before writing a children's book about grief. 

C. News 

D. Criminal homicide, aggravated murder, and possession of a controlled substance with intent to distribute. 

E. Park City, Utah 

F. Kouri Richins, Eric Richins, Amy Richins, Skye Lazaro, Judge Richard E. Mrazik, Matt Throckmorton, Kristal Bowman-Carter. 

G. Serious, somber.


In [16]:
prompt = f'''
Identify the following information for the article below.

A. Explain 10 or less events that happened in short sentences.
B. Topic
C. Category from list: News, Health, World News, Government, Entertainment.
D. News Subcategory
E. Exact location and country
F. List all of the people mentioned in the article
G. Overall mood

Article: ```{article.text}```
'''

response = get_completion(prompt)
print(response)

A. Kouri Richins denied pretrial release, judge cites substantial evidence against her, incriminating internet searches found on her phone, allegations of forged documents, victim impact statement made by Eric Richins' sister, Kouri Richins has not entered a plea, court will reconvene on June 22 for a scheduling conference.
B. Utah widow accused of killing her husband before writing a children's book about grief.
C. News
D. Crime
E. Park City, Utah, United States
F. Kouri Richins, Eric Richins, Amy Richins, Skye Lazaro, Judge Richard E. Mrazik, Matt Throckmorton, Kristal Bowman-Carter
G. Serious, somber.


Finally, reference a different aticle and see if there is a similar output.

In [18]:
url1 = 'https://www.cnn.com/2023/06/14/style/monet-paint-protest-stockholm-climate/index.html'
article1 = Article(url1)
article1.download()
article1.parse()

In [19]:
prompt = f'''
Identify the following information for the article below.

A. Explain 10 or less events that happened in short sentences.
B. Topic
C. Category from list: News, Health, World News, Government, Entertainment.
D. News Subcategory
E. Exact location and country
F. List all of the people mentioned in the article
G. Overall mood

Article: ```{article1.text}```
'''

response = get_completion(prompt)
print(response)

A. Two activists smeared red paint and glued their hands to a Monet painting at Stockholm's National Museum. Police were called to the scene and arrested two women. The painting is being inspected for damage.
B. Activists vandalize Monet painting at Stockholm's National Museum.
C. News
D. Art Vandalism
E. Stockholm, Sweden
F. Claude Monet, Per Hedström
G. Concerned


The results are not exactly as wanted-- "Art Vandalism" should be "Crime" and specifically this should be "World News" as this is a U.S. based news source. The model would not be able to infer this context without being directed to do so, so this result is not unexpected. The following changes can help be more specific and imply the context needed.

In [20]:
prompt = f'''
Identify the following information for the article below.

A. Explain 10 or less events that happened in short sentences.
B. Topic
C. Category from list: U.S. News, Health, World News, Government, Entertainment.
D. News Subcategory: one word
E. Exact location and country
F. List all of the people mentioned in the article
G. Mood from list: Serious, Cheerful, Neutral

Article: ```{article1.text}```
'''

response = get_completion(prompt)
print(response)

A. Two activists smeared red paint and glued their hands to a Monet painting at Stockholm's National Museum. Police were called to the scene and arrested two women. The painting is being inspected for damage.
B. Art vandalism at Stockholm's National Museum
C. World News
D. Crime
E. Stockholm, Sweden
F. Claude Monet, Per Hedström
G. Serious


An again, test it on the initial article to see if the results are consistent.

In [21]:
prompt = f'''
Identify the following information for the article below.

A. Explain 10 or less events that happened in short sentences.
B. Topic
C. Category from list: U.S. News, Health, World News, Government, Entertainment.
D. News Subcategory: one word
E. Exact location and country
F. List all of the people mentioned in the article
G. Mood from list: Serious, Cheerful, Neutral

Article: ```{article.text}```
'''

response = get_completion(prompt)
print(response)

A. Kouri Richins denied pretrial release, judge cites substantial evidence against her, incriminating internet searches found on her phone, expert witnesses testify in court, victim impact statement made by Eric Richins' sister, Kouri Richins has right to file expedited appeal within 30 days, allegations of forged documents.
B. Utah widow accused of killing her husband before writing a children's book about grief.
C. U.S. News
D. Crime
E. Park City, Utah, United States
F. Kouri Richins, Eric Richins, Amy Richins, Skye Lazaro, Matt Throckmorton, Kristal Bowman-Carter
G. Serious


Somehow the judge is now missing from the list of people. The description of people will be changed to convey that every person MUST be mentioned.

In [23]:
prompt = f'''
Identify the following information for the article below.

A. Explain 10 or less events that happened in short sentences.
B. Topic
C. Category from list: U.S. News, Health, World News, Government, Entertainment.
D. News Subcategory: one word
E. Exact location and country
F. Exhaustive list of every name
G. Mood from list: Serious, Cheerful, Neutral

Article: ```{article.text}```
'''

response = get_completion(prompt)
print(response)

A. Kouri Richins denied pretrial release, judge cites substantial evidence against her, incriminating search history found on her phone, victim impact statement made by Eric Richins' sister, Kouri Richins has right to file expedited appeal within 30 days, allegations of forged documents.
B. Utah widow accused of killing her husband before writing a children's book about grief.
C. U.S. News
D. Crime
E. Park City, Utah, United States
F. Kouri Richins, Eric Richins, Amy Richins, Skye Lazaro, Richard E. Mrazik, Matt Throckmorton, Kristal Bowman-Carter
G. Serious


In [24]:
url2 = 'https://www.cnn.com/2023/06/15/health/youth-suicide-homicide/index.html'
article2 = Article(url2)
article2.download()
article2.parse()

In [25]:
prompt = f'''
Identify the following information for the article below.

A. Explain 10 or less events that happened in short sentences.
B. Topic
C. Category from list: U.S. News, Health, World News, Government, Entertainment.
D. News Subcategory: one word
E. Exact location and country
F. Exhaustive list of every name
G. Mood from list: Serious, Cheerful, Neutral

Article: ```{article2.text}```
'''

response = get_completion(prompt)
print(response)

A. Suicide and homicide rates for children and young adults in the US are the highest they've been in decades. Suicide and homicide were the second and third leading causes of death for this age group. The homicide rate for this age group in 2021 was the highest it's been since 1997, and the suicide rate was the highest on record since 1968. Suicide rates surpassed homicide rates for this age group in 2010 and have continued rising for the past decade. But a large spike in homicide rates during the first year of the Covid-19 pandemic brought the rates for both types of violent death together for the first time in a decade. For children ages 10 to 14, however, a large gap remained. The suicide rate in 2021 was twice as high as the homicide rate. Firearms were the most common weapon used in children's deaths, and Black boys were killed more than any other group. 

B. Suicide and homicide rates for children and young adults in the US 

C. Health 

D. Mental health 

E. United States 

F. 

Finally, this data can be passed through the model again to format the output that can be put into a database.

In [29]:
prompt = f'''
Format the Data below into a JSON file with the following key/value pairs:
A. to Summary
B. to Topic
C. to Category
D. to Subject
E. to Location
F. to People
G. to Mood

Data: ```{response}```
'''

formatted_response = get_completion(prompt)
print(formatted_response)

{
  "Summary": "Suicide and homicide rates for children and young adults in the US are the highest they've been in decades. Firearms were the most common weapon used in children's deaths, and Black boys were killed more than any other group.",
  "Topic": "Suicide and homicide rates for children and young adults in the US",
  "Category": "Health",
  "Subject": "Mental health",
  "Location": "United States",
  "People": "No names mentioned",
  "Mood": "Serious"
}
