# **Textual Data Analysis - Exercise - 6**


---


## **Name: Ayesha Zafar**
## **Date: 04/02/2025**


---

### Extracting named entities with a generative model

Here your task is to extract named entities by prompting a generative model. We will be using OpenAI's gpt-4o-mini and openai-python library to access the API.

You can use this API key: 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

There is real money behind this key, so do not share it or use it to any other purpose than this exercise. Also, use only the gpt-4o-mini model.



Your tasks are:

1) Write a prompt to extract named entities from a news article. Your prompt can either focus on one entity type (in this case, discard other types), or extract multiple entity types in the same prompt. Do not repeat the extraction separately for each entity type (let's save quota here!).

2) Write code to access the API, and retrieve results. Debug this with one short request until you know it works!

3) Take 10 news articles from the same news data collection (Finnish or English), verify that the selected articles are not extremely long (should be less than 300 words each), discard or truncate longer documents.

4) Extract named entities using gpt-4o-mini model from these 10 articles, and store/print the results.



If you have any problems of accessing the API with the given API key, you can also return the exercise without results. This is an experimental exercise, which can fail if the API key is invalidated. In this case, preferably post the error you get to discord in case we will have time to investigate, and return your code with the error the API gives. Look for explanation for the error, and write your hypothesis of why it happens.

---



Step 1. Installing necessary libraries

In [1]:
!pip install openai



Step 2. Importing required libraries

In [27]:
from openai import OpenAI
import requests
import json
import os

Step 3. Setting up the api key and initializing openAI client with key

In [28]:
API_KEY = "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"
client = OpenAI(api_key=API_KEY)

Step 4. Setting up the prompt

In [29]:
PROMPT_TEMPLATE = """Extract named entities from the following news article.

Entity Types:
- Persons (PER)
- Organizations (ORG)
- Locations (LOC)

Return the entities in JSON format with keys "PER", "ORG", and "LOC".

Article:
{article_text}

"""

Step 5. Accessing api and getting results

In [30]:
def extract_entities(article_text):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that extracts named entities from text."},
            {"role": "user", "content": PROMPT_TEMPLATE.format(article_text=article_text)}
        ],
        max_tokens=500,
        temperature=0.0
    )
    return response.choices[0].message.content

Step 6. Loading and accessing and filtering articles with >300 words

In [31]:
url = "http://dl.turkunlp.org/TKO_8964_2023/news-en-2021.jsonl"
response = requests.get(url)
data = response.text.splitlines()

articles = []
for line in data:
    obj = json.loads(line)
    article_text = obj.get("text", "").strip()
    if article_text and len(article_text.split()) <= 300:
      articles.append(article_text)
    if len(articles) == 10:
      break

print(f"Collected {len(articles)} articles.")
print("First Article:")
print(articles[0])

Collected 10 articles.
First Article:
Rental fees for non-subsidised apartments rose across most of Finland during April to June, compared to the same period a year ago, according to data from Statistics Finland. 
 On average, rents rose by 0.9 percent during that period across the country. 
 Timo Metsola , board chair of rental agency Vuokraturva, attributed the increase to growing demand, saying that competition clearly intensified for the most desirable properties. 
 The sharpest rise in apartment rents during the April-June period was seen in the city of Turku, where costs rose by 1.6 percent, with the city of Tampere seeing an increase of 1.4 percent. 
 Meanwhile in the Greater Helsinki area, rents ticked up by 0.9 percent. 
 Among the country's municipal centres, the town of Mikkeli was the only area which saw rental fees decline. 
 Still priciest in Helsinki area 
 The number-crunching agency reported that the median rent for a studio apartment in central Helsinki was 809 euros 

Step 7. Extracting named entities from articles

In [32]:
for i, article in enumerate(articles):
    print(f"Article {i + 1}:")
    entities = extract_entities(article)
    print(entities)

Article 1:
```json
{
  "PER": [
    "Timo Metsola"
  ],
  "ORG": [
    "Statistics Finland",
    "Vuokraturva"
  ],
  "LOC": [
    "Finland",
    "Turku",
    "Tampere",
    "Greater Helsinki",
    "Mikkeli",
    "Helsinki",
    "Oulu"
  ]
}
```
Article 2:
```json
{
  "PER": [
    "Emma Terho",
    "Kirsty Coventry",
    "Yelena Isinbayeva",
    "Peter Tallberg"
  ],
  "ORG": [
    "International Olympic Committee",
    "Athletes’ Commission",
    "IOC"
  ],
  "LOC": [
    "Zimbabwe",
    "Finland",
    "Pyeongchang"
  ]
}
```
Article 3:
```json
{
  "PER": [],
  "ORG": [
    "Regional State Administrative Agency of Southern Finland",
    "Avi"
  ],
  "LOC": [
    "Helsinki Metropolitan Area",
    "Kymenlaakso",
    "capital region"
  ]
}
```
Article 4:
```json
{
  "PER": [],
  "ORG": [
    "Helsinki District Court",
    "Police"
  ],
  "LOC": [
    "Helsinki"
  ]
}
```
Article 5:
```json
{
  "PER": [],
  "ORG": [
    "Association of Finnish Theatres",
    "Finnish Theatre Directors' As