# Extracting Information from Financial News Descriptions with LLMs

In this notebook, we will explore how to extract meaningful information from news descriptions using **LLMs**. More specifically, we will extract sentiment and company tickers (STOCK identifiers for companies found in the descriptions).
Ticker extration can be regarded as NER/NEL task.
We will use a pre-trained **Mistral-7b** model from **Hugginface pipelineHub**.

---
- **Dataset**: https://www.kaggle.com/datasets/rdolphin/financial-news-with-ticker-level-sentiment/data?select=polygon_news_sample.json
- **Base Moidel**: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3



In [1]:
import os
from pydantic import BaseModel, Field
from typing import List, Optional
import pandas as pd 
import re
import json
from dotenv import load_dotenv

In [None]:
# Load environment variables from the .env file
load_dotenv()

# Access the secrets
secret_key = os.getenv("SECRET_KEY")

# Read Finalcial News Dataset

In [2]:
import json

# Global variable for the dataset path
DATA_DIR = 'data/polygon'
JSON_PATH = os.path.join(DATA_DIR, 'financial_news_with_ticker_level_sentiment.json')

with open(JSON_PATH, 'r') as f:
    data = json.load(f)
df = pd.DataFrame(data)

In [3]:
df.head()

Unnamed: 0,article_url,author,description,id,image_url,amp_url,keywords,published_utc,publisher,tickers,insights,title
0,https://www.zacks.com/stock/news/2114673/alleg...,Zacks.com,Allegiant Travel (ALGT) is a fast-moving stock...,db221630f08d9064b6539534cc9957ecd7ba2a626927c7...,https://staticx-tuner.zacks.com/images/default...,,"[Allegiant Travel, momentum investing, fast-pa...",2023-06-29T12:50:06Z,"{'name': 'Zacks Investment Research', 'homepag...",[ALGT],"[{'ticker': 'ALGT', 'sentiment': 'positive', '...",Allegiant Travel (ALGT) Is Attractively Priced...
1,https://www.zacks.com/stock/news/2085677/appli...,Zacks.com,Applied Industrial Technologies (AIT) reported...,bb7e1725949a7254ae18e8d149c3c19af050c0ac05f18f...,https://staticx-tuner.zacks.com/images/default...,,"[earnings, revenues, estimates, industrial pro...",2023-04-27T11:55:14Z,"{'name': 'Zacks Investment Research', 'homepag...","[AIT, NPO]","[{'ticker': 'AIT', 'sentiment': 'positive', 's...",Applied Industrial Technologies (AIT) Q3 Earni...
2,https://www.globenewswire.com/news-release/202...,,"Apollo Commercial Real Estate Finance, Inc. (A...",a49c53ef44092946950dfb3f33852c9ef07d7c7dc6c1ea...,https://ml.globenewswire.com/Resource/Download...,,"[commercial real estate, financing, mortgage l...",2023-03-06T13:30:00Z,"{'name': 'GlobeNewswire Inc.', 'homepage_url':...","[ARI, SAN]","[{'ticker': 'ARI', 'sentiment': 'positive', 's...","Apollo Commercial Real Estate Finance, Inc. Cl..."
3,https://www.globenewswire.com/news-release/202...,,"Maravai LifeSciences, a global provider of lif...",be4f5174307cd0f3309ee931ab4ec4fc2451af056769ca...,https://ml.globenewswire.com/Resource/Download...,,"[Maravai LifeSciences, investor conferences, f...",2023-11-09T13:15:00Z,"{'name': 'GlobeNewswire Inc.', 'homepage_url':...",[MRVI],"[{'ticker': 'MRVI', 'sentiment': 'positive', '...",Maravai LifeSciences Announces November 2023 I...
4,https://www.zacks.com/stock/news/2069321/dht-h...,Zacks Equity Research,"DHT Holdings, an independent oil tanker compan...",29bea2bb15df75a10fd940c2dc705d21d4c413fb45c17a...,https://staticx-tuner.zacks.com/images/default...,,"[DHT Holdings, oil tanker, earnings, revenue, ...",2023-03-22T22:00:25Z,"{'name': 'Zacks Investment Research', 'homepag...",[DHT],"[{'ticker': 'DHT', 'sentiment': 'neutral', 'se...",DHT Holdings (DHT) Stock Moves -1.33%: What Yo...


# Define Structured output Prompt
We want to force the LLM to output structured info. For that we can 
- define a **json schema**. We will use this approach for this simple example, but for a less error prone procedure use below's solution.
- use **pydantic** (define a class and get the json schema from pydantic class). by defining the structure as a pydantic class and use **langchain** to prepare the json schema for you.

In [4]:
def extract_json_after_instructions(llm_response):
    """
    Extracts JSON from the LLM response by locating the </instructions> tag
    and returning the first valid JSON block found after it.
    """
    # Ensure the response contains the tag
    if "</instructions>" in llm_response:
        _, json_part = llm_response.split("</instructions>", 1)  # Split at </instructions>
    else:
        json_part = llm_response  # If no tag, assume the whole response is JSON
    
    # Use regex to extract JSON from the remaining response
    match = re.search(r'\{.*\}', json_part, re.DOTALL)
    if match:
        try:
            return json.loads(match.group())  # Convert JSON string to Python dictionary
        except json.JSONDecodeError:
            print("Error: Couldn't parse extracted JSON.")
            return None
    return None  # No valid JSON found

def extract_structured_info(news_text: str):
    # Create a prompt to instruct the model
    prompt = f"""
    <article>
    {news_text}
    </article>
    <instructions>
    Write a concise version of this news article and extract the relevant companies
    mentioned along with the sentiment for each and the reasoning for choosing that sentiment rating.
    Your response must be in this JSON format. Do not include extra text outside the JSON object.:
    {{
    "title: : "...",
    "article_keywords" : "...",
    "relevant_company_details": {{
        "company_name":
        {{
            "ticker" : "tickerN/A",
            "sentiment_reasoning" : "...",
            "sentiment": "negative/neutral/positive",
        }},
        "company_name: {{"..."}}
     }}
    }}
    </instructions>
    """

    response = llm(prompt)
    structured_response = extract_json_after_instructions(response)
    return structured_response

# Load base LLM
We will use **HugginFaceHub** that allows to make **LLM inference in the cloud**. No need to download the model to local. Slower, but no trouble with memory. We will use a **Mistral-7B** model for that.

In [24]:
from langchain.llms import HuggingFaceHub
llm = HuggingFaceHub(repo_id="mistralai/Mistral-7B-Instruct-v0.3", model_kwargs={"temperature": 0.7})

  llm = HuggingFaceHub(repo_id="mistralai/Mistral-7B-Instruct-v0.3", model_kwargs={"temperature": 0.7})
  from .autonotebook import tqdm as notebook_tqdm


# Inference Loop
This loop calls **HugginfaceHub for LLM inference**. One call per entry in the dataset. 
 **input**: news description
 **output**: json with tickers and sentiment
 
The dataset is 5k entries large and takes around 3h (cloud inference) so we implement a few tricks to:
- **resume the process** where it left off in case of breaks.
- **retries** inference for an entry in case the model is nt responsive for a while.


In [29]:
import os
import json
import pandas as pd
import time
from tqdm import tqdm

# Global variables for dataset paths
DATA_DIR = 'data/polygon'
ORIGINAL_JSON_PATH = os.path.join(DATA_DIR, 'financial_news_with_ticker_level_sentiment.json')
EXTRACTED_JSON_PATH = os.path.join(DATA_DIR, 'extracted_structured_data.json')

# Load original dataset
with open(ORIGINAL_JSON_PATH, 'r', encoding='utf-8') as f:
    original_data = json.load(f)
df = pd.DataFrame(original_data)

# Try to load already extracted data
if os.path.exists(EXTRACTED_JSON_PATH):
    with open(EXTRACTED_JSON_PATH, 'r', encoding='utf-8') as f:
        extracted_data = json.load(f)
else:
    extracted_data = []

# Determine how many rows have already been processed
num_extracted = len(extracted_data)

print(f"🔄 Resuming extraction from index {num_extracted} of {len(df)}")

# Initialize results with existing extracted data
results = extracted_data

# Define retry parameters
MAX_RETRIES = 3
WAIT_TIME = 5  # seconds

# Loop through remaining rows
for index, row in tqdm(df.iloc[num_extracted:].iterrows(), total=len(df) - num_extracted, desc="Processing rows"):
    news_text = row['description']  # Assuming 'description' is the column with text

    retries = 0
    while retries < MAX_RETRIES:
        try:
            # Attempt to extract structured information
            insights = extract_structured_info(news_text)

            # If successful, break out of retry loop
            if insights is not None:
                results.append(insights)
            else:
                results.append({"placeholder": "Missing"})  # Handle null cases
            
            break  # Exit retry loop if successful

        except Exception as e:
            retries += 1
            print(f"⚠️ Error processing index {index}: {e}")
            if retries < MAX_RETRIES:
                wait_time = WAIT_TIME * retries  # Exponential backoff
                print(f"🔄 Retrying ({retries}/{MAX_RETRIES}) in {wait_time} seconds...")
                time.sleep(wait_time)
            else:
                print(f"❌ Skipping index {index} after {MAX_RETRIES} failed attempts.")
                results.append({"error": str(e)})  # Log error and continue

    # Save progress every 100 entries
    if (index + 1) % 100 == 0:
        with open(EXTRACTED_JSON_PATH, 'w', encoding='utf-8') as f:
            json.dump(results, f, indent=4)


🔄 Resuming extraction from index 978 of 5548


Processing rows:   0%|                                                                 | 0/4570 [00:00<?, ?it/s]

⚠️ Error processing index 978: 402 Client Error: Payment Required for url: https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3 (Request ID: Root=1-67c17a5c-59d5fd800316c9c931d305e3;1301a4ab-88ac-4857-9470-4936ba51ee92)

You have exceeded your monthly included credits for Inference Providers. Subscribe to PRO to get 20x more monthly allowance.
🔄 Retrying (1/3) in 5 seconds...
⚠️ Error processing index 978: 402 Client Error: Payment Required for url: https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.3 (Request ID: Root=1-67c17a61-37f77b20095d96d062512cbc;66b110d6-03e8-4245-b3ef-d433e3ec0acb)

You have exceeded your monthly included credits for Inference Providers. Subscribe to PRO to get 20x more monthly allowance.
🔄 Retrying (2/3) in 10 seconds...


Processing rows:   0%|                                                                 | 0/4570 [00:13<?, ?it/s]


KeyboardInterrupt: 

In [28]:
with open(EXTRACTED_JSON_PATH, 'w', encoding='utf-8') as f:
    json.dump(results, f, indent=4)

# Check: Load extracted Data

In [20]:
JSON_PATH = os.path.join(DATA_DIR, 'extracted_structured_data.json')

with open(JSON_PATH, 'r') as f:
    data = json.load(f)
    
# Filter out null entries
filtered_data = [entry if entry is not None else {"placeholder": "Missing"} for entry in data]
#[entry if entry is not None else pd.NA for entry in data] # 
#filtered_data = [entry for entry in data if entry is not None]
extracted_df = pd.DataFrame(filtered_data)

In [21]:
extracted_df.head(15)

Unnamed: 0,title,article_keywords,relevant_company_details,placeholder,title:
0,Allegiant Travel: Fast-Moving Stock with Stron...,"[Allegiant Travel, ALGT, stock, momentum, valu...","{'Allegiant Travel': {'ticker': 'ALGT', 'senti...",,
1,"AIT Q3 Results Beat Estimates, Stock to Perfor...","[Applied Industrial Technologies, Q3 results, ...",{'Applied Industrial Technologies': {'ticker':...,,
2,Apollo Commercial Real Estate Finance Secures ...,"[Apollo Commercial Real Estate Finance, Banco ...",{'Apollo Commercial Real Estate Finance': {'ti...,,
3,Maravai LifeSciences to present at investor co...,"[Maravai LifeSciences, investor conferences, N...",{'Maravai LifeSciences': {'ticker': 'tickerN/A...,,
4,DHT Holdings Stocks Down 1.33%,"[DHT Holdings, Transportation sector]","{'DHT Holdings': {'ticker': 'N/A', 'sentiment_...",,
5,Meta Platforms Q2 2023 Earnings: Impressive Ne...,"[Meta Platforms, Q2 2023 earnings, net profit,...","{'Meta Platforms': {'ticker': 'FB', 'sentiment...",,
6,Comcast's Entertainment Industry Recovery Amid...,"[Comcast, NBCUniversal, Peacock, theme parks, ...","{'Comcast': {'ticker': 'CMCSA', 'sentiment_rea...",,
7,Analysts increase Sarepta Therapeutics price t...,"[Sarepta Therapeutics, analysts, price target,...","{'Sarepta Therapeutics': {'ticker': 'SRPT', 's...",,
8,Stocks with Strong Value and Positive Outlook:...,"[Atmus Filtration Technologies Inc., Carrols R...",{'Atmus Filtration Technologies Inc.': {'ticke...,,
9,Celldex Therapeutics Q1 Earnings Beat Estimate...,"[Celldex Therapeutics, Quarterly Loss, Zacks C...",{'Celldex Therapeutics': {'ticker': 'tickerN/A...,,
