# Dow Jones Technical Assignment

#### Dataset Selection
For this assignment, I am using the [Yahoo Finance News Dataset (2023)](https://github.com/FelixDrinkall/financial-news-dataset/blob/main/data/2023_processed.json.xz), which contains real-world financial news articles published on finance.yahoo.com during the year 2023. The dataset is part of a broader collection covering the years 2017–2023 and available in the following [link](https://github.com/felixdrinkall/financial-news-dataset).
#### Rationale
I selected the 2023 dataset because it provides recent, high-quality financial news articles from multiple reputable media sources, closely resembling what would be available to a financial media company like Dow Jones. This ensures:
- Realistic input for summarization and semantic search.
- Diverse content, including company earnings, market movements, and macroeconomic developments.
- Clean article metadata such as date when the article was published, original link, title and full text.
#### Licensing
The dataset is distributed under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License](https://creativecommons.org/licenses/by-nc-sa/4.0/). This allows for academic use, sharing, and adaptation, while restricting commercial applications.

## Initial setup: Libraries, OpenAI key and model

In [1]:
import os
import json
import asyncio
import pandas as pd
import openai
import plotly.express as px
from dotenv import load_dotenv
from langchain.chat_models import init_chat_model
from langchain_openai import OpenAIEmbeddings
from langchain.prompts import PromptTemplate

pd.options.display.max_columns=999
pd.options.display.max_rows=999

In [2]:
# Load the environment variables from .env file
load_dotenv()
openai.api_key = os.environ['OPENAI_API_KEY']

# Cache file to avoid calling multiple time the same API with the same input
cache_file = '../cache_summary.json'

# Load cache from disk if it exists
try:
    with open(cache_file, 'r') as f:
        cache = json.load(f)
except FileNotFoundError:
    cache = {}

I am using gpt4.1 nano since it is the not too far from GPT4o performance but much cheaper (25x), [here](https://docsbot.ai/models/compare/gpt-4o/gpt-4-1-nano) a full comparison.

In [None]:
# Initializing OpenAI models for later
llm = init_chat_model("gpt-4.1-nano-2025-04-14", model_provider="openai", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

: 

## Read and explore data

In [None]:
# Read the data
df = pd.read_json("../data/2023_processed.json")
# Remove duplicated articles
df = df.dropna(subset=["maintext"])
print(df.language.unique()) # Articles always in English
print(df.date_publish.map(lambda x: x[:4]).unique()) # Articles always published in 2023

In [None]:
# Keep only relevant columns
cols_to_keep = ["date_publish", "description", "maintext", "title", "url", "related_companies"]
df = df[cols_to_keep]
df.head()

Unnamed: 0,date_publish,description,maintext,title,url,related_companies
0,2023-06-23 05:38:00,"At Tyler Malek's ice cream parlors, one cook's...","LOS GATOS, Calif. (AP) — At Tyler Malek's ice ...",The US has tons of leftover food. Upcycling se...,https://finance.yahoo.com/news/us-tons-leftove...,"[BSAC, FHN, PACW, BSMX, VLY, MBRG, SMMF, GNBC,..."
1,2023-08-26 14:00:17,"The worst result, after buying shares in a com...","The worst result, after buying shares in a com...",Baker Hughes (NASDAQ:BKR) shareholders have ea...,https://finance.yahoo.com/news/baker-hughes-na...,"[CHU, INSG, S, TDS, DCM, TMUS, CHT, SPOK, VEON..."
2,2023-12-06 16:57:28,(Bloomberg) -- An insolvency filing by Signa H...,(Bloomberg) -- An insolvency filing by Signa H...,Signa’s Insolvency Yields Long List of Credito...,https://finance.yahoo.com/news/signa-insolvenc...,[TXT]
3,2023-06-14 07:21:56,Swiss citizens vote this weekend on whether to...,By John Revill\nZURICH (Reuters) - Swiss citiz...,Low-tax Switzerland votes on global minimum co...,https://finance.yahoo.com/news/low-tax-switzer...,"[IGLD, RAMP, NSR, TWTR, ACXM, COR, PINS, META,..."
4,2023-01-10 20:23:00,Nationally recognized branding agency HAVEN Cr...,"WAXHAW, N.C., Jan. 10, 2023 /PRNewswire/ -- Na...",National Branding Agency HAVEN Creative Looks ...,https://finance.yahoo.com/news/national-brandi...,"[FIS, FRXB, AAQC, EEX, AUXO, BBOX, GHY, CTLP, ..."


In [None]:
# See sample of an article
df["maintext"].sample(1).values[0].split("\n")

['(Bloomberg) -- UK Prime Minister Rishi Sunak will outline plans to bolster energy security on a visit Monday to Scotland amid growing disagreement over the government’s broader environmental policies.',
 'Most Read from Bloomberg',
 'Stocks Are Doing So Well That It May Be Time to Start Worrying',
 'Burning Ship’s Operator Says Almost 500 EVs Are on Board',
 'Bodegas Put on Notice as Visa Fights Back on Card Surcharges',
 'Stocks Crush ‘Year of Bond’ in Biggest Sentiment Shift Since ‘99',
 'Vanguard’s Economists Aren’t Buying Talk of a Soft Landing: Q&A',
 'Sunak will announce measures to help the North Sea oil and gas industry adapt to the transition to net zero greenhouse gas emissions and meet industry leaders in Aberdeenshire, the UK’s drilling hub. The Sunday Times reported that Sunak will unveil multi-million-pound funding for a carbon capture project in Scotland that could help support oil and gas production.',
 'Meanwhile, the government made it cheaper for industrial firms t

: 

## Generation of a summary
Financial news articles are often lengthy and dense with information. When dealing with large volumes of content, it's critical to quickly assess which articles are worth a deeper read. Summaries serve this exact purpose: they allow readers to grasp the core message of an article in seconds.

To address this, we generate concise, high-quality summaries using Generative AI. Specifically, we leverage OpenAI’s gpt-4.1-nano model (initialized earlier) to produce abstractive summaries that capture the main events and insights from each article.

This not only improves readability but also lays the foundation for downstream tasks like semantic search and topic clustering.

In [None]:
# Showing the histogram of how many worlds are in an article
fig = px.histogram(x=df.maintext.map(lambda x: len(x.split())), nbins=100)
fig.show()

Most of the articles are between 400-600 words, usually a good summary is between 150 to 200 words.

Let's create a prompt template using langchain to create the summary.

In [None]:
prompt_template = PromptTemplate.from_template("""
As a professional summarizer for a financial newspaper, create a concise and comprehensive summary of the provided article while adhering to these guidelines:

Craft a summary that is detailed, thorough, in-depth, and complex, while maintaining clarity and conciseness.

Incorporate main ideas and essential information, eliminating extraneous language and focusing on critical aspects.

Rely strictly on the provided text, without including external information.

Format the summary in paragraph form for easy understanding.
                                               
You are creating the summary to be put in the frontpage of the newspaper, so be catchy and critical.

Use from 150 to 200 words.                                                                                         

By following this optimized prompt, you will generate an effective summary that encapsulates the essence of the given article in a clear, concise, and reader-friendly manner.

Article: {Article}
""")

Let’s define an asynchronous function to generate summaries. Using asyncio allows us to speed up multiple API calls in parallel — especially useful when processing hundreds or thousands of articles.

In [None]:
async def process_single_article(article_text, prompt_template):
    """Process a single article asynchronously"""
    if article_text in cache:
        return cache[article_text]
    try:
        formatted_prompt = prompt_template.format(Article=article_text)
        response = await llm.ainvoke(formatted_prompt)
        summary = response.content
        cache[article_text] = summary
        with open(cache_file, 'w') as f:
            json.dump(cache, f)
        return summary
    except Exception as e:
        print(f"Error processing article: {e}")
        return None

async def process_articles_batch(df, prompt_template, batch_size=5):
    """Process articles in batches to avoid rate limits"""
    results = []
    
    for i in range(0, len(df), batch_size):
        batch = df.iloc[i:i+batch_size]
        
        # Create tasks for the batch
        tasks = [
            process_single_article(row["maintext"], prompt_template) 
            for _, row in batch.iterrows()
        ]
        
        # Process batch concurrently
        batch_results = await asyncio.gather(*tasks, return_exceptions=True)
        results.extend(batch_results)
        
        # Optional: Add delay between batches to respect rate limits
        if i + batch_size < len(df):
            await asyncio.sleep(1)  # 1 second delay between batches
    
    return results

In [None]:
# I create the summary of the first 100 articles
res = await process_articles_batch(df.iloc[:100, :], prompt_template)

In [None]:
res[0]

'Amid rising awareness of food waste’s environmental and economic toll, the upcycling movement is gaining momentum across the food industry, exemplified by Salt & Straw’s innovative ice cream flavors crafted from leftover ingredients. Portland-based Malek’s chain champions the reuse of whey, rice remnants, and cacao pulp, transforming waste into gourmet products while advocating for a shift from “food waste” to “wasted food.” This trend aligns with consumer demand for transparency and sustainability, as over 35 million tons of food are wasted annually in the U.S., costing the economy over $200 billion. The Upcycled Food Association’s “Upcycling Certified” seal now adorns hundreds of products, from bakery mixes to veggie chips, highlighting ingredients like misshapen produce and byproducts from plant-based milk, such as okara flour used in Salt & Straw’s cupcakes. Beyond retail, innovative restaurants like San Francisco’s Shuggie’s Trash Pie utilize imperfect produce and offcuts, challe

In [None]:
res[1]

NameError: name 'res' is not defined