# Dow Jones Technical Assignment

#### Dataset Selection
For this assignment, I am using the [Yahoo Finance News Dataset (2023)](https://github.com/FelixDrinkall/financial-news-dataset/blob/main/data/2023_processed.json.xz), which contains real-world financial news articles published on finance.yahoo.com during the year 2023. The dataset is part of a broader collection covering the years 2017–2023 and available in the following [link](https://github.com/felixdrinkall/financial-news-dataset).
#### Rationale
I selected the 2023 dataset because it provides recent, high-quality financial news articles from multiple reputable media sources, closely resembling what would be available to a financial media company like Dow Jones. This ensures:
- Realistic input for summarization and semantic search.
- Diverse content, including company earnings, market movements, and macroeconomic developments.
- Clean article metadata such as date when the article was published, original link, title and full text.
#### Licensing
The dataset is distributed under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License](https://creativecommons.org/licenses/by-nc-sa/4.0/). This allows for academic use, sharing, and adaptation, while restricting commercial applications.

## Initial setup: Libraries, OpenAI key and model

In [None]:
import os
import json
import pandas as pd
import openai
import plotly.express as px
from dotenv import load_dotenv
from langchain.chat_models import init_chat_model
from langchain.prompts import PromptTemplate

pd.options.display.max_columns=999
pd.options.display.max_rows=999

In [2]:
# Load the environment variables from .env file
load_dotenv()
openai.api_key = os.environ['OPENAI_API_KEY']

# Cache file to avoid calling multiple time the same API with the same input
cache_file = '../cache_summary.json'

# Load cache from disk if it exists
try:
    with open(cache_file, 'r') as f:
        cache = json.load(f)
except FileNotFoundError:
    cache = {}

I am using gpt4.1 nano since it is the not too far from GPT4o performance but much cheaper (25x), [here](https://docsbot.ai/models/compare/gpt-4o/gpt-4-1-nano) a full comparison.

In [3]:
# Initializing OpenAI model
llm = init_chat_model("gpt-4.1-nano-2025-04-14", model_provider="openai", temperature=0)

## Read and explore data

In [4]:
# Read the data
df = pd.read_json("../data/2023_processed.json")
# Remove duplicated articles
df = df.dropna(subset=["maintext"])
print(df.language.unique()) # Articles always in English
print(df.date_publish.map(lambda x: x[:4]).unique()) # Articles always published in 2023

['en']
['2023']


In [5]:
# Keep only relevant columns
cols_to_keep = ["date_publish", "description", "maintext", "title", "url", "related_companies"]
df = df[cols_to_keep]
df.head()

Unnamed: 0,date_publish,description,maintext,title,url,related_companies
0,2023-06-23 05:38:00,"At Tyler Malek's ice cream parlors, one cook's...","LOS GATOS, Calif. (AP) — At Tyler Malek's ice ...",The US has tons of leftover food. Upcycling se...,https://finance.yahoo.com/news/us-tons-leftove...,"[BSAC, FHN, PACW, BSMX, VLY, MBRG, SMMF, GNBC,..."
1,2023-08-26 14:00:17,"The worst result, after buying shares in a com...","The worst result, after buying shares in a com...",Baker Hughes (NASDAQ:BKR) shareholders have ea...,https://finance.yahoo.com/news/baker-hughes-na...,"[CHU, INSG, S, TDS, DCM, TMUS, CHT, SPOK, VEON..."
2,2023-12-06 16:57:28,(Bloomberg) -- An insolvency filing by Signa H...,(Bloomberg) -- An insolvency filing by Signa H...,Signa’s Insolvency Yields Long List of Credito...,https://finance.yahoo.com/news/signa-insolvenc...,[TXT]
3,2023-06-14 07:21:56,Swiss citizens vote this weekend on whether to...,By John Revill\nZURICH (Reuters) - Swiss citiz...,Low-tax Switzerland votes on global minimum co...,https://finance.yahoo.com/news/low-tax-switzer...,"[IGLD, RAMP, NSR, TWTR, ACXM, COR, PINS, META,..."
4,2023-01-10 20:23:00,Nationally recognized branding agency HAVEN Cr...,"WAXHAW, N.C., Jan. 10, 2023 /PRNewswire/ -- Na...",National Branding Agency HAVEN Creative Looks ...,https://finance.yahoo.com/news/national-brandi...,"[FIS, FRXB, AAQC, EEX, AUXO, BBOX, GHY, CTLP, ..."


In [6]:
# See sample of an article
df["maintext"].sample(1).values[0].split("\n")

['(Reuters) -Datadog on Tuesday topped estimates for third-quarter results and raised its forecast for annual adjusted profit and revenue, driven by demand from customers seeking better security solutions due to increasing cybersecurity threats.',
 'Shares of the software solutions provider soared 29.9% and were on track for the biggest one-day percentage gain since their listing.',
 'With the rising number of security breaches hitting major companies like MGM Resorts and Clorox, businesses and governments are turning to software and cybersecurity solutions providers like Datadog.',
 'The Delaware, New-York-based company said it expects annual adjusted profit between $1.52 and $1.54 per share, up from its prior outlook of $1.30 and $1.34. Analysts were expecting $1.33 per share, according to LSEG data.',
 'It also raised its full-year revenue forecast to the range of $2.10 billion to $2.11 billion, from its prior outlook of $2.05 billion to $2.06 billion.',
 'Analysts expect Datadog to

## Generation of a summary
Financial news articles are often lengthy and dense with information. When dealing with large volumes of content, it's critical to quickly assess which articles are worth a deeper read. Summaries serve this exact purpose: they allow readers to grasp the core message of an article in seconds.

To address this, we generate concise, high-quality summaries using Generative AI. Specifically, we leverage OpenAI’s gpt-4.1-nano model (initialized earlier) to produce abstractive summaries that capture the main events and insights from each article.

This not only improves readability but also lays the foundation for downstream tasks like semantic search and topic clustering.

In [None]:
# Showing the histogram of how many worlds are in an article
fig = px.histogram(x=df.maintext.map(lambda x: len(x.split())), nbins=100)
fig.show()