Skip to content
/ FiLM Public

Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models

Notifications You must be signed in to change notification settings

deep-over/FiLM

Repository files navigation

Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models

(EMNLP 2023 findings)

Paper: https://aclanthology.org/2023.findings-emnlp.138/

model repository: https://huggingface.co/HYdsl/FiLM

Abstract

Over the past few years, various domain-specific pretrained language models (PLMs) have been proposed and have outperformed general-domain PLMs in specialized areas such as biomedical, scientific, and clinical domains. In addition, financial PLMs have been studied because of the high economic impact of financial data analysis. However, we found that financial PLMs were not pretrained on sufficiently diverse financial data. This lack of diverse training data leads to a subpar generalization performance, resulting in general-purpose PLMs, including BERT, often outperforming financial PLMs on many downstream tasks. To address this issue, we collected a broad range of financial corpus and trained the Financial Language Model (FiLM) on these diverse datasets. Our experimental results confirm that FiLM outperforms not only existing financial PLMs but also general domain PLMs. Furthermore, we provide empirical evidence that this improvement can be achieved even for unseen corpus groups.

FiLM(Financial Language Model) Models 🌟

FiLM is a Pre-trained Language Model (PLM) optimized for the Financial domain, built upon a diverse range of Financial domain corpora. Initialized with the RoBERTa-base model, FiLM undergoes further training to achieve performance that surpasses RoBERTa-base in financial domain for the first time. Our model can be called Fin-RoBERTa.

To train FiLM, we have categorized our Financial Corpus into specific groups and gathered a diverse range of corpora to ensure optimal performance.

We offer two versions of the FiLM model, each tailored for specific use-cases in the Financial domain:

FiLM (2.4B): Our Base Model

This is our foundational model, trained on the entire range of corpora as outlined in the above Corpus table. Ideal for a wide array of financial applications. 📊

FiLM (5.5B): Optimized for SEC Filings

This model is specialized for handling SEC filings. We expanded the training set by adding 3.1 billion tokens from the SEC filings corpus dataset. The dataset is sourced from EDGAR-CORPUS: Billions of Tokens Make The World Go Round (Loukas et al., ECONLP 2021)

The method to load a tokenizer and a model. For the FiLM model, you can call 'roberta-base' from the tokenizer.

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = AutoModel.from_pretrained('HYdsl/FiLM')

Refer to the following documentation for basic code use.

Basic code.md

Types of Training Corpora 📚

Groupd Name Description # Tokens
News TRC2 Collection financial news stories from Reuters 227.39 M 
Investing.com Stock, options, commodity etc. News article  130.88 M
NYtimes Economic articles from the New York Times  75.04 M 
EIA Commodity related news articles from EIA  1.12 M 
SEC filings Annual reports(10-K) and quarterly reports(10-Q) 307.19 M 
Earnings Call Earnings conference call transcripts  1.66 B 
Papers ArXiv A collection of abstracts of economic research papers 42.18 M
AIHUB A collection of Korean economics research papers 5.89 M 
MISC Investopedia Economic glossary 5.33 M 
FinWEB Finance, loans, and insurance related articles  2.86 M 
A total of 10 corpora 2.4 B

Financial tasks performance

Model FPB NER Headline FiNER FinQA FOMC
Metric Accuracy F-1 F-1 F-1 F-1 Prog Acc Exe Acc F-1
BERT [Devlin et al., 2019] 83.30 81.73 75.09 89.54 79.40 51.09 53.10 63.81
RoBERTa-base [Liu et al., 2019b] 85.30 83.93 78.81 91.29 81.58 56.76 59.11 69.16
Fin-BERT [Araci D et al., 2019] 85.25 82.45 77.93 90.48 81.49 47.86 50.04 64.50
Fin-BERT [Yang Y et al., 2020] 83.68 82.52 70.40 90.83 81.08 38.79 40.54 64.30
FLANG-BERT [Shah et al., 2022] 84.76 83.12 75.58 91.06 81.53 49.17 51.44 64.93
FLANG-RoBERTa [Shah et al., 2022] 83.86 82.18 71.36 90.46 80.78 30.69 32.17 68.02
SEC-BERT-base [Loukas L et al., 2022] 84.37 82.18 78.74 90.52 82.35 53.18 55.45 65.06
FiLM [ours] 86.25 84.48 79.78 91.79 82.02 58.85 61.38 69.60
FiLM (5.5B) [ours] 86.14 84.11 78.82 91.74 82.39 59.37 61.64 69.16

Information from financial tasks

Name Task Train size Valid size Test size Metric
FPB [1] Sentiment classification 3,391 726 726 Accuracy & F-1
NER [2] Named entity recognition 932 232 302 F-1
Headline [3] News headlines classification 7,989 1,141 2,282 F-1
FiNER [4] Numeric entity recognition 900,384 112,494 108,378 F-1
FinQA [5] Question answering 6,251 883 1,147 Accuracy(Prog & Exe)
FOMC [6] Sentiment classification 1,588 396 496 F-1 (Combined-S)

For information on the task, refer to the FLUE benchmark. We follow Benchmark too.

[1] https://huggingface.co/datasets/financial_phrasebank

[2] https://huggingface.co/datasets/tner/fin

[3] https://www.kaggle.com/datasets/ankurzing/sentiment-analysis-in-commodity-market-gold/data

[4] https://github.com/nlpaueb/finer

[5] https://github.com/czyssrs/FinQA

[6] https://github.com/gtfintechlab/fomc-hawkish-dovish

About

Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published