- [How to make or get corpus of financial documents](https://stackoverflow.com/questions/32127265/how-to-make-or-get-corpus-of-financial-documents)
- [Reuters-21578](http://www.daviddlewis.com/resources/testcollections/reuters21578/)
- [Reuters-21578-Classification](https://github.com/giuseppebonaccorso/Reuters-21578-Classification)
- [Financial News Dataset from Reuters](https://github.com/Danbo3004/financial-news-dataset)
- [Reuters Dataset of Financial News Articles](https://github.com/Kriyszig/financial-news-data)
- [Sentence-Level Sentiment Analysis of Financial News Using Distributed Text Representations and Multi-Instance Learning](https://arxiv.org/pdf/1901.00400.pdf)
- [SentenceLevelSentimentFinancialNews](https://github.com/InformationSystemsFreiburg/SentenceLevelSentimentFinancialNews)

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

%load_ext autoreload
%autoreload 2

In [2]:
from fnsa.feature import FeatureExtractor
from fnsa.lexicon import get_en2scope, Lexicon
from fnsa.scope import DRScopeDetector, IFScopeDetector
from fnsa.graph import make_graph
from fnsa.util import *
import os
import spacy
from tqdm import tqdm

In [11]:
include_words = False
nlp = spacy.load('en_core_web_sm')
lexicon = Lexicon(nlp)
dr_detector = DRScopeDetector()
if_detector = IFScopeDetector()
extractor = FeatureExtractor(lexicon, detectors=[dr_detector, if_detector], include_words=include_words)

In [4]:
directory = '/opt/code/sentiment-analysis/data/fpb/FinancialPhraseBank-v1.0'
fname = 'Sentences_AllAgree.txt'
path = os.path.join(directory, fname)
#os.listdir(directory)

In [5]:
sentiment2code = {'negative':-1, 'neutral':0, 'positive':1}

records = []
with open(path, mode='r', encoding='Windows-1252') as ifp:
    cnt = 0
    for line in ifp:
        line = line.strip()
        if line:
            text, sentiment = line.split('@')
            records.append((text, sentiment2code[sentiment]))
            cnt += 1
print("Loaded %d records." % cnt)

Loaded 2264 records.


In [8]:
%%time

directory = './data'
fname = 'all-agree.tsv'
if include_words: fname = 'all-agree-with-words.tsv'
path = os.path.join(directory, fname)
with open(path, mode='w', encoding='UTF-8') as ofp:
    ofp.write("Sentiment\tSentence\tFeatures\n")
    for record in tqdm(records):
        text, code = record
        doc, features = extractor(text)
        features = " ".join([feature.lower() for feature in features])
        ofp.write("%d\t%s\t%s\n" % (code, text, features))


100%|██████████| 2264/2264 [00:22<00:00, 100.87it/s]   | 1/2264 [00:00<03:50,  9.83it/s]


CPU times: user 1min 25s, sys: 2.78 s, total: 1min 28s
Wall time: 22.5 s


## Sentence-Level Sentiment Analysis of Financial News Test Dataset

There is a small test dataset related to the paper

 > [Sentence-Level Sentiment Analysis of Financial News](https://arxiv.org/pdf/1901.00400.pdf)<br/>
 > Bernhard Lutz, Nicolas Prollochs and Dirk Neumann. (2018)
 
available in the repo

 > [Sentence Level Sentiment Financial News Dataset](https://github.com/InformationSystemsFreiburg/SentenceLevelSentimentFinancialNews)
 
which is useful for additional evaluation of our approach and model.

In [9]:
path = '/opt/code/github/SentenceLevelSentimentFinancialNews/adhoc_test.tsv'
with open(path) as ifp: lines = ifp.readlines()
records = []
cnt = -1
for line in lines:
    cnt += 1
    if cnt == 0: continue
    line = line.strip()
    if line:
        sentence, code = line.split('\t')
        sentence = sentence.strip()
        code = int(code)
        records.append((sentence, code))
print("Loaded %d records." % cnt)        

Loaded 1000 records.


In [12]:
%%time

directory = './data'
fname = 'sentence-level.tsv'
if include_words: fname = 'sentence-level-with-words.tsv'
path = os.path.join(directory, fname)
with open(path, mode='w', encoding='UTF-8') as ofp:
    ofp.write("Sentiment\tSentence\tFeatures\n")
    for record in tqdm(records):
        text, code = record
        doc, features = extractor(text)
        features = " ".join([feature.lower() for feature in features])
        ofp.write("%d\t%s\t%s\n" % (code, text, features))



100%|██████████| 1000/1000 [00:11<00:00, 84.93it/s]    | 2/1000 [00:00<00:53, 18.49it/s]

CPU times: user 42.4 s, sys: 1.64 s, total: 44 s
Wall time: 11.8 s



