# Workflow Demo

### Initialisation

In [57]:
import os
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("../")

import config.environvars as environvars
import data.extract
import data.transform
import models.llm
import models.apply_models

import pandas as pd
import numpy as np
import re
import warnings
import plotly.graph_objects as go

### Quick EDA

In [29]:
df = pd.read_csv(environvars.paths.path_swissdox+"swissdox.csv")
print("Number of articles in sample: "+str(df.shape[0]))
print("Number of articles with unique content: "+str(df.drop_duplicates(subset="content_id", keep="first").shape[0]))

Number of articles in sample: 109863
Number of articles with unique content: 46858


Note that there are duplicates for two reasons:
- Often, different papers publish the same article. This type of duplicates will be filtered.
- Since we query the API by keyword which will perform a string match in the text of the article, an article containing "Credit Suisse" and "UBS" will be returned when querying the API for CS and UBS accordingly. Since I preprocess the text data dependent on the keyword used in the query, I need to keep this type of duplicates.

In [30]:
df.groupby("query_bank").size().sort_values(ascending=False)

query_bank
credit_suisse                   49707
ubs                             38497
julius_baer                      5654
kantonalbank_bern                2666
baloise                          2437
kantonalbank_luzern              1830
swissquote                       1744
kantonalbank_stgallen            1350
kantonalbank_baselstadt          1092
cembra_money_bank                1039
efg_international                 825
kantonalbank_genf                 776
kantonalbank_baselland            708
kantonalbank_graubuenden          499
kantonalbank_glarus               321
hypothekarbank_lenzburg           306
kantonalbank_waadt                225
kantonalbank_wallis                75
arundel                            62
kantonalbank_jura                  41
lichtensteinische_landesbank        9
dtype: int64

Note that we query swissdox API by keywords, whereby keywords are (variations) of bank names. Hence, the API returns a result if there is at least one match of the keyword.

Therefore, there are articles assigned to "query_bank" which are not (only) about the corresponding bank, i.e. the "query bank" gets referenced in an article about another bank or the "query bank" is just mentioned as a sponsor of some event.

In [31]:
df.groupby("language").size().sort_values(ascending=False)

language
de    97518
fr    11209
it      677
rm      242
en      217
dtype: int64

To start, I just focus on the german articles. Potentially, I will include other languages later on.

Reason: at the moment, I use an LLM which was pretrained on german texts to classify the sentiment of text and most of articles in sample are in german.

In [32]:
df = df[df["language"] == "de"]

## Short demo of obtaining sentiment scores: where I"m at now

I calculate one sentiment score per article.

An article contains following information:

In [49]:
article = df[df["content_id"] == "03a801e8-e72c-d3e2-8276-69ebd9a7816a"].iloc[0]
article

id                                                              47225409
pubtime                                           2022-08-11 08:00:17+02
medium_code                                                         FUWO
medium_name                                                       fuw.ch
rubric                                                       Unternehmen
regional                                                             NaN
doctype                                                              WWE
doctype_description                                        Online medium
language                                                              de
char_count                                                          2612
dateline                                                             NaN
head                                                   BKB erhöht Gewinn
subhead                                                              NaN
content_id                          03a801e8-e72c-d

Note that there are often articles, which are not only about "one bank", rather the article is about several banks or about several topics altogether.

Hence, my current best approach is to only evaluate paragraphs which mention the bank of interest. To do this, each bank has resp. will have corresponding regex-pattern to search for all paragraphs in article which mention the corresponding bank:

In [34]:
pattern = [i for i in data.extract.swissdox.query_inputs if i["query_name"] == article["query_bank"]][0]
pattern = re.compile(pattern["regex"])
pattern

re.compile(r'basler[a-z]* kantonalbank|bkb', re.UNICODE)

Split article in paragraphs and search for regex pattern:

In [35]:
text = article["content"]
text = text.split("</p>")
text = [data.transform.preprocess.remove_tags(item) for item in text]
text = [item.lower() for item in text]
text = [item for item in text if bool(re.search(pattern, item))]
text

[' <ld> die basler kantonalbank profitiert im ersten semester u.a. vom erhöhten zinsgeschäft.',
 '</ld> (awp)\xa0die basler kantonalbank (bkb) hat im ersten halbjahr 2021 den gewinn deutlich verbessert. der bkb-konzern, zu der neben dem bkb-stammhaus auch die bank cler gehört, konnte dabei seine produktivität weiter verbessern.',
 ' deutlich höher fiel auch der kommissions- und dienstleistungsertrag aus (+7,3% auf 70,7 mio). hier konnte der bkb-konzern von der wirtschaftlichen erholung nach corona profitieren – so dürften die kunden etwa ihre kreditkarten wieder deutlich mehr benutzt haben. in der vermögensverwaltung konnte die bkb zudem trotz der turbulenten börsen neugeldzuflüsse generieren. rückläufig war allerdings der bei der bkb traditionell volatile handelsertrag (-33% auf 28,9 mio).',
 ' tiefere kosten  während der ertrag insgesamt leicht zulegte (+0,4% auf 298,6 mio), sank der geschäftsaufwand deutlicher (-4,9% auf 167,5 mio). die bkb verweist dabei auf die optimierung der kon

At the moment, I use a pretrained BERT model (https://huggingface.co/scherrmann/GermanFinBert_SC_Sentiment) to evaluate the sentiment of the paragraphs.

Each paragraph gets a numeric value: 1 if positive sentiment, 0 if neutral and -1 if negative.

In [36]:
device = models.llm.select_device()
model_initialise = models.llm.finbert_german_sentiment.model_initialise()
result = [models.llm.finbert_german_sentiment.finbert_german_sentiment(
    item,
    tokenizer=model_initialise[0],
    model=model_initialise[1],
    device=device
) for item in text]
result

Apple Silicon GPU available.


[1, 1, 1, 1, 1, 1, 1]

Now, to get one single sentiment score for one article, the mean of the sentiment scores per article is calculated:

In [37]:
np.nanmean(result)


1.0

## Topic Modelling

Right now, a pretrained BERT topic classifier from huggingface (https://huggingface.co/nickmuchi/finbert-tone-finetuned-finance-topic-classification) is used to assign a topic to each paragraph based on the "lead" of an article.

Reason: I assume that the lead contains most relevant information about an article in a concise form and therefore be sufficient to classify topic without too much noise.

Additionally, the LLM used was trained on tweets, hence I expect it to perform better with short text inputs.

In [53]:
article = df[df["content_id"] == "8e4a950a-c052-8f3e-eabe-719e8055cb50"].iloc[0]
lead = article["content"]
lead = re.findall(r"<ld>(.*?)</ld>", lead)[0]
lead

'<p>Die Inflation steigt auf 2,4 Prozent – den höchsten Wert seit 2008. Was heisst das für unser Portemonnaie, und was macht die Nationalbank?</p>'

In [54]:
lead = models.llm.llama.translate(lead)
lead

'The inflation rate rises to 2.4 percent - the highest value since 2008. What does that mean for our wallet, and what is the central bank doing?'

In [60]:
warnings.filterwarnings("ignore")
pipe = models.llm.finbert_english_topic.model_initialise()
result = models.llm.finbert_english_topic.finbert_english_topic(pipe, lead)
warnings.filterwarnings("default")
print("The assigned topic is: "+result)

Apple Silicon GPU available.
The assigned topic is: Macro


In [61]:
article = df[df["content_id"] == "65c76521-9638-0d9a-4bca-bfd6050fb3bb"].iloc[0]
lead = article["content"]
lead = re.findall(r"<ld>(.*?)</ld>", lead)[0]
print(lead)
lead = models.llm.llama.translate(lead)
warnings.filterwarnings("ignore")
pipe = models.llm.finbert_english_topic.model_initialise()
result = models.llm.finbert_english_topic.finbert_english_topic(pipe, lead)
warnings.filterwarnings("default")
print("The assigned topic is: "+result)

<p>Steuerstreit mit Frankreich Die Grossbank zahlt 238 Millionen Euro und kann den Fall so abschliessen. Das liegt auch am ehemaligen UBS-Chefjuristen Markus Diethelm, der jetzt für die CS arbeitet.</p>
Apple Silicon GPU available.
The assigned topic is: Legal | Regulation


Note: I tried other approaches like Latent Dirichlet Allocation (LDA) or thought of implementing a single own rule-based classifier but LLM was better in terms of performance and did not need any subjective criteria when defining rules or when interpreting output of LDA.

Nevertheless, here some code which was used to preprocess the text data.

In [22]:
article = df.iloc[9999]
pattern = [i for i in data.extract.swissdox.query_inputs if i["query_name"] == article["query_bank"]][0]
pattern = re.compile(pattern["regex"])
text = article["content"]
text = text.split("</p>")
text = [data.transform.preprocess.remove_tags(item) for item in text]
text = [item.lower() for item in text]
text = [item for item in text if bool(re.search(pattern, item))]
print("Relevant paragraphs (first 4 shown):")
print(text[0:3])
text = " ".join(text)
text = data.transform.preprocess.tokenize(text)
text = data.transform.preprocess.remove_stopwords(text, "german")
text = data.transform.preprocess.remove_punctuation(text)
text = data.transform.preprocess.lemmatize(text)
print("\nContent lemmatized (first 10 shown):")
print(text[0:9])
text = [t[1] for t in text if t[2] == "NN"]
print("\nContent lemmatized, only nouns of class NN (first 10 shown):")
print(text[0:9])

Relevant paragraphs (first 4 shown):
['   etwas klarheit zum greensill-debakel: das logo der credit suisse am zürcher paradeplatz. ', '  vor mehr als einem jahr geriet die firma des australischen finanzwunderkinds lex greensill in schieflage. die pleite hatte auch folgen für die credit suisse. die bank hatte fonds der finanzboutique im umfang von 10 milliarden franken an rund 1200 kundinnen und kunden verkauft. noch immer ist nicht klar, wie teuer das greensill-debakel für die credit suisse und ihre kundschaft wird. bislang hat sie rund 70 prozent des fondsvermögens zurückbezahlt.', '   finanzunternehmer lex greensill bereitet der cs ärger. ']
I probably want to extend stop word list later on

Content lemmatized (first 10 shown):
[('klarheit', 'Klarheit', 'NN'), ('greensill-debakel', 'Greensill-debakel', 'NE'), ('logo', 'Logo', 'NN'), ('credit', 'Credit', 'FM'), ('suisse', 'Suisse', 'FM'), ('zürcher', 'zürcher', 'ADJ(A)'), ('paradeplatz', 'Paradeplatz', 'NN'), ('mehr', 'mehr', 'PIAT'),

### First "case-study": Credit Suisse

As a first test, a random sample of 1000 articles (because of computation time) for Credit Suisse is evaluated.

In [69]:
cs_result = pd.read_csv(environvars.paths.path_preprocessed+"cs_example.csv", sep=";")
cs_result = cs_result[cs_result["result"].notna()]
cs_result["date"] = pd.to_datetime(cs_result["pubtime"], utc=True)
cs_result = cs_result.groupby("date")["result"].mean().reset_index()
cs_result = cs_result.sort_values(by="date")
cs_result["rolling_ma"] = cs_result["result"].rolling(window=5).mean()

Note: at the moment, the workflow returns some NAs because some problems with LLM tokenizer: "token indices sequence length is longer than the specified maximum sequence length"

#### General Sentiment towards Credit Suisse

In [81]:
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=cs_result["date"],
    y=cs_result["result"],
    mode="lines",
    name="Sentiment Score",
    line=dict(color="lightgrey")
))

fig.add_trace(go.Scatter(
    x=cs_result["date"],
    y=cs_result["rolling_ma"],
    mode="lines",
    name="Rolling MA (5 days)",
    line=dict(color="red")
))

fig.update_layout(
    title="Sentiment Scores of Credit Suisse Articles (red: 5 day rolling MA)",
    xaxis_title="Date",
    yaxis_title="Sentiment Score [-1, 1]",
    xaxis_tickangle=-45,
    height=600,
    width=1400,
    plot_bgcolor="#F5F5F5"
)

fig.show()


#### Sentiment per Topic towards Credit Suisse

In [82]:
cs_result = pd.read_csv(environvars.paths.path_preprocessed + "cs_example.csv", sep=";")
cs_result = cs_result[cs_result["result"].notna()]
cs_result["date"] = pd.to_datetime(cs_result["pubtime"], utc=True)
cs_result = cs_result.groupby(["date", "topic"])["result"].mean().reset_index()
cs_result = cs_result.sort_values(by=["date", "topic"])
cs_result["rolling_ma"] = cs_result.groupby("topic")["result"].rolling(window=5).mean().reset_index(drop=True)


In [83]:
fig = px.line(cs_result, x="date", y="rolling_ma", color="topic", 
              line_shape="linear", 
              title="Moving Averages by Topic",
              labels={"rolling_ma": "Moving Average Value", "date": "Date"})

fig.update_traces(line=dict(width=2))
fig.update_layout(yaxis_title="Moving Average Value", xaxis_title="Date", plot_bgcolor="#F5F5F5")

default_visible_topics = ["General News | Opinion", "Stock Commentary"]
for trace in fig.data:
    if trace.name not in default_visible_topics:
        trace.visible = "legendonly"

fig.show()
