# Benzinga-Nachrichten-Verarbeitung

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
cwd = "/content/drive/MyDrive/NewsTrading/trading_bot"
%cd /content/drive/MyDrive/NewsTrading/trading_bot

/content/drive/MyDrive/NewsTrading/trading_bot


In [81]:
%%capture
!pip install html2text
!pip install datefinder
!pip install -U dask[complete];

In [72]:
import dask.dataframe as dd
import dask
import pandas as pd
from dask.distributed import Client
from src.preprocessing.news_parser import filter_body, time, body_formatter
import re
import plotly.express as px

In [None]:
# client = Client(memory_limit='25GB', processes=False,
#                 n_workers=2, threads_per_worker=1)
# client

In [None]:
dask.config.set(scheduler="threads")

<dask.config.set at 0x7aeed9dc6230>

## HTML-Parsing
Als erstes müssen wir die HTML-Dokumente zu normalem Text umwandeln, ansonsten sind die Text-Zellen zu groß und führen zu Problemen mit PyArrow/Dask.

In [None]:
input_dir = "data/raw_bzg/"
output_dir = 'data/unraw1_bzg/'

In [None]:
# for year in range(2019, 2020):
#     print(year)
#     df = pd.read_parquet(f"{input_dir}story_df_raw_{year}.parquet")
#     df = dd.from_pandas(df, npartitions=12)
#     df["html_body"] = df["html_body"].apply(body_formatter, meta=pd.Series(dtype="str"))
#     df = df.rename(columns={"html_body":"body"})
#     name_function = lambda x: f"data-{year}-{x}.parquet"
#     df.to_parquet(output_dir, name_function=name_function)

Daten neu partitionieren, sodass alle Partitionen etwa die gleiche Größe haben.

In [None]:
input_dir = 'data/unraw1_bzg/'
output_dir = 'data/unraw2_bzg/'

# ddf = dd.read_parquet(input_dir+"*.parquet")
# ddf2 = ddf.repartition(npartitions=50)
# name_function = lambda x: f"data-{x}.parquet"
# ddf2.to_parquet(output_dir, name_function=name_function)

Ein bisschen die Daten säubern...

In [None]:
input_dir = cwd+'/data/unraw2_bzg/'
output_dir = cwd+'/data/unraw3_bzg/'

In [None]:
ddf = dd.read_parquet(input_dir+"*.parquet")

In [None]:
# Remove rows for which noo stock ticker is recorded
ddf = ddf[ddf.stocks != '']

In [None]:
# Convert `channels`  datatype from string to list
ddf["channels"] = ddf["channels"].apply(eval, meta=pd.Series(dtype='object'))

## Author-Inferenz

Untersuche als nächstes die Behauptung, dass **PRNewswire** und **Businesswire** den gesamten Markt für Pressemeldungen in den USA kontrollieren. Wenn dem so ist, und sie nicht noch weitere, unwichtige Meldungen veröffentlichen, dann können wir einfach die Newsartikel nach diesen Autoren filtern und uns viel Arbeit ersparen.

In [None]:
dask.config.set(scheduler="processes")
ddf["inferred_author"] = None

def infer_author(body):
  for author in ["PRNewswire", "Globe Newswire", "Business Wire", "ACCESSWIRE"]:
    if re.search(author, body, re.IGNORECASE) is not None:
      return author
  return None

ddf["inferred_author"] = ddf.body.apply(infer_author, meta=pd.Series(dtype="string"))

In [None]:
# value_counts for authors
auhtor_value_counts = pd.concat([ddf.author.value_counts().head(10), ddf.inferred_author.value_counts().head(10)], axis=1)

In [None]:
auhtor_value_counts

Unnamed: 0,author,inferred_author
Benzinga,1061214,
PRNewswire,305720,587242.0
Globe Newswire,293466,475171.0
Business Wire,268561,293052.0
Newsfile,70877,
ACCESSWIRE,62615,81054.0
"AB Digital, Inc.",9936,
WebWire,6404,
PRWeb,2617,
News Direct,2080,


In [None]:
auhtor_value_counts.sum().diff()

author                  NaN
inferred_author   -646971.0
dtype: float64

Ungefähr 650k Nachrichten werden ausgelassen, wenn nur die vier Hauptvertreiber von Pressemeldungen berücksichtigt werden.

In [None]:
ddf = ddf[~ddf.inferred_author.isna()]

In [None]:
ddf["inferred_author"] = ddf["inferred_author"].astype("string")

In [None]:
ddf["channels"] = ddf.channels.apply(lambda x: str(x), meta=pd.Series(dtype="string"))

In [None]:
name_function = lambda x: f"data-{x}.parquet"
ddf.to_parquet(output_dir, name_function=name_function)

In [5]:
input_dir = cwd+'/data/unraw3_bzg/'
output_dir = cwd+'/data/unraw2_bzg/'

In [6]:
ddf = dd.read_parquet(input_dir)

In [16]:
ddf.inferred_author.value_counts().compute()

PRNewswire        587242
Globe Newswire    475171
Business Wire     293052
ACCESSWIRE         81054
Name: inferred_author, dtype: Int64

In [17]:
ddf.inferred_author.value_counts().sum().compute()

1436519

In [18]:
# Contains 100k rows
earnings_ddf = ddf[ddf.channels.apply(lambda x: "Earnings" in x, meta=pd.Series(dtype=bool))]

In [19]:
# value counts for authors of earnings reports (contrast to value counts of all news articles)
earnings_ddf.inferred_author.value_counts().head(10)

Globe Newswire    44589
PRNewswire        31440
ACCESSWIRE        16434
Name: inferred_author, dtype: Int64

Hier sehen wir, dass es keine einzige Pressemeldung von **Business Wire** gibt, die mit *Earnings* gekennzeichnet sind. Trotzdem gibt es relevante *Earnings* reports von Business Wire. Dies habe ich kurz verifiziert...

## Russell-Filtering

Wie viele Nachrichten bleiben, wenn wir auch die momentane Zusammensetzung des Russell 3000 filtern?

In [50]:
# Around 3k tickers at this moment
russell_tickers = pd.read_pickle("data/tickers.pkl")
russell_tickers = russell_tickers.categories

In [46]:
ddf.shape[0].compute()

1436519

In [51]:
filt_ddf = ddf[ddf.stocks.isin(russell_tickers)]

In [52]:
filt_ddf.shape[0].compute()

657996

Es verbleiben circa 660k Nachrichten, die für unser Russell 3000-Aktienuniversum relevant sind. Dieses Filtering hätten wir eigentlich auch schon früher machen können... Aber egal.

In [65]:
filt_ddf = filt_ddf.set_index("time")

In [69]:
monthly_news_counts = filt_ddf.stocks.resample("MS").count()

In [77]:
px.line(monthly_news_counts.compute(), title="# of articles per month")

## Parsing


Diese Nachrichten können wir nun wirklich parsen, und danach ordentlich kategorisieren.