# Benzinga-Nachrichten-Verarbeitung

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
cwd = "/content/drive/MyDrive/NewsTrading/trading_bot"
%cd /content/drive/MyDrive/NewsTrading/trading_bot

/content/drive/MyDrive/NewsTrading/trading_bot


In [3]:
%%capture
!pip install html2text
!pip install datefinder
!pip install -U dask[complete]
!pip install nltk;

In [122]:
%load_ext autoreload
%autoreload 2
import dask.dataframe as dd
import dask
import pandas as pd
from dask.distributed import Client
from src.preprocessing.news_parser import filter_body, time, body_formatter, get_company_abbreviation
import re
import plotly.express as px
import nltk
nltk.download('punkt')
import yfinance as yf

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
# client = Client(memory_limit='25GB', processes=False,
#                 n_workers=2, threads_per_worker=1)
# client

# Neuer Abschnitt

## HTML-Parsing
Als erstes müssen wir die HTML-Dokumente zu normalem Text umwandeln, ansonsten sind die Text-Zellen zu groß und führen zu Problemen mit PyArrow/Dask.

In [6]:
dask.config.set(scheduler="threads")

<dask.config.set at 0x7ae1f2680e20>

In [None]:
input_dir = "data/raw_bzg/"
output_dir = 'data/unraw1_bzg/'

In [None]:
# for year in range(2019, 2020):
#     print(year)
#     df = pd.read_parquet(f"{input_dir}story_df_raw_{year}.parquet")
#     df = dd.from_pandas(df, npartitions=12)
#     df["html_body"] = df["html_body"].apply(body_formatter, meta=pd.Series(dtype="str"))
#     df = df.rename(columns={"html_body":"body"})
#     name_function = lambda x: f"data-{year}-{x}.parquet"
#     df.to_parquet(output_dir, name_function=name_function)

## Neu-Partitionierug
Sodass alle Partitionen etwa die gleiche Größe haben.

In [None]:
input_dir = 'data/unraw1_bzg/'
output_dir = 'data/unraw2_bzg/'

# ddf = dd.read_parquet(input_dir+"*.parquet")
# ddf2 = ddf.repartition(npartitions=50)
# name_function = lambda x: f"data-{x}.parquet"
# ddf2.to_parquet(output_dir, name_function=name_function)

## Author-Inferenz

Ein bisschen die Daten säubern...

In [8]:
input_dir = cwd+'/data/unraw2_bzg/'
output_dir = cwd+'/data/unraw3_bzg/'

In [None]:
ddf = dd.read_parquet(input_dir+"*.parquet")

In [None]:
# Remove rows for which noo stock ticker is recorded
ddf = ddf[ddf.stocks != '']

In [None]:
# Convert `channels`  datatype from string to list
ddf["channels"] = ddf["channels"].apply(eval, meta=pd.Series(dtype='object'))

Untersuche als nächstes die Behauptung, dass **PRNewswire** und **Businesswire** den gesamten Markt für Pressemeldungen in den USA kontrollieren. Wenn dem so ist, und sie nicht noch weitere, unwichtige Meldungen veröffentlichen, dann können wir einfach die Newsartikel nach diesen Autoren filtern und uns viel Arbeit ersparen.

In [None]:
dask.config.set(scheduler="processes")
ddf["inferred_author"] = None

def infer_author(body):
  for author in ["PRNewswire", "Globe Newswire", "Business Wire", "ACCESSWIRE"]:
    if re.search(author, body, re.IGNORECASE) is not None:
      return author
  return None

ddf["inferred_author"] = ddf.body.apply(infer_author, meta=pd.Series(dtype="string"))

In [None]:
# value_counts for authors
auhtor_value_counts = pd.concat([ddf.author.value_counts().head(10), ddf.inferred_author.value_counts().head(10)], axis=1)

In [None]:
auhtor_value_counts

Unnamed: 0,author,inferred_author
Benzinga,1061214,
PRNewswire,305720,587242.0
Globe Newswire,293466,475171.0
Business Wire,268561,293052.0
Newsfile,70877,
ACCESSWIRE,62615,81054.0
"AB Digital, Inc.",9936,
WebWire,6404,
PRWeb,2617,
News Direct,2080,


In [None]:
auhtor_value_counts.sum().diff()

author                  NaN
inferred_author   -646971.0
dtype: float64

Ungefähr 650k Nachrichten werden ausgelassen, wenn nur die vier Hauptvertreiber von Pressemeldungen berücksichtigt werden.

In [None]:
ddf = ddf[~ddf.inferred_author.isna()]

In [None]:
ddf["inferred_author"] = ddf["inferred_author"].astype("string")

In [None]:
ddf["channels"] = ddf.channels.apply(lambda x: str(x), meta=pd.Series(dtype="string"))

In [10]:
ddf.inferred_author.value_counts().compute()

PRNewswire        587242
Globe Newswire    475171
Business Wire     293052
ACCESSWIRE         81054
Name: inferred_author, dtype: Int64

In [11]:
ddf.inferred_author.value_counts().sum().compute()

1436519

In [None]:
name_function = lambda x: f"data-{x}.parquet"
ddf.to_parquet(output_dir, name_function=name_function)

In [None]:
# Contains 100k rows
earnings_ddf = ddf[ddf.channels.apply(lambda x: "Earnings" in x, meta=pd.Series(dtype=bool))]

In [None]:
# value counts for authors of earnings reports (contrast to value counts of all news articles)
earnings_ddf.inferred_author.value_counts().head(10)

Globe Newswire    44589
PRNewswire        31440
ACCESSWIRE        16434
Name: inferred_author, dtype: Int64

Hier sehen wir, dass es keine einzige Pressemeldung von **Business Wire** gibt, die mit *Earnings* gekennzeichnet sind. Trotzdem gibt es relevante *Earnings* reports von Business Wire. Dies habe ich kurz verifiziert...

## Russell-Filtering und Ticker-Namen-Mapping

Wie viele Nachrichten bleiben, wenn wir auch die momentane Zusammensetzung des Russell 3000 filtern?

In [123]:
input_dir = cwd+'/data/unraw3_bzg/'
output_dir = cwd+'/data/unraw4_bzg/'
ddf = dd.read_parquet(input_dir)

In [124]:
# Around 3k tickers at this moment
russell_tickers = pd.read_pickle("data/tickers.pkl")
russell_tickers = russell_tickers.categories

In [51]:
# Get company name by ticker (longName is always equal to shortName in yf...)
# This takes a long time, because of the api calls to yf (15min)
def yahoo_get_wrapper(x):
  try:
    return yf.Ticker(x).info.get("longName")
  except:
    return None

company_names = pd.Series(russell_tickers).apply(lambda x: yahoo_get_wrapper(x))

In [125]:
mapper = pd.concat([company_names, pd.Series(russell_tickers)], axis=1)

In [126]:
mapper.columns = ["company_names", "ticker"]
mapper = mapper[mapper.isna().sum(axis=1) == 0]

In [127]:
mapper = mapper.set_index("ticker")

In [138]:
company_endings = pd.read_table("data_shared/corporation_endings.txt").iloc[:, 0]
mapper["short_name"] = mapper.company_names.apply(lambda x: get_company_abbreviation(x,
                                                                                     company_endings=company_endings))

In [139]:
mapper.short_name.isna().sum() # 84 stocks for which we don't have an ending to abbreviate

83

In [140]:
mapper.to_parquet(cwd + "/data_shared/ticker_name_mapper.parquet")

In [141]:
filt_ddf = ddf[ddf.stocks.isin(mapper.index.to_list())]

In [142]:
# ddf.shape[0].compute()

In [143]:
# filt_ddf.shape[0].compute()

Es verbleiben circa 660k Nachrichten, die für unser Russell 3000-Aktienuniversum relevant sind. Dieses Filtering hätten wir eigentlich auch schon früher machen können... Aber egal.

In [144]:
filt_ddf = filt_ddf.set_index("time")

In [145]:
monthly_news_counts = filt_ddf.stocks.resample("MS").count()

In [146]:
# px.line(monthly_news_counts.compute(), title="# of articles per month")

## Parsing


In [147]:
dask.config.set(scheduler="processes")

<dask.config.set at 0x7e974e5280a0>

Diese Nachrichten können wir nun wirklich parsen, und danach ordentlich kategorisieren.

In [155]:
ddf = filt_ddf
ddf = ddf.drop(columns=["author"]).rename(columns={"inferred_author":"author"})

In [156]:
ddf["company_name"] = ddf.stocks.apply(lambda x: mapper.company_names.loc[x], meta=pd.Series(dtype="string"))
ddf["short_name"] = ddf.stocks.apply(lambda x: mapper.short_name.loc[x], meta=pd.Series(dtype="string"))

In [157]:
name_function = lambda x: f"data-{x}.parquet"
ddf.to_parquet(cwd+'/data/latest/', name_function=name_function)

In [158]:
ddf = dd.read_parquet(cwd+'/data/latest/')

In [159]:
sample_partition = ddf.get_partition(5)

In [160]:
x = sample_partition.head().iloc[0]
x

stocks                                                       PANW
title           Palo Alto Networks Prices Secondary Public Off...
channels                                       ['Press Releases']
body            SANTA CLARA, Calif., Oct. 17, 2012 /PRNewswire...
author                                                 PRNewswire
company_name                             Palo Alto Networks, Inc.
short_name                                     Palo Alto Networks
Name: 2012-10-17 21:59:30-04:00, dtype: object

In [161]:
x.body

"SANTA CLARA, Calif., Oct. 17, 2012 /PRNewswire/ -- Palo Alto Networks, Inc.\n(NYSE: PANW) announced the pricing of 4,800,000 shares of its common stock at\n$63.00 per share in a secondary offering. All of the shares will be sold by\nexisting stockholders. In addition, the underwriters have a 30-day option to\npurchase up to 720,000 additional shares of common stock from certain of the\nselling stockholders. As part of the offering, all selling stockholders have\nentered into lock-up agreements that will extend the initial public offering\nlock-up period until 135 days after this offering.\n\nPalo Alto Networks will not receive any proceeds from the sale of the shares\nin this offering. The primary purposes of the offering are to facilitate an\norderly distribution of shares and to increase the company's public float.\n\nMorgan Stanley & Co. LLC, Goldman, Sachs & Co. and Citigroup Global Markets\nInc. are acting as lead joint book-running managers for the offering, and\nCredit Suisse S

In [162]:
filter_body(x.body, x.stocks, x.author, x.name, x.company_name, x.short_name) # x.name == time

"the company announced the pricing a past date shares of its common stock at $63.00 per share in a secondary offering. All of the shares will be sold by existing stockholders. In addition, the underwriters have a 30-day option to purchase up to 720,000 additional shares of common stock from certain of the selling stockholders. As part of the offering, all selling stockholders have entered into lock-up agreements that will extend the initial public offering lock-up period until 135 days after this offering. the company will not receive any proceeds from the sale of the shares in this offering. The primary purposes of the offering are to facilitate an orderly distribution of shares and to increase the company's public float. Morgan Stanley & Co. LLC, Goldman, Sachs & Co. and Citigroup Global Markets Inc. are acting as lead joint book-running managers for the offering, and Credit Suisse Securities (USA) LLC, Barclays Capital Inc., UBS Securities LLC and Raymond James & Associates, Inc. ar

In [44]:
sample_partition["parsed_body"] = sample_partition.apply(lambda x: filter_body(x.body,
                                                                               x.stocks,
                                                                               x.author,
                                                                               x.name,
                                                                               x.company_name,
                                                                               x.short_name),
                                                         axis=1,
                                                         meta=pd.Series(dtype="string"))

In [46]:
company_endings

NameError: ignored

In [45]:
sample_partition.head()

TypeError: ignored