# Benzinga-Nachrichten-Verarbeitung

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
cwd = "/content/drive/MyDrive/NewsTrading/trading_bot"
%cd /content/drive/MyDrive/NewsTrading/trading_bot

/content/drive/MyDrive/NewsTrading/trading_bot


In [3]:
%%capture
!pip install html2text
!pip install datefinder
!pip install -U dask[complete]
!pip install nltk;

In [4]:
import dask.dataframe as dd
import dask
import pandas as pd
from dask.distributed import Client
from src.preprocessing.news_parser import filter_body, time, body_formatter
import re
import plotly.express as px
import nltk
nltk.download('punkt')
import yfinance as yf

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [5]:
# client = Client(memory_limit='25GB', processes=False,
#                 n_workers=2, threads_per_worker=1)
# client

## HTML-Parsing
Als erstes müssen wir die HTML-Dokumente zu normalem Text umwandeln, ansonsten sind die Text-Zellen zu groß und führen zu Problemen mit PyArrow/Dask.

In [6]:
dask.config.set(scheduler="threads")

<dask.config.set at 0x7ae1f2680e20>

In [None]:
input_dir = "data/raw_bzg/"
output_dir = 'data/unraw1_bzg/'

In [None]:
# for year in range(2019, 2020):
#     print(year)
#     df = pd.read_parquet(f"{input_dir}story_df_raw_{year}.parquet")
#     df = dd.from_pandas(df, npartitions=12)
#     df["html_body"] = df["html_body"].apply(body_formatter, meta=pd.Series(dtype="str"))
#     df = df.rename(columns={"html_body":"body"})
#     name_function = lambda x: f"data-{year}-{x}.parquet"
#     df.to_parquet(output_dir, name_function=name_function)

## Neu-Partitionierug
Sodass alle Partitionen etwa die gleiche Größe haben.

In [None]:
input_dir = 'data/unraw1_bzg/'
output_dir = 'data/unraw2_bzg/'

# ddf = dd.read_parquet(input_dir+"*.parquet")
# ddf2 = ddf.repartition(npartitions=50)
# name_function = lambda x: f"data-{x}.parquet"
# ddf2.to_parquet(output_dir, name_function=name_function)

## Author-Inferenz

Ein bisschen die Daten säubern...

In [8]:
input_dir = cwd+'/data/unraw2_bzg/'
output_dir = cwd+'/data/unraw3_bzg/'

In [None]:
ddf = dd.read_parquet(input_dir+"*.parquet")

In [None]:
# Remove rows for which noo stock ticker is recorded
ddf = ddf[ddf.stocks != '']

In [None]:
# Convert `channels`  datatype from string to list
ddf["channels"] = ddf["channels"].apply(eval, meta=pd.Series(dtype='object'))

Untersuche als nächstes die Behauptung, dass **PRNewswire** und **Businesswire** den gesamten Markt für Pressemeldungen in den USA kontrollieren. Wenn dem so ist, und sie nicht noch weitere, unwichtige Meldungen veröffentlichen, dann können wir einfach die Newsartikel nach diesen Autoren filtern und uns viel Arbeit ersparen.

In [None]:
dask.config.set(scheduler="processes")
ddf["inferred_author"] = None

def infer_author(body):
  for author in ["PRNewswire", "Globe Newswire", "Business Wire", "ACCESSWIRE"]:
    if re.search(author, body, re.IGNORECASE) is not None:
      return author
  return None

ddf["inferred_author"] = ddf.body.apply(infer_author, meta=pd.Series(dtype="string"))

In [None]:
# value_counts for authors
auhtor_value_counts = pd.concat([ddf.author.value_counts().head(10), ddf.inferred_author.value_counts().head(10)], axis=1)

In [None]:
auhtor_value_counts

Unnamed: 0,author,inferred_author
Benzinga,1061214,
PRNewswire,305720,587242.0
Globe Newswire,293466,475171.0
Business Wire,268561,293052.0
Newsfile,70877,
ACCESSWIRE,62615,81054.0
"AB Digital, Inc.",9936,
WebWire,6404,
PRWeb,2617,
News Direct,2080,


In [None]:
auhtor_value_counts.sum().diff()

author                  NaN
inferred_author   -646971.0
dtype: float64

Ungefähr 650k Nachrichten werden ausgelassen, wenn nur die vier Hauptvertreiber von Pressemeldungen berücksichtigt werden.

In [None]:
ddf = ddf[~ddf.inferred_author.isna()]

In [None]:
ddf["inferred_author"] = ddf["inferred_author"].astype("string")

In [None]:
ddf["channels"] = ddf.channels.apply(lambda x: str(x), meta=pd.Series(dtype="string"))

In [10]:
ddf.inferred_author.value_counts().compute()

PRNewswire        587242
Globe Newswire    475171
Business Wire     293052
ACCESSWIRE         81054
Name: inferred_author, dtype: Int64

In [11]:
ddf.inferred_author.value_counts().sum().compute()

1436519

In [None]:
name_function = lambda x: f"data-{x}.parquet"
ddf.to_parquet(output_dir, name_function=name_function)

In [None]:
# Contains 100k rows
earnings_ddf = ddf[ddf.channels.apply(lambda x: "Earnings" in x, meta=pd.Series(dtype=bool))]

In [None]:
# value counts for authors of earnings reports (contrast to value counts of all news articles)
earnings_ddf.inferred_author.value_counts().head(10)

Globe Newswire    44589
PRNewswire        31440
ACCESSWIRE        16434
Name: inferred_author, dtype: Int64

Hier sehen wir, dass es keine einzige Pressemeldung von **Business Wire** gibt, die mit *Earnings* gekennzeichnet sind. Trotzdem gibt es relevante *Earnings* reports von Business Wire. Dies habe ich kurz verifiziert...

## Russell-Filtering

Wie viele Nachrichten bleiben, wenn wir auch die momentane Zusammensetzung des Russell 3000 filtern?

In [54]:
input_dir = cwd+'/data/unraw3_bzg/'
output_dir = cwd+'/data/unraw4_bzg/'
ddf = dd.read_parquet(input_dir)

In [55]:
# Around 3k tickers at this moment
russell_tickers = pd.read_pickle("data/tickers.pkl")
russell_tickers = russell_tickers.categories

In [56]:
# Get company name by ticker (longName is always equal to shortName in yf...)
# This takes a long time, because of the api calls to yf (15min)
def yahoo_get_wrapper(x):
  try:
    return yf.Ticker(x).info.get("longName")
  except:
    return None

company_names = pd.Series(russell_tickers).apply(lambda x: yahoo_get_wrapper(x))
mapper = pd.concat([company_names, pd.Series(russell_tickers)], axis=1)

In [57]:
mapper.columns = ["company_names", "ticker"]
mapper = mapper[mapper.isna().sum(axis=1) == 0]

In [58]:
mapper = mapper.set_index("ticker").company_names

In [60]:
filt_ddf = ddf[ddf.stocks.isin(mapper.index.to_list())]

In [61]:
# ddf.shape[0].compute()

In [62]:
# filt_ddf.shape[0].compute()

Es verbleiben circa 660k Nachrichten, die für unser Russell 3000-Aktienuniversum relevant sind. Dieses Filtering hätten wir eigentlich auch schon früher machen können... Aber egal.

In [63]:
filt_ddf = filt_ddf.set_index("time")

In [64]:
monthly_news_counts = filt_ddf.stocks.resample("MS").count()

In [65]:
# px.line(monthly_news_counts.compute(), title="# of articles per month")

## Parsing


In [66]:
dask.config.set(scheduler="processes")

<dask.config.set at 0x7dcec84c32b0>

Diese Nachrichten können wir nun wirklich parsen, und danach ordentlich kategorisieren.

In [84]:
ddf = filt_ddf
ddf = ddf.drop(columns=["author"]).rename(columns={"inferred_author":"author"})

In [94]:
ddf["company_name"] = ddf.stocks.apply(lambda x: mapper.loc[x], meta=pd.Series(dtype="string"))

In [99]:
name_function = lambda x: f"data-{x}.parquet"
ddf.to_parquet(cwd+'/data/latest/', name_function=name_function)

In [95]:
ddf.head()

Unnamed: 0_level_0,stocks,title,channels,body,author,company_name
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-01-03 19:05:02-04:00,JBLU,JetBlue to Waive Change Fees and Fare Differen...,[],"NEW YORK, Jan. 3 /PRNewswire-FirstCall/ -- Jet...",PRNewswire,JetBlue Airways Corporation
2010-01-04 06:01:02-04:00,ENSG,The Ensign Group Acquires Two Idaho Skilled Nu...,[],"BOISE, Idaho, Jan. 4 /PRNewswire-FirstCall/ --...",PRNewswire,"The Ensign Group, Inc."
2010-01-04 06:01:04-04:00,CCS,Pike Research Launches Utility Innovations Adv...,[],"BOULDER, Colo.--(BUSINESS WIRE)-- Today Pike ...",Business Wire,"Century Communities, Inc."
2010-01-04 06:01:04-04:00,ENSG,The Ensign Group Acquires Utah Skilled Nursing...,[],"SALT LAKE CITY, Jan. 4 /PRNewswire-FirstCall/ ...",PRNewswire,"The Ensign Group, Inc."
2010-01-04 07:01:03-04:00,AMSF,AMERISAFE Completes Redemption of All Outstand...,[],"DERIDDER, La., Jan. 4 /PRNewswire-FirstCall/ -...",PRNewswire,"AMERISAFE, Inc."


In [None]:
ddf = pd.read_parquet(cwd+'/data/latest/')

In [96]:
sample_partition = ddf.get_partition(5)

In [97]:
x = sample_partition.head().iloc[0]
x

stocks                                                        HON
title           Honeywell Breaks Ground On Elder Community Cen...
channels                                       ['Press Releases']
body            ## Honeywell Ibasho House to open next spring ...
author                                                 PRNewswire
company_name                         Honeywell International Inc.
Name: 2012-10-24 12:23:09-04:00, dtype: object

In [110]:
x.body

'## Honeywell Ibasho House to open next spring to support residents affected by\nthe Great East Japan Earthquake and Tsunami\n\nOFUNATO, IWATE, Japan, Oct. 24, 2012 /PRNewswire/ -- Honeywell (NYSE: HON)\nannounced today that it will begin construction on an elder care community\ncenter, the Honeywell Ibasho House, in Ofunato, Iwate, to provide support for\nresidents affected by the Great East Japan Earthquake and Tsunami.\n\nThe Honeywell Humanitarian Relief Fund, a component of Honeywell Hometown\nSolutions, will build the Honeywell Ibasho House to support Ofunato elders,\nhelping them to find their "ibasho," a place where one feels at home.\n\nScheduled to open in Spring 2013, the Honeywell Ibasho House is designed by\nProfessor Suguru Mori with the Laboratory of Architecture and Planning at\nHokkaido University using contemporary and sustainable building methods that\ncan withstand future earthquakes.\n\n"Honeywell is committed to helping Japan rebuild," said Tom Buckmaster,\nPresid

In [109]:
filter_body(x.body, x.stocks, x.author, x.name, x.company_name) # x.name == time

'## Honeywell Ibasho House to open next spring to support residents affected by the Great East Japan Earthquake and Tsunami Honeywell announced today that it will begin construction on an elder care community center, the Honeywell Ibasho House, in Ofunato, Iwate, to provide support for residents affected by the Great East Japan Earthquake and Tsunami. The Honeywell Humanitarian Relief Fund, a component of Honeywell Hometown Solutions, will build the Honeywell Ibasho House to support Ofunato elders, helping them to find their "ibasho," a place where one feels at home. Scheduled to open in Spring 2013, the Honeywell Ibasho House is designed by Professor Suguru Mori with the Laboratory of Architecture and Planning at Hokkaido University using contemporary and sustainable building methods that can withstand future earthquakes. "Honeywell is committed to helping Japan rebuild," said Tom Buckmaster, President of Honeywell Hometown Solutions. "Working in partnership with Mayor Toda of Ofunato

In [30]:
sample_partition["parsed_body"] = sample_partition.apply(lambda x: filter_body(x.body, x.stocks, x.author, x.time), axis=1, meta=pd.Series(dtype="string"))



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [39]:
sample_partition.body.head()

KeyboardInterrupt: ignored