*Stanislav Borysov [stabo@dtu.dk], DTU Management*
# Advanced Business Analytics

## Lecture 2 - Text Analytics - Part 2: Daily News for Stock Market Prediction [FOR LECTURE]

*Source: https://www.kaggle.com/aaron7sun/stocknews*

### Data Description

There are two channels of data provided in this dataset:

- News data: Historical news headlines from Reddit WorldNews Channel (/r/worldnews). They are ranked by Reddit users' votes, and only the top 25 headlines are considered for a single date. (Range: 2008-06-08 to 2016-07-01)
- Stock data: Dow Jones Industrial Average (DJIA) is used to "prove the concept". (Range: 2008-08-08 to 2016-07-01)

Three data files in .csv format are provided in `stocknews/`:

- `RedditNews.csv`: two columns; The first column is the "date", and the second column is the "news headlines". All news are ranked from top to bottom based on how hot they are. Hence, there are 25 lines for each date.
- `DJIA_table.csv`: Downloaded directly from Yahoo Finance: check out the web page for more info.
- `Combined_News_DJIA.csv`: Combined dataset (using the other two files) with 27 columns. The first column is "Date", the second is "Label", and the following ones are news headlines ranging from "Top1" to "Top25". The "Label" column can take only two values: "1" when "DJIA Adj Close" value rose or stayed as the same and "0" when "DJIA Adj Close" value decreased.

### Tasks

1. Classification: Predict "Label" from `Combined_News_DJIA.csv` using news headlines.
2. Regression: Predict "DJIA Adj Close" from `DJIA_table.csv` using news headlines.

For evaluation, please use data from 2008-08-08 to 2014-12-31 as a training set, and a test set is then the following two years of data (from 2015-01-02 to 2016-07-01). This is roughly a 80%/20% split.

In [23]:
import matplotlib.pyplot as plt
%matplotlib inline

In [24]:
import pandas as pd
import numpy as np

In [25]:
data_df = pd.read_csv("stocknews/Combined_News_DJIA.csv")

In [26]:
data_df.tail()

Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
1984,2016-06-27,0,Barclays and RBS shares suspended from trading...,Pope says Church should ask forgiveness from g...,Poland 'shocked' by xenophobic abuse of Poles ...,"There will be no second referendum, cabinet ag...","Scotland welcome to join EU, Merkel ally says",Sterling dips below Friday's 31-year low amid ...,No negative news about South African President...,Surge in Hate Crimes in the U.K. Following U.K...,...,German lawyers to probe Erdogan over alleged w...,"Boris Johnson says the UK will continue to ""in...",Richard Branson is calling on the UK governmen...,Turkey 'sorry for downing Russian jet',Edward Snowden lawyer vows new push for pardon...,Brexit opinion poll reveals majority don't wan...,"Conservative MP Leave Campaigner: ""The leave c...","Economists predict UK recession, further weake...","New EU 'superstate plan by France, Germany: Cr...",Pakistani clerics declare transgender marriage...
1985,2016-06-28,1,"2,500 Scientists To Australia: If You Want To ...","The personal details of 112,000 French police ...",S&amp;P cuts United Kingdom sovereign credit r...,Huge helium deposit found in Africa,CEO of the South African state broadcaster qui...,"Brexit cost investors $2 trillion, the worst o...",Hong Kong democracy activists call for return ...,Brexit: Iceland president says UK can join 'tr...,...,"US, Canada and Mexico pledge 50% of power from...",There is increasing evidence that Australia is...,"Richard Branson, the founder of Virgin Group, ...","37,000-yr-old skull from Borneo reveals surpri...",Palestinians stone Western Wall worshipers; po...,Jean-Claude Juncker asks Farage: Why are you h...,"""Romanians for Remainians"" offering a new home...",Brexit: Gibraltar in talks with Scotland to st...,8 Suicide Bombers Strike Lebanon,Mexico's security forces routinely use 'sexual...
1986,2016-06-29,1,Explosion At Airport In Istanbul,Yemeni former president: Terrorism is the offs...,UK must accept freedom of movement to access E...,Devastated: scientists too late to captive bre...,British Labor Party leader Jeremy Corbyn loses...,A Muslim Shop in the UK Was Just Firebombed Wh...,Mexican Authorities Sexually Torture Women in ...,UK shares and pound continue to recover,...,"Escape Tunnel, Dug by Hand, Is Found at Holoca...",The land under Beijing is sinking by as much a...,Car bomb and Anti-Islamic attack on Mosque in ...,Emaciated lions in Taiz Zoo are trapped in blo...,Rupert Murdoch describes Brexit as 'wonderful'...,More than 40 killed in Yemen suicide attacks,Google Found Disastrous Symantec and Norton Vu...,Extremist violence on the rise in Germany: Dom...,BBC News: Labour MPs pass Corbyn no-confidence...,Tiny New Zealand town with 'too many jobs' lau...
1987,2016-06-30,1,Jamaica proposes marijuana dispensers for tour...,Stephen Hawking says pollution and 'stupidity'...,Boris Johnson says he will not run for Tory pa...,Six gay men in Ivory Coast were abused and for...,Switzerland denies citizenship to Muslim immig...,Palestinian terrorist stabs israeli teen girl ...,Puerto Rico will default on $1 billion of debt...,Republic of Ireland fans to be awarded medal f...,...,Googles free wifi at Indian railway stations i...,Mounting evidence suggests 'hobbits' were wipe...,The men who carried out Tuesday's terror attac...,Calls to suspend Saudi Arabia from UN Human Ri...,More Than 100 Nobel Laureates Call Out Greenpe...,British pedophile sentenced to 85 years in US ...,"US permitted 1,200 offshore fracks in Gulf of ...",We will be swimming in ridicule - French beach...,UEFA says no minutes of silence for Istanbul v...,Law Enforcement Sources: Gun Used in Paris Ter...
1988,2016-07-01,1,A 117-year-old woman in Mexico City finally re...,IMF chief backs Athens as permanent Olympic host,"The president of France says if Brexit won, so...",British Man Who Must Give Police 24 Hours' Not...,100+ Nobel laureates urge Greenpeace to stop o...,Brazil: Huge spike in number of police killing...,Austria's highest court annuls presidential el...,"Facebook wins privacy case, can track any Belg...",...,"The United States has placed Myanmar, Uzbekist...",S&amp;P revises European Union credit rating t...,India gets $1 billion loan from World Bank for...,U.S. sailors detained by Iran spoke too much u...,Mass fish kill in Vietnam solved as Taiwan ste...,Philippines president Rodrigo Duterte urges pe...,Spain arrests three Pakistanis accused of prom...,"Venezuela, where anger over food shortages is ...",A Hindu temple worker has been killed by three...,Ozone layer hole seems to be healing - US &amp...


In [27]:
loc = 1984#1983

In [28]:
data_df['Top1'].iloc[loc] = data_df['Top1'].iloc[loc] + "."
data_df.iloc[loc]['Top1']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


'Barclays and RBS shares suspended from trading after tanking more than 8%.'

In [29]:
print(data_df.iloc[loc])

Date                                            2016-06-27
Label                                                    0
Top1     Barclays and RBS shares suspended from trading...
Top2     Pope says Church should ask forgiveness from g...
Top3     Poland 'shocked' by xenophobic abuse of Poles ...
Top4     There will be no second referendum, cabinet ag...
Top5         Scotland welcome to join EU, Merkel ally says
Top6     Sterling dips below Friday's 31-year low amid ...
Top7     No negative news about South African President...
Top8     Surge in Hate Crimes in the U.K. Following U.K...
Top9     Weapons shipped into Jordan by the CIA and Sau...
Top10    Angela Merkel said the U.K. must file exit pap...
Top11    In a birth offering hope to a threatened speci...
Top12    Sky News Journalist Left Speechless As Leave M...
Top13            Giant panda in Macau gives birth to twins
Top14    Get out now: EU leader tells Britain it must i...
Top15    Sea turtle 'beaten and left for dead' on beach.

In [30]:
data_df.iloc[loc]['Top2']

'Pope says Church should ask forgiveness from gays for past treatment'

In [31]:
data_df.iloc[loc]['Top23']

'Economists predict UK recession, further weakening of Pound following Brexit.'

In [32]:
data_df['text'] = data_df[['Top{}'.format(i) for i in range(1, 26)]].apply(lambda vals: '. '.join([x for x in vals if x is not np.nan]), axis=1)

In [33]:
print(data_df.iloc[loc]['text'])
print(data_df.iloc[loc]['Date'])
print(data_df.iloc[loc]['Label'])

Barclays and RBS shares suspended from trading after tanking more than 8%.. Pope says Church should ask forgiveness from gays for past treatment. Poland 'shocked' by xenophobic abuse of Poles in UK. There will be no second referendum, cabinet agrees. Scotland welcome to join EU, Merkel ally says. Sterling dips below Friday's 31-year low amid Brexit uncertainty. No negative news about South African President allowed on state broadcaster.. Surge in Hate Crimes in the U.K. Following U.K.s Brexit Vote. Weapons shipped into Jordan by the CIA and Saudi Arabia intended for Syrian rebels have been systematically stolen by Jordanian intelligence operatives and sold to arms merchants on the black market, according to American and Jordanian officials. Angela Merkel said the U.K. must file exit papers with the European Union before talks can begin. In a birth offering hope to a threatened species, an aquarium in Osaka, Japan, has succeeded in artificially breeding a southern rockhopper penguin for

In [34]:
pip install --user -U nltk

Requirement already up-to-date: nltk in c:\users\bjark\appdata\roaming\python\python37\site-packages (3.5)Note: you may need to restart the kernel to use updated packages.



In [38]:
import nltk

In [39]:
from nltk.tokenize import word_tokenize
from nltk import ngrams

In [37]:
gram1 = word_tokenize(data_df.iloc[loc]['Top2'])
print(gram1)

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\bjark/nltk_data'
    - 'C:\\Users\\bjark\\Anaconda3\\envs\\BjarkiLord\\nltk_data'
    - 'C:\\Users\\bjark\\Anaconda3\\envs\\BjarkiLord\\share\\nltk_data'
    - 'C:\\Users\\bjark\\Anaconda3\\envs\\BjarkiLord\\lib\\nltk_data'
    - 'C:\\Users\\bjark\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


In [40]:
xxx = word_tokenize("Giant panda twins born at a zoo in European Union")
print(xxx)

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\bjark/nltk_data'
    - 'C:\\Users\\bjark\\Anaconda3\\envs\\BjarkiLord\\nltk_data'
    - 'C:\\Users\\bjark\\Anaconda3\\envs\\BjarkiLord\\share\\nltk_data'
    - 'C:\\Users\\bjark\\Anaconda3\\envs\\BjarkiLord\\lib\\nltk_data'
    - 'C:\\Users\\bjark\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


In [17]:
gram2 = ngrams(gram1, 2)
print([x for x in gram2])

[('Pope', 'says'), ('says', 'Church'), ('Church', 'should'), ('should', 'ask'), ('ask', 'forgiveness'), ('forgiveness', 'from'), ('from', 'gays'), ('gays', 'for'), ('for', 'past'), ('past', 'treatment')]


In [18]:
gram3 = ngrams(gram1, 3)
print([x for x in gram3])

[('Pope', 'says', 'Church'), ('says', 'Church', 'should'), ('Church', 'should', 'ask'), ('should', 'ask', 'forgiveness'), ('ask', 'forgiveness', 'from'), ('forgiveness', 'from', 'gays'), ('from', 'gays', 'for'), ('gays', 'for', 'past'), ('for', 'past', 'treatment')]


In [19]:
def bow_transform(corpora):
    vocabulary = {}
    for d in corpora:
        for w in d:
            if w not in vocabulary:
                vocabulary[w] = len(vocabulary)
    voc_len = len(vocabulary)
    corpora_bow = []
    for d in corpora:
        d_bow = [0 for i in range(voc_len)]
        for w in d:
            d_bow[vocabulary[w]] += 1
        corpora_bow.append(d_bow)
    return vocabulary, corpora_bow

In [20]:
corpora = [
    ['David', 'Cameron', 'to', 'Resign', 'as', 'PM', 'After', 'EU', 'Referendum', '.'],
    ['Prime', 'Minister', 'Resigns', 'After', 'European', 'Union', 'referendum', '.'],
    ['Giant', 'panda', 'twins', 'born', 'at', 'a', 'zoo', 'in', 'European', 'Union', '.'],
    #
    ["Union", "After", "panda", "."]
]
vocabulary, corpora_bow = bow_transform(corpora)
print(vocabulary, corpora_bow)
print(len(vocabulary))

{'David': 0, 'Cameron': 1, 'to': 2, 'Resign': 3, 'as': 4, 'PM': 5, 'After': 6, 'EU': 7, 'Referendum': 8, '.': 9, 'Prime': 10, 'Minister': 11, 'Resigns': 12, 'European': 13, 'Union': 14, 'referendum': 15, 'Giant': 16, 'panda': 17, 'twins': 18, 'born': 19, 'at': 20, 'a': 21, 'zoo': 22, 'in': 23} [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0]]
24


In [21]:
v_s = sorted(vocabulary.items(), key=lambda kv: kv[1])
[x[0] for x in v_s]

['David',
 'Cameron',
 'to',
 'Resign',
 'as',
 'PM',
 'After',
 'EU',
 'Referendum',
 '.',
 'Prime',
 'Minister',
 'Resigns',
 'European',
 'Union',
 'referendum',
 'Giant',
 'panda',
 'twins',
 'born',
 'at',
 'a',
 'zoo',
 'in']

In [22]:
corpora_bow

[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0]]

In [23]:
[" ".join(d) for d in corpora]

['David Cameron to Resign as PM After EU Referendum .',
 'Prime Minister Resigns After European Union referendum .',
 'Giant panda twins born at a zoo in European Union .',
 'Union After panda .']

In [24]:
from scipy.spatial.distance import cosine

In [25]:
for d in corpora_bow:
    print(np.dot(corpora_bow[-1], d))

2
3
3
4


In [26]:
for d in corpora_bow:
    print(1 - cosine(corpora_bow[-1], d))

0.3162277660168379
0.5303300858899106
0.4522670168666454
1.0


In [27]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer()
corpora_bow_tfidf = tf_transformer.fit_transform(corpora_bow[:-1])
q = tf_transformer.transform([corpora_bow[-1]])

In [28]:
for d in corpora_bow_tfidf:
    print(1 - cosine(q[0].todense(), d.todense()))

0.1960512255643465
0.38562352543990863
0.3948974401446338


## Topic modeling

### Preprocessing

In [29]:
news_df = pd.read_csv("stocknews/RedditNews.csv")

In [30]:
news_df.head()

Unnamed: 0,Date,News
0,2016-07-01,A 117-year-old woman in Mexico City finally re...
1,2016-07-01,IMF chief backs Athens as permanent Olympic host
2,2016-07-01,"The president of France says if Brexit won, so..."
3,2016-07-01,British Man Who Must Give Police 24 Hours' Not...
4,2016-07-01,100+ Nobel laureates urge Greenpeace to stop o...


In [31]:
docs = news_df['News'].values

In [33]:
from stopwords import get_stopwords
import re, string
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize

In [34]:
lemmatizer = WordNetLemmatizer()

In [35]:
def text_processing(text):
    # remove punctuation 
    text = "".join([c for c in text 
                    if c not in string.punctuation])
    # lowercase
    text = "".join([c.lower() for c in text])
    # remove stopwords
    text = " ".join([w for w in text.split() 
                     if w not in get_stopwords("en")])
    # replace numbers with a token
    text = re.sub('[0-9]+', 'number', text).lower()
    # stemming / lematizing (optional)
    text = " ".join([lemmatizer.lemmatize(w) for w in text.split()])
    return text

In [36]:
text_processing_vec = np.vectorize(text_processing)

In [37]:
docs = text_processing_vec(docs)

In [38]:
# tokenize
docs = [word_tokenize(d) for d in docs]

In [39]:
docs[:3]

[['numberyearold',
  'woman',
  'mexico',
  'city',
  'finally',
  'received',
  'birth',
  'certificate',
  'died',
  'hour',
  'later',
  'trinidad',
  'alvarez',
  'lira',
  'waited',
  'year',
  'proof',
  'born',
  'number'],
 ['imf', 'chief', 'back', 'athens', 'permanent', 'olympic', 'host'],
 ['president', 'france', 'say', 'brexit', 'won', 'can', 'donald', 'trump']]

In [40]:
from gensim import corpora

ModuleNotFoundError: No module named 'gensim'

In [None]:
dictionary = corpora.Dictionary(docs)
print(dictionary)

In [None]:
corpus = [dictionary.doc2bow(d) for d in docs]
print(corpus[0])

### LSI

In [None]:
from gensim.models.lsimodel import LsiModel

In [None]:
num_topics_lsi = 25

In [None]:
lsi = LsiModel(corpus=corpus, id2word=dictionary, num_topics=num_topics_lsi, onepass=False, power_iters=5)

In [None]:
lsi[corpus][0]

In [None]:
lsi.print_topics(num_topics_lsi)

### LDA

In [None]:
from gensim.models.ldamodel import LdaModel

In [None]:
num_topics_lda = 25

In [None]:
lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics_lda, update_every=1, passes=5, alpha='symmetric')

In [None]:
lda[corpus[0]]

In [None]:
lda.get_document_topics(corpus[0])

In [None]:
lda.print_topics(num_topics_lda)

### Check

In [None]:
news_ids_to_check = [
    2814, # 2016-03-10,Gas pipeline from Israel to Jordan to begin operating in 2017
    2835, # 2016-03-09,China signals interest in denuclearization talks without North Korea
    4034, # 2016-01-21,Chinese president declares support for Palestinian state
    11465, # 2015-03-30,Former Israeli Prime Minister Ehud Olmert found guilty in retrial on corruption charges
    136, # 2016-06-26,The United Nations issued a strongly worded statement on Friday warning that any attempt by the European Union to bypass national parliaments to push through controversial trade deals would violate international human rights norms and standards.
    2242, # 2016-04-03,Bahraini citizens protest for human rights ahead of Formula One race
    13901, # 2014-12-22,"Argentine Court rules: Orang Utans are ""non-human-persons"" with human rights and therefore need to be released from zoo"
    14708, # 2014-11-20,North Korea has threatened to conduct a nuclear test in response to a United Nations move towards a probe into the country's human rights violations
    1732, # 2016-04-23,North Korea seen to fire submarine-launched ballistic missile: South Korea
]

In [None]:
news_df['News'].values[14708]

In [None]:
for news_id in news_ids_to_check:
    print(80*'-')
    print(news_df['News'].values[news_id])
    print(docs[news_id])
    print(sorted(lsi[corpus[news_id]], key=lambda x: abs(x[1]), reverse=True))
    print(sorted(lda[corpus[news_id]], key=lambda x: abs(x[1]), reverse=True))

### Viz

In [None]:
len(corpus)

In [None]:
len(dictionary)

In [None]:
import wordcloud

In [None]:
class SimpleGroupedColorFunc(object):
    """Create a color function object which assigns EXACT colors
       to certain words based on the color to words mapping

       Parameters
       ----------
       color_to_words : dict(str -> list(str))
         A dictionary that maps a color to the list of words.

       default_color : str
         Color that will be assigned to a word that's not a member
         of any value from color_to_words.
    """

    def __init__(self, color_to_words, default_color):
        self.word_to_color = {word: color
                              for (color, words) in color_to_words.items()
                              for word in words}

        self.default_color = default_color

    def __call__(self, word, **kwargs):
        return self.word_to_color.get(word, self.default_color)

In [None]:
def viz_topic(topic, title, color_func):
    fig_wordcloud = wordcloud.WordCloud(
        max_font_size=100, max_words=50, background_color="white", prefer_horizontal=1, relative_scaling=0.8,
    ).fit_words(topic)
    if color_func is not None:
        fig_wordcloud.recolor(color_func=color_func)
    plt.figure(figsize=(10,7), frameon=True)
    plt.imshow(fig_wordcloud, interpolation="bilinear")  
    plt.axis('off')
    plt.title(title, fontsize=20)
    plt.show()

**LSI**

In [None]:
lsi_negative = {}
for i in range(num_topics_lsi):
    topic = lsi.show_topic(i, topn=1000)
    lsi_negative_topic = []
    for w, f in topic:
        if f < 0:
            lsi_negative_topic.append(w)
    lsi_negative[i] = lsi_negative_topic

In [None]:
default_color = '#00AA00'
for i in range(num_topics_lsi):
    color_to_words = {
        # will be colored with a red single color function
        'red': lsi_negative[i]
    }
    grouped_color_func = SimpleGroupedColorFunc(color_to_words, default_color)
    viz_topic(dict(map(lambda x: (x[0], abs(x[1])), lsi.show_topic(i, topn=100))), "Topic #{}".format(i), grouped_color_func)

**LDA**

In [None]:
for i in range(num_topics_lda):
    viz_topic(dict(lda.show_topic(i, topn=100)), "Topic #{}".format(i), None)

## Prediction

### Prerocessing

### ML