[View in Colaboratory](https://colab.research.google.com/github/estoilkov/machine-learning-project-walkthrough/blob/master/SentimentAnalysisFinance.ipynb)

# Sentiment Analysis in Finance

A very hyped use of AI in Finance comes from the application of Sentiment Analysis as a factor to predict the future price of a security.

Below I will explain some of the basics and caveats of using it in a financial context.

Sentiment analysis can be defined as:


> "*Sentiment analysis, or opinion mining, is an active area of
study in the field of natural language processing that analyzes
people's opinions, sentiments, evaluations, attitudes,
and emotions via the** computational treatment** of subjectivity
in text. *"

A simple introduction to it can be found in ["How Quant Traders Use Sentiment To Get An Edge On The Market"](https://www.forbes.com/sites/kumesharoomoogan/2015/08/06/how-quant-traders-use-sentiment-to-get-an-edge-on-the-market/#5a38018d4b5d), and typical academic papers explaining in full detail the process are ["Twitter mood predicts the stock market."](https://arxiv.org/pdf/1010.3003.pdf) and  ["Stock Prediction Using Twitter Sentiment Analysis"](http://cs229.stanford.edu/proj2011/GoelMittal-StockMarketPredictionUsingTwitterSentimentAnalysis.pdf).

The basic idea is the following:
* convert a pipeline text sources (news, twitters, posts) into one or many **quantitative** (numerical) values,
* feed the above values into a complex model (classical econometrics or newly developed neural networks) as input to predict the price of a security.

If you did open one of the academic papers, you would have been drowned in jargon, but by now 
big companies like Bloomberg and new companies like RavenPack  have jumped on the bandwagon and now provide Sentiment Analysis indices as utilities: ["How you can get an edge by trading on news sentiment data"](https://www.bloomberg.com/professional/blog/can-get-edge-trading-news-sentiment-data/) and ["Abnormal Media Attention Impacts Stock Returns"](https://www.ravenpack.com/research/abnormal-media-attention-impacts-stock-returns/).

As in the previous posts, on one hand I prefer to produce [reproducible research](https://en.wikipedia.org/wiki/Reproducibility#Reproducible_research) in the form of jupyter notebooks that actually run (instead of pdf papers like the links above), but in this case the complexity is such that I only will illustrate basic concepts.

Let starts by downloading a python off-the-shelf sentiment analyzer (["VADER: A Parsimonious Rule-based Model for
Sentiment Analysis of Social Media Text"](https://pdfs.semanticscholar.org/a6e4/a2532510369b8f55c68f049ff11a892fefeb.pdf)).




In [0]:

import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import tokenize


[nltk_data] Downloading package vader_lexicon to /content/nltk_data...




In [0]:
sid = SentimentIntensityAnalyzer()

That's it. The magic of open source collaboration allows you to get the underlying tools free of charge, although if you are partial to corporate solution (and like Jeopardy) you can use IBM's [Watson Natural Understanding kit](https://www.ibm.com/watson/services/natural-language-understanding/).

If you have this notebook open in Google colaboratory you can change the sentences below.

In [0]:
#@title Example of a negative sentence:
sentence = "North Korea threatens to cancel Trump summit" #@param {type:"string"}
ss = sid.polarity_scores(sentence)
print (ss)

{'neg': 0.479, 'neu': 0.521, 'pos': 0.0, 'compound': -0.5574}


In [0]:
#@title Example of a positive sentence:
sentence = "Colombia's ex-fighters taught skills for peace" #@param {type:"string"}
ss = sid.polarity_scores(sentence)
print (ss)

{'neg': 0.0, 'neu': 0.588, 'pos': 0.412, 'compound': 0.5423}


You can see how the tool converted the text into various numerical values.

The last value ('compound') can be used as a measure (from -1 to 1) of how 'positive' or 'negative' the sentence is, which in turn can now be manipulated numerically:


*   aggregate them to create an index of positive and negative total sentiment,
* use metadata to identify clusters of activity geographically
*   separate text by likely subject (company) and use sentiment by company

However, how does this off-the-shelf tool work in a financial context ? After all, it was developed as  "*a simple rule-based model for **general** sentiment analysis*". 

Let's try it in another two examples:


In [0]:
#@title Example of a positive business sentence:
sentence = "Paddy Power Betfair confirms it is in talks to buy FanDuel" #@param {type:"string"}
ss = sid.polarity_scores(sentence)
print (ss)

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


In [0]:
#@title Example of a negative business sentence:
sentence = "Court order threatens deal with Jio and sends shares down by a fifth" #@param {type:"string"}
ss = sid.polarity_scores(sentence)
print (ss)

{'neg': 0.176, 'neu': 0.676, 'pos': 0.149, 'compound': -0.1027}


It does not do well in this very specific domain; a business analyst would have expected FanDuel shares to shot up while Jio's were already down 20% (a number which does not correlate well with only a compound score of -0.1027 instead of close to -1).

## Some issues with Sentiment analysis in Finance

It turns out that this off-the-shelf tool is not great for financial results (and that is only after checking 2 simple cases). Some additional problems:


*   Sentiment analysis is very 'domain specific' (academic talk to say that a set of tools only works in a pre-defined context - so you need to find the correct one for your applications)
*   Regime changes can happen - if you keep using 'old' models for current data you will miss opportunities: think what happens if you use use pre-2013 models (before [HODL](https://litecoinalliance.org/hodl-on-for-dear-life-the-history-and-meaning-of-hodl/) for cryptocurrencies entered the cybersphere) - in fact, if you look at Vaders [lexicon data](https://www.kaggle.com/nltkdata/vader-lexicon/data) you will not find 'hodl'.
* as the 'Jio' example shows, a piece of news can be very negative *but* the price impact can be muted (that particular piece of news is very negative but is also a laggard indicator as the price already tanked).

Instead, we can:

###Create our own sentiment tool. 

The Vader guys:
> " collected
intensity ratings on each of our candidate lexical
features from ten independent human raters (for a total of
90,000+ ratings)."

Instead of using their "lexical features", we would have to design and implement a system to collect the ratings of something close to 90k (the more the merrier) in a business context (create a user interface). I could not find a publicly available corpus of annotated financial news (financial news with a score).

Also, if we are training in our own we could change the numerical measure to reflect directly the impact on the stock.

**Connecting to relevant (and timely) news pipeline:** assuming the sentiment tool is ok, we would need to connect it to a set of reliable news and relevant comments (no fake news, or add a fake news analyser).

### Use Professional sentiment indicators 
I mentioned above some sentiment analysis providers. The good thing is that the platforms handle the whole data connection. Unfortunately, their systems are proprietary (blackest black box of all):
*  so we cannot test whether their sentiment tool is finely tuned to our specific financial topics (or generic business topics), 
* we cannot control how often they update them,
* are available to competitors, hence the alpha they can provide will diminish in time.




# Entity Recognition

In financial news we would also like to to separate text documents that correspond to different companies. We can do so by using additional modules that perform [Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition): "*a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc*"

Applying the general purpose ER module on:


> "Court order threatens deal with Jio and sends shares down by a fifth"

We get two entities, "Court" and "Jio": if we were monitoring "Jio" we could now separate this message into the 'bad' news bucket and see how its sentiment would affect the price.






In [0]:

nltk.download('maxent_ne_chunker')
nltk.download('words')

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
 
def get_continuous_chunks(text):
  chunked = ne_chunk(pos_tag(word_tokenize(text)))
  prev = None
  continuous_chunk = []
  current_chunk = []
  for i in chunked:
#         print(i)
#         print (type(i))

         if type(i) == Tree:
                 print(i)
                 current_chunk.append(" ".join([token for token, pos in i.leaves()]))
         elif current_chunk:
                 named_entity = " ".join(current_chunk)
                 if named_entity not in continuous_chunk:
                         continuous_chunk.append(named_entity)
                         current_chunk = []
         else:
                 continue
  return continuous_chunk

sentence = "Court order threatens deal with Jio and sends shares down by a fifth"
get_continuous_chunks(sentence)





[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /content/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /content/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


LookupError: ignored

As with Sentiment analysis, Entity recognition can be fine tuned to handle a specific financial domain (companies,equity, bonds, sovereigns, etc) - but needs lots of data and a training system, but also can be used independently:
* as a measure of media impact (aggregation of news mentioning a company, regardless of whether bad or good)
* to extract information and act on it -- for example in the news "Paddy Power Betfair confirms it is in talks to buy FanDuel" ER can be used to identify which company (Paddy Power Betfair) buys another (FanDuel) and act on it (e.g. [Risk arbitrage](https://en.wikipedia.org/wiki/Risk_arbitrage)  - "*a hedge fund investment strategy that speculates on the successful completion of mergers and acquisitions.*")


In [0]:
sentence = "Paddy Power Betfair confirms it is in talks to buy FanDuel"  
get_continuous_chunks(sentence)

(PERSON Paddy/NNP)
(PERSON Power/NNP Betfair/NNP)
(ORGANIZATION FanDuel/NNP)


['Paddy Power Betfair']