# Machine Learning for Finance Homework 3 NLP Part

##1. Basic NLP

### 1.1 What is NLP?

Natural Language Processing broadly refers to the
study and development of computer systems that can
interpret speech and text as humans naturally speak
and type it

### 1.2 List three positive use cases of NLP in the field of finance

1) Risk Assessments: used to assess credit risk 2) Portfolio Selection and optimization: used for semi-log-optimal portfolio optimization 3) Stock behavior predictions

### 1.3 What is tokenization?

Tokenization is a process of breaking down a piece of text into smaller units called tokens. In the context of Natural Language Processing (NLP), tokenization is a
critical step in text pre-processing, which is essential for various downstream NLP tasks such as sentiment analysis, text classification, and machine translation.

### 1.4 What is the difference between stemming and lemmatization?

Stemming usually refers to a crude heuristic
process that chops off the ends of words in the
hope of achieving this goal correctly most of the
time, and often includes the removal of
derivational affixes.
• Lemmatization usually refers to doing things
properly with the use of a vocabulary and
morphological analysis of words, normally aiming
to remove inflectional endings only and to return
the base or dictionary form of a word, which is
known as the lemma.


Stemming is the process of reducing a word to its base or root form by removing the suffixes. For example, stemming the word "running" would result in "run". This process is usually done by using a set of heuristics, such as removing the "ing" or "ed" suffixes. The goal of stemming is to reduce a word to a common base form so that variations of the same word can be treated as the same word, despite their different forms.

Lemmatization, on the other hand, is the process of reducing a word to its base or root form by taking into account the context and morphological analysis of the word. For example, lemmatizing the word "ran" would result in "run", while lemmatizing the word "were" would result in "be". The goal of lemmatization is to transform words to their base form or lemma, while preserving the context and meaning of the words in the sentence.

In general, lemmatization is more sophisticated than stemming because it takes into account the context and morphological analysis of the word, while stemming is a simpler technique that relies on rules and heuristics to reduce words to their base form. Therefore, lemmatization is often preferred in applications where the accuracy of the word analysis is critical, such as in search engines, machine translation, and sentiment analysis.


### 1.5 What is BERT? What makes it different?

Bidirectional Encoder Representations from Transformers, or BERT.


• What Makes BERT Different?

– BERT is the first deeply bidirectional, unsupervised language representation,
pre-trained using only a plain text corpus

– Breakthrough research on transformers: models that process words in relation
to all the other words in a sentence, rather than one-by-one in order.

– BERT models consider the full context of a word by looking at the words that
come before and after it—particularly useful for understanding intended
meaning.
BERT (Bidirectional Encoder Representations from Transformers) is a deep learning algorithm for natural language processing (NLP) introduced by Google in 2018. What makes BERT different is its ability to understand the context of a word by considering the words that come before and after it in a sentence. This is achieved through bidirectional training, where the model is trained to predict the missing word in a sentence by considering both the left and right contexts of the word. BERT has significantly improved the performance of various NLP tasks and has been used in applications such as chatbots, text summarization, and machine translation.

## 2. Pre-trained model: FinBERT

Your hedge fund manager wakes up every morning at 4am and wants a concise summary of the news relevant to his portfolio. He/She wants a quick summary showing the headlines of the news as well as classified sentiment scores.

Furthermore, he/she asks you to aggregate this information per stock and summarize the overall tone for this stock

2.1 Install the transformers package

In [1]:
%%capture
pip install transformers

2.2 import the requests package and other libraries you might need

In [2]:
import transformers,requests,json,numpy as np

The following dictionary represents our portfolio of stocks using the following logic (key, value) = (holding-name, branch), where $$branch ∈ \{"business", "entertainment", "general", "health", "science", "sports", "technology"\}$$
following the NewsAPI documentation.

To save you some time, a typical request is built similar to:
"https://newsapi.org/v2/top-headlines?country=us&q=*KEYWORD*&category=*CATEGORY*&sortBy=top&apiKey=*YOURKEYHERE*".

For example: 

https://newsapi.org/v2/top-headlines?country=us&q=apple&category=technology&sortBy=top&apiKey=123ab45c687d...





Please see:


1.   https://newsapi.org/docs (To construct your HTTP GET-requests)
2.   https://newsapi.org/register (To obtain your API key)




Say our portfolio consists of these four assets:

#600d91ddd8ca4c12976d79ef778ddc84

In [3]:
portfolio = dict({"apple": 'technology', "tesla": 'business', "amazon": 'technology', "s&p500": 'business'})

# we are going to use the S&P500 to get a general idea of the sentiment of news in the US market

2.3 Write the function fetch_news() that returns a dictionary that stores the name of the holding as key (analogous to our stock portfolio) and as value an array that holds the strings of news

In [4]:
import requests

def fetch_news():
    news_dict = {}
    for name,sector in portfolio.items():
      response = requests.get("https://newsapi.org/v2/top-headlines?country=us&q="+name+"&category="+sector+"&sortBy=top&apiKey=f5c36c278ec549dcab246a1815afba08")

      response_json = response.json()
      articles = [article["title"] for article in response_json["articles"]]
      news_dict[name] = articles

    return news_dict

In [5]:
fetch_news()

{'apple': ['New Apple Leak Reveals iPhone 15 Design Surprise - Forbes',
  'Poll: Which new feature do you want to see in iOS 17 and iPadOS 17? - 9to5Mac',
  'Three Products We Might See at WWDC 2023 - MacRumors',
  "Here's why Apple should still make a dedicated Passwords app (and a workaround) - 9to5Mac",
  'New York Phil to launch big Apple app - Slippedisc - Slipped Disc',
  'Apple MR Headset Delayed Again? Analysts See More Hurdles - Sony Group (NYSE:SONY), Apple (NASDAQ:AAPL) - Benzinga'],
 'tesla': ['Court rules Elon Musk broke federal labor law with 2018 tweet | Engadget - Engadget',
  "Tesla Shares 'Crash Test' Video Of Cybertruck, Provoking Amusing Response From Twitter: 'Just Like The Truck! It Never Arrives' - Yahoo Finance"],
 'amazon': ["The newest iPad Mini and Google's Pixel 6A top our favorite deals of the week - The Verge"],
 's&p500': ['Celebrities Complain on Final Day of Free Twitter Blue Checkmarks - TMZ',
  'If You Had $1000 Right Now, Would You Buy Shiba Inu (SHI

2.4 Following the lecture notes, using the pre-trained Finbert Classifier, classify the news fetched in 2.3 into neutral, positive or negative by modifying the below code:

2.5 Last but not least, find the total tone for each element in our portfolio, where:


*   neutral=0
*   positive=+1
*   negative=-1

In [None]:
# BUILD YOUR CODE ON TOP OF THIS EXAMPLE CODE IN THE CELL BELOW
from transformers import BertTokenizer, BertForSequenceClassification
import numpy as np

finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')

sentences = ["there is a shortage of capital, and we need extra financing", 
             "growth is strong and we have plenty of liquidity", 
             "there are doubts about our finances", 
             "profits are flat"]

inputs = tokenizer(sentences, return_tensors="pt", padding=True)
outputs = finbert(**inputs)[0]

labels = {0:'neutral', 1:'positive',2:'negative'}
for idx, sent in enumerate(sentences):
    print(sent, '----', labels[np.argmax(outputs.detach().numpy()[idx])])
  

In [None]:
# Your code here
from transformers import BertTokenizer, BertForSequenceClassification
import numpy as np

finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')

def sentimentsss(sentancesss):
  scoress=[]
  inputs = tokenizer(sentancesss, return_tensors="pt", padding=True)
  outputs = finbert(**inputs)[0]
  labels = {0:'neutral', 1:'positive',2:'negative'}
  scores_sentiment={0:0,1:1,2:-1}
  for idx, sent in enumerate(sentancesss):
    print(sent, '----', labels[np.argmax(outputs.detach().numpy()[idx])])
    scoress.append(scores_sentiment[np.argmax(outputs.detach().numpy()[idx])])

  return scoress

for i in fetch_news():
  print('\n')
  print('Company: ')
  print(i)
  print('\n')
  a=sentimentsss(fetch_news()[i])
  print('\n')
  print('total tone score: ')
  print(sum(a))
  print('average tone of the stock')
  print(sum(a)/len(a))

2.6 What do you find from the results?

We find that the bert model classifies the headlines accurately although, we would need a custom corpus if we have to traverse through financial documents like information released by the Fed etc. Overall, the score depends on the number of differenct sentances and specific words, eg: violating is considered as negative ec.

We find the average score for each stock so as to have a comparable measure.
The avg score is the total score by the len of the list of sentances because wen we look at data in a cummulative way which is unwighted or equillay weighted this helps us.