# Stock Sentiment Analysis with News Headlines

More and more hedge funds and independent traders are utilizing data science to process the wealth of information available from news headlines in the quest for profit. In this project, I will generate investment insights by applying sentiment analysis on financial news headlines, webscraped from FINVIZ.com. Repsecting data science ethics with regard to webscraping, I have downloaded various HTML files for two large firms: Facebook & Tesla.

Through conducting sentiment analysis, we can examine the emotion behind the headlines and predict whether the market feels good or bad about a stock. Then, we can make educated guesses on how certain stocks will perform and trade accordingly. Below, we import these files:

In [19]:
# import libraries
from bs4 import BeautifulSoup
import os
import nltk
nltk.download('stopwords')

html_tables = {}

# for every data set in os dataset folder
for table_name in os.listdir('datasets'):
    # filepath
    table_path = f'datasets/{table_name}'

    # open as read-only, read into 'html'
    table_file = open(table_path, 'r')
    html = BeautifulSoup(table_file)

    # add news-table to 'html_tables' dictionary
    html_table = html.find(id = 'news-table')
    html_tables[table_name] = html_table

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!




We've now obtained the table which contains all the headlines from each stock's HTML file. Before we move any further, we must investigate the structure in the data table. Let's read a single day of headlines for Tesla. 

In [None]:
tsla = html_tables['tsla_22sep.html']

# store all table rows with <tr> tags
tsla_tr = tsla.findAll('tr')

for i, table_row in enumerate(tsla_tr) :

    # store <a> elements in link_text
    link_text = table_row.a.get_text()
    # store <td> elements in data_text
    data_text = table_row.a.get_text()

    # print file count & text variables
    print(f'File number {i+1}:'); print(link_text); print(data_text)

    # exits loop after 4 rows
    if i == 3: break

Now, we're ready to parse the data for <strong>all</strong> tables into a list.

In [21]:
# parsed news list
parsed_news = []
# Iterate through news, nested iterate through all tr tags in each "news_table"
for file_name, news_table in html_tables.items():
    for x in news_table.findAll('tr'):
        # store read text in 'text'
        text = x.get_text() 
        # scrape the text, split into a list
        date_scrape = x.td.text.split()
        
        # if date_scrape only has 1 element, only load 'time'
        if len(date_scrape) == 1:
            time = date_scrape[0]
        # otherwise, load both 'date' and 'time'
        else:
            date = date_scrape[0]
            time = date_scrape[1]

        # extract stock ticker 
        stock_ticker = file_name.split("_")[0]

        # append all our information: ticker, date, time, headline
        parsed_news.append([stock_ticker, date, time, x.a.text])

Sentiment Analysis is very sensitive to context. For example, saying "This is so addicting." can be a positive statement when describing an exciting thing, like a video game, but can also be negative when we're talking about drugs. Like most professionals, financial journalists have their own writing style, so to extract sentiment from their headlines, must make NLTK think like a financial journalist. Let's add some keywords and sentiment values to our program.

In [None]:
# NLTK VADER for sentiment analysis
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')
nltk.download('stopwords')

# words and values
keywords = {
    'crushes': 10,
    'beats': 5,
    'misses': -5,
    'trouble': -10,
    'falls': -100,
}
# instantiate sentiment intensity analyzer
vader = SentimentIntensityAnalyzer()
# update the lexicon
vader.lexicon.update(keywords)