<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Collect-S&amp;P-500-Companies" data-toc-modified-id="Collect-S&amp;P-500-Companies-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Collect S&amp;P 500 Companies</a></span></li><li><span><a href="#Example-code" data-toc-modified-id="Example-code-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Example code</a></span></li><li><span><a href="#Stock-Prices" data-toc-modified-id="Stock-Prices-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Stock Prices</a></span></li><li><span><a href="#Calculate-Correlation" data-toc-modified-id="Calculate-Correlation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Calculate Correlation</a></span></li></ul></div>

In [1]:
# Import libraries 
import pandas as pd
import os
import time

from datetime import datetime
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
from urllib.parse import urlparse

%matplotlib inline

# Collect S&P 500 Companies

In [2]:
table = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
sandp_df = table[0]

#sandp_df.to_csv('data/S&P500-Info.csv')
#sandp_df.to_csv("data/S&P500-Symbols.csv", columns=['Symbol'])

#https://medium.com/wealthy-bytes/5-lines-of-python-to-automate-getting-the-s-p-500-95a632e5e567

In [3]:
sandp_df.head(5)
# so the symbol is the same as the corresponding stock ticker. 
# It will be used for parsing news results that reference the company that made the headlines.

Unnamed: 0,Symbol,Security,SEC filings,GICS Sector,GICS Sub-Industry,Headquarters Location,Date first added,CIK,Founded
0,MMM,3M Company,reports,Industrials,Industrial Conglomerates,"St. Paul, Minnesota",1976-08-09,66740,1902
1,ABT,Abbott Laboratories,reports,Health Care,Health Care Equipment,"North Chicago, Illinois",1964-03-31,1800,1888
2,ABBV,AbbVie Inc.,reports,Health Care,Pharmaceuticals,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
3,ABMD,Abiomed,reports,Health Care,Health Care Equipment,"Danvers, Massachusetts",2018-05-31,815094,1981
4,ACN,Accenture,reports,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989


# Example code

Taken from https://towardsdatascience.com/sentiment-analysis-of-stocks-from-financial-news-using-python-82ebdcefb638

''' function to read news table from finviz '''
def finviz_news_table(ticker):
    start_time = time.perf_counter()
    
    try:
        url = finviz_url + ticker
        req = Request(url=url, headers={'user-agent': 'my-app/0.0.1'})
        response = urlopen(req)
        html = BeautifulSoup(response)
        news_table = html.find(id='news-table')
    except:
        news_table = None
        
    end_time = time.perf_counter()
    
    return [ticker, news_table, end_time - start_time]

In [4]:
# function to read news table from finviz (use for process pool executor)
def finviz_news_table_process(ticker):
    start_time = time.perf_counter()
    
    pid = os.getpid()
    
    try:
        url = finviz_url + ticker
        req = Request(url=url, headers={'user-agent': 'my-app/0.0.1'})
        response = urlopen(req)
        html = BeautifulSoup(response)
        news_table = str(html.find(id='news-table'))
    except:
        news_table = None
        
    end_time = time.perf_counter()
    
    # Return [ticker, str_news_table, run_time, pid]
    return [ticker, news_table, end_time - start_time, pid]

''' Thread Pool Executor: read html from finviz for each ticker and save the news_table of each as a dataframe '''
''' runs slower than process pool executor '''
from concurrent.futures import ThreadPoolExecutor
import time


start_test1 = time.perf_counter()


if __name__ == '__main__':
    finviz_url = 'https://finviz.com/quote.ashx?t='
    ticker_list = sandp_df['Symbol']
    # initiate executor
    executor = ThreadPoolExecutor()
    # apply executor to map finviz_news_table on ticker list 
    p = executor.map(finviz_news_table, ticker_list)
    # save news tables as a dataframe (includes run time for each request)
    news_table_df_test1 = pd.DataFrame([[ticker, run_time, news_table] for ticker, news_table, run_time in p], columns=['ticker', 'html_news_table', 'run_time'])
    print(news_table_df_test1.head(10))

end_test1 = time.perf_counter()

print('Thread Pool Executor finished in: ', end_test1 - start_test1, ' seconds')

In [5]:
# Process Pool Executor: read html from finviz for each ticker and save the news_table of each as a dataframe

from loky import get_reusable_executor
import time


start_test1 = time.perf_counter()


if __name__ == '__main__':
    finviz_url = 'https://finviz.com/quote.ashx?t='
    ticker_list = sandp_df['Symbol']
    # initiate executor
    executor = get_reusable_executor(max_workers=10, timeout=2)
    # apply executor to map finviz_news_table on ticker list 
    process_1 = executor.map(finviz_news_table_process, ticker_list)
    # save news tables as a dataframe (includes run time for each request)
    news_table_df = pd.DataFrame([[ticker, pid, run_time, str_news_table] for ticker, str_news_table, run_time, pid in process_1], columns=['ticker', 'pid', 'run_time', 'str_news_table'])
    print(news_table_df.head(10))
    

end_test1 = time.perf_counter()

print('Process Pool Executor finished in: ', end_test1 - start_test1, ' seconds')

  ticker    pid   run_time                                     str_news_table
0    MMM   2464   5.640783  <table border="0" cellpadding="1" cellspacing=...
1    ABT  16096   5.560090  <table border="0" cellpadding="1" cellspacing=...
2   ABBV    548   5.181580  <table border="0" cellpadding="1" cellspacing=...
3   ABMD  12368   7.059482  <table border="0" cellpadding="1" cellspacing=...
4    ACN  11536   5.758770  <table border="0" cellpadding="1" cellspacing=...
5   ATVI  16396   9.440227  <table border="0" cellpadding="1" cellspacing=...
6   ADBE  12452  11.734430  <table border="0" cellpadding="1" cellspacing=...
7    AMD   1632  11.684490  <table border="0" cellpadding="1" cellspacing=...
8    AAP    548   7.100792  <table border="0" cellpadding="1" cellspacing=...
9    AES  16096   5.570064  <table border="0" cellpadding="1" cellspacing=...
Process Pool Executor finished in:  280.5679949  seconds


Plan for next section

Single layer multiprocessing with thread (parallel inside loop outside)  # in progress

Single layer multiprocessing with process

Single layer multiprocessing with thread (parallel outside loop inside)

Single layer multiprocessing with process

Double layer multiprocessing with thread (both)

Double layer multiprocessing with process

Double layer multiprocessing (thread inside, process outside)

Double layer multiprocessing (thread outside, process inside)

### Single layer with thread (parallel inside loop outside)
Note: In synchronous execution, soupifying the response takes up an overwhelming amount of the total processing time(93%). 2nd highest is waiting for replies at 6%.

Based on actual run time, thread appears to significantly save the soupifying time (by 90+%). Theoretically it should save the waiting time as well.

Test run time = 191.48199870000008 seconds
Estimated full run time = 322.0 minutes 19.68 seconds

In [53]:
# define function to generate [date, time, headline, news_source, content, article_site, run_time] 
# input is a single article
def article_details(str_article):
    
    art_det_start = time.perf_counter()
    
    
    # convert str to html
    html_article = BeautifulSoup(str_article, 'html.parser')
    
    # Produce headlines
    headline = html_article.a.get_text() 
    
    # Produce news source company
    news = html_article.span.get_text()
    
    # Produce Date and Time
    # split text in the td tag into a list 
    date_scrape = html_article.td.text.split()
    # ensure most recent date is used
    global date
    # if the length of 'date_scrape' is 1, load 'time' as the only element
    if len(date_scrape) == 1:
        time_ = date_scrape[0]
    # else load 'date' as the 1st element and 'time' as the second    
    else:
        date = date_scrape[0]
        time_ = date_scrape[1]
        
    req_start = 0
    req_end = 0
    wait_end = 0
    soup_end = 0
    cont_end = 0
    
    # Produce news content
    # get link to the full article
    link = html_article.find('a').get('href')
    content = 'empty string'
    url_root = urlparse(link).netloc
    # check if link leads to yahoo.finance
    if url_root == 'finance.yahoo.com':
        try:
            # request from yahoo.finance
            req_start = time.perf_counter()
            req_art = Request(url=link, headers={'user-agent':'my-app/0.0.1'})
            req_end = time.perf_counter()
            
            response_art = urlopen(req_art)
            wait_end = time.perf_counter()
            
            html_art = BeautifulSoup(response_art)
            soup_end = time.perf_counter()
            # get the article content
            content = str(html_art.find(class_='caas-body').get_text())
            cont_end = time.perf_counter()
        except:
            print('Error following article link: ', link)
    
    
    art_det_end = time.perf_counter()
    
    
    # Return [date, time, headline, news_source, content, article_site, run_time, req_time, wait_time, soup_time, cont_time] 
    return [date, time_, headline, news, content, url_root, art_det_end - art_det_start, req_end - req_start, wait_end - req_end, soup_end - wait_end, cont_end - soup_end]



'''
# define function to generate article content (in a list) given a row in news_table_df
# only works with articles from yahoo.finance right now
def article_content(html_article):
    # get the link to the full article
    link = html_article.find('a').get('href')
    content = 'empty string'
    # check if link leads to yahoo.finance
    if urlparse(link).netloc == 'finance.yahoo.com':
        try:
            # request from yahoo.finance
            req_art = Request(url=link, headers={'user-agent':'my-app/0.0.1'})
            response_art = urlopen(req_art)
            html_art = BeautifulSoup(response_art)
            # get the article content
            content = str(html_art.find(class_='caas-body').get_text())
        except:
            print('Error following article link: ', link)
    else:
        return [content, urlparse(link).netloc]
    return [content]
'''
    

"\n# define function to generate article content (in a list) given a row in news_table_df\n# only works with articles from yahoo.finance right now\ndef article_content(html_article):\n    # get the link to the full article\n    link = html_article.find('a').get('href')\n    content = 'empty string'\n    # check if link leads to yahoo.finance\n    if urlparse(link).netloc == 'finance.yahoo.com':\n        try:\n            # request from yahoo.finance\n            req_art = Request(url=link, headers={'user-agent':'my-app/0.0.1'})\n            response_art = urlopen(req_art)\n            html_art = BeautifulSoup(response_art)\n            # get the article content\n            content = str(html_art.find(class_='caas-body').get_text())\n        except:\n            print('Error following article link: ', link)\n    else:\n        return [content, urlparse(link).netloc]\n    return [content]\n"

In [13]:
# record time to parse through each ticker and generate necessary details
#time_per_ticker_1 = []

In [51]:
# Define function to prepare dataframe with [ticker, date, time, headline, news, content, article_site, run_time, req_time, wait_time, soup_time, cont_time]
# Thread Pool Processor

# input is a row [ticker, pid, run_time, str_news_table] from news_table_df
# return completed dataframe for 1 ticker (to be appended)
# also records time to process 1 ticker and saves in list 'time_per_ticker'

def ticker_to_dataframe_thr(row):
    
    ticker_time_start = time.perf_counter()
    
    
    # convert str_news_table to html format
    html_news_table = BeautifulSoup(row[3], 'html.parser')
    # split into list of articles in html format
    article_list = html_news_table.findAll('tr')
    # convert all html to str
    article_list = [str(x) for x in article_list]
    
    # executor
    if __name__ == '__main__':
        executor = ThreadPoolExecutor()
        thread_2 = executor.map(article_details, article_list)
        ticker_df = [[date, time_, news_source, headline, content, site, run_time, req_time, wait_time, soup_time, cont_time] for date, time_, headline, news_source, content, site, run_time, req_time, wait_time, soup_time, cont_time in thread_2]
        ticker_df = pd.DataFrame(ticker_df, columns=['date', 'time', 'news', 'headline', 'content', 'article_site', 'run_time', 'req_time', 'wait_time', 'soup_time', 'cont_time'])
        
        
    ticker = row[0]
    ticker_col = pd.Series([ticker] * len(ticker_df))
    
    ticker_df.insert(0, 'ticker', ticker_col)
        
    
    ticker_time_end = time.perf_counter()
    time_per_ticker_1.append(ticker_time_end - ticker_time_start)

    return ticker_df

In [54]:
# test run 

from concurrent.futures import ThreadPoolExecutor
import time

# define test df
test_news_table_df = news_table_df.copy()
test_news_table_df = test_news_table_df.iloc[0:5]
print('Testing input')
print(test_news_table_df)

time_per_ticker_1 = []

compile_start = time.perf_counter()

test_art_det_df = ticker_to_dataframe_thr(test_news_table_df.iloc[0])
print('Ticker complete: MMM')


for i in range(1, len(test_news_table_df)):
    test_ticker_df = ticker_to_dataframe_thr(test_news_table_df.iloc[i])
    test_art_det_df = test_art_det_df.append(test_ticker_df)
    print('Ticker complete: ', test_ticker_df.iloc[0,0])

compile_end = time.perf_counter()
    
print(test_art_det_df.head(10))
print('Time taken to compile 5 tickers is: ', compile_end - compile_start, ' seconds')
minutes = ((compile_end - compile_start) * len(news_table_df)/5 ) // 60 
seconds = ((compile_end - compile_start) * len(news_table_df)/5 ) % 60
print('Estimated time to compile all tickers is: ', minutes, 'minutes', seconds, 'seconds')

Testing input
  ticker    pid  run_time                                     str_news_table
0    MMM   2464  5.640783  <table border="0" cellpadding="1" cellspacing=...
1    ABT  16096  5.560090  <table border="0" cellpadding="1" cellspacing=...
2   ABBV    548  5.181580  <table border="0" cellpadding="1" cellspacing=...
3   ABMD  12368  7.059482  <table border="0" cellpadding="1" cellspacing=...
4    ACN  11536  5.758770  <table border="0" cellpadding="1" cellspacing=...
Ticker complete: MMM
Ticker complete:  ABT
Ticker complete:  ABBV
Ticker complete:  ABMD
Ticker complete:  ACN
  ticker       date     time             news  \
0    MMM  Feb-14-21  09:27AM      Motley Fool   
1    MMM  Jan-26-21  05:12PM    GuruFocus.com   
2    MMM  Feb-12-21  10:10AM      Motley Fool   
3    MMM  Feb-12-21  07:45AM      Motley Fool   
4    MMM  Feb-10-21  06:00AM      Barrons.com   
5    MMM  Oct-29-20  02:35PM      PR Newswire   
6    MMM  Jan-26-21  11:07AM      PR Newswire   
7    MMM  Oct-29-20  

In [57]:
# synchronous run times

art_det_total_time = sum(test_art_det_df['run_time'])
gen_req_total_time = sum(test_art_det_df['req_time'])
wait_total_time = sum(test_art_det_df['wait_time'])
soup_total_time = sum(test_art_det_df['soup_time'])
cont_total_time = sum(test_art_det_df['cont_time'])

time_series = [gen_req_total_time, wait_total_time, soup_total_time, cont_total_time]
frac_time_series = [x / art_det_total_time for x in time_series]


print('Total art det time is: ', art_det_total_time)
print('Total time series')
print(time_series)
print('Fractional time series')
print(frac_time_series)

Total art det time is:  3427.004503999985
Total time series
[0.011493199994220049, 207.07549440000457, 3215.8481493000018, 3.728625999999167]
Fractional time series
[3.353716045836892e-06, 0.0604246344462947, 0.9383845704160814, 0.0010880131600781763]


In [60]:
# estimated time saved from soupifying
art_det_total_time - 191.48 - 207

3028.524503999985

### Single layer with process (parallel inside loop outside)

In [None]:
# Define function to prepare data with ticker, date, time, headline, news, content
# Process Pool Processor

# input is a row [ticker, pid, run_time, str_news_table]
# return completed dataframe for 1 ticker (to be appended)
# also records time to process 1 ticker and saves in list 'time_per_ticker'

def ticker_to_dataframe_pro(row):
    
    ticker_time_start = time.perf_counter()
    
    
    ticker_pid = os.get_pid()
    
    # convert str_news_table to html format
    html_news_table = BeautifulSoup(row[3], 'html.parser')
    
    # section of code to be looped
    if __name__ == '__main__':
        executor = 
        
        
        
        
    for file_name, news_str in html_dict.items():
        # soupify str
        news_html = BeautifulSoup(news_str, 'html.parser')
        # iterate through all tr tags in 'news_table'
        for x in news_html.findAll('tr'):
            # article content and rejected site (if applicable) (use to expand available sites in the future)
            art_content = article_content(x)
            ticker = file_name.split('_')[0]
            # combine article summary and article content into 1 row of info about the article
            art_row = [ticker] + article_summary(x) + [art_content[0]]
            # append relevant info to the 'parsed_news' list
            parsed_news.append(art_row)
            if len(art_content) > 1:
                rejected_sites.append(art_content[1])
        print('Done parsing through: ', ticker)
        
    
    ticker_time_end = time.perf_counter()
    time_per_ticker.append(ticker_time_end - ticker_time_start)

    
    return

In [None]:
# Process Pool Executor

from loky import get_reusable_executor
import time



start_test1 = time.perf_counter()



if __name__ == '__main__':
    test_news_tables = news_tables.copy()
    test_news_tables = dict(zip(['MMM', 'ABT', 'ABBV'], [str(test_news_tables[x]) for x in ['MMM', 'ABT', 'ABBV']]))
    executor = get_reusable_executor(max_workers=4, timeout=2)
    p = executor.submit(prepare_data, test_news_tables)
    print('Processor ID: ', p.result())
        
        
columns = ['ticker', 'date', 'time', 'headline', 'news', 'content']

# Convert the parsed_news list into a DataFrame called 'parsed_news_updated'
parsed_news_updated = pd.DataFrame(parsed_news, columns=columns)
ori_len = len(parsed_news_updated)
# remove articles whose contents are not available
parsed_news_updated_cont = parsed_news_updated.copy()
parsed_news_updated_cont = parsed_news_updated_cont.loc[parsed_news_updated_cont['content'] != 'empty string']

parsed_news_updated = parsed_news_updated.drop('content', axis=1)

print('Percentage decrease in articles is: ', (ori_len - len(parsed_news_updated_cont))/ori_len)


print(parsed_news_updated_cont.head(10))
print(rejected_sites[:20])

end_test1 = time.perf_counter()

print('Process Pool Executor finished in: ', end_test1 - start_test1, ' seconds')

In [None]:
# Thread Pool Executor

from concurrent.futures import ThreadPoolExecutor
import time



start_test1 = time.perf_counter()



if __name__ == '__main__':
    test_news_tables = news_tables.copy()
    test_news_tables = dict(zip(['MMM', 'ABT', 'ABBV'], [str(test_news_tables[x]) for x in ['MMM', 'ABT', 'ABBV']]))
    executor = ThreadPoolExecutor()
    p = executor.submit(prepare_data, test_news_tables)
    print(p.result())
        
        
columns = ['ticker', 'date', 'time', 'headline', 'news', 'content']

# Convert the parsed_news list into a DataFrame called 'parsed_news_updated'
parsed_news_updated = pd.DataFrame(parsed_news, columns=columns)
ori_len = len(parsed_news_updated)
# remove articles whose contents are not available
parsed_news_updated_cont = parsed_news_updated.copy()
parsed_news_updated_cont = parsed_news_updated_cont.loc[parsed_news_updated_cont['content'] != 'empty string']

parsed_news_updated = parsed_news_updated.drop('content', axis=1)

print('Percentage decrease in articles is: ', (ori_len - len(parsed_news_updated_cont))/ori_len)


print(parsed_news_updated_cont.head(10))
print(rejected_sites[:20])

end_test1 = time.perf_counter()

print('Thread Pool Executor finished in: ', end_test1 - start_test1, ' seconds')

In [None]:
import time


start_test2 = time.perf_counter()


test_news_tables = news_tables.copy()
test_news_tables = dict(zip(['MMM', 'ABT', 'ABBV'], [str(test_news_tables[x]) for x in ['MMM', 'ABT', 'ABBV']]))
print(prepare_data(test_news_tables))

# Set column names
columns = ['ticker', 'date', 'time', 'headline', 'news', 'content']

# Convert the parsed_news list into a DataFrame called 'parsed_and_scored_news'
parsed_news_updated = pd.DataFrame(parsed_news, columns=columns)
ori_len = len(parsed_news_updated)
# remove articles whose contents are not available
parsed_news_updated_content = parsed_news_updated.copy()
parsed_news_updated_content = parsed_news_updated_content.loc[parsed_news_updated['content'] != 'empty string']
parsed_news_updated = parsed_news_updated.drop('content', axis=1)
print('Percentage decrease in articles is: ', (ori_len - len(parsed_news_updated_content))/ori_len)


print(parsed_news_updated_content.head(10))
print(rejected_sites[:20])


end_test2 = time.perf_counter()

print('Synchronous test finished in: ', end_test2 - start_test2, ' seconds')

In [None]:
parsed_news = []
rejected_sites = []

# Iterate through the news
for file_name, news_table in news_tables.items():
    # Iterate through all tr tags in 'news_table'
    for x in news_table.findAll('tr'):
        art_content = article_content(x)
        # combine article summary and article content into 1 row of info about the article
        article_row = article_summary(x) + [art_content[0]]
        # Append relevant info as a list to the 'parsed_news' list
        parsed_news.append(article_row)
        rejected_sites.append(art_content[1])

# Set column names
columns = ['ticker', 'date', 'time', 'headline', 'news', 'content']

# Convert the parsed_news list into a DataFrame called 'parsed_and_scored_news'
parsed_news_updated = pd.DataFrame(parsed_news, columns=columns)
ori_len = len(parsed_news_updated)
# remove articles whose contents are not available
parsed_news_updated_content = parsed_news_updated.copy()
parsed_news_updated_content = parsed_news_updated_content.loc[parsed_news_updated['content'] != 'empty string']
parsed_news_updated = parsed_news_updated.drop('content', axis=1)
print('Percentage decrease in articles is: ', (ori_len - len(parsed_news_updated_content))/ori_len)
parsed_news_updated_content.head(10)

#parsed_news_updated does not have the content column and has full number of articles (100 per ticker)
#parsed_news_updated_content has content column and is shortened to remove all rows with no content

In [None]:
# Count number of headlines produced by each news source and remove news sources with <median headlines
parsed_news_updated = pd.DataFrame(parsed_news, columns=columns)
# generate series of counts
news_count = parsed_news_updated['news'].value_counts()
# set minimum count = median
cutoff_point = news_count.median()
# append count of news to dataframe
parsed_news_updated = parsed_news_updated.merge(news_count, left_on='news', right_index=True)
parsed_news_updated = parsed_news_updated.drop('news_x', axis=1).rename({'news_y' : 'count'}, axis=1)
# remove news which have count < cutoff_point
parsed_news_updated = parsed_news_updated.loc[parsed_news_updated['count'] > cutoff_point]
parsed_news_updated


In [None]:
#need to tokenize each words within the headlines to improve the sentiment score.

import re
import nltk
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def regex(x):
    special_chars_p = "[.®'&$’\"\-()#@!?/:]"
    s1 = re.sub(special_chars_p, '', x)  
    return(s1)

parsed_news_updated['headline'] = parsed_news_updated['headline'].apply(regex)

stemmer = PorterStemmer()

def stem_sentences(sentence):
    tokens = sentence.lower().split()
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return ' '.join(stemmed_tokens)

parsed_news_updated_stem = parsed_news_updated.copy()
parsed_news_updated_stem['headline'] = parsed_news_updated_stem['headline'].apply(stem_sentences)

stop=stopwords.words('english')

parsed_news_updated['headline'].apply(lambda x: [item for item in x if item not in stop])
parsed_news_updated_stem['headline'].apply(lambda x: [item for item in x if item not in stop])

parsed_news_updated['headline'] = parsed_news_updated['headline'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)])) 
parsed_news_updated_stem['headline'] = parsed_news_updated_stem['headline'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)])) 
parsed_news_updated.head()

In [None]:
# NLTK VADER for sentiment analysis (unstem)
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Instantiate the sentiment intensity analyzer
vader = SentimentIntensityAnalyzer()

# Iterate through the headlines and get the polarity scores using vader
scores = parsed_news_updated['headline'].apply(vader.polarity_scores).tolist()

# Convert the 'scores' list of dicts into a DataFrame
scores_df = pd.DataFrame(scores)

# Join the DataFrames of the news and the list of dicts
parsed_and_scored_news = parsed_news_updated.join(scores_df, rsuffix='_right')

# Convert the date column from string to datetime
parsed_and_scored_news['date'] = pd.to_datetime(parsed_and_scored_news.date).dt.date

parsed_and_scored_news.head(10)

In [None]:
# NLTK VADER for sentiment analysis (stem)

# Instantiate the sentiment intensity analyzer
vader = SentimentIntensityAnalyzer()

# Iterate through the headlines and get the polarity scores using vader
scores_stem = parsed_news_updated_stem['headline'].apply(vader.polarity_scores).tolist()

# Convert the 'scores' list of dicts into a DataFrame
scores_stem_df = pd.DataFrame(scores_stem)

# Join the DataFrames of the news and the list of dicts
parsed_and_scored_news_stem = parsed_news_updated_stem.join(scores_stem_df, rsuffix='_right')

# Convert the date column from string to datetime
parsed_and_scored_news_stem['date'] = pd.to_datetime(parsed_and_scored_news_stem.date).dt.date

parsed_and_scored_news_stem.head(10)

# Stock Prices

In [None]:
# Get S&P 500 prices
# source: https://www.spglobal.com/spdji/en/indices/equity/sp-500/#overview

df_sp = pd.read_csv('data/S&P500_5years.csv', usecols=[0,1]) # Use only first 2 columns
df_sp.columns = ['date', 'price']
df_sp['date'] = pd.to_datetime(df_sp['date'])
df_sp.head()

In [None]:
# Get S&P 500 individual stock prices

# Create a function to get stock price given a ticker 
def get_stock_price(ticker, start, end):
    '''Get prices of a stock in a given period.
    
    Args:
        ticker (str): ticker of a company 
        start (str): date in format of 'YYYY-MM-DD'
        end (str): date in format of 'YYYY-MM-DD'
    
    Returns:
        A DataFrame containing open, high, low, close, volume, dividends, stock splits
    '''
    import yfinance as yf
    
    ticker = yf.Ticker(ticker)
    data = ticker.history(start=start, end=end)
    data.reset_index(level=0, inplace=True)
    return data 


In [None]:
# Define function to generate merged dataframe (merged by date)
# columns: ['ticker', 'date', 'time', 'headline', 'news', 'neg', 'neu', 'pos', 'compound', 'open', 'close', 'change']
# scored dataframe should be the input (not sure if works for scored sentiment other than Vader)


def generate_final_df(scored_df):
    # Get a list of 505 stocks from S&P 500
    sp500 = sandp_df['Symbol'].unique()
    start = scored_df['date'].min()
    end = scored_df['date'].max()
    
    # Iterate through each stock to get price
    df_stock = pd.DataFrame()
    for ticker in sp500:
        data = get_stock_price(ticker, start, end)
        data['ticker'] = ticker
        df_stock = pd.concat([df_stock, data], axis=0)
        
    # Change all columns names to lowercase  
    df_stock.columns = df_stock.columns.str.lower()
    
    # Convert timestamp to date
    df_stock['date'] = df_stock['date'].apply(datetime.date)
    
    # Reset index
    df_stock.reset_index(drop=True, inplace=True)
    
    # Merge stock price info and sentiment scores
    df_merged = scored_df.merge(df_stock.loc[:, ['date', 'ticker', 'open', 'close']], on=['date', 'ticker'])
    # Add column: price change
    df_merged['change'] = df_merged['close'] - df_merged['open']
    
    return df_merged

In [None]:
# Generate final df for unstem and stem

df_final_unstem = generate_final_df(parsed_and_scored_news)
df_final_stem = generate_final_df(parsed_and_scored_news_stem)

# Calculate Correlation 

In [None]:
# Calculate pearson correlation coef between sentiment score and price for each news media
scores_close_unstem = df_final_unstem.groupby('news')[['compound', 'close']].corr().unstack().iloc[:, 1].sort_values(ascending=False)
scores_close_stem = df_final_stem.groupby('news')[['compound', 'close']].corr().unstack().iloc[:, 1].sort_values(ascending=False)
scores_change_unstem = df_final_unstem.groupby('news')[['compound', 'change']].corr().unstack().iloc[:, 1].sort_values(ascending=False)
scores_change_stem = df_final_stem.groupby('news')[['compound', 'change']].corr().unstack().iloc[:, 1].sort_values(ascending=False)


# https://stackoverflow.com/questions/28988627/pandas-correlation-groupby

In [None]:
# Calculate spearman correlation coef between sentiment score and price for each news media
scores_close_unstem = df_final_unstem.groupby('news')[['compound', 'close']].corr(method='spearman').unstack().iloc[:, 1].sort_values(ascending=False)
scores_close_stem = df_final_stem.groupby('news')[['compound', 'close']].corr(method='spearman').unstack().iloc[:, 1].sort_values(ascending=False)
scores_change_unstem = df_final_unstem.groupby('news')[['compound', 'change']].corr(method='spearman').unstack().iloc[:, 1].sort_values(ascending=False)
scores_change_stem = df_final_stem.groupby('news')[['compound', 'change']].corr(method='spearman').unstack().iloc[:, 1].sort_values(ascending=False)


In [None]:
# Pearson's Correlation Coefficient as Dataframe
pearson_corr = pd.DataFrame({'variable' : ['close', 'change'], 'unstem' : [df_final_unstem[['compound', 'close']].corr().iloc[0,1], df_final_unstem[['compound', 'change']].corr().iloc[0,1]], 'stem' : [df_final_stem[['compound', 'close']].corr().iloc[0,1], df_final_stem[['compound', 'change']].corr().iloc[0,1]]})
pearson_corr

In [None]:
# Spearman's rank correlation
spearman_corr = pd.DataFrame({'variable' : ['close', 'change'], 'unstem' : [df_final_unstem[['compound', 'close']].corr(method='spearman').iloc[0,1], df_final_unstem[['compound', 'change']].corr(method='spearman').iloc[0,1]], 'stem' : [df_final_stem[['compound', 'close']].corr(method='spearman').iloc[0,1], df_final_stem[['compound', 'change']].corr(method='spearman').iloc[0,1]]})
spearman_corr

Vader Sentiment Analysis

Performed Pearson's Correlation Coefficient comparisons between compound sentiment and (1) closing price (2) change in price (closing price - opening price). Computed the Spearman rank correlation coefficient as well.

Compared effect of stemming and not stemming words on Pearson's and Spearman's correlation coefficient.


Conclusion:

Pearson:
Correlation between both (1) and (2) is negligible (<1%) without stemming. Stemming appears to improve correlation, but correlation is still very small (<2%)

Spearman:
Correlation for both (1) and (2) is still small, but better than Pearson. Stemming has inconsistent results, slightly lowering (2) but increasing (1)

Overall, Vader sentiment analysis produces very weak correlation with both (1) and (2). Try with other models.

Removing headlines from news sources with less articles:
Decreased correlation. Most of the correlations are negative but close to 0. This suggests that news sources with lower article counts tend to predict the price movements more accurately than the news sources that post more often.


Note: 

1. Might be helpful to determine the most relevant news sources by taking highly correlated news sources with instances of more than 20. Limiting the data might increase correlation. Use test set to evaluate if using this method.

2. Doing linear regression on neg, neu and pos score might produce interesting results.

3. Vader sentiment scores might be the problem. Sentiment scores of some sample headlines was observed and Vader had many false negatives. Try using article content.

In [None]:
df_final_unstem.loc[df_final_unstem['news']==' The Telegraph', ]

Notes:

1. Take note of changes in the composition of S&P 500.