## Christine Li - Sentiment Analysis of S&P 500 Companies' News Data from Finviz Website

### Outline
<span style="font-size: 17px;"> The whole program is seperated into 5 sections:<br> </span>
1. Set Up and Data Gathering
2. Data Processing: Parse and Retrieve the News Data
3. Sentiment Analysis on News Data from Finviz Website with Vader
4. Data Aggregation With Wikipedia Data
5. Interactive Visualization <br>
Each section has subsections with a general description on each part of the code's funcationalities. More detailed explaination is in the comment accompnied with certain lines of code.

### 1. Set Up and Data Gathering **(CHANGED)**

<span style="font-size: 17px;"> 1.1 Importing packages to extract data from webpage and neccesary libraries to complete the basic data manipulation, analyzation(sentiment analysis), and visualization.</span>

In [1]:
# libraries for webscraping, parsing and getting stock data
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import yfinance as yf

# for plotting and data manipulation
import pandas as pd
import numpy as np
import plotly
import plotly.express as px
from datetime import timedelta

# NLTK VADER for sentiment analysis
import nltk
# Lexicon-based approach to do sentiment analysis for of social media texts
nltk.downloader.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\bradsun\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


<span style="font-size: 17px;">1.2 Accessing a Wikipedia page containing information of S&P 500 companies and extracting their tickers into a list.</span>

In [2]:
# Get all dataframes from a given URL
wikipedia_url = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
# Access the first DataFrame in the list 
stocks = wikipedia_url[0]
# Select the column named 'Symbol'
tickers = stocks['Symbol']
# Convert it to a list
tickers = tickers.tolist()


<span style="font-size: 17px;">1.3 Accessing Finviz website to concurrently retrieve news data of each S&P 500 companies</span>

&emsp; <span style="font-size: 17px;"> (1) Preparation</span>

In [3]:
import time
import random
# Libraries for executing concurrent tasks
from concurrent.futures import ThreadPoolExecutor, as_completed
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

# URL for fetching financial data from the website "https://finviz.com/"
finwiz_url = 'https://finviz.com/quote.ashx?t='
# An dictionary used to store news data fetched from web pages.
news_tables = {}

&emsp; <span style="font-size: 17px;">(2) Method Definition to Retrieve News Data</span>

&emsp;&emsp;&emsp; <span style="font-size: 17px;">* Helper method to fetch news data for each stock ticker from finviz website</span> 

In [4]:
from http.client import IncompleteRead
from urllib.error import HTTPError, URLError

def fetch_news_table(ticker):
    # A random delay of 5 to 10 seconds before making the web request
    time.sleep(np.random.uniform(5, 10))
    # Replaces any '.' in the ticker symbol with '-' to ensure compatibility with the URL structure
    ticker = ticker.replace(".", "-")
    print(ticker)

    # Construct the URL for a specific stock on the Finviz website
    url = finwiz_url + ticker  


    ###### Try to make the request 3 times before giving up ######
    for _ in range(3):
        try:
            # Request object
            # User-Agent header is to mimic a web browser and avoid any restrictions or blocks from the website
            req = Request(url=url,headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0'})
            response = urlopen(req)
            # Parsing the HTML content
            html = BeautifulSoup(response, 'html.parser')

            # Get the table containing news data related to the stock
            news_table = html.find(id='news-table')
            return ticker, news_table
        except (IncompleteRead, HTTPError, URLError) as e:
            print("Error occurred while fetching data for {}: {}".format(ticker, str(e)))
            time.sleep(2)  # Wait for 2 seconds before trying again
            continue

    print("Failed to fetch data for {} after 3 attempts".format(ticker))
    return ticker, None

&emsp;&emsp;&emsp;  <span style="font-size: 17px;">* Method to implement the helper method **fetch_news_table(ticker)** with rate limiting</span> 

In [5]:
def process_ticker(index, ticker):
    # 1 second sleep for delay between processing each ticker
    sleep_duration = 1
    # A longer sleep duration between 5 and 10 seconds after certain amount of consecutive requests to prevent overwhelming the website's server
    if index % 50 == 0:
        time.sleep(np.random.uniform(5,10))
    # Call the method fetch_news_table(ticker)
    return fetch_news_table(ticker)

&emsp; <span style="font-size: 17px;">(3) Concurrent tasks to get multiple tickers' news data(table) parallelly</span>

In [6]:
#  A maximum of 8 worker threads
with ThreadPoolExecutor(max_workers=8) as executor:
    # Iterate over the tickers list
    # Submit process_ticker(index, ticker) method to the executor for execution
    # Create a dictionary, keys are futures returned by the executor.submit(), and values are ticker
    futures = {executor.submit(process_ticker, index, ticker): ticker for index, ticker in enumerate(tickers)}
    # Iterate over each future in the futures dictionary as they are completed
    for future in as_completed(futures):
        ticker, news_table = future.result()  # Retrieve the result of a completed future
        news_tables[ticker] = news_table  # Add the retrieved news table to the news_tables dictionary: Tickers serve as the key, and values are new tables

ABT
ACN
ATVI
ABBV
ADBE
AOS
ADM
AES
APD
AFL
ADP
AAP
MMM
A
AKAM
ALK
ARE
ALLE
ALL
ALB
ALGN
GOOGL
LNT
MO
AMD
GOOG
AEE
AMCR
AMZN
AEP
AAL
AXP
AMT
AMP
AIG
AME
AWK
ABC
AMGN
ANSS
AON
ADI
AAPL
APH
APA
ACGL
AMAT
ANET
APTV
AJG
T
ATO
AVY
ADSK
AZO
AVB
AIZ
AXON
BKR
BALL
BBWI
WRB
BRK-B
BAC
BAX
BDX
BBY
BIO
TECH
BLK
BIIB
BWA
BK
BKNG
BA
BXP
BMY
BSX
BF-B
BRO
BR
AVGO
CHRW
BG
CPB
CDNS
COF
CZR
CPT
CAH
CCL
KMX
CTLT
CARR
CDW
CAT
CBOE
CNC
CBRE
CE
CHTR
CDAY
CVX
CF
CRL
SCHW
CMG
CNP
CB
CI
CINF
CHD
CTAS
C
CSCO
CMS
CFG
CME
CLX
CTSH
KO
CL
CMCSA
CEG
STZ
ED
COP
CMA
CAG
COO
CPRT
CTVA
COST
GLW
CSGP
CTRA
CMICSX

CCI
CVS
DHI
DRI
DE
DHR
DVA
DAL
XRAY
DVN
FANG
DXCM
DFS
DLTR
DG
DIS
D
DPZ
DLR
DOV
DTE
DOW
EMN
DD
ETN
DUK
DXC
EBAY
EMR
EIX
ECL
EA
EW
ELV
LLY
ENPH
EPAM
ETR
EOG
EQT
EFX
EQIX
EQR
ESS
ETSY
EL
EVRG
EG
ES
EXC
EXPE
EXR
EXPD
XOMFDS

FFIV
FICO
FAST
FDX
FRT
FITB
FIS
FSLR
FMC
FLT
FI
FTNT
FE
F
FTV
FOXA
FOX
BEN
FCX
GRMN
IT
GEN
GEHC
GD
GNRC
GM
GE
GIS
GPC
GILD
GL
GPN
GS
HAL
HIG
HAS
HCA
HSY
PEAK
HSIC
HES
HPE
HOLX
HLT
HRL
HD
HON
HPQ


<span style="font-size: 17px;">Check the number of tickers that are successfully retrieved.</span>

In [7]:
num_keys = len(news_tables.keys())
num_keys

503

### 2. Data Processing: Parse and Retrieve the News Data **(CHANGED)**

<span style="font-size: 17px;">Iterate through each news table obtained from the website, parse the news text to extract the relevant information (ticker symbol, date, time, and headline) and append them to the **parsed_news** list.</span>

In [8]:
parsed_news = []  # To store the parsed news data
# Iterate through the news data
for file_name, news_table in news_tables.items():
    # Iterate through all tr tags in 'news_table'
    for x in news_table.findAll('tr'):        
        try:
            # get text from tag <a> only to extract news headline 
            text = x.a.get_text() 
            # split text in the td (usually contains date and time info) tag into a list 
            date_scrape = x.td.text.split()
            # if the length of 'date_scrape' is 1 (only the time information is available.), load 'time' as the only element
            if len(date_scrape) == 1:
                time = date_scrape[0]            
            # else load 'date' as the 1st element and 'time' as the second    
            else:
                date = date_scrape[0]
                time = date_scrape[1]
            # Extract the ticker from the file name, get the string up to the 1st '_'  
            ###### ticker = file_name.split('_')[0] ######
            # Append ticker, date, time and headline to the 'parsed_news' list
            ###### parsed_news.append([ticker, date, time, text]) ######
            parsed_news.append([file_name, date, time, text])
        # Catches any exceptions that occur during the process
        except Exception as e:
            print(e)

'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute 'get_text'
'NoneType' object has no attribute

### 3. Sentiment Analysis on News Data from Finviz Website with Vader **(CHANGED)**

<span style="font-size: 17px;">3.1 Analyze the sentiment of all news data with VADER and get each company's overall sentiment scores<br>
Operate sentiment analysis on news' headline with Vader and create a dataframe called **parsed_and_scored_news** which includes necessary information about each news (tickers, date, time, headline, and sentiment scores). Then, get companies' overall sentiment scores by averaging the sentiment of all news.</span>

In [9]:
# Instantiate the sentiment intensity analyzer
vader = SentimentIntensityAnalyzer()
# Set column names
columns = ['ticker', 'date', 'time', 'headline']
# Convert the parsed_news list into a DataFrame called 'parsed_and_scored_news'
parsed_and_scored_news = pd.DataFrame(parsed_news, columns=columns)

# Iterate through the headlines and get the polarity scores using vader
scores = parsed_and_scored_news['headline'].apply(vader.polarity_scores).tolist()
# Convert the 'scores' list of dicts into a DataFrame
scores_df = pd.DataFrame(scores)

# Join the DataFrames of the news and the list of dicts
parsed_and_scored_news = parsed_and_scored_news.join(scores_df, rsuffix='_right')
# Convert the date column from string to datetime
####### parsed_and_scored_news['date'] = pd.to_datetime(parsed_and_scored_news.date).dt.date

# Group by each ticker and get the mean of all sentiment scores
mean_scores = parsed_and_scored_news.groupby(['ticker'])[['neg','neu','pos','compound']].mean()
parsed_and_scored_news['date'] = pd.to_datetime(parsed_and_scored_news['date'])
max_dates = parsed_and_scored_news.groupby('ticker')['date'].max()

  parsed_and_scored_news['date'] = pd.to_datetime(parsed_and_scored_news['date'])


<span style="font-size: 17px;">3.2 Analyze the sentiment of the-most-2-recent-days news data and get each company's overall recent sentiment scores</span>

In [10]:
# A new dataframe to store the recent news data and sentiment scores
frames = []
# Find recent news data for each ticker
for ticker, max_date in max_dates.items():
    # Filter recent news data in the last two days
    frame = parsed_and_scored_news[(parsed_and_scored_news['ticker'] == ticker) & 
                                   ((parsed_and_scored_news['date'] == max_date) | 
                                    (parsed_and_scored_news['date'] == max_date - timedelta(days=1)))]
    frames.append(frame)

# Combine data for all tickers
recent_two_days_news = pd.concat(frames)

# Calculate the average sentiment score for each ticker in the last two days
mean_scores = recent_two_days_news.groupby(['ticker'])[['neg','neu','pos','compound']].mean()

# Remove tickers that have no data in the last two days
mean_scores = mean_scores.dropna()
# Reindex
mean_scores = mean_scores.reset_index()
mean_scores = mean_scores.rename(columns={'ticker': 'Ticker'})

### 4. Data Aggregation With Wikipedia Data

<span style="font-size: 17px;"> Retrieve the S&P 500 company tickers and their respective sectors from Wikipedia, merge the tickers with the mean sentiment scores to get a new dataframe, and identify the top 5 and bottom 5 companies with the highest and lowest sentiment scores for each sector. </span>

In [12]:
import requests
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
# Retrieve and parsing
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find the table with the CSS class 'wikitable sortable' within the HTML content
table = soup.find('table', {'class': 'wikitable sortable'})
# Retrieve all the rows of table, excluding the first one which is header row.
rows = table.findAll('tr')[1:]

# To store the extracted tickers and sectors
tickers_and_sectors = []

# Iterate all rows/stocks
for row in rows:
    # Extract the stock's ticker and sector
    ticker = row.findAll('td')[0].text.strip()
    sector = row.findAll('td')[2].text.strip()
    # Add the tuple (ticker, sector) to a list
    tickers_and_sectors.append((ticker, sector))

# Convert list to dataframe
tickers = pd.DataFrame(tickers_and_sectors, columns=['Ticker', 'Sector'])

df = tickers.merge(mean_scores,on='Ticker')
df = df.rename(columns={"compound": "Sentiment Score", "neg": "Negative", "neu": "Neutral", "pos": "Positive"})
df = df.reset_index()
grouped = df.groupby('Sector')
# Selects the top 5 stocks with the highest sentiment score for each sector 
top_5_each_sector = grouped.apply(lambda x: x.nlargest(5, 'Sentiment Score')).reset_index(drop=True)
# Selects the top 5 stocks with the lowest sentiment score
low_5_each_sector = grouped.apply(lambda x: x.nsmallest(5, 'Sentiment Score')).reset_index(drop=True)

### 5. Interactive Visualization **(CHANGED)**

<span style="font-size: 17px;">Display the tree map of sentiment scores for the top 5 or low 5 companies in each sector based on the users' input.<br>
The first figure is for the top 5 of each sector, and the second figure is for the low 5 of each sector </span>

In [13]:
###### Check the valid input
valid = False

while not valid:

    # Prompt user for input to choose the top 5 or lowest 5 sentiment scores 
    answer=input('input 1 for the most positive or 0 for the most negative ')

    # For the most positive
    if answer=='1':
        # Visualization attributes. Sectors serve as the root level of the treemap hierarchy. The color of each treemap cell based on the 'Sentiment Score' column
        fig = px.treemap(top_5_each_sector, path=[px.Constant("Sectors"), 'Sector', 'Ticker'], 
                        color='Sentiment Score', hover_data=['Negative', 'Neutral', 'Positive', 'Sentiment Score'],
                        color_continuous_scale=['#FF0000', "#000000", '#00FF00'],
                        color_continuous_midpoint=0)
        
        # Customize the hover tooltip text
        fig.data[0].customdata = top_5_each_sector[[ 'Negative', 'Neutral', 'Positive', 'Sentiment Score']].round(3) # round to 3 decimal places
        fig.data[0].texttemplate = "%{label}<br>%{customdata[3]}"
        fig.update_traces(textposition="middle center")
        fig.update_layout(margin = dict(t=30, l=10, r=10, b=10), font_size=20)

        # plotly.offline.plot(fig, filename='stock_sentiment.html') # this writes the plot into a html file and opens it
        fig.show()
        valid = True

    # For the most negative
    elif answer=='0':
        
        fig = px.treemap(low_5_each_sector, path=[px.Constant("Sectors"), 'Sector', 'Ticker'], 
                        color='Sentiment Score', hover_data=['Negative', 'Neutral', 'Positive', 'Sentiment Score'],
                        color_continuous_scale=['#FF0000', "#000000", '#00FF00'],
                        color_continuous_midpoint=0)

        fig.data[0].customdata = low_5_each_sector[[ 'Negative', 'Neutral', 'Positive', 'Sentiment Score']].round(3) # round to 3 decimal places
        fig.data[0].texttemplate = "%{label}<br>%{customdata[3]}"
        fig.update_traces(textposition="middle center")
        fig.update_layout(margin = dict(t=30, l=10, r=10, b=10), font_size=20)

        # plotly.offline.plot(fig, filename='stock_sentiment.html') # this writes the plot into a html file and opens it
        fig.show()
        valid = True
        
    # Check invalid input
    else:
        print("Invalid input. Please enter 1 or 0.")

In [14]:
###### Check the valid input
valid = False

while not valid:

    # Prompt user for input to choose the top 5 or lowest 5 sentiment scores 
    answer=input('input 1 for the most positive or 0 for the most negative ')

    # For the most positive
    if answer=='1':
        # Visualization attributes. Sectors serve as the root level of the treemap hierarchy. The color of each treemap cell based on the 'Sentiment Score' column
        fig = px.treemap(top_5_each_sector, path=[px.Constant("Sectors"), 'Sector', 'Ticker'], 
                        color='Sentiment Score', hover_data=['Negative', 'Neutral', 'Positive', 'Sentiment Score'],
                        color_continuous_scale=['#FF0000', "#000000", '#00FF00'],
                        color_continuous_midpoint=0)
        
        # Customize the hover tooltip text
        fig.data[0].customdata = top_5_each_sector[[ 'Negative', 'Neutral', 'Positive', 'Sentiment Score']].round(3) # round to 3 decimal places
        fig.data[0].texttemplate = "%{label}<br>%{customdata[3]}"
        fig.update_traces(textposition="middle center")
        fig.update_layout(margin = dict(t=30, l=10, r=10, b=10), font_size=20)

        # plotly.offline.plot(fig, filename='stock_sentiment.html') # this writes the plot into a html file and opens it
        fig.show()
        valid = True

    # For the most negative
    elif answer=='0':
        
        fig = px.treemap(low_5_each_sector, path=[px.Constant("Sectors"), 'Sector', 'Ticker'], 
                        color='Sentiment Score', hover_data=['Negative', 'Neutral', 'Positive', 'Sentiment Score'],
                        color_continuous_scale=['#FF0000', "#000000", '#00FF00'],
                        color_continuous_midpoint=0)

        fig.data[0].customdata = low_5_each_sector[[ 'Negative', 'Neutral', 'Positive', 'Sentiment Score']].round(3) # round to 3 decimal places
        fig.data[0].texttemplate = "%{label}<br>%{customdata[3]}"
        fig.update_traces(textposition="middle center")
        fig.update_layout(margin = dict(t=30, l=10, r=10, b=10), font_size=20)

        # plotly.offline.plot(fig, filename='stock_sentiment.html') # this writes the plot into a html file and opens it
        fig.show()
        valid = True
        
    # Check invalid input
    else:
        print("Invalid input. Please enter 1 or 0.")

Invalid input. Please enter 1 or 0.
