# Classifying Stocks Using Machine Learning
## Collecting Richer Data
<hr>

The purpose of this notebook is to evaluate full articles for richer sentiment analysis, creating a related stock count, and completing tf-idf using a query list to parse if stocks may be projected to rise or fall.

### Objectives
 - investigate most used article sites
 - create collection methods for articles
 - collect:
     - sentiment
     - related stocks mentioned in article
     - captured text for tf-idf
 - implement tf-idf scoring system
 - implement collection of top adjacent stock news and social sent
     - create new column for "related_stock_sent"
 - export new df with columns:
     - related_stock_sent
     - tfidf_score
     - full_article_sent

In [48]:
import pandas as pd
import numpy as np
from pycorenlp import StanfordCoreNLP
from collections import Counter
import requests
from bs4 import BeautifulSoup
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
import re
import urllib.request

In [2]:
df = pd.read_hdf('stocksFullCleaned.h5')

In [3]:
# get the amount of each source first

news = []
for i in df['source']:
    for i2 in i:
        news.append(i2)
news = dict(Counter(news))
len(news)

330

In [4]:
news

{'SeekingAlpha': 7792,
 'DowJones': 1822,
 'MarketWatch': 7855,
 'Reuters': 2120,
 'Yahoo': 18524,
 'Alliance News': 1382,
 'TalkMarkets': 1932,
 'Associated Press, The': 1191,
 'Stock Options Channel': 184,
 'CNBC': 4226,
 'Investing.com': 1676,
 'Thefly.com': 4252,
 'GlobeNewswire': 92,
 'InvestorPlace': 1960,
 '247WallSt': 404,
 'GuruFocus': 4168,
 'Nasdaq': 7401,
 'PR Newswire': 104,
 'https://www.independent.co.uk': 13,
 'Benzinga': 3212,
 'benzinga': 588,
 'United Press International': 108,
 'PR Web': 446,
 'Market News Video': 75,
 'ETF Channel': 51,
 'The Online Investor': 42,
 'Business Wire': 174,
 'https://www.arabnews.com': 28,
 'abcnews': 99,
 'businesswire': 464,
 'https://www.einpresswire.com': 15,
 'https://www.teslarati.com': 5,
 'https://www.thelincolnianonline.com': 170,
 'https://screenrant.com': 76,
 'https://born2invest.com': 4,
 'https://www.techtimes.com': 71,
 'https://www.markets.co': 5,
 'https://investorplace.com': 39,
 'https://venturebeat.com': 17,
 'https

In [5]:
news = {k: v for k, v in sorted(news.items(), key=lambda item: item[1],reverse=True)}
news

{'Yahoo': 18524,
 'MarketWatch': 7855,
 'SeekingAlpha': 7792,
 'Nasdaq': 7401,
 'Thefly.com': 4252,
 'CNBC': 4226,
 'GuruFocus': 4168,
 'Benzinga': 3212,
 'Reuters': 2120,
 'InvestorPlace': 1960,
 'TalkMarkets': 1932,
 'DowJones': 1822,
 'Investing.com': 1676,
 'Alliance News': 1382,
 'Associated Press, The': 1191,
 'benzinga': 588,
 'businesswire': 464,
 'PR Web': 446,
 '247WallSt': 404,
 'The Guardian': 187,
 'Stock Options Channel': 184,
 'Business Wire': 174,
 'https://www.thelincolnianonline.com': 170,
 'https://www.fool.com': 159,
 'https://www.windowscentral.com': 126,
 'https://www.marketscreener.com': 120,
 'United Press International': 108,
 'PR Newswire': 104,
 'abcnews': 99,
 'https://www.forbes.com': 98,
 'GlobeNewswire': 92,
 'https://screenrant.com': 76,
 'https://www.businessinsider.in': 76,
 'Market News Video': 75,
 'https://www.techtimes.com': 71,
 'https://www.theverge.com': 71,
 'https://learnbonds.com': 59,
 'https://timesofindia.indiatimes.com': 56,
 'ETF Channel

In [6]:
top20 = sum(list(news.values())[:20])
top15 = sum(list(news.values())[:15])
top10 = sum(list(news.values())[:10])
top5 = sum(list(news.values())[:5])

In [7]:
values = news.values()
total = sum(values)

In [8]:
total

75980

In [9]:
# top 20 will give us 94% coverage
# if its a lot of work, top15 would be good enough
# keeping us at > 90%

print(top20/total)
print(top15/total)
print(top10/total)
print(top5/total)

0.9423795735719926
0.9148854961832061
0.8095551460910766
0.6031060805475125


In [10]:
# we need to get links to top 20 articles, we can start with yahoo

top20keys = list(news.keys())[:20]

In [11]:
# sentiment analyzer (download lexicon if needed)

# nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

In [12]:
def findNewsRows(source):
    articleRows = []
    for idx,row in df.iterrows():
        for i in row["source"]:
            if i == source:
                articleRows.append(idx)
    unqArt = []
    for i in articleRows:
        if i not in unqArt:
            unqArt.append(i)
    return unqArt

In [15]:
def printUrls(idx):
    a = df.iloc[idx]["source"]
    b = df.iloc[idx]["url"]
    urlPair = []
    for i in range(len(a)):
        urlPair.append([a[i],b[i]])
    return urlPair

In [305]:
b = findNewsRows(top20keys[11])
b

[0,
 2,
 19,
 20,
 21,
 27,
 30,
 31,
 39,
 45,
 57,
 70,
 81,
 82,
 85,
 89,
 101,
 103,
 109,
 115,
 123,
 124,
 125,
 130,
 133,
 134,
 138,
 139,
 144,
 145,
 146,
 148,
 150,
 151,
 152,
 154,
 157,
 158,
 159,
 161,
 162,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 175,
 176,
 177,
 178,
 180,
 181,
 182,
 184,
 203,
 204,
 205,
 206,
 207,
 208,
 209,
 210,
 211,
 212,
 213,
 218,
 220,
 221,
 224,
 226,
 230,
 232,
 233,
 234,
 236,
 237,
 238,
 240,
 241,
 242,
 247,
 248,
 250,
 251,
 254,
 255,
 256,
 257,
 303,
 304,
 305,
 306,
 312,
 313,
 318,
 319,
 320,
 321,
 326,
 339,
 340,
 346,
 348,
 349,
 350,
 357,
 360,
 378,
 379,
 380,
 381,
 382,
 384,
 385,
 386,
 395,
 400,
 401,
 403,
 406,
 408,
 410,
 413,
 415,
 420,
 421,
 423,
 425,
 428,
 432,
 453,
 454,
 459,
 467,
 468,
 540,
 544,
 551,
 567,
 572,
 606,
 659,
 681,
 683,
 694,
 695,
 700,
 701,
 704,
 705,
 714,
 717,
 721,
 724,
 731,
 732,
 733,
 746,
 747,
 761,
 770,
 773,
 774,
 776,
 777,
 789

In [309]:
a = printUrls(123)

for i in a:
    print(i)

['DowJones', 'https://finnhub.io/api/news?id=7c402ec87cbba4c75195c0a4cf31a517f207c0b27fab97b41d67dfd9aa33ecca']
['MarketWatch', 'https://finnhub.io/api/news?id=67f823db3f6de09760361ee570ce88036b4c7d77a06694fc920e527aa948c253']
['Benzinga', 'https://finnhub.io/api/news?id=2eb33a4edce904c28b4e1f8a5c7585ea3290b4f10a1784ba58b32fb64e690ce8']
['DowJones', 'https://finnhub.io/api/news?id=9a5d98919c885657d69d605051a1dee4806912b943f2846f5a899e5097fbb42e']
['MarketWatch', 'https://finnhub.io/api/news?id=ed3f8587b9f6f6a2879db3b4385061e61e92b50fb281862f8934c4a03d932873']
['benzinga', 'https://finnhub.io/api/news?id=e3f65917153a62d7424e5b4d686808073c893cb2090d76c43f4f4b0827f83704']
['MarketWatch', 'https://finnhub.io/api/news?id=639a2debe2064eb8980bc41b98c3207adee6abf4417cea15ef3798f30472f6e3']
['Benzinga', 'https://finnhub.io/api/news?id=6d9d5591f08f3b83c1b0ed6756c427b96a7a340d0dfc0a15c85bbed4042e9a59']
['benzinga', 'https://finnhub.io/api/news?id=35fe53b354c706dc97f77424d26c14a1d070b0c0842cc83e35

In [128]:
def collectHTML(url):
    headers = {"User-Agent":"Mozilla/5.0"}
    r = requests.get(url,headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

In [19]:
def pYahoo(soup):
    outButton = soup.find_all("div", {"class": "caas-readmore caas-readmore-collapse caas-readmore-outsidebody caas-readmore-asidepresent"})
    if not outButton:
        body = soup.find("div", {"class": "caas-body"})
        paras = body.find_all("p")
    article = " ".join(str(i.text) for i in paras)
    relstocks = soup.find("ul", {"class": "caas-xray-pills xray-as-popup caas-xray-pills-top"})
    tickers = relstocks.find_all("li")
    stocks = []
    for i in tickers:
        stocks.append(i.text)
    sent = sia.polarity_scores(article)
    return article, stocks, sent['compound']


In [20]:
u = 'https://finance.yahoo.com/news/southwest-ceo-gary-kelly-is-euphoric-about-airline-recovery-191342217.html'

soup = collectHTML(u)

In [21]:
body = soup.find("div", {"class": "caas-body"})

In [22]:
paras = body.find_all("p")

In [23]:
# collecting yahoo articles
u = 'https://finance.yahoo.com/news/airlines-wont-call-travelers-covid-19-vaccination-proof-vaccine-passports-110015915.html'

soup = collectHTML(u)
a,b,c = pYahoo(soup)

In [24]:
u = 'https://www.marketwatch.com/articles/airline-stocks-are-soaring-as-leisure-travel-keeps-growing-51615829245'
soup = collectHTML(u)


In [25]:
content = soup.find("div", {"class": "column column--full article__content"})

In [26]:
def pMWatch(soup):
    content = soup.find("div", {"class": "column column--full article__content"})
    c = re.sub('\s+',' ',content.text)
    article = c.strip()
    
    tickers = soup.find("ul", {"class": "list list--tickers"})
    stock = tickers.find_all("bg-quote")
    itir = -1
    stocks = []
    for i in stock:
        itir=itir+1
        if itir%2 == 0:
            strp = re.sub('\s+',' ',i.text)
            a = strp.split()
            stocks.append(a[0].strip())
            
    sent = sia.polarity_scores(article)
    return article, stocks, sent['compound']

In [27]:
# collecting marketwatch articles
u = 'https://www.marketwatch.com/story/american-airlines-group-inc-stock-underperforms-tuesday-when-compared-to-competitors-01615926803-1d77385fe932'

soup = collectHTML(u)
a,b,c = pMWatch(soup)

In [129]:
u = "https://finnhub.io/api/news?id=c022d52df61b43d33425f3185eb0cc168b3b5b67a997f40c6ce9cd7fc8c95f94"

soup = collectHTML(u)

In [131]:
content = soup.find("div", {"class": "body__content"})
c = re.sub('\s+',' ',content.text)
article = c.strip()

In [163]:
ticks = soup.find("section", {"class": "topics-in-this-story"})
texts = ticks.find_all("a")


In [164]:
for i in texts:
    print(i.text)

BHP
AAL
RIO
TER


In [165]:
def pNasdaq(soup):
    content = soup.find("div", {"class": "body__content"})
    c = re.sub('\s+',' ',content.text)
    article = c.strip()
    
    ticks = soup.find("section", {"class": "topics-in-this-story"})
    texts = ticks.find_all("a")
    stocks = []
    for i in texts:
        stocks.append(i.text)
    
    sent = sia.polarity_scores(article)
    return article, stocks, sent['compound']

In [166]:
# collecting Nasdaq articles
u = 'https://finnhub.io/api/news?id=c022d52df61b43d33425f3185eb0cc168b3b5b67a997f40c6ce9cd7fc8c95f94'

soup = collectHTML(u)
a,b,c = pNasdaq(soup)

In [169]:
c

0.9804

In [187]:
content = soup.find("td", {"class": "newsContent"})
content.find("h1").text

'American Airlines: Future cash flow to be used to reduce debt'

In [188]:
def pFly(soup):
    content = soup.find("td", {"class": "newsContent"})
    article = content.find("h1").text
    sent = sia.polarity_scores(article)
    
    return article, [], sent["compound"]

In [189]:
# collecting theFly articles
u = "https://thefly.com/landingPageNews.php?id=3287478&headline=AAL-American-Airlines-Future-cash-flow-to-be-used-to-reduce-debt"
soup = collectHTML(u)

a,b,c = pFly(soup)

In [196]:
u = "https://www.cnbc.com/2021/04/22/what-to-watch-today-us-stock-futures-steady-after-a-wall-street-comeback.html"

soup = collectHTML(u)

In [202]:
def pCNBC(soup):
    try:
        content = soup.find("div", {"class": "ArticleBody-articleBody"})
        article = re.sub('\s+',' ',content.text)
        sent = sia.polarity_scores(article)

        return article, [], sent["compound"]
    except:
        return '',[],0

In [208]:
# collecting cnbc articles
u = "https://www.cnbc.com/2021/04/22/what-to-watch-today-us-stock-futures-steady-after-a-wall-street-comeback.html"

soup = collectHTML(u)
a,b,c = pCNBC(soup)

In [217]:
u = "https://www.gurufocus.com/news/1405876/anglo-american-plc-stock-shows-every-sign-of-being-significantly-overvalued"

soup = collectHTML(u)

In [218]:
content = soup.find("div", {"class": "main-body"})

In [226]:
content.text

def pGuru(soup):
    content = soup.find("div", {"class": "main-body"})
    article = content.text
    sent = sia.polarity_scores(article)
    
    return article, [], sent["compound"]

In [227]:
# collecting guru articles
u = "https://www.gurufocus.com/news/1412285/wedbush-morgan-securities-inc-buys-vanguard-sp-500-etf-ishares-short-treasury-bond-etf-apple-inc-sells-pimco-enhanced-short-maturity-active-exchangetrad-aecom-allianzgi-artificial-intelligence-technology-opp"

soup = collectHTML(u)
a,b,c = pGuru(soup)

In [234]:
u = "https://www.benzinga.com/news/21/03/20282484/10-information-technology-stocks-with-unusual-options-alerts-in-todays-session"

soup = collectHTML(u)

In [235]:
content = soup.find("div", {"class": "article-content-body-only"})

In [237]:
article = re.sub('\s+',' ',content.text)

In [240]:
def pBzinga(soup):
    content = soup.find("div", {"class": "article-content-body-only"})
    article = re.sub('\s+',' ',content.text)
    article = article.strip()
    
    sent = sia.polarity_scores(article)
    return article, [], sent["compound"]

In [241]:
# collecting benzinga articles
u = "https://www.benzinga.com/news/21/03/20286063/the-qqq-rallied-today-heres-why"

soup = collectHTML(u)
a,b,c = pBzinga(soup)

In [250]:
u = "https://www.reuters.com/article/us-health-coronavirus-american-airline/american-airlines-readies-more-jets-to-meet-rising-demand-idUSKBN2BL21K"

soup = collectHTML(u)

In [254]:
content = soup.find("div", {"class": "ArticleBodyWrapper"})
c = content.find_all("p",{"class": "Paragraph-paragraph-2Bgue ArticleBody-para-TD_9x"})

In [257]:
article = ''
for i in c:
    article = article + " " + i.text
    
article = article.strip()

(Reuters) - American Airlines said on Monday it expects to fly most of its fleet in the coming months thanks to strong domestic and short-haul international bookings as COVID-19 infection rates and hospitalizations decline and more people receive vaccines. American said that as of March 26, average bookings for the next seven days had reached 90% of levels experienced before the pandemic upended air travel in 2019, with a domestic load factor of about 80%. “The Company presently expects this strength in bookings to continue through the end of the first quarter and into the second quarter,” it said in a regulatory filing. Shares in U.S. airlines, which parked hundreds of jets as demand plummeted last year, have climbed this year amid hopes for a recovery. The U.S. Transportation Security Administration (TSA) screened 1.57 million passengers on Sunday, the highest number since March 2020. Following the increase in travel demand so far this year, American said it expects its system capaci

In [258]:
def pReuters(soup):
    content = soup.find("div", {"class": "ArticleBodyWrapper"})
    c = content.find_all("p",{"class": "Paragraph-paragraph-2Bgue ArticleBody-para-TD_9x"})
    article = ''
    for i in c:
        article = article + " " + i.text

    article = article.strip()
    
    sent = sia.polarity_scores(article)
    return article, [], sent["compound"]

In [261]:
# collecting reuters articles
u = "https://www.reuters.com/article/us-american-airlines-debt/american-airlines-cuts-debt-by-2-8-billion-idUSKBN2BN3DX"

soup = collectHTML(u)
a,b,c = pReuters(soup)

0.6249

In [291]:
u = "https://investorplace.com/2021/04/midday-market-update-the-10-most-active-stocks-today-3/"

soup = collectHTML(u)

content = soup.find("div", {"class": "col-xs article"})


In [292]:
content.text

'\n\nFriday is halfway over and that means it’s time to measure up the stocks seeing the most movement today in our midday market update. We’re seeing a lot of tech stocks show up on the list today with many of them seeing share prices rising.\n\n'

In [296]:
def pInvPlace(soup):
    content = soup.find("div", {"class": "col-xs article"})
    article = re.sub('\s+',' ',content.text)
    article = article.strip()
    
    sent = sia.polarity_scores(article)
    return article, [], sent["compound"]

In [297]:
# collecting investorplace articles
u = "https://investorplace.com/2021/05/3-consumer-cyclical-stocks-to-buy-for-the-coming-travel-explosion/"

soup = collectHTML(u)
a,b,c = pInvPlace(soup)

In [299]:
c

0.9986

In [319]:
u = "https://www.wsj.com/articles/google-to-lower-service-fee-for-play-store-app-developers-11615907010"

soup = collectHTML(u)

content = soup.find("div", {"class": "wsj-snippet-body"})


In [320]:
content.text

'\nGoogle is reducing the cut it takes from app sales in its Play store, joining rival  Apple Inc.  in shrinking commissions as the power the tech giants wield through their digital marketplaces has drawn the ire of developers and scrutiny of regulators. \nThe company behind the world’s largest mobile operating system, Android, said Tuesday that it would reduce service fee it collects from 30% to 15% on the first $1 million developers earn from its app store. The reduction, which begins in July, is a slight departure from  Apple Inc . ’s decision late last year to reduce its rate to 15% for software makers who generate less than $1 million in annual sales.\nGoogle, the main business of  Alphabet Inc.,  and Apple have built multibillion-dollar digital empires over the past decade by becoming the primary gatekeepers for apps that are downloaded to smartphones and other mobile devices world-wide. Their position of power has drawn criticism from developers large and small over the amount o

In [321]:
def pDowjones(soup):
    content = soup.find("div", {"class": "wsj-snippet-body"})
    article = re.sub('\s+',' ',content.text)
    article = article.strip()
    
    sent = sia.polarity_scores(article)
    return article, [], sent["compound"]

In [322]:
# collecting dow articles
u = "https://www.wsj.com/articles/american-jetblue-alliance-draws-increased-scrutiny-from-justice-department-11618423182"

soup = collectHTML(u)
a,b,c = pDowjones(soup)

In [325]:
a

'WASHINGTON—The Justice Department has stepped up an antitrust probe of American Airlines Group Inc.’s recent partnership with JetBlue Airways Corp. and is concerned the deal could lead to anticompetitive coordination and inflated fares at key traffic hubs, according to people familiar with the matter. Department antitrust officials harbored reservations about the deal before the recent change in presidential administrations, but their scrutiny of the alliance has increased in recent months, the people said, an early signal of the Biden administration’s interest in antitrust enforcement. The department is concerned that an American-JetBlue alliance could diminish competition at congested Northeast airports in New York and Boston that are hubs for travel around the U.S. and internationally. The investigation is continuing and no final conclusions have been reached. Any Justice Department decision could be affected by the views of the Transportation Department, which has broad regulatory