# News Database
<br>
There was a kaggle competition (https://www.kaggle.com/c/two-sigma-financial-news) by Two Sigma a few months ago, looking to identify any potential correlation between news event and stock performance. In the competition, news and stock market data were provided. The source of data are from Thomson Reuters and Intrino (which is part of Thomson Reuters as well). One of the key feature in the news data is sentiment score (and confidence of the score). Unfortunately, there was no clarification/details on the sentiment scoring model. A sentiment score (can sometimes be refer as polarity score) refers to having negative, postive or neutral expression.

![Data is King](https://www.denofprogramming.com/wp-content/uploads/2015/07/KingData-300x236.jpg)

Therefore, the notebook is to explore and understand how sentiment scoring works, then curate local news, score news event with the goal to train a ML model for local news sentiment scoring

## Strategy
The "Hello World" of Sentiment Analysis begins with "Classifying IMDB movie reviews". The following is a good starting point using vectorization techniques (bag-of-words) with machine learning model (https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184). The general strategy for bag-of-words is to : <br>
<br>
1. Remove any non-essential characters (commas, semicolons, angled brackets, etc) - removing punctuation, HTML tags, forced lower case<br>
    a. removing stop words - words that do not carry weights (i.e. they, we, if, I, you, etc)<br>
    b. stemming and lemmatization<br>
        i. stemming - cut off words to root words (brute force) <br>
        ii. lemmatization - complex transformation to root words <br>
2. Vectorize/Tokenize every word in all comments (document database is also known as corpus) <br>
    a. vectorize combination of word (n-gram technique) <br>
    b. vectorize words importance and words count with inverse relationship (TF-IDF) (i.e. words that appear many times have lower importance than words that appear once or twice) <br>
3. Match X (vector) and Y (score) and use sklearn library to train a model <br>
    a. the produced vector are typically sparse in nature - lots of zeros (i.e. some words are not found in other comments). Common model is to use SVM with linear kernel for separation. <br>
    b. optional : use a neural network to classify sentiment.

As a starter, we will collect some news from a local site.

In [28]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re

url = "http://www.thestar.com.my/business"
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
#print(soup.prettify())

for tag_object in soup.find_all('a'):
    print(tag_object.get_attribute_list("data-content-title"))

['The Star Online']
['ePaper']
['Log In']
[None]
['https://login.thestar.com.my/accountinfo/profile.aspx']
['https://login.thestar.com.my/accountinfo/changepassword.aspx']
['https://login.thestar.com.my/accountinfo/subscriptioninfo.aspx']
['https://login.thestar.com.my/accountinfo/billing.aspx']
['https://login.thestar.com.my/accountinfo/transhistory.aspx']
['http://www.thestar.com.my/foryou/edit']
['http://www.thestar.com.my/saved-articles']
['https://www.thestar.com.my/faqs/']
['https://www.thestar.com.my']
['The Star Online']
['Home']
['For You']
['News']
['Latest']
['Nation']
['Asean+']
['World']
['Environment']
['In Other Media']
['True or Not']
['Focus']
['Business']
['News']
['StarBiz Premium']
['SMEBiz']
['Market Watch']
['Bursa Overview']
['Market Movers']
['Financial Results']
['Dividends']
['Bonus']
['IPO']
['Unit Trust']
['Exchange Rates']
['My Portfolio']
['Sport']
['Football']
['Golf']
['Badminton']
['Tennis']
['Motorsport']
['Community Sports']
['Other Sports']
['Say Wha

Unfortunately, the website uses javascript to display HTML object. Selenium is a better tool for this purpose.

In [15]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("disabled-infobars")
options.add_argument("--incognito")

url = "http://www.thestar.com.my/business"
browser = webdriver.Chrome(chrome_options = options)
browser.get(url)
#tag_element = browser.find_elements_by_xpath('//*[@id="form1"]')
items = browser.find_elements_by_xpath('//h2/a[@href]')
browser.close()

  if __name__ == '__main__':


In [19]:
DF_list = []
for item in items:
    DF_list.append([item.get_attribute('data-content-title'), item.get_attribute('data-content-author'), item.get_attribute('href')])

In [20]:
import pandas as pd

DF = pd.DataFrame(DF_list, columns = ["Title", "Author", "Link"])

In [30]:
DF = DF.applymap(str)
DF = DF[DF["Title"] != "None"]
DF

Unnamed: 0,Title,Author,Link
0,Chips sector hard hit by Covid-19,,https://www.thestar.com.my/business/business-n...
1,Foreign selling extends to third week,Leong Hung Yee,https://www.thestar.com.my/business/business-n...
2,Carnage in oil markets batter Bursa's oil and ...,Joseph Chin,https://www.thestar.com.my/business/business-n...
3,Quick take: Oil and gas counters tumble after ...,,https://www.thestar.com.my/business/business-n...
4,"Affin Hwang maintains 'netural' on telcos, Max...",,https://www.thestar.com.my/business/business-n...
5,"Trading ideas: TRC Synergy, Acoustech, Vsolar,...",,https://www.thestar.com.my/business/business-n...
6,"Saudi Arabia plans big oil output hike, beginn...",,https://www.thestar.com.my/business/business-n...
7,Vehicle sales expected to grow 9% this year,,https://www.thestar.com.my/business/business-n...
8,Datasonic MD says firm’s ability to secure gov...,,https://www.thestar.com.my/business/business-n...
9,Ringgit extends last week's loss against the US$,,https://www.thestar.com.my/business/business-n...


In [106]:
import requests
from bs4 import BeautifulSoup
import time

contentDFList = []

for i in DF["Link"]:
    print("Scraping Link",i)
    response = requests.get(i)
    soup = BeautifulSoup(response.content,'html.parser')
    element_list = soup.find_all("p")
    if element_list != []:
        text_list = []
        for element in element_list:
            text_list.append(element.get_text())
        try:
            trimmingIndex = text_list.index("We're sorry, this article is unavailable at the moment. If you wish to read this article, kindly contact our Customer Service team at 1-300-88-7827. Thank you for your patience - we're bringing you a new and improved experience soon!")
        except:
            try:
                trimmingIndex = text_list.index(" ")
            except:
                contentDFList.append([i,"NA","NA","NA"])
                continue
        day = text_list[0].replace("\n","").strip().split(",")[0]
        date = text_list[0].replace("\n","").strip().split(",")[1].strip()
        content = ''.join(text_list[1:trimmingIndex])
        contentDFList.append([i,day,date,content])
    else:
        contentDFList.append([i,"NA","NA","NA"])
    time.sleep(2)

Scraping Link https://www.thestar.com.my/business/business-news/2020/03/09/chips-sector-hard-hit-by-covid-19
Scraping Link https://www.thestar.com.my/business/business-news/2020/03/09/foreign-selling-extends-to-third-week
Scraping Link https://www.thestar.com.my/business/business-news/2020/03/09/carnage-in-oil-markets-batter-bursa039s-oil-and-gas-stocks
Scraping Link https://www.thestar.com.my/business/business-news/2020/03/09/quick-take-oil-and-gas-counters-tumble-after-oil-price-crashes
Scraping Link https://www.thestar.com.my/business/business-news/2020/03/09/affin-hwang-maintains-039netural039-on-telcos-maxis-is-top-pick
Scraping Link https://www.thestar.com.my/business/business-news/2020/03/09/trading-ideas-trc-synergy-acoustech-vsolar-sinotop
Scraping Link https://www.thestar.com.my/business/business-news/2020/03/09/saudi-arabia-plans-big-oil-output-hike-beginning-all-out-price-war
Scraping Link https://www.thestar.com.my/business/business-news/2020/03/09/vehicle-sales-expected-t

In [109]:
contentDF = pd.DataFrame(contentDFList, columns = ["Link","Day","Date","Content"])
mergedDF = DF.merge(contentDF, left_on="Link", right_on="Link")
mergedDF.drop(columns=["Link"])

Unnamed: 0,Title,Author,Day,Date,Content
0,Chips sector hard hit by Covid-19,,Monday,09 Mar 2020,By DAVID TANMini-Circuits Technologies In Baya...
1,Foreign selling extends to third week,Leong Hung Yee,Monday,09 Mar 2020,By Leong Hung YeeInternational investors took ...
2,Carnage in oil markets batter Bursa's oil and ...,Joseph Chin,Monday,09 Mar 2020,By Joseph ChinKUALA LUMPUR: The carnage in oil...
3,Quick take: Oil and gas counters tumble after ...,,Monday,09 Mar 2020,KUALA LUMPUR: Oil and gas counters on Bursa Ma...
4,"Affin Hwang maintains 'netural' on telcos, Max...",,Monday,09 Mar 2020,
5,"Trading ideas: TRC Synergy, Acoustech, Vsolar,...",,Monday,09 Mar 2020,KUALA LUMPUR: JF Apex Research expects TRC SYN...
6,"Saudi Arabia plans big oil output hike, beginn...",,Monday,09 Mar 2020,DUBAI: Saudi Arabia plans to increase oil outp...
7,Vehicle sales expected to grow 9% this year,,Monday,09 Mar 2020,By EUGENE MAHALINGAMAnalysts sceptical based o...
8,Datasonic MD says firm’s ability to secure gov...,,Monday,09 Mar 2020,By INTAN FARHANA ZAINULDatasonic MD Chew Ben B...
9,Ringgit extends last week's loss against the US$,,Monday,09 Mar 2020,KUALA LUMPUR: The ringgit extended last week's...


So far this is a good start. Now lets combine each component into a python class that could be executed daily.

In [None]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

class theStarBusinessScraper():
    def __init__(self):
        self.url = "www.thestar.com.my/business"
        self.chromeOptions = Options()
        self.chromeOptions.add_argument("--incognito")
        
    def startBrowser(self):
        self.browser = webdriver.Chrome(chrome_options = self.chromeOptions)
        self.browser.get(url)
        items = self.browser.find_elements_by_xpath('//h2/a[@href]')
        self.browser.close()
        return items
    
    def scrapeMainPage(self, items):
        DF_list = []
        for item in items:
            DF_list.append([item.get_attribute('data-content-title'), item.get_attribute('data-content-author'), item.get_attribute('href')])
        DF = pd.DataFrame(DF_list, columns = ["Title", "Author", "Link"])
        DF = DF.applymap(str)
        DF = DF[DF["Title"] != "None"]
        return DF
    
    def scrapeIndividualPage(self):
        
        
        
if __name__ == "__main__":
    scraper = theStarBusinessScraper()
    mainPage = scraper.startBrowser()
    DF = scrapeMainPage(mainPage)
    