# News Sentiment Database
<br>
There was a kaggle competition (https://www.kaggle.com/c/two-sigma-financial-news) by Two Sigma a few months ago, looking to identify any potential correlation between news event and stock performance. In the competition, news and stock market data were provided. The source of data are from Thomson Reuters and Intrino (which is part of Thomson Reuters as well). One of the key feature in the news data is sentiment score (and confidence of the score). Unfortunately, there was no clarification/details on the sentiment scoring model. A sentiment score (can sometimes be refer as polarity score) refers to having negative, postive or neutral expression.

![Data is King](https://www.denofprogramming.com/wp-content/uploads/2015/07/KingData-300x236.jpg)

Therefore, the notebook is to explore and understand how sentiment scoring works, then curate local news, score news event with the goal to train a ML model for local news sentiment scoring

## Strategy
The "Hello World" of Sentiment Analysis begins with "Classifying IMDB movie reviews". The following is a good starting point using vectorization techniques (bag-of-words) with machine learning model (https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184). The general strategy for bag-of-words is to : <br>
<br>
1. Remove any non-essential characters (commas, semicolons, angled brackets, etc) - removing punctuation, HTML tags, forced lower case<br>
    a. removing stop words - words that do not carry weights (i.e. they, we, if, I, you, etc)<br>
    b. stemming and lemmatization<br>
        i. stemming - cut off words to root words (brute force) <br>
        ii. lemmatization - complex transformation to root words <br>
2. Vectorize/Tokenize every word in all comments (document database is also known as corpus) <br>
    a. vectorize combination of word (n-gram technique) <br>
    b. vectorize words importance and words count with inverse relationship (TF-IDF) (i.e. words that appear many times have lower importance than words that appear once or twice) <br>
3. Match X (vector) and Y (score) and use sklearn library to train a model <br>
    a. the produced vector are typically sparse in nature - lots of zeros (i.e. some words are not found in other comments). Common model is to use SVM with linear kernel for separation. <br>
    b. optional : use a neural network to classify sentiment.

As a starter, we will collect some news from a local site.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re

url = "http://www.thestar.com.my/business"
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
#print(soup.prettify())

for tag_object in soup.find_all('a'):
    print(tag_object.get_attribute_list("data-content-title"))

['The Star Online']
['ePaper']
['Log In']
[None]
['https://login.thestar.com.my/accountinfo/profile.aspx']
['https://login.thestar.com.my/accountinfo/changepassword.aspx']
['https://login.thestar.com.my/accountinfo/subscriptioninfo.aspx']
['https://login.thestar.com.my/accountinfo/billing.aspx']
['https://login.thestar.com.my/accountinfo/transhistory.aspx']
['http://www.thestar.com.my/foryou/edit']
['http://www.thestar.com.my/saved-articles']
['https://www.thestar.com.my/faqs/']
['https://www.thestar.com.my']
['The Star Online']
['Home']
['For You']
['News']
['Latest']
['Nation']
['Asean+']
['World']
['Environment']
['In Other Media']
['True or Not']
['Focus']
['Business']
['News']
['StarBiz Premium']
['SMEBiz']
['Market Watch']
['Bursa Overview']
['Market Movers']
['Financial Results']
['Dividends']
['Bonus']
['IPO']
['Unit Trust']
['Exchange Rates']
['My Portfolio']
['Sport']
['Football']
['Golf']
['Badminton']
['Tennis']
['Motorsport']
['Community Sports']
['Other Sports']
['Say Wha

Unfortunately, the website uses javascript to display HTML object and requests does not have framework to understand javascript. Selenium is a better tool for this purpose.

## Combination of Selenium and BeautifulSoup
1. Selenium's webdriver to scrape mainpage for link
2. Scour each link with request and bs4
3. extract day, date, author, title, content


In [2]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("disabled-infobars")
options.add_argument("--incognito")

url = "http://www.thestar.com.my/business"
browser = webdriver.Chrome(chrome_options = options)
browser.get(url)
#tag_element = browser.find_elements_by_xpath('//*[@id="form1"]')
items = browser.find_elements_by_xpath('//h2/a[@href]')


  if __name__ == '__main__':


In [3]:
import pandas as pd
DF_list = []

for item in items:
    DF_list.append([item.get_attribute('data-content-title'), item.get_attribute('data-content-author'), item.get_attribute('href')])

browser.close()

DF = pd.DataFrame(DF_list, columns = ["Title", "Author", "Link"])
DF = DF.applymap(str)
DF = DF[DF["Title"] != "None"]
DF

Unnamed: 0,Title,Author,Link
0,Tough job lies in wait for new Cabinet,,https://www.thestar.com.my/business/business-n...
1,"Bursa stages mild rebound, PChem and banks lift",Joseph Chin,https://www.thestar.com.my/business/business-n...
2,Quick take: Magni-Tech’s falls after earnings ...,,https://www.thestar.com.my/business/business-n...
3,Quick take: Uzma shares rise 9% on contract news,,https://www.thestar.com.my/business/business-n...
4,"Price war could spark new downcycle for oil, d...",,https://www.thestar.com.my/business/business-n...
5,"Trading ides: Leong Hup, Uzma, Kim Teck Cheong...",,https://www.thestar.com.my/business/business-n...
6,Direct hit seen for oil and gas companies,,https://www.thestar.com.my/business/business-n...
7,Markets in turmoil as oil price crashes,,https://www.thestar.com.my/business/business-n...
8,Zafrul quits CIMB CEO post,,https://www.thestar.com.my/business/business-n...
9,"Ringgit weakens against US$ on Covid-19, plung...",,https://www.thestar.com.my/business/business-n...


## Function Declaration for 
##### 1. Displaying html page with tag p
##### 2. Scrape individual link and return as list
##### 3. Scan all tag p, combine list into single DF

In [7]:
import requests
from bs4 import BeautifulSoup
import time

def showSoupP(link):
    print("Scraping Link",link)
    response = requests.get(link)
    soup = BeautifulSoup(response.content,'html.parser')
    element_list = soup.find_all("p")
    for i in element_list:
        print(i.get_text())

def scrapeLink(link):
    print("Scraping Link",link)
    response = requests.get(link)
    soup = BeautifulSoup(response.content,'html.parser')
    element_list = soup.find_all("p")
    if element_list != []:
        text_list = []
        for element in element_list:
            text_list.append(element.get_text())
        try:
            trimmingIndex = text_list.index("We're sorry, this article is unavailable at the moment. If you wish to read this article, kindly contact our Customer Service team at 1-300-88-7827. Thank you for your patience - we're bringing you a new and improved experience soon!")
        except:
            try:
                trimmingIndex = text_list.index(" ")
            except:
                return [link,"NA","NA","NA","NA"]
        day = text_list[0].replace("\n","").strip().split(",")[0]
        date = text_list[0].replace("\n","").strip().split(",")[1].strip()
        if ("by" in text_list[1].lower()) and (len(text_list[1]) < 50 ):
            author = text_list[1].lower().strip("by").strip()
            content = ''.join(text_list[2:trimmingIndex])
        else:
            author = "NA"
            content = ''.join(text_list[1:trimmingIndex])
    else:
        return [link,"NA","NA","NA","NA"]
    
    return [link,day,date,author,content]
 
def scrapeIndividualPage(args):
    contentDFList = []

    if (isinstance(args,pd.DataFrame)):
        for i in args["Link"]:
            result = scrapeLink(i)
            contentDFList.append(result)
            time.sleep(2)
    else:
        result = scrapeLink(args)
        contentDFList.append(result)
    
    DF = pd.DataFrame(contentDFList, columns = ["Link","Day","Date","_Author","Content"])
    return DF

In [8]:
singleDF = scrapeIndividualPage("https://www.thestar.com.my/business/business-news/2020/03/10/markets-in-turmoil-as-oil-price-crashes")
singleDF

Scraping Link https://www.thestar.com.my/business/business-news/2020/03/10/markets-in-turmoil-as-oil-price-crashes


Unnamed: 0,Link,Day,Date,_Author,Content
0,https://www.thestar.com.my/business/business-n...,Tuesday,10 Mar 2020,daniel khoo,"Oil prices tanked by more than 30%, sending th..."


## Final Output

In [13]:
#contentDF = scrapeIndividualPage(DF)
mergedDF = DF.merge(contentDF, left_on="Link", right_on="Link")
mergedDF["Author"].update(mergedDF.pop("_Author"))
mergedDF = mergedDF.drop(columns=["Link"])
mergedDF

Unnamed: 0,Title,Author,Day,Date,Content
0,Tough job lies in wait for new Cabinet,tee lin sa,Tuesday,10 Mar 2020,The new Cabinet has a tall order ahead of them...
1,"Bursa stages mild rebound, PChem and banks lift",joseph chin,Tuesday,10 Mar 2020,"At Bursa on Monday, foreign funds stepped up t..."
2,Quick take: Magni-Tech’s falls after earnings ...,,Tuesday,10 Mar 2020,KUALA LUMPUR: Shares in Magni-Tech Industries ...
3,Quick take: Uzma shares rise 9% on contract news,,Tuesday,10 Mar 2020,KUALA LUMPUR: UZMA BHD shares advanced almost ...
4,"Price war could spark new downcycle for oil, d...",,Tuesday,10 Mar 2020,
5,"Trading ides: Leong Hup, Uzma, Kim Teck Cheong...",,Tuesday,10 Mar 2020,"KUALA LUMPUR: Leong Hup International Bhd, UZM..."
6,Direct hit seen for oil and gas companies,,Tuesday,10 Mar 2020,UOB Kay Hian said that the combination of Covi...
7,Markets in turmoil as oil price crashes,daniel khoo,Tuesday,10 Mar 2020,"Oil prices tanked by more than 30%, sending th..."
8,Zafrul quits CIMB CEO post,"commenting on his new appointment, tengku zafr...",Tuesday,10 Mar 2020,PETALING JAYA: CIMB GROUP HOLDINGS BHD group c...
9,"Ringgit weakens against US$ on Covid-19, plung...",,Tuesday,10 Mar 2020,KUALA LUMPUR: The ringgit remained weaker agai...


So far this is a good start. There are a few place for improvements (i.e. scraping cleaner content text, dropping missing content, cleaner author naming convention). Lets combine each component into a python class that could be executed daily and collect more data points. We will then combine the 5-days output to a larger DF for ranking. Let's start with some cool NLP work !

## NLP (Part-2)