# News Sentiment Database
So far we have curated 5 days worth of content from a local news site. The dataset has basic information such as author, day, date, title and article-text. However, there are no ranking/rating/scoring for the article yet. Therefore, for this notebook, we hope to achieve the following:
1. Leverage Google Cloud Natural Language API to score the article. Hence, building a reference target score (Y).
2. Compare the target score to internal observers' score.
3. Vectorize article (X) and Build a personal rule-based model to rank/score articles

## Prerequisites:
1. Google Cloud Platform (GCP) account <br>
(https://cloud.google.com/natural-language/docs/quickstart) <br>
**Google is generous enough to offer first time user \$300 worth of credit to try their platform **

The following code snippets and function call are inspired from google's tutorial site (https://cloud.google.com/natural-language/docs/sentiment-tutorial)

In [60]:
import requests
import json

from bs4 import BeautifulSoup

from google.cloud import language
from google.oauth2 import service_account
from google.cloud.language import enums
from google.cloud.language import types

client = language.LanguageServiceClient.from_service_account_json('thestarnlp-d16fce32e2dc.json')

def googlenlpurl(client, url, invalid_types = ['OTHER'], **data):
   
        #html = load_text_from_url(url, **data)
   
        #if not html:
        #    return None
   
        document = types.Document(
        content=load_text_from_url(url),
        language ="en",
        type=language.enums.Document.Type.PLAIN_TEXT )
        
        # using annotate_text feature set allow us to do all API call in a single API call, rather than calling
        # analyzeSentiment, analyzeEntities, analyzeSyntax, classifyText individually
        # switching between true and false allow us to obtain only desired attribute easily
        features = {'extract_syntax': False,
                'extract_entities': False,
                'extract_document_sentiment': True,
                'extract_entity_sentiment': False,
                'classify_text': False
                }
   
        response = client.annotate_text(document=document, features=features)
        sentiment = response.document_sentiment
        entities = response.entities
   
        response = client.classify_text(document)
        categories = response.categories
         
        def get_type(type):
            return client.enums.Entity.Type(entity.type).name
   
        result = {}
   
        result['sentiment'] = []    
        result['entities'] = []
        result['categories'] = []

        if sentiment:
            result['sentiment'] = [{'magnitude': sentiment.magnitude, 'score':sentiment.score}]
         
        #for entity in entities:
        #    if get_type(entity.type) not in invalid_types:
        #        result['entities'].append({'name': entity.name, 'type': get_type(entity.type), 'salience': entity.salience, 'wikipedia_url': entity.metadata.get('wikipedia_url', '-')  })
         
        for category in categories:
            result['categories'].append({'name':category.name, 'confidence': category.confidence})
         
         
        return result

def googlenlptext(client, text, invalid_types = ['OTHER'], **data):
   
        document = types.Document(
        content=text,
        language ="en",
        type=language.enums.Document.Type.PLAIN_TEXT )
   
        response = client.analyze_sentiment(document=document)
        sentiment = response.document_sentiment
   
        response = client.classify_text(document)
        categories = response.categories
   
        result = {}
   
        result['sentiment'] = []    
        result['categories'] = []

        if sentiment:
            result['sentiment'] = [{ 'magnitude': sentiment.magnitude, 'score':sentiment.score }]

        for category in categories:
            result['categories'].append({'name':category.name, 'confidence': category.confidence})
         
        return result

    
    
def load_text_from_url(url):

        try:
         
            print("Extracting text from: {}".format(url))
            response = requests.get(url)

            status = response.status_code
            text_list = []
            if status == 200:
                soup = BeautifulSoup(response.content, "html.parser")
                myp = soup.find_all("p")
                for element in myp:
                    text_list.append(element.get_text())
                try:
                    for index, line in enumerate(text_list):
                        if ("Tags / Keywords" in line):
                            trimmingIndex = index-1
                        if ("by" in line.lower()) and (len(line) < 50):
                            startingIndex = index+1
                        else:
                            startingIndex = 1
                except:
                    return None
                text = "".join(text_list[startingIndex:trimmingIndex])
                return text
         
            return None
         

        except Exception as e:
            print('Problem with url: {0}.'.format(url))
            return None


In [3]:
url = "https://www.thestar.com.my/news/nation/2020/03/11/dr-m-vote-of-no-confidence-likely-to-fail-najib-the-real-conspirator"
googlenlp(client, url)

Extracting text from: https://www.thestar.com.my/news/nation/2020/03/11/dr-m-vote-of-no-confidence-likely-to-fail-najib-the-real-conspirator


{'sentiment': [{'magnitude': 4.199999809265137,
   'score': -0.30000001192092896}],
 'entities': [],
 'categories': [{'name': '/News/Politics', 'confidence': 0.9800000190734863},
  {'name': '/Law & Government/Government', 'confidence': 0.6299999952316284}]}

## Let's get an idea on what GCP Natural Language Processing API is capable of
A good place to start is to understand the output

Sentiment Analysis : useful to determine sentiment of the overall text (granularity up to individual sentence if needed)
1. Score : normalized sentiment score range that ranges from -1.0 to 1.0, describing the emotion of the text between negative, neutral and positive. Threshold depends on user.
2. Magnitude : indicate the confidence of the sentiment score between 0 to +inf. 

Entity Analysis : useful to determine context in sentences. (i.e. prince of Persia (movie) against prince of Persia (person)
1. Name : entity to be analyzed
2. Type : type of entity based on context
3. Salience : importance/relevance of entity in the text. score range from 0 (not important) to 1 (very important)

Syntactic Analysis : useful to extract sentences and token from text
1. Sentence : extract an array of sentences, to which each element in the array contain the sentence/content and offset (where the sentence start)
2. Tokenization : extract content, lemma and dependency of the sentence (i.e

Classify Text : useful to categorize text type (i.e. science.anatomy, news.politics)
1. name : category type
2. confidence : score to rank confidence in categorizing

Now that we have an understanding of the API calls. Let's open up the curated data. The goal for this step is to create the following table:

| Title | Day | Date | Author | Content | Sentiment Score | Sentiment Magnitude | Text Category | Text Category Confidence |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| dr-m-vote-of-no-confidence-likely-to-fail-najib-the-real-conspirator | Wednesday | 11 Mar 2020 | Zakiah Koya | ... | -0.3 | 4.2 | [News/Politics, Law & Government/Government] | [0.98, 0.63] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |

In [110]:
import pandas as pd
import glob

#pd.set_option('display.max_rows', None)

DF = pd.DataFrame()

for file in glob.glob("*0320.csv"):
    #print(file)
    DF = DF.append(pd.read_csv(file))

## Lets do some clean up
DF["Author"].fillna(DF["_Author"], inplace=True)
DF = DF.drop(columns=["Link","Unnamed: 0","_Author"])
DF.drop_duplicates(subset ="Title", keep = "first", inplace = True) 
DF.dropna(subset = ["Content"],inplace=True)
DF["Length"] = DF["Content"].map(lambda x : len(x))
DF = DF[DF["Length"] > 200]
DF = DF.reset_index()

In [81]:
resultList = []
for i in DF["Content"]:
    resultList.append(googlenlptext(client,i))

In [102]:
rearrangedResultList = []
for i in resultList:
    if len(i["categories"]) == 0:
        rearrangedResultList.append([i["sentiment"][0]['score'], i["sentiment"][0]['magnitude'], "NA", 0.0])
    else:
        rearrangedResultList.append([i["sentiment"][0]['score'], i["sentiment"][0]['magnitude'], i["categories"][0]['name'], i["categories"][0]['confidence']])

In [105]:
DFcolumns = ["Sentiment Score","Sentiment Magnitude","Category","Category Confidence"]
sentimentDF = pd.DataFrame(rearrangedResultList, columns=DFcolumns)
len(DF), len(sentimentDF)

(98, 98)

## Final Output

In [111]:
mergedDF = pd.concat([DF, sentimentDF], axis=1)
mergedDF

Unnamed: 0,index,Title,Author,Day,Date,Content,Length,Sentiment Score,Sentiment Magnitude,Category,Category Confidence
0,0,Markets in turmoil as oil price crashes,daniel khoo,Tuesday,10 Mar 2020,"Oil prices tanked by more than 30%, sending th...",5576,-0.5,4.3,/Business & Industrial/Energy & Utilities/Oil ...,0.97
1,1,Zafrul quits CIMB CEO post,,Tuesday,10 Mar 2020,"Commenting on his new appointment, Tengku Zafr...",1931,0.0,0.3,/Finance/Banking,0.78
2,2,"Ringgit weakens against US$ on Covid-19, plung...",,Tuesday,10 Mar 2020,KUALA LUMPUR: The ringgit remained weaker agai...,1196,-0.1,0.8,/Finance/Investing,0.86
3,3,"Bursa stages mild rebound, PChem and banks lift",Joseph Chin,Tuesday,10 Mar 2020,"At Bursa on Monday, foreign funds stepped up t...",1902,-0.1,1.0,/Finance/Investing,0.95
4,4,Quick take: Magni-Tech’s falls after earnings ...,,Tuesday,10 Mar 2020,KUALA LUMPUR: Shares in Magni-Tech Industries ...,1473,-0.5,1.0,/Finance/Investing/Stocks & Bonds,0.59
5,5,Quick take: Uzma shares rise 9% on contract news,,Tuesday,10 Mar 2020,KUALA LUMPUR: UZMA BHD shares advanced almost ...,869,0.0,0.3,/Business & Industrial,0.75
6,6,Direct hit seen for oil and gas companies,,Tuesday,10 Mar 2020,UOB Kay Hian said that the combination of Covi...,3510,-0.6,1.8,/Business & Industrial/Energy & Utilities/Oil ...,0.99
7,7,US stocks plunge most since financial crisis,,Tuesday,10 Mar 2020,"The S&P 500 sank the most since December 2008,...",3901,-0.5,7.5,/Business & Industrial,0.9
8,8,Tough job lies in wait for new Cabinet,tee lin sa,Tuesday,10 Mar 2020,The new Cabinet has a tall order ahead of them...,2208,-0.1,4.3,/Finance,0.69
9,9,Tesco sells Thai and M’sian businesses to CP G...,royce tan,Tuesday,10 Mar 2020,Checking out: Tesco’s latest store in Tanjung ...,2544,0.1,0.9,/Business & Industrial/Hospitality Industry/Fo...,0.99
