# Article Tagging

- This notebook explores the possibility of analysing sentence structures to flag out articles that are wrongly tagged to the entity.
- E.g. an article could be about upbit getting hacked, and mentioned other entities e.g. "binance" in the article but binance is not actually related to the hack. Scraping would tag this article to both entities including "binance" and falsely flag binance as risky.

In [1]:
import numpy as np
import pandas as pd

# text processing
import sys
sys.path.insert(0, '../sentiment-analysis/utils/')
from text_processing import text_processing
from textblob import TextBlob
import nltk

# entity filter
sys.path.insert(0, '../scraping/utils/')
from data_filter import filter_entity

## Retrieve Example Text

In [2]:
# reading sample test data
path = 'data/'
data = pd.read_csv(path+"bert_base_cased_test.csv")[['entity', 'text']]
print("Before text processing: ", len(data))

# apply entity filter
data = data[data.apply(lambda x: filter_entity(x['text'], x['entity']), axis=1)]
data = data.reset_index(drop=True)
# data['has_entity'] = filter_entity(data['text'], data['entity'])

# apply text processing
text = data.apply(lambda x: text_processing(str(x['text']), \
                                             lower=False, \
                                             remove_url=True, \
                                             remove_punctuation=False, \
                                             remove_stopwords=False, \
                                             replace_entity=False, \
                                             replace_hash=True, \
                                             split_alphanumeric=True, \
                                             lemmatize=False, \
                                             stem=False), axis=1)

data['text'] = text

# remove na
data = data.dropna()

print("After text processing: ", len(data))

Before text processing:  1028
After text processing:  450


In [3]:
# count duplicates
# data[data["text"].duplicated()]

# drop duplicates in entity and text
data.drop_duplicates(subset = ["entity", "text"], keep='last',inplace=True)
data

Unnamed: 0,entity,text
0,Poloniex,Poloniex Stops Offering Nine Coins in US Due t...
1,Cryptopia,The Cryptopia Hackers Are Moving Funds Into at...
2,Cobinhood,Crypto exchange Cobinhood files for bankruptcy...
3,Cryptopia,ETH Stolen From Crypto Exchange Cryptopia Move...
4,Binance,Binance CEO CZ Is Suing VC Giant Sequoia for R...
...,...,...
445,Upbit,upbit hack public statement - how changenow ma...
446,Upbit,upbit hack public statement . changenow takes ...
447,Bithumb,bk group chairman ( bithumb owner ) and bithum...
448,Bittrex,withdrawal failure and bittrex keeps the coins...


## Syntax Identification Methods
1. TextBlob .noun_phrases
2. TextBlob POS (Part-of-Speech) Tags
3. NLTK POS (Part-of-Speech) Tags

In [4]:
for i in range(0, 2):
    print(f"EXAMPLE {i+1}")
    print(f"Entity: {data.iloc[i]['entity']}")
    example = data.iloc[i]['text']
    print(f"Text  : {example} \n")

    print("Using TextBlob .noun_phrases")
    print(TextBlob(example).noun_phrases)

    print("Using TextBlob pos tags")
    
    pos_textblob = [w for (w, pos) in TextBlob(example).pos_tags if pos[0:2] == 'NN']
    print(pos_textblob)
    
    print("Using NLTK pos tags")
    tokens = nltk.word_tokenize(example)
    pos_nltk = [w for (w, pos) in nltk.pos_tag(tokens) if pos[0:2] == 'NN']
    print(pos_nltk)
    print("\n")

EXAMPLE 1
Entity: Poloniex
Text  : Poloniex Stops Offering Nine Coins in US Due to Uncertain Regulations . Leading cryptocurrency exchange Poloniex will stop offering nine coins in the United States due to uncertain regulations . 

Using TextBlob .noun_phrases
['poloniex stops offering', 'coins', 'uncertain regulations', 'cryptocurrency exchange', 'poloniex', 'uncertain regulations']
Using TextBlob pos tags
['Poloniex', 'Stops', 'Offering', 'Nine', 'Coins', 'US', 'Due', 'Uncertain', 'Regulations', 'cryptocurrency', 'exchange', 'Poloniex', 'offering', 'coins', 'United', 'States', 'regulations']
Using NLTK pos tags
['Poloniex', 'Stops', 'Offering', 'Nine', 'Coins', 'US', 'Due', 'Uncertain', 'Regulations', 'Leading', 'cryptocurrency', 'exchange', 'Poloniex', 'offering', 'coins', 'United', 'States', 'regulations']


EXAMPLE 2
Entity: Cryptopia
Text  : The Cryptopia Hackers Are Moving Funds Into at Least Four Wallets . The thieves who cleared at least $ 16 million of ether from the Cryptopi

## Retrieve POS Tags
- Extract nouns of the text and check whether the entity is identified as a noun.

In [5]:
res = []

for i in range(len(data)):
    entity = str(data.iloc[i]['entity'])
    txt = data.iloc[i]['text']
    
    # get noun_phrases
    txt_noun = TextBlob(txt).noun_phrases
    
    # get pos_tags
    # pos_textblob = [w.lower() for (w, pos) in TextBlob(example).pos_tags if pos[0:2] == 'NN']
    
    # tokenize
    sentence = ''
    for s in txt_noun:
         sentence += s + ' '
    txt_noun = nltk.word_tokenize(sentence.lower())
    
    # check if entity in txt_noun
    check = (entity.lower() in txt_noun)
    
    # check = entity.lower() in pos_textblob
    
    # if check == False:
    #     print(entity)
    #     print(txt)
    #     print(txt_noun)
    #     print("\n")
        
    res.append(check)

In [6]:
data1 = data.copy()
# check tagging
data1['tag'] = res
print(data1['tag'].value_counts())
data1.head()

True     269
False    100
Name: tag, dtype: int64


Unnamed: 0,entity,text,tag
0,Poloniex,Poloniex Stops Offering Nine Coins in US Due t...,True
1,Cryptopia,The Cryptopia Hackers Are Moving Funds Into at...,True
2,Cobinhood,Crypto exchange Cobinhood files for bankruptcy...,True
3,Cryptopia,ETH Stolen From Crypto Exchange Cryptopia Move...,True
4,Binance,Binance CEO CZ Is Suing VC Giant Sequoia for R...,True


In [7]:
data1_false = data1.loc[data1['tag'] == False].reset_index(drop=True)
data1_false

Unnamed: 0,entity,text,tag
0,Cryptopia,Komodo Hacks Itself and Saves Crypto Worth $ 1...,False
1,Ethereum Foundation,Hyperledger Gains Microsoft and Ethereum Found...,False
2,Ethereum Foundation,"Microsoft , Ethereum Foundation Join Hyperledg...",False
3,Gods Unchained,Crypto Seems Ready to Solve Gaming Industrys M...,False
4,My Crypto Heroes,Crypto Seems Ready to Solve Gaming Industrys M...,False
...,...,...,...
95,Upbit,weekly crypto market analysis : bitcoin strugg...,False
96,Bittrex,bittrex was first targeted to breach the coins...,False
97,Upbit,the hacker transferred 500 eth stolen from upb...,False
98,Bittrex,withdrawal failure and bittrex keeps the coins...,False


**Conclusions:**
- Need more comprehensive checks for nouns for entities with more than 1 word in its name.
- Nouns may not always be accurately identified by the word processing packages.