#`5430 NLP | SPRING 2021 | ASSIGNMENT 5 | UNI: CHB2132 `#


---


> Select an article from Webhose dataset and write a program that:

> 1. Extract and print subject-verb-object (SVO) relations from each sentence

> 2. Apply TextRank for ranking and selecting key phrases, print the result

> 3. Apply LexRank to produce an extractive summary of 5 sentences.

**1. Extract and print subject-verb-object (SVO) relations from each sentence**

---

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
SUBJECTS = ["nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"]
OBJECTS = ["dobj", "dative", "attr", "oprd"]

In [None]:
def getSubsFromConjunctions(subs):
    moreSubs = []
    for sub in subs:
        # rights is a generator
        rights = list(sub.rights)
        rightDeps = {tok.lower_ for tok in rights}
        if "and" in rightDeps:
            moreSubs.extend([tok for tok in rights if tok.dep_ in SUBJECTS or tok.pos_ == "NOUN"])
            if len(moreSubs) > 0:
                moreSubs.extend(getSubsFromConjunctions(moreSubs))
    return moreSubs

In [None]:
def getObjsFromConjunctions(objs):
    moreObjs = []
    for obj in objs:
        # rights is a generator
        rights = list(obj.rights)
        rightDeps = {tok.lower_ for tok in rights}
        if "and" in rightDeps:
            moreObjs.extend([tok for tok in rights if tok.dep_ in OBJECTS or tok.pos_ == "NOUN"])
            if len(moreObjs) > 0:
                moreObjs.extend(getObjsFromConjunctions(moreObjs))
    return moreObjs

In [None]:
def getVerbsFromConjunctions(verbs):
    moreVerbs = []
    for verb in verbs:
        rightDeps = {tok.lower_ for tok in verb.rights}
        if "and" in rightDeps:
            moreVerbs.extend([tok for tok in verb.rights if tok.pos_ == "VERB"])
            if len(moreVerbs) > 0:
                moreVerbs.extend(getVerbsFromConjunctions(moreVerbs))
    return moreVerbs

In [None]:
def findSubs(tok):
  head = tok.head
  while head.pos_ != "VERB" and head.pos_ != "NOUN" and head.head != head:
      head = head.head
  if head.pos_ == "VERB":
      subs = [tok for tok in head.lefts if tok.dep_ == "SUB"]
      if len(subs) > 0:
          verbNegated = isNegated(head)
          subs.extend(getSubsFromConjunctions(subs))
          return subs, verbNegated
      elif head.head != head:
          return findSubs(head)
  elif head.pos_ == "NOUN":
      return [head], isNegated(tok)
  return [], False

In [None]:
def isNegated(tok):
    negations = {"no", "not", "n't", "never", "none"}
    for dep in list(tok.lefts) + list(tok.rights):
        if dep.lower_ in negations:
            return True
    return False

In [None]:
def findSVs(tokens):
    svs = []
    verbs = [tok for tok in tokens if tok.pos_ == "VERB"]
    for v in verbs:
        subs, verbNegated = getAllSubs(v)
        if len(subs) > 0:
            for sub in subs:
                svs.append((sub.orth_, "!" + v.orth_ if verbNegated else v.orth_))
    return svs

In [None]:
def getObjsFromPrepositions(deps):
    objs = []
    for dep in deps:
        if dep.pos_ == "ADP" and dep.dep_ == "prep":
            objs.extend([tok for tok in dep.rights if tok.dep_  in OBJECTS or (tok.pos_ == "PRON" and tok.lower_ == "me")])
    return objs

In [None]:
def getObjsFromAttrs(deps):
    for dep in deps:
        if dep.pos_ == "NOUN" and dep.dep_ == "attr":
            verbs = [tok for tok in dep.rights if tok.pos_ == "VERB"]
            if len(verbs) > 0:
                for v in verbs:
                    rights = list(v.rights)
                    objs = [tok for tok in rights if tok.dep_ in OBJECTS]
                    objs.extend(getObjsFromPrepositions(rights))
                    if len(objs) > 0:
                        return v, objs
    return None, None

In [None]:
def getObjFromXComp(deps):
    for dep in deps:
        if dep.pos_ == "VERB" and dep.dep_ == "xcomp":
            v = dep
            rights = list(v.rights)
            objs = [tok for tok in rights if tok.dep_ in OBJECTS]
            objs.extend(getObjsFromPrepositions(rights))
            if len(objs) > 0:
                return v, objs
    return None, None

In [None]:
def getAllObjs(v):
    # rights is a generator
    rights = list(v.rights)
    objs = [tok for tok in rights if tok.dep_ in OBJECTS]
    objs.extend(getObjsFromPrepositions(rights))

    #potentialNewVerb, potentialNewObjs = getObjsFromAttrs(rights)
    #if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0:
    #    objs.extend(potentialNewObjs)
    #    v = potentialNewVerb

    potentialNewVerb, potentialNewObjs = getObjFromXComp(rights)
    if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0:
        objs.extend(potentialNewObjs)
        v = potentialNewVerb
    if len(objs) > 0:
        objs.extend(getObjsFromConjunctions(objs))
    return v, objs

In [None]:
def getAllSubs(v):
    verbNegated = isNegated(v)
    subs = [tok for tok in v.lefts if tok.dep_ in SUBJECTS and tok.pos_ != "DET"]
    if len(subs) > 0:
        subs.extend(getSubsFromConjunctions(subs))
    else:
        foundSubs, verbNegated = findSubs(v)
        subs.extend(foundSubs)
    return subs, verbNegated

In [None]:
def findSVOs(tokens):
    svos = []
    verbs = [tok for tok in tokens if tok.pos_ == "VERB" and tok.dep_ != "aux"]
    for v in verbs:
        subs, verbNegated = getAllSubs(v)
        # hopefully there are subs, if not, don't examine this verb any longer
        if len(subs) > 0:
            v, objs = getAllObjs(v)
            for sub in subs:
                for obj in objs:
                    objNegated = isNegated(obj)
                    svos.append((sub.lower_, "!" + v.lower_ if verbNegated or objNegated else v.lower_, obj.lower_))
    return svos

In [None]:
def printDeps(toks):
    for tok in toks:
        print(tok.orth_, tok.dep_, tok.pos_, tok.head.orth_, [t.orth_ for t in tok.lefts], [t.orth_ for t in tok.rights])

def testSVOs():

    SENTENCE0 = '''Apple removed the headphone jack from the iPhone nearly four years ago in September 2016, and this year it may take those efforts one step further by nixing its wired EarPods from the iPhone 12 packaging, according to the latest predictions from TF International Securities analyst Ming-Chi Kuo.'''
    
    SENTENCE1 = '''Kuo, who has been accurate about some Apple product predictions in the past, said that Apple may offer promotions or discounts on the AirPods around the holiday season, according to 9to5Mac.'''
    
    SENTENCE2 = '''The report does not specify whether these expected discounts would pertain to the second-generation regular AirPods, which sell for $159 or $199 with the wireless charging case, or the $249 AirPods Pro.'''

    SENTENCE3 = '''Apple also sells its wired EarPods through its website for $29.'''
    
    SENTENCE4 = '''Apple's removal of the headphone jack with the iPhone 7 was met with controversy, with some tech critics disapproving of the decision and calling it "user-hostile" at the time.'''
    
    SENTENCE5 = '''Kuo reported in a previous note from December that Apple may be planning to release its first totally wireless iPhone in 2021.'''

    SENTENCE6 = '''That phone would likely be the most expensive one in Apple's 2021 lineup, and Kuo didn't offer any additional details other than that it would offer a completely wireless experience.''' 
    
    SENTENCE7 = '''Phil Schiller, Apple's senior vice president of worldwide marketing, talked about the company's move toward wireless when unveiling the original AirPods in 2016.'''
    
    SENTENCE8 = '''It makes no sense to tether ourselves with cables to our mobile devices, he said during the company's keynote.'''
    
    SENTENCE9 = '''AirPods have become an increasingly important part of Apple's product lineup since their 2016 debut.'''
    
    SENTENCE10 = '''The success of both the Apple Watch and AirPods has made Apple the top maker of wearable devices in the world during the fourth quarter of 2019, according to The International Data Corporation.'''

    tok = nlp(SENTENCE0)
    svos = findSVOs(tok)
    print(svos)
    print("-----------------------------------------------")
    
    tok = nlp(SENTENCE1)
    svos = findSVOs(tok)
    print(svos)
    print("-----------------------------------------------")
    
    tok = nlp(SENTENCE2)
    svos = findSVOs(tok)
    print(svos)
    print("-----------------------------------------------")

    tok = nlp(SENTENCE3)
    svos = findSVOs(tok)
    print(svos)
    print("-----------------------------------------------")

    tok = nlp(SENTENCE4)
    svos = findSVOs(tok)
    print(svos)
    print("-----------------------------------------------")


    tok = nlp(SENTENCE5)
    svos = findSVOs(tok)
    print(svos)
    print("-----------------------------------------------")


    tok = nlp(SENTENCE6)
    svos = findSVOs(tok)
    print(svos)
    print("-----------------------------------------------")


    tok = nlp(SENTENCE7)
    svos = findSVOs(tok)
    print(svos)
    print("-----------------------------------------------")


    tok = nlp(SENTENCE8)
    svos = findSVOs(tok)
    print(svos)
    print("-----------------------------------------------")


    tok = nlp(SENTENCE9)
    svos = findSVOs(tok)
    print(svos)
    print("-----------------------------------------------")


    tok = nlp(SENTENCE10)
    svos = findSVOs(tok)
    print(svos)
    print("-----------------------------------------------")

    #printDeps(tok)
    #assert set(svos) == {('apple', 'removed', 'jack')}

In [None]:
if __name__ == "__main__":
    testSVOs()

[('apple', 'removed', 'jack'), ('it', 'take', 'efforts')]
-----------------------------------------------
[('apple', 'offer', 'promotions')]
-----------------------------------------------
[]
-----------------------------------------------
[('apple', 'sells', 'earpods')]
-----------------------------------------------
[]
-----------------------------------------------
[('apple', 'release', 'iphone')]
-----------------------------------------------
[('kuo', '!offer', 'details'), ('it', 'offer', 'experience')]
-----------------------------------------------
[]
-----------------------------------------------
[('it', '!makes', 'sense')]
-----------------------------------------------
[('airpods', 'become', 'part')]
-----------------------------------------------
[]
-----------------------------------------------


**2. Apply TextRank for ranking and selecting key phrases, print the result**

---

In [None]:
from collections import OrderedDict
import numpy as np
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
class TextRank4Keyword():
    """Extract keywords from text"""
    
    def __init__(self):
        self.d = 0.85 # damping coefficient, usually is .85
        self.min_diff = 1e-5 # convergence threshold
        self.steps = 10 # iteration steps
        self.node_weight = None # save keywords and its weight

    
    def set_stopwords(self, stopwords):  
        """Set stop words"""
        for word in STOP_WORDS.union(set(stopwords)):
            lexeme = nlp.vocab[word]
            lexeme.is_stop = True
    
    def sentence_segment(self, doc, candidate_pos, lower):
        """Store those words only in cadidate_pos"""
        sentences = []
        for sent in doc.sents:
            selected_words = []
            for token in sent:
                # Store words only with cadidate POS tag
                if token.pos_ in candidate_pos and token.is_stop is False:
                    if lower is True:
                        selected_words.append(token.text.lower())
                    else:
                        selected_words.append(token.text)
            sentences.append(selected_words)
        return sentences
        
    def get_vocab(self, sentences):
        """Get all tokens"""
        vocab = OrderedDict()
        i = 0
        for sentence in sentences:
            for word in sentence:
                if word not in vocab:
                    vocab[word] = i
                    i += 1
        return vocab
    
    def get_token_pairs(self, window_size, sentences):
        """Build token_pairs from windows in sentences"""
        token_pairs = list()
        for sentence in sentences:
            for i, word in enumerate(sentence):
                for j in range(i+1, i+window_size):
                    if j >= len(sentence):
                        break
                    pair = (word, sentence[j])
                    if pair not in token_pairs:
                        token_pairs.append(pair)
        return token_pairs
        
    def symmetrize(self, a):
        return a + a.T - np.diag(a.diagonal())
    
    def get_matrix(self, vocab, token_pairs):
        """Get normalized matrix"""
        # Build matrix
        vocab_size = len(vocab)
        g = np.zeros((vocab_size, vocab_size), dtype='float')
        for word1, word2 in token_pairs:
            i, j = vocab[word1], vocab[word2]
            g[i][j] = 1
            
        # Get Symmeric matrix
        g = self.symmetrize(g)
        
        # Normalize matrix by column
        norm = np.sum(g, axis=0)
        g_norm = np.divide(g, norm, where=norm!=0) # this is ignore the 0 element in norm
        
        return g_norm

    
    def get_keywords(self, number=10):
        """Print top number keywords"""
        node_weight = OrderedDict(sorted(self.node_weight.items(), key=lambda t: t[1], reverse=True))
        for i, (key, value) in enumerate(node_weight.items()):
            print(key + ' - ' + str(value))
            if i > number:
                break
        
        
    def analyze(self, text, 
                candidate_pos=['NOUN', 'PROPN'], 
                window_size=4, lower=False, stopwords=list()):
        """Main function to analyze text"""
        
        # Set stop words
        self.set_stopwords(stopwords)
        
        # Pare text by spaCy
        doc = nlp(text)
        
        # Filter sentences
        sentences = self.sentence_segment(doc, candidate_pos, lower) # list of list of words
        
        # Build vocabulary
        vocab = self.get_vocab(sentences)
        
        # Get token_pairs from windows
        token_pairs = self.get_token_pairs(window_size, sentences)
        
        # Get normalized matrix
        g = self.get_matrix(vocab, token_pairs)
        
        # Initionlization for weight(pagerank value)
        pr = np.array([1] * len(vocab))
        
        # Iteration
        previous_pr = 0
        for epoch in range(self.steps):
            pr = (1-self.d) + self.d * np.dot(g, pr)
            if abs(previous_pr - sum(pr))  < self.min_diff:
                break
            else:
                previous_pr = sum(pr)

        # Get weight for each node
        node_weight = dict()
        for word, index in vocab.items():
            node_weight[word] = pr[index]
        
        self.node_weight = node_weight

In [None]:
keyphrase_extractor = TextRank4Keyword()

In [None]:
text = '''Apple removed the headphone jack from the iPhone nearly four years ago in September 2016, and this year it may take those efforts one step further by nixing its wired EarPods from the iPhone 12 packaging, according to the latest predictions from TF International Securities analyst Ming-Chi Kuo. 

          Kuo, who has been accurate about some Apple product predictions in the past, said that Apple may offer promotions or discounts on the AirPods around the holiday season, according to 9to5Mac. 

          The report does not specify whether these expected discounts would pertain to the second-generation regular AirPods, which sell for $159 or $199 with the wireless charging case, or the $249 AirPods Pro. 
          
          Apple also sells its wired EarPods through its website for $29.
    
          Apple's removal of the headphone jack with the iPhone 7 was met with controversy, with some tech critics disapproving of the decision and calling it "user-hostile" at the time.
    
          Kuo reported in a previous note from December that Apple may be planning to release its first totally wireless iPhone in 2021.

          That phone would likely be the most expensive one in Apple's 2021 lineup, and Kuo didn't offer any additional details other than that it would offer a completely wireless experience.
    
          Phil Schiller, Apple's senior vice president of worldwide marketing, talked about the company's move toward wireless when unveiling the original AirPods in 2016.
    
          It makes no sense to tether ourselves with cables to our mobile devices, he said during the company's keynote.
    
          AirPods have become an increasingly important part of Apple's product lineup since their 2016 debut.
    
          The success of both the Apple Watch and AirPods has made Apple the top maker of wearable devices in the world during the fourth quarter of 2019, according to The International Data Corporation.'''


In [None]:
tr4w = TextRank4Keyword()
tr4w.analyze(text, candidate_pos = ['NOUN', 'PROPN',"ADP"], window_size=8, lower=False)
tr4w.get_keywords(10)

Apple - 4.289341349094153
AirPods - 2.8901379612268823
iPhone - 2.45597989192802
predictions - 1.8447739264264218
Kuo - 1.8224049087597662
International - 1.6813454174664981
devices - 1.4942342561688458
company - 1.4205085157873065
discounts - 1.3437843412018382
jack - 1.2445233386911432
EarPods - 1.1803789770457653
product - 1.16757434664517


**3. Apply LexRank to produce an extractive summary of 5 sentences.**

---

In [None]:
!pip install sumy  

import nltk
nltk.download('punkt')

from sumy.parsers.plaintext import PlaintextParser
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
class TextSummary(object):

    def __init__(self, feeds_str, num_sents):
        self.summary = str()
        
        parser = PlaintextParser.from_string(feeds_str, Tokenizer("english"))
        summarizer = LexRankSummarizer()

        sentences = summarizer(parser.document, num_sents)  # Summarize the document with 5 sentences
        for sentence in sentences:
            self.summary += (sentence.__unicode__())

    def output(self):
        return self.summary

In [None]:
input_text = '''Apple removed the headphone jack from the iPhone nearly four years ago in September 2016, and this year it may take those efforts one step further by nixing its wired EarPods from the iPhone 12 packaging, according to the latest predictions from TF International Securities analyst Ming-Chi Kuo. 

          Kuo, who has been accurate about some Apple product predictions in the past, said that Apple may offer promotions or discounts on the AirPods around the holiday season, according to 9to5Mac. 

          The report does not specify whether these expected discounts would pertain to the second-generation regular AirPods, which sell for $159 or $199 with the wireless charging case, or the $249 AirPods Pro. 
          
          Apple also sells its wired EarPods through its website for $29.
    
          Apple's removal of the headphone jack with the iPhone 7 was met with controversy, with some tech critics disapproving of the decision and calling it "user-hostile" at the time.
    
          Kuo reported in a previous note from December that Apple may be planning to release its first totally wireless iPhone in 2021.

          That phone would likely be the most expensive one in Apple's 2021 lineup, and Kuo didn't offer any additional details other than that it would offer a completely wireless experience.
    
          Phil Schiller, Apple's senior vice president of worldwide marketing, talked about the company's move toward wireless when unveiling the original AirPods in 2016.
    
          It makes no sense to tether ourselves with cables to our mobile devices, he said during the company's keynote.
    
          AirPods have become an increasingly important part of Apple's product lineup since their 2016 debut.
    
          The success of both the Apple Watch and AirPods has made Apple the top maker of wearable devices in the world during the fourth quarter of 2019, according to The International Data Corporation.'''


In [None]:
text_to_sum = TextSummary(input_text,5)
print(text_to_sum.output())

Apple removed the headphone jack from the iPhone nearly four years ago in September 2016, and this year it may take those efforts one step further by nixing its wired EarPods from the iPhone 12 packaging, according to the latest predictions from TF International Securities analyst Ming-Chi Kuo.Kuo, who has been accurate about some Apple product predictions in the past, said that Apple may offer promotions or discounts on the AirPods around the holiday season, according to 9to5Mac.Apple's removal of the headphone jack with the iPhone 7 was met with controversy, with some tech critics disapproving of the decision and calling it "user-hostile" at the time.That phone would likely be the most expensive one in Apple's 2021 lineup, and Kuo didn't offer any additional details other than that it would offer a completely wireless experience.Phil Schiller, Apple's senior vice president of worldwide marketing, talked about the company's move toward wireless when unveiling the original AirPods in 2