In [None]:
'''
Based off of this paper: https://www.cs.utexas.edu/~ml/papers/spk-emnlp-05.pdf

For relation extraction (e.g. '5%' is a 'discount rate'), we build a dependency tree
of the sentence and look at the number with the shortest distance from the target metric to extract
(e.g. 'discount rate')

'''

In [265]:
import pandas as pd

In [300]:
exsentence1 = u'Implied present values of cash flows and terminal values were calculated using discount rates ranging from 8% to 9%, reflecting estimates of Advanced Disposal’s weighted average cost of capital.'
exsentence2 = "UBS performed discounted cash flow analyses utilizing financial forecasts and estimates prepared by the management of Advanced Disposal (as more fully described in the section entitled “The Merger (Proposal 1)—Projected Financial Information” beginning on page 52 of this proxy statement). UBS calculated ranges of implied present values (as of January 1, 2019) of the standalone, unlevered, free cash flows that Advanced Disposal was forecasted to generate from January 1, 2019 through December 31, 2023 and of terminal values for Advanced Disposal. Implied terminal values were derived by applying to the projection of Advanced Disposal’s 2023 estimated Market Adjusted EBITDA a range of estimated terminal last twelve months Market Adjusted EBITDA multiples of 9.5x to 11.5x, which range was selected based on current and historical trading Market Adjusted EBITDA multiples of Advanced Disposal and certain of the selected companies. Implied present values of cash flows and terminal values were calculated using discount rates ranging from 8% to 9%, reflecting estimates of Advanced Disposal’s weighted average cost of capital. The discounted cash flow analyses resulted in a range of implied equity values of approximately $25 to $35 per share for Advanced Disposal, as compared to the $33.15 offer price."
exsentence3 = "The discounted cash flow analyses resulted in a range of implied equity values of approximately $25 to $35 per share for Advanced Disposal, as compared to the $33.15 offer price."
exsentence5 = "in performing this analysis, kbw used earnings and asset growth estimates for cadence for 2010 through 2015 from company management and assumed discount rates ranging from 12.0% to 16.0%."

exdocument1 = "discounted cash flow analysis. kbw performed a discounted cash flow analysis to estimate a range of the present values of after-tax cash flows that cadence could provide to equity holders through 2015, assuming a $90.0 million common equity offering in the third quarter of 2010 at $1.00 per share. the analysis assumed a 6.00% underwriting discount and $300,000 of additional offering expenses. in performing this analysis, kbw used earnings and asset growth estimates for cadence for 2010 through 2015 from company management and assumed discount rates ranging from 12.0% to 16.0%. the range of values was determined by adding (1) ;the present value of projected cash dividends to cadence shareholders from 2010 to 2015, assuming excess capital is paid out to cadence shareholders and (2) ;the present value of the terminal value of cadence&#146;s common stock. in calculating the terminal value of cadence, kbw applied multiples ranging from 10.0x to 14.0x 2015 forecasted earnings. in determining cash flows available to shareholders, kbw assumed that cadence would maintain a tangible common equity / tangible assets ratio of 7.00% and any excess capital would be paid out to cadence shareholders. this resulted in a range of values of cadence from $0.93 to $1.32 per share.  kbw also considered and discussed with the cadence board how this analysis would be affected by changes in the underlying assumptions of the capital raise, including variations with respect to the amount raised and offer price per share. to illustrate this impact, kbw performed a similar discounted cash flow analysis assuming capital raises ranging from $80.0 million to $100.0 million at offer prices from $0.50 per share to $1.50 per share. the analysis assumes a 12.0x terminal multiple on 2015 earnings and a 14.0% discount rate. this resulted in a range of values of cadence from $0.57 to $1.64.  kbw stated that the discounted cash flow present value analysis is a widely used valuation methodology but noted that it relies on numerous assumptions, including asset and earnings growth rates, terminal values and discount rates. the analysis did not purport to be indicative of the actual values or expected values of cadence.  "







In [301]:
INPUT_DOCUMENT = exdocument1
INPUT_SENTENCE = exsentence5
TARGET_METRIC = 'discount rates'

In [296]:
'''sentence preprocessing: if target metric has a space, remove the space from the sentence
so that it will be treated as one token
'''
#INPUT_DOCUMENT = INPUT_DOCUMENT.replace(TARGET_METRIC, TARGET_METRIC.replace(' ', ''))

#TARGET_METRIC = TARGET_METRIC.replace(' ', '')

In [289]:
import networkx as nx
import spacy

In [464]:
def is_number(s):

    
    try:
        float(s)
        return True
    except ValueError:
        return False

In [450]:
# removes the edge/tree info from the processed num
def unprocess_scores(num_scores):
    temp = []
    
    
    

    for score in num_scores:
        #if(score[0].split('-')[0] == ''):
            #print('numscores = ', num_scores)
            #print('ragga', score[0].split('-')[0], 'ragga')
        if len(score[0].split('-'))<3:
            temp.append((float(score[0].split('-')[0]), score[1]))
        else: # negative number, e.g. '-1-16'
            temp.append((float('-' + score[0].split('-')[1]), score[1]))

    
    return temp

In [466]:
'''
gets candidate numeric values for the target metric based on the sentence

'''
def get_candidates(sentence):
    sentence = nlp(sentence)
    # Load spacy's dependency tree into a networkx graph
    edges = []

    # get all numbers
    all_nums = []

    target_metric_processed = ''
    #print("Target metric: ", TARGET_METRIC)

    for token in sentence:
        # FYI https://spacy.io/docs/api/token
        for child in token.children:
            edges.append(('{0}-{1}'.format(token.lower_,token.i),
                          '{0}-{1}'.format(child.lower_,child.i)))
            if(is_number(child.lower_)):
                all_nums.append('{0}-{1}'.format(child.lower_,child.i))

            if child.lower_ == str.lower(TARGET_METRIC.replace(' ', '')):
                target_metric_processed = '{0}-{1}'.format(child.lower_,child.i)

    if target_metric_processed == '':
        #print('target metric is empty')
        return [] # will be '' if the target metric is not in the sentence

    graph = nx.Graph(edges)

    # https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.shortest_paths.html

    num_scores = []

    for num in all_nums:
        try:
            shortest_len = nx.shortest_path_length(graph, source=target_metric_processed, target=num)
            #print('Shortest path for {} is {}'.format(num, shortest_len))
            num_scores.append((num, shortest_len))
        except nx.NetworkXNoPath:
            #print('No path between {} and {}'.format(target_metric_processed, num))
            pass

    # unprocess all target_metrics
    #print('this is ns: ', num_scores)
    #print('sentence: ', sentence)
    num_scores = unprocess_scores(num_scores)
    
    return num_scores


In [452]:
# discard candidates that are obviously not discount rateas
def sanity_check(candidates, sentence):

    #sentence = 'the cash flows and terminal values were then discounted to present value (as of june ;30, 2017) using a selected range of discountrates of 9.5% to 11.5%.'

    #candidates = [(30.0, 7), (2017.0, 8), (9.5, 3), (11.5, 6)]

    # loop over all candidates
    #print('all candidates to loop over: ', candidates)
    
    to_remove = []
    
    for candidate in candidates:
        #candidate = candidates[0]
        #candidate[0]
        #print('curr candidate: ', candidate)

        # check if number followed by a % in original
        window = sentence[sentence.find(str(int(candidate[0]))) : sentence.find(str(int(candidate[0]))) + len(str(candidate[0])) + 3]
        
        if '%' not in window: # is not a percentage
            #print('will remove: ', candidate)
            to_remove.append(candidate)
            
        #else:
            
            #print('window: ', str(window))

        # check if number is too big? TODO
        # before appending to to_remove, check if it's not there
        
    
    # remove candidates
    for candidate in to_remove:
        candidates.remove(candidate)


    return candidates







In [511]:

def most_probable_metric(all_candidates):
    
    #all_candidates = [[(3.0, 13), (9.25, 6), (10.25, 7)], [(3.0, 13), (4.0, 14), (9.75, 6), (10.75, 7)], [(9.25, 6), (10.25, 7)]]
    if all_candidates == []:
        print('empty allcadnidates in mindst')
        return []
    # find the minimal distance in all sentences
    print('all cands right before mindst: ', all_candidates)
    min_dst = all_candidates[0][0][1]

    for sent_candidates in all_candidates:
        for candidate in sent_candidates:
            if candidate[1] < min_dst:
                min_dst = candidate[1]

    # find all candidates within threshold of the minimum distance (+2)
    probable_candidates = []
    for sent_candidates in all_candidates:
        sent_probable_candidates = []
        for candidate in sent_candidates:
            if candidate[1] <= min_dst+2:
                sent_probable_candidates.append(candidate[0])
        if sent_probable_candidates != []:
            probable_candidates.append(sent_probable_candidates)

    return probable_candidates



In [512]:
probable_candidates

[[9.25, 10.25], [9.75, 10.75], [9.25, 10.25]]

In [514]:
import copy

nlp = spacy.load('en')

# https://spacy.io/docs/usage/processing-text
#document = nlp(INPUT_DOCUMENT)

data = pd.read_csv('processedData.csv')

predicted_metrics = []

i = 2

for model in data['Text']:
    
    model = model.replace(TARGET_METRIC, TARGET_METRIC.replace(' ', ''))
    
    
    #print(model)
    document = nlp(model)
    
    all_candidates = [] # for all sentences
    
    for sentence in document.sents:
        sentence = sentence.string.strip()
        
        #print('sentence: ', sentence[:100])
        #print('TARGET_METRIC: ', TARGET_METRIC)
        if TARGET_METRIC.replace(' ', '') in sentence:
            
            candidates = get_candidates(sentence)
            #print('---')
            candidates = sanity_check(candidates, sentence)
            
            print('candidates after SC: ', candidates)
        
            
            if candidates != []:
                all_candidates.append(candidates)
                
    print('all candidates: ', i, all_candidates)
    
    chosen_metrics = most_probable_metric(all_candidates)
    i+=1
        
        
    print('chosen metrics: ', chosen_metrics)
    
    predicted_metrics.append(chosen_metrics)
    
    


candidates after SC:  [(12.0, 4), (16.0, 5)]
candidates after SC:  []
all candidates:  2 [[(12.0, 4), (16.0, 5)]]
all cands right before mindst:  [[(12.0, 4), (16.0, 5)]]
chosen metrics:  [[12.0, 16.0]]
all candidates:  3 []
empty allcadnidates in mindst
chosen metrics:  []
all candidates:  4 []
empty allcadnidates in mindst
chosen metrics:  []
candidates after SC:  []
all candidates:  5 []
empty allcadnidates in mindst
chosen metrics:  []
candidates after SC:  [(23.4, 5)]
candidates after SC:  []
candidates after SC:  []
candidates after SC:  [(19.4, 10), (23.4, 11)]
all candidates:  6 [[(23.4, 5)], [(19.4, 10), (23.4, 11)]]
all cands right before mindst:  [[(23.4, 5)], [(19.4, 10), (23.4, 11)]]
chosen metrics:  [[23.4]]
candidates after SC:  [(11.0, 4), (13.0, 5)]
all candidates:  7 [[(11.0, 4), (13.0, 5)]]
all cands right before mindst:  [[(11.0, 4), (13.0, 5)]]
chosen metrics:  [[11.0, 13.0]]
candidates after SC:  [(9.0, 4), (11.0, 5)]
all candidates:  8 [[(9.0, 4), (11.0, 5)]]
all

candidates after SC:  [(8.0, 16), (10.0, 16), (5.0, 12)]
all candidates:  41 [[(8.0, 16), (10.0, 16), (5.0, 12)]]
all cands right before mindst:  [[(8.0, 16), (10.0, 16), (5.0, 12)]]
chosen metrics:  [[5.0]]
candidates after SC:  [(8.5, 4), (10.5, 5)]
all candidates:  42 [[(8.5, 4), (10.5, 5)]]
all cands right before mindst:  [[(8.5, 4), (10.5, 5)]]
chosen metrics:  [[8.5, 10.5]]
candidates after SC:  [(8.0, 2), (7.0, 3)]
all candidates:  43 [[(8.0, 2), (7.0, 3)]]
all cands right before mindst:  [[(8.0, 2), (7.0, 3)]]
chosen metrics:  [[8.0, 7.0]]
all candidates:  44 []
empty allcadnidates in mindst
chosen metrics:  []
candidates after SC:  [(7.0, 6), (8.0, 7)]
all candidates:  45 [[(7.0, 6), (8.0, 7)]]
all cands right before mindst:  [[(7.0, 6), (8.0, 7)]]
chosen metrics:  [[7.0, 8.0]]
candidates after SC:  [(10.5, 4), (11.5, 6), (5.5, 20)]
candidates after SC:  [(11.0, 4), (12.0, 5), (5.5, 16)]
all candidates:  46 [[(10.5, 4), (11.5, 6), (5.5, 20)], [(11.0, 4), (12.0, 5), (5.5, 16)]]

candidates after SC:  [(15.5, 8), (18.4, 19)]
candidates after SC:  [(15.0, 4), (23.0, 5)]
candidates after SC:  []
all candidates:  83 [[(17.5, 5), (17.4, 5), (1.0, 6), (15.0, 4), (23.0, 5)], [(15.5, 8), (18.4, 19)], [(15.0, 4), (23.0, 5)]]
all cands right before mindst:  [[(17.5, 5), (17.4, 5), (1.0, 6), (15.0, 4), (23.0, 5)], [(15.5, 8), (18.4, 19)], [(15.0, 4), (23.0, 5)]]
chosen metrics:  [[17.5, 17.4, 1.0, 15.0, 23.0], [15.0, 23.0]]
candidates after SC:  [(12.0, 3)]
all candidates:  84 [[(12.0, 3)]]
all cands right before mindst:  [[(12.0, 3)]]
chosen metrics:  [[12.0]]
candidates after SC:  [(16.0, 6), (16.0, 3)]
all candidates:  85 [[(16.0, 6), (16.0, 3)]]
all cands right before mindst:  [[(16.0, 6), (16.0, 3)]]
chosen metrics:  [[16.0]]
candidates after SC:  [(9.5, 3), (13.5, 4)]
candidates after SC:  []
all candidates:  86 [[(9.5, 3), (13.5, 4)]]
all cands right before mindst:  [[(9.5, 3), (13.5, 4)]]
chosen metrics:  [[9.5, 13.5]]
candidates after SC:  []
candidates after SC

candidates after SC:  [(9.0, 4), (13.0, 5)]
candidates after SC:  [(9.0, 4), (13.0, 5)]
all candidates:  128 [[(9.0, 4), (13.0, 5)], [(9.0, 4), (13.0, 5)]]
all cands right before mindst:  [[(9.0, 4), (13.0, 5)], [(9.0, 4), (13.0, 5)]]
chosen metrics:  [[9.0, 13.0], [9.0, 13.0]]
candidates after SC:  [(7.5, 4), (8.5, 5)]
all candidates:  129 [[(7.5, 4), (8.5, 5)]]
all cands right before mindst:  [[(7.5, 4), (8.5, 5)]]
chosen metrics:  [[7.5, 8.5]]
all candidates:  130 []
empty allcadnidates in mindst
chosen metrics:  []
candidates after SC:  [(11.0, 3), (15.0, 5)]
all candidates:  131 [[(11.0, 3), (15.0, 5)]]
all cands right before mindst:  [[(11.0, 3), (15.0, 5)]]
chosen metrics:  [[11.0, 15.0]]
all candidates:  132 []
empty allcadnidates in mindst
chosen metrics:  []
all candidates:  133 []
empty allcadnidates in mindst
chosen metrics:  []
candidates after SC:  [(11.5, 5)]
all candidates:  134 [[(11.5, 5)]]
all cands right before mindst:  [[(11.5, 5)]]
chosen metrics:  [[11.5]]
candid

candidates after SC:  [(100.0, 13), (65.8, 11), (3.0, 13), (10.0, 4)]
all candidates:  179 [[(100.0, 13), (65.8, 11), (3.0, 13), (10.0, 4)]]
all cands right before mindst:  [[(100.0, 13), (65.8, 11), (3.0, 13), (10.0, 4)]]
chosen metrics:  [[10.0]]
candidates after SC:  [(7.0, 6), (8.0, 7)]
candidates after SC:  []
all candidates:  180 [[(7.0, 6), (8.0, 7)]]
all cands right before mindst:  [[(7.0, 6), (8.0, 7)]]
chosen metrics:  [[7.0, 8.0]]
candidates after SC:  [(11.7, 3), (13.7, 4)]
candidates after SC:  [(11.7, 4), (13.7, 5)]
all candidates:  181 [[(11.7, 3), (13.7, 4)], [(11.7, 4), (13.7, 5)]]
all cands right before mindst:  [[(11.7, 3), (13.7, 4)], [(11.7, 4), (13.7, 5)]]
chosen metrics:  [[11.7, 13.7], [11.7, 13.7]]
candidates after SC:  []
candidates after SC:  [(10.0, 4), (12.0, 3)]
candidates after SC:  []
all candidates:  182 [[(10.0, 4), (12.0, 3)]]
all cands right before mindst:  [[(10.0, 4), (12.0, 3)]]
chosen metrics:  [[10.0, 12.0]]
candidates after SC:  [(10.5, 4), (12

candidates after SC:  [(9.0, 4), (10.0, 5)]
candidates after SC:  []
all candidates:  236 [[(9.0, 4), (10.0, 5)]]
all cands right before mindst:  [[(9.0, 4), (10.0, 5)]]
chosen metrics:  [[9.0, 10.0]]
candidates after SC:  [(13.0, 4), (17.0, 5)]
all candidates:  237 [[(13.0, 4), (17.0, 5)]]
all cands right before mindst:  [[(13.0, 4), (17.0, 5)]]
chosen metrics:  [[13.0, 17.0]]
all candidates:  238 []
empty allcadnidates in mindst
chosen metrics:  []
candidates after SC:  [(10.0, 4), (12.0, 5)]
candidates after SC:  [(4.4, 6), (7.8, 7)]
all candidates:  239 [[(10.0, 4), (12.0, 5)], [(4.4, 6), (7.8, 7)]]
all cands right before mindst:  [[(10.0, 4), (12.0, 5)], [(4.4, 6), (7.8, 7)]]
chosen metrics:  [[10.0, 12.0], [4.4]]
all candidates:  240 []
empty allcadnidates in mindst
chosen metrics:  []
candidates after SC:  [(3.0, 12)]
candidates after SC:  [(8.0, 6), (10.0, 4)]
candidates after SC:  []
candidates after SC:  [(8.0, 6), (10.0, 4)]
candidates after SC:  []
all candidates:  241 [[(3

In [515]:
predicted_metrics

[[[12.0, 16.0]],
 [],
 [],
 [],
 [[23.4]],
 [[11.0, 13.0]],
 [[9.0, 11.0]],
 [],
 [[16.0, 17.0]],
 [[9.0, 12.0]],
 [[5.2, 6.1]],
 [],
 [],
 [[9.25, 10.25], [9.75, 10.75], [9.25, 10.25]],
 [[9.5, 11.5]],
 [[9.0, 11.0]],
 [],
 [[9.5, 11.5]],
 [[10.5, 12.5, 1.0, 2.0]],
 [[10.8, 12.8], [9.1, 11.1]],
 [],
 [[8.0, 10.0], [9.0, 11.0]],
 [[12.75, 13.75], [10.0, 11.0]],
 [[9.5]],
 [[9.0, 11.0]],
 [[10.0, 12.0]],
 [[17.8, 20.8]],
 [[18.3, 20.3, 2.0, 4.0]],
 [[25.0, 23.3], [25.0, 23.7]],
 [],
 [[10.0, 12.0]],
 [],
 [],
 [[19.0, 21.0]],
 [[7.0, 8.0]],
 [[7.3, 9.3], [7.7]],
 [[10.2], [16.8]],
 [[13.5]],
 [[12.0, 14.0]],
 [[5.0]],
 [[8.5, 10.5]],
 [[8.0, 7.0]],
 [],
 [[7.0, 8.0]],
 [[10.5, 11.5], [11.0, 12.0]],
 [[7.0]],
 [[8.5, 10.0]],
 [[10.0]],
 [[9.5, 11.5], [9.5, 11.5], [9.5, 11.5], [9.5, 11.5]],
 [[9.5, 10.5]],
 [[11.5, 12.5]],
 [[12.0, 16.0]],
 [],
 [[10.75, 11.75], [10.75, 11.75]],
 [],
 [],
 [],
 [[11.95, 19.0, 25.0]],
 [[8.5, 10.5], [8.5, 10.5]],
 [[9.0, 11.0], [8.0, 10.0, 1.0], [8.0, 10.0

In [139]:
spacy.displacy.serve(sentence, style='dep')

  "__main__", mod_spec)



Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.
