In [1]:
from src.modelling.extractive_model import ExtractiveSummarizer
import os

In [7]:
DATA_ORIGINAL = 'news'
DATA_SUMMARIZED = 'summary'
TEST_INDEX = 1
SAMPLE_SIZE = 10
RANDOM_SEED = 42

In [4]:
summarizer = ExtractiveSummarizer()
data_path = os.path.join(os.getcwd(), r"data\bbc_news.csv")
data = summarizer._load_data(data_path)

#### Original text

In [5]:
input_text = data[DATA_ORIGINAL][TEST_INDEX]
print(input_text)

b'Dollar gains on Greenspan speech\n\nThe dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise.\n\nAnd Alan Greenspan highlighted the US government\'s willingness to curb spending and rising household savings as factors which may help to reduce it. In late trading in New York, the dollar reached $1.2871 against the euro, from $1.2974 on Thursday. Market concerns about the deficit has hit the greenback in recent months. On Friday, Federal Reserve chairman Mr Greenspan\'s speech in London ahead of the meeting of G7 finance ministers sent the dollar higher after it had earlier tumbled on the back of worse-than-expected US jobs data. "I think the chairman\'s taking a much more sanguine view on the current account deficit than he\'s taken for some time," said Robert Sinche, head of currency strategy at Bank of America in New York. "He\'s taking a longer-term view, laying out a set of conditions 

#### Actual summarized text from dataset

In [8]:
actual_summarized_text = data[DATA_SUMMARIZED][TEST_INDEX]
print(actual_summarized_text)

b'The dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise.China\'s currency remains pegged to the dollar and the US currency\'s sharp falls in recent months have therefore made Chinese export prices highly competitive.Market concerns about the deficit has hit the greenback in recent months."I think the chairman\'s taking a much more sanguine view on the current account deficit than he\'s taken for some time," said Robert Sinche, head of currency strategy at Bank of America in New York.The recent falls have partly been the result of big budget deficits, as well as the US\'s yawning current account gap, both of which need to be funded by the buying of US bonds and assets by foreign firms and governments."He\'s taking a longer-term view, laying out a set of conditions under which the current account deficit can improve this year and next."'


#### Summarized text

In [10]:
summary = summarizer.predict(input_text)
print(summary)

China\'s currency remains pegged to the dollar and the US currency\'s sharp falls in recent months have therefore made Chinese export prices highly competitive. "He\'s taking a longer-term view, laying out a set of conditions under which the current account deficit can improve this year and next. The half-point window, some believe, could be enough to keep US assets looking more attractive, and could help prop up the dollar. The White House will announce its budget on Monday, and many commentators believe the deficit will remain at close to half a trillion dollars.\n' Market concerns about the deficit has hit the greenback in recent months. "\n\nWorries about the deficit concerns about China do, however, remain. In late trading in New York, the dollar reached $1.2871 against the euro, from $1.2974 on Thursday.


#### Evaluation

In [6]:
summarizer = ExtractiveSummarizer()

data_path = os.path.join(os.getcwd(), r"data\bbc_news.csv")
data = summarizer._load_data(data_path)
data

Unnamed: 0,news,summary,type
0,b'Ad sales boost Time Warner profit\n\nQuarter...,"b""TimeWarner said fourth quarter sales rose 2%...",business
1,b'Dollar gains on Greenspan speech\n\nThe doll...,b'The dollar has hit its highest level against...,business
2,b'Yukos unit buyer faces loan claim\n\nThe own...,b'Yukos\' owner Menatep Group says it will ask...,business
3,b'High fuel prices hit BA\'s profits\n\nBritis...,"b'Rod Eddington, BA\'s chief executive, said t...",business
4,"b""Pernod takeover talk lifts Domecq\n\nShares ...","b""Pernod has reduced the debt it took on to fu...",business
...,...,...,...
2220,b'BT program to beat dialler scams\n\nBT is in...,b'BT is introducing two initiatives to help be...,tech
2221,b'Spam e-mails tempt net shoppers\n\nComputer ...,b'A third of them read unsolicited junk e-mail...,tech
2222,b'Be careful how you code\n\nA new European di...,"b""This goes to the heart of the European proje...",tech
2223,b'US cyber security chief resigns\n\nThe man m...,"b""Amit Yoran was director of the National Cybe...",tech


In [16]:
reference_data = data[DATA_ORIGINAL].sample(n=SAMPLE_SIZE, random_state=RANDOM_SEED)
reference_data

414     b'UK house prices dip in November\n\nUK house ...
420     b'LSE \'sets date for takeover deal\'\n\nThe L...
1644    b'Harinordoquy suffers France axe\n\nNumber ei...
416     b'Barclays shares up on merger talk\n\nShares ...
1232    b'Campaign \'cold calls\' questioned\n\nLabour...
1544    b'Wolves appoint Hoddle as manager\n\nGlenn Ho...
1748    b'Hantuchova in Dubai last eight\n\nDaniela Ha...
1264    b'BAA support ahead of court battle\n\nUK airp...
629     b'\'My memories of Marley...\'\n\nTo mark the ...
1043    b'Labour trio \'had vote-rig factory\'\n\nThre...
Name: news, dtype: object

In [18]:
reference_summary = data[DATA_SUMMARIZED].sample(n=SAMPLE_SIZE, random_state=RANDOM_SEED)
reference_summary

414     b'All areas saw a rise in annual house price i...
420     b'A \xc2\xa31.3bn offer from Deutsche Boerse h...
1644    b'Harinordoquy was a second-half replacement i...
416     b"Shares in UK banking group Barclays have ris...
1232    b'Assistant information commissioner Phil Jone...
1544    b'Gray, who has been caretaker manager, was as...
1748    b'Williams was also far from content.Davenport...
1264    b'"We do not underestimate the scale of the ch...
629     b'"Bob was a good player.That was the end of i...
1043    b'"When the officers left, all the envelopes a...
Name: summary, dtype: object

In [15]:
candidate_data = reference_data.map(summarizer.predict)
candidate_data

414     But while the monthly figures may hint at a co...
420     Speculation suggests that Paris-based Euronext...
1644    France, who lost to Wales last week, must defe...
416     Its North American divisions focus on business...
1232    Assistant information commissioner Phil Jones ...
1544    "I\'m delighted to be here," said Hoddle.\n\n"...
1748    "I started well and finished well, but played ...
1264    HACAN chairman John Stewart said: "Almost exac...
629     "I\'m sure if he were alive today he would bel...
1043    The case against the men follows a hearing int...
Name: news, dtype: object

In [17]:
precision, recall, f1 = summarizer.evaluation(preds=candidate_data, refs=reference_data)
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1: {f1}')

calculating scores...
computing bert embedding.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1.0), HTML(value='')))


computing greedy matching.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1.0), HTML(value='')))


done in 97.06 seconds, 0.10 sentences/sec
Precision: 0.9311454892158508
Recall: 0.8612416982650757
F1: 0.8946964144706726


In [19]:
precision, recall, f1 = summarizer.evaluation(preds=candidate_data, refs=reference_summary)
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1: {f1}')

calculating scores...
computing bert embedding.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1.0), HTML(value='')))


computing greedy matching.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1.0), HTML(value='')))


done in 51.34 seconds, 0.19 sentences/sec
Precision: 0.9008392095565796
Recall: 0.8920493125915527
F1: 0.8963781595230103
