CS551g Data Mining and Visualisation  <br>
Practical 3 - Summarisation with Lexrank <br>

LexRank is an unsupervised approach to text summarization based on graph-based centrality
scoring of sentences. The main idea is that sentences "recommend" other similar sentences
to the reader. Thus, if one sentence is very similar to many others, it will likely be a sentence
of great importance. The importance of this sentence also stems from the importance of the
sentences "recommending" it. Thus, to get ranked highly and placed in a summary, a
sentence must be similar to many sentences that are in turn also similar to many other
sentences. This makes intuitive sense and allows the algorithms to be applied to any
arbitrary new text.

More info https://github.com/crabcamp/lexrank.

**Step 1: Install lexrank**

In [None]:
pip install lexrank

**STEP 2: Upload Data**<br>
Upload the txt documents found in docs(Part2).zip. The docs folder contains 12 documents, organized into 4 clusters of 3 documents(doc1-
2.txt refers to document #2 in cluster #1). Each document is a news article on the topic of
Angelina Jolie's preventative cancer treatment in recent years. Each cluster consists of 3
articles, one from the _Daily Mail_,another from _Mirror_, and the last from the _Belfast
Telegraph_ (except for cluster #1, where the last is from _New York Daily News_), each
published around the same time with the same subject matter.

In [3]:
from lexrank.algorithms.summarizer import LexRank 
from lexrank.mappings.stopwords import STOPWORDS
from path import Path

documents = []
documents_dir = Path('./')

for file_path in documents_dir.files('*.txt'):
    with file_path.open(mode='rt', encoding='utf-8') as fp:
        documents.append(fp.readlines())

lxr = LexRank(documents, stopwords=STOPWORDS['en'])



**STEP 3:Run the code and inspect the results**

In [None]:
sentences = [
    "Angelina Jolie's openness about breast cancer has inspired growing numbers of British women to have the same lifesaving operation.",

"The Tomb Raider star made headlines around the world when, in May 2013, she announced she had both breasts removed to prevent her developing cancer.",

"She took the radical action after discovering she carried faulty DNA that gave her an 87 per cent chance of developing breast cancer - eight times the chance for an average woman.",

"With the flawed BRCA1 gene also greatly increasing her the odds of ovarian cancer, she had a second round of major surgery in which her ovaries and Fallopian tubes removed.",

"Her frankness about the surgery is believed to have raised awareness of the procedures and removed some of the fear that surrounds them.",

"Referrals for genetic counselling and testing more than doubled after she spoke out and, as a result, more women are choosing to have both breasts removed to prevent the disease and potentially save their lives.",

"Figures from a leading clinic in Manchester show 83 double mastectomies were carried out in the 18 months from January 2014 - almost treble the figure for the previous year and a half.",

"Many of these women carried the same gene as Miss Jolie, the journal Breast Cancer Research reports.",

"Doctors at the Genesis Breast Cancer Prevention Centre say the rise is likely due to the 'Angelina effect'.",

"Singer Kylie Minogue's breast cancer diagnosis and reality TV star Jade Goody's battle with cervical cancer both led to increased awareness.",

"But the 'Angelina effect' is thought to be bigger and longer-lasting.",

"The Manchester researchers said that her 'glamorous image' and marriage to Brad Pitt may have lessened patients' fears about the surgery and its effect on their sexuality.",

"Lester Barr, chairman of Genesis Breast Cancer Prevention, said: 'Scheduling a preventative double mastectomy is not an easy decision to make, but it's encouraging to know that many of these women will have drawn comfort from by the fact that such a high-profile star had undergone the surgery.",

"Breast Cancer Now said that GPs are referring women appropriately, rather than simply sending on the 'worried well'",

"Yinka Ebo, the charity's health information manager, said: 'The challenge now is to ensure that genetic testing services can cope with the increased demand and that women and GPs have access to the information they need.'",
    ]

# get summary with classical LexRank algorithm
summary = lxr.get_summary(sentences, summary_size=2, threshold=.1)
print(summary)


# get summary with continuous LexRank
summary_cont = lxr.get_summary(sentences, threshold=None)
print(summary_cont)



# get LexRank scores for sentences
# 'fast_power_method' speeds up the calculation, but requires more RAM
scores_cont = lxr.rank_sentences(
    sentences,
    threshold=None,
    fast_power_method=False,
)
print(scores_cont)

