## Extractive Approach

The Extractive approach takes sentences directly from the document according to a scoring function to form a cohesive summary. This method works by identifying the important sections of the text cropping and assembling parts of the content to produce a condensed version.

This project summarizes a text using the TextRank algorithm, which is an extractive and unsupervised machine learning algorithm. [NetworkX PageRank](https://networkx.org/documentation/networkx-1.2/reference/generated/networkx.pagerank.html)

## Steps

* Tokenize texts into sentences to obtain
* Cleaning the sentences (remove punctuation, case, stopwords)
* Create a word embedings dictionary from GloVe dataset
* Compute vectors for every sentence
* Compute a similarity matrix
* Apply PageRank algorithm to find the most important sentence per text

In [2]:
import pandas as pd
import itertools
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
import random
import re

In [3]:
# tennis.csv contains 8 online articles about tennis
df = pd.read_csv("data/tennis.csv")
df

Unnamed: 0,article_id,article_text,source
0,1,Maria Sharapova has basically no friends as te...,https://www.tennisworldusa.org/tennis/news/Mar...
1,2,"BASEL, Switzerland (AP), Roger Federer advance...",http://www.tennis.com/pro-game/2018/10/copil-s...
2,3,Roger Federer has revealed that organisers of ...,https://scroll.in/field/899938/tennis-roger-fe...
3,4,Kei Nishikori will try to end his long losing ...,http://www.tennis.com/pro-game/2018/10/nishiko...
4,5,"Federer, 37, first broke through on tour over ...",https://www.express.co.uk/sport/tennis/1036101...
5,6,Nadal has not played tennis since he was force...,https://www.express.co.uk/sport/tennis/1037119...
6,7,"Tennis giveth, and tennis taketh away. The end...",http://www.tennis.com/pro-game/2018/10/tennisc...
7,8,Federer won the Swiss Indoors last week by bea...,https://www.express.co.uk/sport/tennis/1038186...


In [4]:
# Show the 8 articles
for art in df['article_text']:
    print(f'{art}\n\n')

Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all. I think just because you're in the same s

## Tokenize the texts into sentences

Using: [nltk.tokenize](https://www.nltk.org/api/nltk.tokenize.html) - split the texts into lists of sentences and create a single list of all the sentences. 

In [8]:
sentences = []
for s in df['article_text']:
    sentences.append(sent_tokenize(s))

# Extract the longest sentence of each text to be able to check,
# if the algorithm performs better then just picking the longest
# sentence
longest_sentences = []
    
for item in sentences:
    sorteditems = sorted(item, key=len)
    longest_sentences.append((sorteditems[-1]))

longest_sentences

["When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.",
 'Copil fired 26 aces past Zverev and never dropped serve, clinching after 2 1/2 hours with a forehand volley winner to break Zverev for the second time in the semifinal.',
 'The 20-time Grand Slam champion has voiced doubts about the wisdom of the one-week format to be introduced by organisers Kosmos, who have promised the International Tennis Federation up to $3 billion in prize money over the next quarter-century.',
 'Kei Nishikori will try to end his long losing streak in ATP finals and Kevin Anderson will go for his second title of the year at the Erste Bank Open on Sunday.',
 '"Not always, but I really feel like in the mid-2000 years there was a huge shift of the attit

In [9]:
# Flatten nested list of sentences 
# to have a single list of all sentences
sentences = list(itertools.chain(*sentences))
sentences[:5]

['Maria Sharapova has basically no friends as tennis players on the WTA Tour.',
 "The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much.",
 'I think everyone knows this is my job here.',
 "When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.",
 "I'm a pretty competitive girl."]

## Cleaning the sentences

* replace every non alphabetic character with a space
* lowercase all the words
* remove all stopwords as per `nltk.corpus.stopwords`

In [10]:
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
clean_sentences = [s.lower() for s in clean_sentences]
stop_words = stopwords.words('english')

def remove_stopwords(words: list) -> str:
    """Remove stopwords from a list of words and return the remaining words
       as a string joint by the space char. 
    """
    sentence = " ".join([word for word in words if word not in stop_words])
    return sentence

clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
print(sentences[:5])
clean_sentences[:5]

['Maria Sharapova has basically no friends as tennis players on the WTA Tour.', "The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much.", 'I think everyone knows this is my job here.', "When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.", "I'm a pretty competitive girl."]


  """Entry point for launching an IPython kernel.


['maria sharapova basically friends tennis players wta tour',
 'russian player problems openly speaking recent interview said really hide feelings much',
 'think everyone knows job',
 'courts court playing competitor want beat every single person whether locker room across net one strike conversation weather know next minutes go try win tennis match',
 'pretty competitive girl']

## Create a word embedings dictionary

The project makes use of a pre-trained dataset that contains vector representation for words. The dataset has been created using the unsupervised learning algorithm GloVe [GloVe: Global Vectors for Word Representation at Stanford](https://nlp.stanford.edu/projects/glove/). It uses the Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 300d vectors, 822 MB download) data set, which is published under the `Public Domain Dedication and License v1.0`.

The dictionary `word_embeddings` contains a word as key and 300 numeric vector values as value.

In [11]:
word_embeddings = {}

f = open('../input/glove6b/glove.6B.300d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [12]:
word_embeddings['world']

array([-0.25831  ,  0.43644  , -0.1138   , -0.5259   ,  0.20213  ,
        0.95247  , -0.58764  , -0.047001 , -0.053704 , -1.744    ,
        0.99583  ,  0.063464 , -0.093147 , -0.26441  , -0.28676  ,
       -0.52357  , -0.17867  ,  0.18171  , -0.71696  , -0.13301  ,
        0.42476  ,  0.42044  ,  0.3775   ,  0.082431 ,  0.13154  ,
       -0.10151  , -0.11898  ,  0.029509 , -0.39635  ,  0.26516  ,
       -0.55091  ,  0.23805  , -0.018748 , -0.039944 , -1.1972   ,
        0.13567  ,  0.09371  , -0.60134  ,  0.12887  ,  0.34876  ,
       -0.25588  , -0.33466  ,  0.069678 ,  0.5429   ,  0.25246  ,
        0.17249  ,  0.099885 ,  0.099456 , -0.01592  ,  0.2617   ,
        0.36155  , -0.12417  ,  0.27516  ,  0.037434 , -0.075003 ,
        0.61096  ,  0.05261  ,  0.017307 ,  0.12576  , -0.11952  ,
       -0.49077  ,  0.026711 , -0.27187  , -0.15268  , -0.22147  ,
        0.18131  , -0.045344 ,  0.76151  ,  0.17489  , -0.44112  ,
        0.027347 ,  0.42676  , -0.0069618, -0.60233  , -0.0166

## Compute a vector for every sentence

In the last step the dictionary `word_embeds` has been created. It contains 300 vectors for every occuring word. In the following step 300 vectors are composed (by summing the word vectors) on the sentence level. For normalization the vector values of the sentence are divided by the number of words in the sentence.

In [13]:
sentence_vectors = []
words_not_in_word_embeddings = set()

def get_vectors_from_word_embeddings(w):
    try:
        vec = word_embeddings[w]
    except KeyError:
        vec = np.zeros((300,))
        words_not_in_word_embeddings.add(w)
    return vec
        
for sentence in clean_sentences:
    if len(sentence) != 0:
        word_list = sentence.split()
        vec_list = []
        for w in word_list:
            vec = get_vectors_from_word_embeddings(w)
            vec_list.append(vec)
        v = sum(vec_list)/(len(word_list)+0.001)
    else:
        v = np.zeros((300,))
    sentence_vectors.append(v)

In [14]:
words_not_in_word_embeddings

{'cecchinato', 'khachanov', 'kranjovic', 'struff', 'tsitsipas'}

In [15]:
print(len(sentences))
print(len(clean_sentences))
len(sentence_vectors)

119
119


119

In [16]:
sentence_vectors[0]

array([-4.61076051e-02,  1.29591778e-01, -5.93765713e-02, -9.25393030e-02,
        9.77605209e-02,  3.80744934e-02, -5.39967835e-01, -1.52831122e-01,
       -1.40714034e-01, -6.11173630e-01,  4.43402022e-01, -5.85616753e-02,
        2.28641272e-01, -1.77562177e-01, -3.79852504e-01, -1.27993613e-01,
       -1.60404950e-01,  1.98072717e-02,  2.03368679e-01,  1.93133727e-01,
       -3.58894654e-02, -4.78439452e-03, -3.17186564e-01, -3.08756888e-01,
       -2.04970121e-01, -1.72748305e-02, -6.77067935e-02,  3.13480824e-01,
       -1.85879245e-02,  1.04714625e-01, -6.04711846e-03,  2.00459920e-02,
        5.60364872e-02,  7.86496699e-02, -8.38511407e-01,  6.67703524e-02,
        3.29794616e-01,  7.84649476e-02, -9.31521133e-02,  8.28586444e-02,
        1.80333838e-01, -6.22619689e-01, -7.29705393e-02,  1.25791267e-01,
        2.01891467e-01,  7.65589178e-02,  3.84096801e-01, -1.04883015e-02,
       -1.06981210e-01,  1.31630525e-01,  5.08688867e-01,  2.08538920e-02,
        5.01151104e-03, -

## Compute a similarity matrix

Use `cosine_similarity` from `sklearn` ([Cosine similarity](https://scikit-learn.org/stable/modules/metrics.html#cosine-similarity)) to compute how similar 
1 sentence to every other sentence in the dataset is.

In [17]:
from sklearn.metrics.pairwise import cosine_similarity

# Spoiler alert: 119
dataset_length = len(sentences)

similarity_matrix = np.zeros([dataset_length, dataset_length])
for i in range(dataset_length):
    for j in range(dataset_length):
        # ignore the diagonal
        if i != j:
            similarity_matrix[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,300), sentence_vectors[j].reshape(1,300))[0,0]


In [18]:
# 119x119 Matrix
similarity_matrix[118][117]

0.6195055246353149

## Apply PageRank algorithm

Running the PageRank algorithm on the similarity matrix, which determines the most relevant sentence in an article. [NetworkX PageRank](https://networkx.org/documentation/networkx-1.2/reference/generated/networkx.pagerank.html)

### XXX from score to ranked sentences

The whole networkx part is a bit unclear to me in the moment.

In [19]:
import networkx as nx

nx_graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(nx_graph)

In [20]:
scores

{0: 0.007885822078952403,
 1: 0.00856722385846422,
 2: 0.007818889627019092,
 3: 0.009937108623006664,
 4: 0.006948421240372102,
 5: 0.007465147237139853,
 6: 0.008104859289518476,
 7: 0.008218248700946833,
 8: 0.008788236303539901,
 9: 0.008068025264877293,
 10: 0.0012695725958021694,
 11: 0.009282832292337088,
 12: 0.008035558063571544,
 13: 0.00790518618590624,
 14: 0.008738186300099638,
 15: 0.008412522857862509,
 16: 0.007651314728852419,
 17: 0.007957040155162624,
 18: 0.008283437931987365,
 19: 0.009176327507459128,
 20: 0.009004824177569802,
 21: 0.007182012332588499,
 22: 0.008304125497996579,
 23: 0.00929505568108357,
 24: 0.007429506687507387,
 25: 0.006014473001368166,
 26: 0.007859709788318944,
 27: 0.009081402102427456,
 28: 0.009463532359947967,
 29: 0.009432608476160597,
 30: 0.009726397109355165,
 31: 0.0094323888284346,
 32: 0.006380907993610014,
 33: 0.008868253130682065,
 34: 0.009292494625094499,
 35: 0.009576898266548057,
 36: 0.00738032159813084,
 37: 0.009164544

In [21]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [22]:
print(len(ranked_sentences))
ranked_sentences[:5]

119


[(0.009937108623006664,
  "When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match."),
 (0.009885503389801521,
  'Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest.'),
 (0.009800672148492593,
  '"I felt like the best weeks that I had to get to know players when I was playing were the Fed Cup weeks or the Olympic weeks, not necessarily during the tournaments.'),
 (0.009726397109355165,
  'Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of any commitment.'),
 (0.

In [23]:
# ranked_sentences contains the highest ranked sentence for every text. It is a list of tuples (score, sentence)
# How can this be mapped back to the texts? Why do we need to compute the pagerank for every sentence against every other sentence?
for sentence in ranked_sentences:
    print(sentence[0], sentence[1])

0.009937108623006664 When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.
0.009885503389801521 Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest.
0.009800672148492593 "I felt like the best weeks that I had to get to know players when I was playing were the Fed Cup weeks or the Olympic weeks, not necessarily during the tournaments.
0.009726397109355165 Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of any commitment.
0.009650774595610931 Currently in ninth 

In [24]:
from termcolor import colored
for i, article in enumerate(df['article_text']):
    print(colored(("ARTICLE:".center(50)),'yellow'))
    print('\n')
    print(colored((article),'blue'))
    print('\n')
    print(colored(("SUMMARY:".center(50)),'green'))
    print('\n')
    print(colored((f"Summary in Text? {ranked_sentences[i][1] in article}".center(50)),'green'))
    print('\n')
    print(colored((f'{ranked_sentences[i][1]} - Score: {ranked_sentences[i][0]}'),'cyan'))
    print('\n')

[33m                     ARTICLE:                     [0m


[34mMaria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women