In [25]:
import pandas as pd
import itertools
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
from termcolor import colored

In [2]:
# tennis.csv contains 8 online articles about tennis
df = pd.read_csv("./input/tennis.csv")
df

Unnamed: 0,article_id,article_text,source
0,1,Maria Sharapova has basically no friends as te...,https://www.tennisworldusa.org/tennis/news/Mar...
1,2,"BASEL, Switzerland (AP), Roger Federer advance...",http://www.tennis.com/pro-game/2018/10/copil-s...
2,3,Roger Federer has revealed that organisers of ...,https://scroll.in/field/899938/tennis-roger-fe...
3,4,Kei Nishikori will try to end his long losing ...,http://www.tennis.com/pro-game/2018/10/nishiko...
4,5,"Federer, 37, first broke through on tour over ...",https://www.express.co.uk/sport/tennis/1036101...
5,6,Nadal has not played tennis since he was force...,https://www.express.co.uk/sport/tennis/1037119...
6,7,"Tennis giveth, and tennis taketh away. The end...",http://www.tennis.com/pro-game/2018/10/tennisc...
7,8,Federer won the Swiss Indoors last week by bea...,https://www.express.co.uk/sport/tennis/1038186...


In [3]:
for art in df['article_text']:
    print(f'{art}\n\n')

Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all. I think just because you're in the same s

In [4]:
sentences = []
for s in df['article_text']:
    sentences.append(sent_tokenize(s))

# Extract the longest sentence of each text to be able to check,
# if the algorithm performs better then just picking the longest
# sentence
longest_sentences = []
    
for item in sentences:
    sorteditems = sorted(item, key=len)
    longest_sentences.append((sorteditems[-1]))

longest_sentences

["When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.",
 'Copil fired 26 aces past Zverev and never dropped serve, clinching after 2 1/2 hours with a forehand volley winner to break Zverev for the second time in the semifinal.',
 'The 20-time Grand Slam champion has voiced doubts about the wisdom of the one-week format to be introduced by organisers Kosmos, who have promised the International Tennis Federation up to $3 billion in prize money over the next quarter-century.',
 'Kei Nishikori will try to end his long losing streak in ATP finals and Kevin Anderson will go for his second title of the year at the Erste Bank Open on Sunday.',
 '"Not always, but I really feel like in the mid-2000 years there was a huge shift of the attit

In [5]:
# Flatten nested list of sentences 
# to have a single list of all sentences
sentences = list(itertools.chain(*sentences))
sentences[:5]

['Maria Sharapova has basically no friends as tennis players on the WTA Tour.',
 "The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much.",
 'I think everyone knows this is my job here.',
 "When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.",
 "I'm a pretty competitive girl."]

In [6]:
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")
clean_sentences = [s.lower() for s in clean_sentences]
stop_words = stopwords.words('english')

def remove_stopwords(words: list) -> str:
    """Remove stopwords from a list of words and return the remaining words
       as a string joint by the space char. 
    """
    sentence = " ".join([word for word in words if word not in stop_words])
    return sentence

clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
print(sentences[:5])
clean_sentences[:5]

['Maria Sharapova has basically no friends as tennis players on the WTA Tour.', "The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much.", 'I think everyone knows this is my job here.', "When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.", "I'm a pretty competitive girl."]


  clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")


['maria sharapova basically friends tennis players wta tour',
 'russian player problems openly speaking recent interview said really hide feelings much',
 'think everyone knows job',
 'courts court playing competitor want beat every single person whether locker room across net one strike conversation weather know next minutes go try win tennis match',
 'pretty competitive girl']

In [7]:
word_embeddings = {}

f = open('./input/glove.6B.300d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [8]:
word_embeddings['world']

array([-0.25831  ,  0.43644  , -0.1138   , -0.5259   ,  0.20213  ,
        0.95247  , -0.58764  , -0.047001 , -0.053704 , -1.744    ,
        0.99583  ,  0.063464 , -0.093147 , -0.26441  , -0.28676  ,
       -0.52357  , -0.17867  ,  0.18171  , -0.71696  , -0.13301  ,
        0.42476  ,  0.42044  ,  0.3775   ,  0.082431 ,  0.13154  ,
       -0.10151  , -0.11898  ,  0.029509 , -0.39635  ,  0.26516  ,
       -0.55091  ,  0.23805  , -0.018748 , -0.039944 , -1.1972   ,
        0.13567  ,  0.09371  , -0.60134  ,  0.12887  ,  0.34876  ,
       -0.25588  , -0.33466  ,  0.069678 ,  0.5429   ,  0.25246  ,
        0.17249  ,  0.099885 ,  0.099456 , -0.01592  ,  0.2617   ,
        0.36155  , -0.12417  ,  0.27516  ,  0.037434 , -0.075003 ,
        0.61096  ,  0.05261  ,  0.017307 ,  0.12576  , -0.11952  ,
       -0.49077  ,  0.026711 , -0.27187  , -0.15268  , -0.22147  ,
        0.18131  , -0.045344 ,  0.76151  ,  0.17489  , -0.44112  ,
        0.027347 ,  0.42676  , -0.0069618, -0.60233  , -0.0166

In [9]:
sentence_vectors = []
words_not_in_word_embeddings = set()

def get_vectors_from_word_embeddings(w):
    try:
        vec = word_embeddings[w]
    except KeyError:
        vec = np.zeros((300,))
        words_not_in_word_embeddings.add(w)
    return vec
        
for sentence in clean_sentences:
    if len(sentence) != 0:
        word_list = sentence.split()
        vec_list = []
        for w in word_list:
            vec = get_vectors_from_word_embeddings(w)
            vec_list.append(vec)
        v = sum(vec_list)/(len(word_list)+0.001)
    else:
        v = np.zeros((300,))
    sentence_vectors.append(v)

In [10]:
words_not_in_word_embeddings

{'cecchinato', 'khachanov', 'kranjovic', 'struff', 'tsitsipas'}

In [11]:
print(len(sentences))
print(len(clean_sentences))
len(sentence_vectors)

119
119


119

In [12]:
sentence_vectors[0]

array([-4.61076051e-02,  1.29591778e-01, -5.93765713e-02, -9.25393030e-02,
        9.77605209e-02,  3.80744934e-02, -5.39967835e-01, -1.52831122e-01,
       -1.40714034e-01, -6.11173630e-01,  4.43402022e-01, -5.85616753e-02,
        2.28641272e-01, -1.77562177e-01, -3.79852504e-01, -1.27993613e-01,
       -1.60404950e-01,  1.98072717e-02,  2.03368679e-01,  1.93133727e-01,
       -3.58894654e-02, -4.78439452e-03, -3.17186564e-01, -3.08756888e-01,
       -2.04970121e-01, -1.72748305e-02, -6.77067935e-02,  3.13480824e-01,
       -1.85879245e-02,  1.04714625e-01, -6.04711846e-03,  2.00459920e-02,
        5.60364872e-02,  7.86496699e-02, -8.38511407e-01,  6.67703524e-02,
        3.29794616e-01,  7.84649476e-02, -9.31521133e-02,  8.28586444e-02,
        1.80333838e-01, -6.22619689e-01, -7.29705393e-02,  1.25791267e-01,
        2.01891467e-01,  7.65589178e-02,  3.84096801e-01, -1.04883015e-02,
       -1.06981210e-01,  1.31630525e-01,  5.08688867e-01,  2.08538920e-02,
        5.01151104e-03, -

In [17]:
# Spoiler alert: 119
dataset_length = len(sentences)

similarity_matrix = np.zeros([dataset_length, dataset_length])
for i in range(dataset_length):
    for j in range(dataset_length):
        # ignore the diagonal
        if i != j:
            similarity_matrix[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,300), sentence_vectors[j].reshape(1,300))[0,0]

In [18]:
similarity_matrix[118][117]

0.6195055246353149

In [19]:
nx_graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(nx_graph)

In [20]:
scores

{0: 0.007885821982884933,
 1: 0.008567224044913526,
 2: 0.007818890205731737,
 3: 0.009937108525290174,
 4: 0.006948420815115995,
 5: 0.0074651472561015056,
 6: 0.00810485920387062,
 7: 0.00821824791313258,
 8: 0.008788235585347835,
 9: 0.008068025315786044,
 10: 0.0012695725958021694,
 11: 0.009282832381744273,
 12: 0.008035557979024586,
 13: 0.007905186321127215,
 14: 0.008738186092527984,
 15: 0.008412522897742504,
 16: 0.00765131463538145,
 17: 0.007957039806613943,
 18: 0.008283437423493719,
 19: 0.009176328270149274,
 20: 0.009004823527359708,
 21: 0.007182012090878639,
 22: 0.008304125364273265,
 23: 0.009295054603912051,
 24: 0.007429507352878927,
 25: 0.0060144730442320765,
 26: 0.007859710263762022,
 27: 0.009081402216231561,
 28: 0.009463531926604955,
 29: 0.009432609315097299,
 30: 0.009726396964829154,
 31: 0.009432388750070885,
 32: 0.006380908575663468,
 33: 0.008868253270499891,
 34: 0.009292494059303411,
 35: 0.009576898630127463,
 36: 0.007380322052458923,
 37: 0.0091

In [21]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [22]:
print(len(ranked_sentences))
ranked_sentences[:5]

119


[(0.009937108525290174,
  "When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match."),
 (0.009885503471476712,
  'Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest.'),
 (0.009800672319327613,
  '"I felt like the best weeks that I had to get to know players when I was playing were the Fed Cup weeks or the Olympic weeks, not necessarily during the tournaments.'),
 (0.009726396964829154,
  'Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of any commitment.'),
 (0.

In [23]:
# ranked_sentences contains the highest ranked sentence for every text. It is a list of tuples (score, sentence)
# How can this be mapped back to the texts? Why do we need to compute the pagerank for every sentence against every other sentence?
for sentence in ranked_sentences:
    print(sentence[0], sentence[1])

0.009937108525290174 When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.
0.009885503471476712 Major players feel that a big event in late November combined with one in January before the Australian Open will mean too much tennis and too little rest.
0.009800672319327613 "I felt like the best weeks that I had to get to know players when I was playing were the Fed Cup weeks or the Olympic weeks, not necessarily during the tournaments.
0.009726396964829154 Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of any commitment.
0.009650774597349778 Currently in ninth 

In [26]:
for i, article in enumerate(df['article_text']):
    print(colored(("ARTICLE:".center(50)),'yellow'))
    print('\n')
    print(colored((article),'blue'))
    print('\n')
    print(colored(("SUMMARY:".center(50)),'green'))
    print('\n')
    print(colored((f"Summary in Text? {ranked_sentences[i][1] in article}".center(50)),'green'))
    print('\n')
    print(colored((f'{ranked_sentences[i][1]} - Score: {ranked_sentences[i][0]}'),'cyan'))
    print('\n')

                     ARTICLE:                     


Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, 