In [1]:
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt') # one time execution
import re

[nltk_data] Downloading package punkt to
[nltk_data]     /home/quicksilver/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
df = pd.read_csv("tennis_articles_v4.csv")

In [3]:
df.head()

Unnamed: 0,article_id,article_text,source
0,0,The OnePlus 7 is basically the OnePlus 6T with...,


In [4]:
df['article_text'][0]

'The OnePlus 7 is basically the OnePlus 6T with the guts of the OnePlus 7 Pro, which sounds like a bad thing, but for 500 it is arguably the best bang for your buck going. There was nothing wrong with the design of the 6T, so there isn’t with the 7. The 6.41in AMOLED display is bright and crisp, filling most of the front of the phone with a small chin at the bottom and a teardrop notch in the top for the selfie camera. It’s still one of the best looking designs available, but the notch is more intrusive than the holepunch design of the Honor 20 or similar, and not quite on the same level as the OnePlus 7 Pro.The phone is only available in mirror grey in the UK, which is super-shiny with a slight purple tint to it. It looks sleek but is a bit of a fingerprint magnet. A mirror blue version is available in India, while a red version is available in India and China.The rear glass isn’t as slippery as other phones, which combined with its curved back and relatively narrow 74.8mm width makes

In [5]:
df['article_text'][1]


"BASEL, Switzerland (AP), Roger Federer advanced to the 14th Swiss Indoors final of his career by beating seventh-seeded Daniil Medvedev 6-1, 6-4 on Saturday. Seeking a ninth title at his hometown event, and a 99th overall, Federer will play 93th-ranked Marius Copil on Sunday. Federer dominated the 20th-ranked Medvedev and had his first match-point chance to break serve again at 5-1. He then dropped his serve to love, and let another match point slip in Medvedev's next service game by netting a backhand. He clinched on his fourth chance when Medvedev netted from the baseline. Copil upset expectations of a Federer final against Alexander Zverev in a 6-3, 6-7 (6), 6-4 win over the fifth-ranked German in the earlier semifinal. The Romanian aims for a first title after arriving at Basel without a career win over a top-10 opponent. Copil has two after also beating No. 6 Marin Cilic in the second round. Copil fired 26 aces past Zverev and never dropped serve, clinching after 2 1/2 hours with

In [5]:
from nltk.tokenize import sent_tokenize
sentences = []
for s in df['article_text']:
  sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x] # flatten list

In [7]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2019-09-10 22:14:49--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2019-09-10 22:14:50--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2019-09-10 22:14:51--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2019-0

In [6]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [7]:
len(word_embeddings)

400000

In [8]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [9]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/quicksilver/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [11]:
# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [12]:
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [13]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [14]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

In [15]:
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])

In [16]:
from sklearn.metrics.pairwise import cosine_similarity

In [17]:
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

In [18]:
import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

In [19]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [22]:
# Extract top 10 sentences as the summary
for i in range(7):
  print(ranked_sentences[i][1])

It’s still one of the best looking designs available, but the notch is more intrusive than the holepunch design of the Honor 20 or similar, and not quite on the same level as the OnePlus 7 Pro.The phone is only available in mirror grey in the UK, which is super-shiny with a slight purple tint to it.
That meant the OnePlus 7 would make it from 7am on day one until 4pm on day two, or longer with lighter use or one of the power-saving modes active.There’s no wireless charging, but OnePlus’s included fast charging tech will hit 80% in 50 minutes and a full battery in 90 minutes.
OnePlus gives you the choice of choice standard three-button Android navigation keys, Google’s “pill” navigation button or gestures, which are smooth – swipe up from the centre to go home, up and hold for recently-used apps, or up and over to the right for the previously-used app.
Both smartphones have seen a barrage of recent software updates that have improved the cameras – something that’s very good to see.
The 