In [15]:
import numpy as np
import pandas as pd
import nltk 
nltk.download('punkt')
import re

[nltk_data] Downloading package punkt to C:\Users\Mir
[nltk_data]     Info\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [17]:
df= pd.read_csv("tennis_articles.csv")

In [None]:
from IPython.display import Markdown
    display(Markdown(document))

In [18]:
#inspect the data
df.head()

Unnamed: 0,article_id,article_title,article_text,source
0,1,"I do not have friends in tennis, says Maria Sh...",Maria Sharapova has basically no friends as te...,https://www.tennisworldusa.org/tennis/news/Mar...
1,2,Federer defeats Medvedev to advance to 14th Sw...,"BASEL, Switzerland (AP) — Roger Federer advanc...",http://www.tennis.com/pro-game/2018/10/copil-s...
2,3,Tennis: Roger Federer ignored deadline set by ...,Roger Federer has revealed that organisers of ...,https://scroll.in/field/899938/tennis-roger-fe...
3,4,Nishikori to face off against Anderson in Vien...,Kei Nishikori will try to end his long losing ...,http://www.tennis.com/pro-game/2018/10/nishiko...
4,5,Roger Federer has made this huge change to ten...,"Federer, 37, first broke through on tour over ...",https://www.express.co.uk/sport/tennis/1036101...


In [19]:
df['article_text'][0]

"Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: 'I don't really hide any feelings too much. I think everyone knows this is my job here. When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person whether they're in the locker room or across the net. So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match. I'm a pretty competitive girl. I say my hellos, but I'm not sending any players flowers as well. Uhm, I'm not really friendly or close to many players. I have not a lot of friends away from the courts.' When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men's tour than the women's tour? 'No, not at all. I think just because you're in the same

In [5]:
#extract word vectors 
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding = 'utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype = 'float32')
    word_embeddings[word]=coefs
f.close()
len(word_embeddings)

400000

In [6]:
#do some text cleaning 
sentences = df['article_text']
#remove the punctuations, numbers and special characters 
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

  clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")


In [7]:
#make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [8]:
#download stopwords from nltk package  
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Mir
[nltk_data]     Info\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
#import the stopwords 
from nltk.corpus import stopwords 
stop_words = stopwords.words('english')

In [10]:
#function to remove stopwords 
def remove_stopwords(sen):
    sen_new = " ". join([i for i in sen if i not in stop_words])
    return sen_new

In [11]:
#use the function to remove stopwords from sentences 
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [12]:
clean_sentences

['maria sharapova basically friends tennis players wta tour russian player problems openly speaking recent interview said really hide feelings much think everyone knows job courts court playing competitor want beat every single person whether locker room across net one strike conversation weather know next minutes go try win tennis match pretty competitive girl say hellos sending players flowers well uhm really friendly close many players lot friends away courts said really close lot players something strategic different men tour women tour think sport mean friends everyone categorized tennis player going get along tennis players think every person different interests friends completely different jobs interests met different parts life think everyone thinks tennis players greatest friends ultimately tennis small part many things interested also read maria sharapova reveals tennis keeps motivated',
 'basel switzerland ap roger federer advanced th swiss indoors final career beating seven

In [13]:
#we will create vectors for our sentences. each words in a sentence as a vector size of 100 elements. we will take the mean of those vectors to come to a consolidated vector for the sentence 
sentence_vectors = []
for i in clean_sentences: 
    if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001) #100  
    else:
        v = np.zeros((100,))
    sentence_vectors.append(v)

In [14]:
sentence_vectors

[array([ 0.00483909,  0.28544757,  0.42578188, -0.0922095 , -0.17029235,
         0.2513602 , -0.07538531,  0.16931818, -0.09813195, -0.14324999,
         0.1356024 , -0.05566843,  0.19561599, -0.00907872, -0.11904312,
        -0.15082781, -0.02610032,  0.23232764, -0.36464164,  0.37236184,
         0.14870016,  0.22674483,  0.09170502, -0.06452769,  0.09085543,
        -0.09743623, -0.06819598, -0.58882797,  0.27535385,  0.02007891,
        -0.19707477,  0.4776674 ,  0.0392753 , -0.16300361,  0.25556332,
         0.09693789, -0.2571401 ,  0.25395647, -0.13058247, -0.1783668 ,
        -0.26708218, -0.20551905,  0.20955761, -0.3373452 , -0.14459155,
        -0.08284358,  0.06940231, -0.11812706,  0.07247393, -0.72735333,
         0.02643214, -0.18338543,  0.02643522,  0.769736  ,  0.04523619,
        -1.9630214 ,  0.01066783,  0.06949498,  1.1767952 ,  0.5375171 ,
        -0.15733175,  0.7271858 , -0.22895402, -0.00910158,  0.45067054,
        -0.02839184,  0.3791105 ,  0.30643913,  0.0

In [15]:
#find similartities betwen the sentences using cosine similarity 
#similarity matrix 
sim_mat = np.zeros([len(sentences), len(sentences)]) #empty square matrix of zeros


In [16]:
#use cosine similarity to compute the similarity between a pair of sentences 
from sklearn.metrics.pairwise import cosine_similarity 

In [17]:
#initialize the empty metrics with cosine similarity score
 
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

In [18]:
#applying page rank algorithm, the nodes of the graph will represent sentences and the edges will represent the similarity scores between the sentences 
import networkx as nx

In [19]:
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

In [20]:
scores

{0: 0.1216317478725013,
 1: 0.12243770946848995,
 2: 0.12683679471176998,
 3: 0.12448198395108506,
 4: 0.12593658025062596,
 5: 0.1251296501110371,
 6: 0.1268850568792313,
 7: 0.12666047675525943}

In [21]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [22]:
#Extract top 10 sentences as the summary 
for i in range(10):
    print(ranked_sentences[i][1])

Tennis giveth, and tennis taketh away. The end of the season is finally in sight, and with so many players defending—or losing—huge chunks of points in Singapore, Zhuhai and London, podcast co-hosts Nina Pantic and Irina Falconi discuss the art of defending points (02:14). It's no secret that Jack Sock has struggled on the singles court this year (his record is 7-19). He could lose 1,400 points in the next few weeks—but instead of focusing on the negative, it can all be about perspective (06:28). Let's also not forget his two Grand Slam doubles triumphs this season. Two players, Stefanos Tsitsipas and Kyle Edmund, won their first career ATP titles last week (13:26). It's a big deal because...you never forget your first. Irina looks back at her WTA title win in Bogota in 2016, and tells an unforgettable story about her semifinal drama (14:04). In Singapore, one of the biggest storylines (aside from the matches, of course) has been the on-court coaching debate. Nina and Irina give their 

IndexError: list index out of range

In [25]:
#topic modelling with LSA

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline

documents = ["doc1.txt", "doc2.txt", "doc3.txt"] 
  
# raw documents to tf-idf matrix: 

vectorizer = TfidfVectorizer(stop_words='english', 
                             use_idf=True, 
                             smooth_idf=True)

# SVD to reduce dimensionality: 

svd_model = TruncatedSVD(n_components=100,   # num dimensions
                         algorithm='randomized',
                         n_iter=10)

# pipeline of tf-idf + SVD, fit to and applied to documents:

svd_transformer = Pipeline([('tfidf', vectorizer), 
                            ('svd', svd_model)])

svd_matrix = svd_transformer.fit_transform(documents)

# svd_matrix can later be used to compare documents, compare words, or compare queries with documents


ValueError: n_components must be < n_features; got 100 >= 4