# NLP for news 

----

## 1. Identify major breaking news events

Major breaking news events can be identified by noticing that many articles reporting on the same news event have similar content, and are published relatively quickly after the event has happened.

For this first part, we ask you to group articles related to the same breaking news event by assigning an eventId to each article. Articles that don’t relate to a major breaking news event should be labelled as “-1”.


### Preliminaries

In [1]:
import numpy as np
import string as string
import operator
from operator import itemgetter
import pandas as pd
import spacy
import sklearn
import time
from datetime import datetime
from collections import Counter

import en_core_web_sm
nlp = en_core_web_sm.load()

from nltk import word_tokenize as tokenize
from nltk import sent_tokenize
from nltk.cluster.util import cosine_distance
from nltk.corpus import stopwords
en_stop = stopwords.words('english')
en_stop.append("'d")
punctuations = list(string.punctuation)

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.metrics import adjusted_rand_score
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.metrics import pairwise_distances_argmin_min
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

import gensim.models as gm
from gensim.models.doc2vec import TaggedDocument
from gensim.corpora.textcorpus import TextCorpus
from gensim.test.utils import datapath, get_tmpfile
from gensim.similarities import Similarity



Read dataset into dataframe.

In [4]:
df = pd.read_csv('20190710.csv')

### Pre-processing step

I assume all articles are in English as mentioned in the task even there are articles in Russian appeared in the dataset.

In [5]:
#look at statistics
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13022 entries, 0 to 13021
Data columns (total 6 columns):
ArticleId               13022 non-null object
ArticleURL              13022 non-null object
ArticleTitle            13022 non-null object
ArticleDescription      9594 non-null object
ArticlePublishedTime    13022 non-null int64
EventId                 0 non-null float64
dtypes: float64(1), int64(1), object(4)
memory usage: 610.5+ KB


Check a lenth of ArticleTitle and ArticleDescription if there are any outliers

In [6]:
df['len_title'] = df.ArticleTitle.apply(lambda x: len(str(x)))
df['len_description'] = df.ArticleDescription.apply(lambda x: len(str(x)))

In [7]:
df.describe()

Unnamed: 0,ArticlePublishedTime,EventId,len_title,len_description
count,13022.0,0.0,13022.0,13022.0
mean,1557744000.0,,69.471203,261.899247
std,46708.56,,24.110344,751.004452
min,1557654000.0,,5.0,3.0
25%,1557703000.0,,54.0,3.0
50%,1557749000.0,,67.0,135.0
75%,1557781000.0,,82.0,275.0
max,1557826000.0,,241.0,26883.0


There are 25% of ArticleDescription have 3 chars len (hypothesise NAN after import), it is not informative for the analysis. Remove them.

Check whether any duplications in ArticleId:

In [13]:
if len(df.ArticleId.unique()) == df.shape[0]:
    print('All ActicleId are unique!')
else:
    print('ArticleId has problems')

All ActicleId are unique!


Transform time in readable format for further work. Truncate seconds.

In [15]:
df['ArticleTime'] = df.ArticlePublishedTime.apply(lambda x: (pd.to_datetime(x, unit = 's'))) 
df['ArticleTime'] = df.ArticleTime.apply(lambda x: str(x)[:-3]) 
print(df['ArticleTime'][0])

2019-05-12 09:32


Concatenate information from Title and Description to have more data about every article.

In [16]:
df['_Article'] = list(df.ArticleTitle + ' ' + df.ArticleDescription)

Remove stopwords, punctuation.

In [18]:
t = time.time()

_remove = en_stop
_remove.append('‘')
_remove.append('’')
_remove.append('\'s')

df._Article = df._Article.apply(lambda x: tokenize(str(x).translate(str.maketrans('', '', string.punctuation))))
df._Article = df._Article.apply(lambda x: [item for item in x if item.lower() not in _remove])

print('Running time: {} mins'.format(round((time.time() - t) / 60, 2)))

Running time: 0.12 mins


Save a clean string for every article in 'output'

In [19]:
#output for further work, case normalisation applied
df['output'] = df._Article.apply(lambda x: ' '.join(x).lower())

#output for working with Named entities, keep original case
df['output_ne'] = df._Article.apply(lambda x: ' '.join(x))

Look at the most frequent tokens in the articles, maybe there is something to remove for better performance:

In [143]:
def vocab_counter(df, column_name):
    vocab = Counter()
    for text in df[column_name]:
        for word in str(text).split(' '):
            vocab[word] += 1
    return vocab

print('Total words in a vocabulary: {}'.format(len(vocab_counter(df, 'output'))))

Total words in a vocabulary: 47446


In [21]:
_remove_extra = sorted(vocab.items(), key=lambda x: x[1], reverse=True)[:7]
_remove_extra

[('nan', 3431),
 ('first', 2032),
 ('new', 1882),
 ('said', 1851),
 ('“', 1836),
 ('…', 1727),
 ('”', 1663)]

Update text for a every article removing extra uninformative tokens. 

In [22]:
df._Article = df._Article.apply(lambda x: [item for item in x if item.lower() not in [i[0] for i in _remove_extra]])
df.output = df._Article.apply(lambda x: ' '.join(x).lower())
df.output_ne = df._Article.apply(lambda x: ' '.join(x))

I do not do stemming and lemmatization as data is not so big and it will not reflect significantly on the result in this case.

# Solution for grouping articles

This approach is a combination of named entity extraction, topic modeling using Latent Dirichlet Allocation (LDA) and KMeans clustering.

Transform articles into sequence of named entities

In [24]:
t = time.time()

#transform articles in Spacy nlp format
df['ne_span'] = df.output_ne.apply(lambda x: nlp(x))

#create unigrams from bigrams replacing a space with underscore
df['ne'] = df.ne_span.apply(lambda x: [ent.text.replace(' ', '_') for ent in x.ents])

#transform a new article presentation into a string for a further work convenience
df['ne'] = df['ne'].apply(lambda x: ' '.join(x))

print('Running time: {} mins'.format(round((time.time() - t) / 60, 2)))

Running time: 3.27 mins


### Topic modeling LDA

Apply topic modeling over named entities. I assume that it was 10 major events.

In [25]:
#Convert a collection of text documents to a matrix of token counts
lda_vectorizer = sklearn.feature_extraction.text.CountVectorizer(min_df = 5, max_df = 0.9)

#Learn the vocabulary dictionary and return term-document matrix.
lda_data_vectorized = lda_vectorizer.fit_transform(df['ne'])

NUM_TOPICS = 10

#LDA Model
lda = LatentDirichletAllocation(n_components = NUM_TOPICS, max_iter = 10, learning_method = 'online', verbose = True)
data_lda = lda.fit_transform(lda_data_vectorized)

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10


In [26]:
# article - topic matrix
lda_output = lda.transform(lda_vectorizer.transform(df.output_ne))
# column names
topicnames = ["Topic" + str(i) for i in range(0, NUM_TOPICS)]
# index names
docnames = [item for item in df.ArticleId]

df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)
df_document_topic['_Topic'] = list(df_document_topic.values)

#consider the topic is correct if the article got a probability > 0.5
df_document_topic['Topic'] = df_document_topic._Topic.apply(lambda x: list(x).index(np.max(x)) if any(x > 0.5) else -1)

#add topic information in general dataframe
df['Topic'] = np.array(df_document_topic.Topic)

In [27]:
#function for printing key words for detected topic

def selected_topic(model, vectorizer, top_n = 10):
    for i, topic in enumerate(model.components_):
        print("Topic %d: "%(i))
        print([(lda_vectorizer.get_feature_names()[i], topic[i])
              for i in topic.argsort()[:-top_n - 1:-1]])

print("Top words for a major events derived by LDA model: ")
print('')
selected_topic(lda, lda_vectorizer)

Top words for a major events derived by LDA model: 

Topic 0: 
[('saturday', 246.49120409487307), ('friday', 235.66770740063112), ('10', 203.79275986202654), ('chinese', 196.6534952298538), ('inquirer_news', 149.8602980706571), ('hong_kong', 148.76605061610135), ('celtic', 96.59537175283356), ('may_13', 79.07611718168809), ('nba', 47.20421689269658), ('swedish', 43.42402057236027)]
Topic 1: 
[('one', 625.0890170326954), ('manchester_united', 157.0151175800864), ('singapore', 135.89034850662645), ('uber', 135.52332333711846), ('chelsea', 133.60424961956974), ('amazon', 122.39093328267725), ('seven', 86.25066688029688), ('ole_gunnar_solskjaer', 79.60561145336007), ('eden_hazard', 72.3765045052068), ('barcelona', 69.65278404759039)]
Topic 2: 
[('us', 792.3198336533801), ('china', 412.9107310094856), ('three', 410.29225417568836), ('tuesday', 287.4857919807417), ('trump', 144.5633379977283), ('summer', 141.75470477196524), ('donald_trump', 102.1803067053905), ('french', 92.00694737083694),

Print out a number of articles per event. I consider an article as a valid if it probability is more than 50% that the article belongs to the particular topic (as LDA adds all documents in all topics varying just their probabilities). Articles having probabilities lower than 0.5 to any topis are considered as not relevant and labeled as -1.

In [28]:
df_topic_distribution = df_document_topic.Topic.value_counts().reset_index(name="Num Documents")
df_topic_distribution.columns = ['EventId', 'Number of Articles']
df_topic_distribution

Unnamed: 0,EventId,Number of Articles
0,-1,9559
1,8,692
2,2,513
3,7,417
4,5,392
5,1,337
6,9,302
7,6,292
8,3,222
9,0,162


The articles are already grouped in a sensible manner, see the result for instance, for a Topic (event) 4 below. But it still needs additional grouping within topics.

In [39]:
df[(df.Topic == 1)].groupby('Topic')[['ArticleTitle', 'ArticleDescription', 'ArticleTime', 'Topic']].head(20).sort_values(by = 'ArticleTime', ascending = True).reset_index(drop=True)

Unnamed: 0,ArticleTitle,ArticleDescription,ArticleTime,Topic
0,Man Utd vs Cardiff TV channel: What TV channel...,MANCHESTER UNITED end their Premier League cam...,2019-05-12 09:44,1
1,Chelsea 'are only team to show interest in Bar...,"Coutinho, who has struggled to impress for Bar...",2019-05-12 09:48,1
2,How Amazon dumps millions of new items it can’...,AMAZON dumps and destroys millions of brand-ne...,2019-05-12 10:00,1
3,Man Utd duo Paul Pogba and Alexis Sanchez at h...,MANCHESTER UNITED stars are jealous of Alexis ...,2019-05-12 10:18,1
4,Silver King: il wrestler messicano muore sul ring,"Cesar Cuauhtemoc Gonzalez Barron, questo il su...",2019-05-12 10:20,1
5,Uber’s rocky road to global powerhouse,"Uber, set to make its stock market debut in on...",2019-05-12 10:27,1
6,Experience Kallang's rich history in new 3km w...,SINGAPORE - Take a 3km walk on a trail that wi...,2019-05-12 10:40,1
7,Man Utd want one transfer deal because of Chel...,MANCHESTER UNITED want to sign Everton’s Idris...,2019-05-12 10:50,1
8,7 nabbed for alleged vote-buying in Iloilo,Seven persons were arrested in Iloilo amid rep...,2019-05-12 10:52,1
9,Ciara can’t get enough of high-top sneakers,Ciara loves the way high-top sneakers can be p...,2019-05-12 10:59,1


As few events can happen at the same moment around the world so I cannot extract articles just by timings (calculating cosine similarity is not too reliable method also in this case). 

I will cluster each of the selection of article by topic (result above) and look at the output.

I omit all articles labeled as -1 for clustering (means they are not involved in any topic)

In [40]:
df_cluster = df[(df.Topic > -1)][['ArticleId', 'output_ne', 'ArticleTime', 'Topic']].sort_values(by = 'Topic', ascending = True).reset_index(drop=True)
df_cluster.shape

(3463, 4)

### KMeans clustering for every topic

In [41]:
#define a column to save clusters'numbers
df_cluster['Topic_KMeans'] = 0

Assume there are 5 events within every major topic discovered above.

In [42]:
n_clusters = 10

model = KMeans(n_clusters = n_clusters, init='k-means++', max_iter=100, n_init=10)

#function to extract clusters within a topic
def get_cluster_topic(model, df):
    # save all article in a list
    articles = list(*[df[c].values.tolist() for c in ['output_ne']])
    kmeans_vectorizer = TfidfVectorizer(stop_words='english')
    X = kmeans_vectorizer.fit_transform(articles)        
    model.fit(X)
    return model.labels_

_clustering = []

#run over all 10 topic
for i in range(0, 10):
    clustering = get_cluster_topic(model, df_cluster[df_cluster.Topic == i])
    #print(clustering)
    _clustering.append(clustering)
    
#update dataframe with cluster numbers for every article
df_cluster.Topic_KMeans = np.concatenate([item for item in _clustering])

Assess the result of clustering articles by topics. See a selection of articles grouped in one particular event.

In [48]:
 df_cluster[(df_cluster.Topic == 1) & (df_cluster.Topic_KMeans == 1)].sort_values(by = 'ArticleTime', ascending = True).reset_index(drop=True)

Unnamed: 0,ArticleId,output_ne,ArticleTime,Topic,Topic_KMeans
0,5cd7ec5f7ddacd3b2b3b5462,Chelsea team show interest Barcelona flop Phil...,2019-05-12 09:48,1,1
1,5cd8375d7ddacd3b2b3b6681,Lionel Messi Chelsea star Eden Hazard Messi to...,2019-05-12 14:06,1,1
2,5cd8522b8e662d1e4435e9d3,Fans BEG Eden Hazard stay Chelsea home banner ...,2019-05-12 16:35,1,1
3,5cd863b47ddacd3b2b3b7211,Chelsea want Barcelona star replace Eden Hazar...,2019-05-12 18:23,1,1
4,5cd867197ddacd3b2b3b72ab,Hazard future still uncertain says Chelsea bos...,2019-05-12 18:32,1,1
5,5cd8a0cb8e662d1e4435f893,Eden Hazard Chelsea must respect forwards deci...,2019-05-12 19:37,1,1
6,5cd892708e662d1e4435f6c6,Hazard drops biggest hint yet told Chelsea Rea...,2019-05-12 21:30,1,1
7,5cd8909a8e662d1e4435f663,Eden Hazard drops Chelsea transfer bombshell m...,2019-05-12 21:30,1,1
8,5cd8911b8e662d1e4435f67e,made decision Eden Hazard reveals told Chelsea...,2019-05-12 21:30,1,1
9,5cd8935c7ddacd3b2b3b7a37,Eden Hazard drops biggest transfer hint yet Ch...,2019-05-12 21:30,1,1


It can be seen from the output above that I grouped articles by an event they describe. All articles above about Eden Hazard and Real Madrid grouped in one cluster, I can say this is one event. There is maybe 1-2 articles are not strongly about the event, this is kind of error of my approach. To exclude that small error it is possible simply calculate *cosine similarity between articles within one cluster* (this data is not sparse so the performance will be accurate enough) to be sure with a high degree the article is exactly describes the event. 
I will not do that simple operation here due to luck of time.

I have not involved time publishing parameter as you noticed. Maybe using another approach it will be useful to use time as a main parameter for a decision about article relevancy to event. 

Also the dataset contains articles for a short period. If I had articles for a larger period I would additionally filter my result above by selecting articles, for example, during 24 hours after first its appearance. But maybe in that case I will be able to put the article into breaking news category. I am not a media specialist and not sure how much time the article can be considered as a 'hot' (24 hours picked as an example). Although the example of extraction above shows they publish articles even over 24 hours boundary, it should be included or not I do not know ). Need media professional judgment.

In conclusion, I grouped the articles as asked in the question, identified 99 events.

In [85]:
#fill EventId column
df_cluster['EventId'] = pd.to_numeric(df_cluster.Topic.map(str) + df_cluster.Topic_KMeans.map(str))
df.drop('EventId', axis = 1, inplace = True)
df_q1 = pd.merge(df, df_cluster[['ArticleId', 'EventId']], on = 'ArticleId', how = 'left')
df_q1.EventId.fillna(-1, inplace = True)

## Additional approaches, less quality result as for now

I have tried clear KMeans and LDA but the performance is not so impressive as from the solution above. Maybe need more time to go a bit deeper into those algorithms.

### KMeans clustering 

In [52]:
# save all article in a list
documents = list(*[df[c].values.tolist() for c in ['output_ne']])
ids = list(*[df[c].values.tolist() for c in ['ArticleId']])

In [53]:
kmeans_vectorizer = TfidfVectorizer(stop_words='english')
X = kmeans_vectorizer.fit_transform(documents)

Assume there are 10 events happened during two days. Define number of clusters equals 10.

In [54]:
n_clusters = 10
model = KMeans(n_clusters = n_clusters, init='k-means++', max_iter=100, n_init=10)
model.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=10, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

Add number of relevant cluster to the dataframe with articles:

In [55]:
df['KMeans'] = model.labels_.tolist()

Amount of articles per cluster (event)

In [56]:
df.KMeans.value_counts()

1    12293
8      590
4       84
5       19
7       13
2       10
6        7
3        3
9        2
0        1
Name: KMeans, dtype: int64

The distribution among clusters looks weird.

In [59]:
clusters_dic = {}

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = kmeans_vectorizer.get_feature_names()
for i in range(n_clusters):
    words = ''
    for ind in order_centroids[i, :10]:
        #print (' %s' % terms[ind])
        words += terms[ind]+ ' '
    clusters_dic[i] = words
    print ("Cluster %d:" % i, words)
    #print() #add whitespace
    

Top terms per cluster:
Cluster 0: minnesota eastbound 36 traffic closed 694 36hadley detour sewer installation 
Cluster 1: post appeared monday day game says season police years time 
Cluster 2: charles st avenue kane girl quartuccio geneva driver scene fled 
Cluster 3: progeny business daughter ties lucy enters joined industry compere chows 
Cluster 4: bafta awards tv 2019 carpet red jodie ant eve comer 
Cluster 5: sri lanka mosques easter muslimowned curfew bombings social media colombo 
Cluster 6: brotherinlaw best allegations world sgx confirmed connections customer chief ceos 
Cluster 7: leeds derby playoff lampard marcelo bielsa semifinal championship leg frank 
Cluster 8: league premier liverpool city manchester title season man final champions 
Cluster 9: protesters omar albashir ousted sudan killing charged president involvement prosecutors 


Conclusion: I refuse the idea to use KMean clustering only for this task.

# Topic Modeling. LDA

In [62]:
# Materialize the sparse data
data_dense = lda_data_vectorized.todense()

# Compute Sparsicity = Percentage of Non-Zero cells
print("Sparsicity: ", ((data_dense > 0).sum()/data_dense.size)*100, "%")

Sparsicity:  0.15175036818158819 %


In [None]:
# Build LDA Model
lda_model = LatentDirichletAllocation(n_topics=20,               # Number of topics
                                      max_iter=10,               # Max learning iterations
                                      learning_method='online',   
                                      random_state=100,          # Random state
                                      batch_size=128,            # n docs in each learning iter
                                      evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      n_jobs = -1,               # Use all available CPUs
                                     )
lda_output = lda_model.fit_transform(lda_data_vectorized)

print(lda_model)  # Model attributes

In [None]:
# Define Search Param
search_params = {'n_components': [10, 20, 100], 'learning_decay': [.7]}

# Init the Model
lda = LatentDirichletAllocation()

# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)

# Do the Grid Search
model.fit(lda_data_vectorized)

In [67]:
# Best Model
best_lda_model = model.best_estimator_

# Model Parameters
print("Best Model's Params: ", model.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(lda_data_vectorized))

Best Model's Params:  {'learning_decay': 0.7, 'n_components': 10}
Best Log Likelihood Score:  -82874.59014212845
Model Perplexity:  550.1253864555185


In [76]:
# Create Document - Topic Matrix
lda_output = best_lda_model.transform(lda_data_vectorized)
# column names
topicnames = ["Topic" + str(i) for i in range(0, 10)]

# index names
docnames = ["Doc" + str(i) for i in range(len(df.output))]

# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)

# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic

# Styling
def color_red(val):
    color = 'red' if val > .1 else 'black'
    return 'color: {col}'.format(col=color)

def make_bold(val):
    weight = 700 if val > .1 else 400
    return 'font-weight: {weight}'.format(weight=weight)

# Apply Style
df_document_topics = df_document_topic.head(10).style.applymap(color_red).applymap(make_bold)
df_document_topics

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,dominant_topic
Doc0,0.05,0.05,0.05,0.05,0.05,0.55,0.05,0.05,0.05,0.05,5
Doc1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0
Doc2,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0
Doc3,0.05,0.05,0.05,0.05,0.55,0.05,0.05,0.05,0.05,0.05,4
Doc4,0.02,0.02,0.44,0.02,0.42,0.02,0.02,0.02,0.02,0.02,2
Doc5,0.03,0.27,0.03,0.03,0.03,0.03,0.03,0.27,0.03,0.28,9
Doc6,0.01,0.01,0.01,0.01,0.01,0.01,0.89,0.01,0.01,0.01,6
Doc7,0.03,0.03,0.77,0.03,0.03,0.03,0.03,0.03,0.03,0.03,2
Doc8,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0
Doc9,0.05,0.05,0.05,0.55,0.05,0.05,0.05,0.05,0.05,0.05,3


In [74]:
df_document_topic.groupby('dominant_topic')[['dominant_topic']].count()

Unnamed: 0_level_0,dominant_topic
dominant_topic,Unnamed: 1_level_1
0,5863
1,1046
2,644
3,611
4,728
5,1049
6,787
7,780
8,842
9,672


The GridSearch from sklearn using cross validation option advised to have 10 topic, but using human judgment I consider too many documents assighed per each topic (more than 1000 few topics). 

Additionally, I tried *doc2vec* but performance was poor I did not include it in the script (I predict it is due to limited vocabulary). Maybe if run it on full article text it will be more sensible.

Conclusion: the LDA has good potential and with realisation provided as first solution above and gives exciting result.

# Question 2. Extract information about breaking news events

Once you have identified which articles are linked to their respective events, please attempt to derive meaningful information about these events.

In [105]:
#gest the data need for this question
df_q2 = pd.concat([df, df_q1.EventId], axis = 1)

### Get Article Subject

Use Spacy to extract Named Entities

In [479]:
from spacy.gold import GoldParse 
  
nlp = spacy.load('en', entity = False, parser = False) 
entities = ['PERSON', 'GPE']

#df_q2['GoldParse'] = df.output_ne.apply(lambda x: GoldParse(nlp.make_doc(x), entities).orig_annot)

I extract most common city, country, person name from the articles to get the subject of the event. 
I did not split country city and person into two columns as there are articles where no location mentioned and I do not want to have an empty field in the ArticleSubject column.

In [577]:
#extract country, city, person name most probable in the articles per event
df_q2['ArticleSubject'] = df.ne_span.apply(lambda x: ' '.join(item for item in set([ent.text for ent in x.ents 
                                                                 if ent.label_ in entities])))
df_sum = pd.DataFrame(df_q2[df_q2.EventId > -1].groupby('EventId')['ArticleSubject'].apply(lambda x: 
                                                                                       Counter([item for item in x if len(item)>0])
                                                                                       .most_common()[:-1-1:-1][0][0]  ))
df_sum.ArticleSubject = df_sum.ArticleSubject.apply(lambda x: ' '.join([item[0] for item in Counter(tokenize(str(x))).most_common(4)[:-4-1:-1]]))
df_sum.ArticleSubject.head(10)

EventId
0.0       Tony West Taiwan Jakarta
1.0            Dundee McIntyre Jim
2.0            Dineo Solo Moeketsi
3.0                         Penang
4.0                       Dry Mark
5.0                 Yusran Maulana
6.0    Kent Essex MacGregor Tartan
7.0                 Bielsa Marcelo
8.0                 Tillie Killian
9.0                   Oakland Port
Name: ArticleSubject, dtype: object

### Get Article Time

In [580]:
df_temp = df_q2[df_q2.EventId > -1].sort_values(['EventId', 'ArticlePublishedTime'], ascending = [True, True]).drop_duplicates('EventId', keep='first')['ArticlePublishedTime']
df_sum['ArticleTime'] = list(df_temp)
df_sum[:2]

Unnamed: 0_level_0,ArticleSubject,ArticleTime
EventId,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,Tony West Taiwan Jakarta,1557660519
1.0,Dundee McIntyre Jim,1557661019
2.0,Dineo Solo Moeketsi,1557656777


### Article summary through page rank algorithm and cosine similarity.

I will extract most relevant sentence within all data given per event.

In [599]:
def PageRank(M, eps = 1.0e-08, d = 0.85):
    N = M.shape[1]
    v = np.random.rand(N, 1)
    v = v/ np.linalg.norm(v, 1)
    last_v = np.ones((N, 1), dtype = np.float32)*np.inf
    M_hat = (d*M)+(((1-d)/N)*np.ones((N,N), dtype = np.float32))
    
    while np.linalg.norm(v - last_v, 2) > eps:
        last_v = v
        v = np.matmul(M_hat, v)
    return v

In [583]:
def cos_sim(sent1, sent2):
    vocab = list(set(sent1 + sent2))
    vector1 = [0]*len(vocab)
    vector2 = [0]*len(vocab)
    
    for  word in sent1:
        vector1[vocab.index(word)] +=1
    
    for  word in sent2:
        vector2[vocab.index(word)] +=1
    
    return 1 - cosine_distance(vector1, vector2)

In [585]:
def similarity_matrix(sentences):
    S = np.zeros((len(sentences), len(sentences)))
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i == j:
                continue
            else:
                S[i][j] = cos_sim(sentences[i], sentences[j])
                
    for i in range(len(S)):
        S[i] /= S[i].sum()
    return S

In [646]:
df_q1['_Article'] = df_q1.ArticleTitle + ' ' + df_q2.ArticleDescription

#I want to extract one sentence as a summary
SUMMARY_SIZE = 1
final_summary = []

#run over all 100 topic
for i in range(0, 100):
    df_temp = df_q1[df_q1.EventId == i]
    # save all article in a list
    documents = list(*[df_temp[c].values.tolist() for c in ['_Article']])
    S = similarity_matrix(documents)
    #find sentences ranks within every event
    sentence_rank = PageRank(S)
    #find sentence index
    sentence_index = [item[0] for item in sorted(enumerate(sentence_rank), key = lambda item: -item[1])]
    article_sum = itemgetter(*sorted(article_index[:SUMMARY_SIZE]))(documents)
    final_summary.append(article_sum)

Add summary into final output.

In [649]:
df_sum['ArticleMain'] = final_summary

In [654]:
df_sum.head()

Unnamed: 0_level_0,ArticleSubject,ArticleTime,ArticleMain
EventId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,Tony West Taiwan Jakarta,1557660519,Aussie parties turn to WeChat to woo Chinese C...
1.0,Dundee McIntyre Jim,1557661019,Dundee manager Jim McIntyre leaves post Dundee...
2.0,Dineo Solo Moeketsi,1557656777,Climb inside Oakland Airport's massive new fir...
3.0,Penang,1557658811,Why borrowing makes sense as a financial strat...
4.0,Dry Mark,1557662413,Should Rangers ace Jon Flanagan have been sent...


I have not implemented pre-processing at this question due to run out of time. With a sentence pre-processing the performance will be better. This pre-processing not the same as done for the question one. 

**Conclusion**

As a summary I picked the most relevant sentence within all articles (among article titles and descriptions) per event. I realize that this is not a summary but main idea. However, in general it can be suitable in some point when we need to make a short description for example.
