In [26]:
import nltk
import pandas as pd
import numpy as np

In [33]:
df=pd.read_csv('metadata.csv')

In [24]:
filtered=df.drop(df[df['abstract'].isnull()].index)

In [35]:
filtered.reset_index(drop=True,inplace=True)

In [36]:
filtered['abstract'][31]

'The term "Geographic Information Systems" (GIS) has been added to MeSH in 2003, a step reflecting the importance and growing use of GIS in health and healthcare research and practices. GIS have much more to offer than the obvious digital cartography (map) functions. From a community health perspective, GIS could potentially act as powerful evidence-based practice tools for early problem detection and solving. When properly used, GIS can: inform and educate (professionals and the public); empower decision-making at all levels; help in planning and tweaking clinically and cost-effective actions, in predicting outcomes before making any financial commitments and ascribing priorities in a climate of finite resources; change practices; and continually monitor and analyse changes, as well as sentinel events. Yet despite all these potentials for GIS, they remain under-utilised in the UK National Health Service (NHS). This paper has the following objectives: (1) to illustrate with practical, 

In [38]:
from nltk.tokenize import sent_tokenize

In [39]:
#tokenize text
sentences=[]

for i in range (0,80):
    sentences.append(sent_tokenize(filtered['abstract'][i]))

In [40]:
sentences=[y for x in sentences for y in x]

In [41]:
#clean text
clean_sentences= pd.Series(sentences).str.replace("[^a-zA-Z]"," ")

#make sentences in lowercase
clean_sentences=[s.lower() for s in clean_sentences]

In [42]:
clean_sentences[:5]

['objective  this retrospective chart review describes the epidemiology and clinical features of    patients with culture proven mycoplasma pneumoniae infections at king abdulaziz university hospital  jeddah  saudi arabia ',
 'methods  patients with positive m  pneumoniae cultures from respiratory specimens from january      through december      were identified through the microbiology records ',
 'charts of patients were reviewed ',
 'results     patients were identified             of whom required admission ',
 'most infections         were community acquired ']

In [43]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\win10\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [44]:
from nltk.corpus import stopwords
stop_words=stopwords.words('english')

In [45]:
#remove stop words
def remove_stopwords(sen):
    sen_new=" ".join([i for i in sen if i not in stop_words])
    return sen_new

In [46]:
clean_sentences=[remove_stopwords(r.split()) for r in clean_sentences]

In [47]:
word_embeddings={}
f=open('glove.6B.100d.txt',encoding='utf-8')
for line in f:
    values=line.split()
    words=values[0]
    coefs= np.asarray(values[1:],dtype='float32')
    word_embeddings[words]=coefs
f.close()

In [48]:
sentence_vectors=[]
for i in clean_sentences:
    if len(i)!=0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.002)
    else:
        v = np.zeros((100,))
    sentence_vectors.append(v)

In [49]:
len(sentence_vectors)

662

In [50]:
sim_matrix=np.zeros([len(sentences),len(sentences)])

In [51]:
from sklearn.metrics.pairwise import cosine_similarity

In [52]:
for i in range (len(sentences)):
    for j in range (len(sentences)):
        if i!=j:
            sim_matrix[i][j]=cosine_similarity(sentence_vectors[i].reshape(1,100),sentence_vectors[j].reshape(1,100))[0][0]

In [53]:
import networkx as nx

In [54]:
nx_graph=nx.from_numpy_array(sim_matrix)
scores=nx.pagerank(nx_graph)

In [55]:
ranked_sentences=sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [56]:
len(ranked_sentences)

662

In [87]:
sn=10
abstract=" "

for i in range(sn):
    abstract=abstract+ranked_sentences[i][1]
    print(ranked_sentences[i][1])

RESULTS: Close examination of whole genome sequence of 54 SARS-CoV isolates identified before 14(th )October 2003, including 22 from patients in Singapore, revealed the mutations engendered during human-to-Vero and Vero-to-human transmission as well as in multiple Vero cell passages in order to refine our analysis of human-to-human transmission.
To minimize the effects of sequencing errors and additional mutations during the cell culture, three strategies were applied to estimate the mutation rate by 1) using the closely related sequences as background controls; 2) adjusting the divergence time for cell culture; or 3) using the common variants only.
Thus, the virus DNA-chip development reported in Retrovirology is an important advance in diagnostic application which could be a potent clinical tool for characterizing viral co-infections in AIDS as well as other patients.
We were also able to detect heterogeneous viral sequences in primary lung tissues, suggesting a possible coevolution 

In [115]:
#keywords extraction
lent=662
keys=" "
for s in range (lent):
    keys=keys+ranked_sentences[s][1]

In [86]:
import rake

In [66]:
import operator

In [116]:
rake_object=rake.Rake("SmartStoplist.txt",5,2,5)

In [117]:
keywords=rake_object.run(keys)

In [118]:
print(keywords)

[('home thermometer', 4.0), ('spike protein', 3.8), ('ring vaccination', 3.7857142857142856), ('cell culture', 3.7272727272727275), ('common ancestor', 3.666666666666667), ('rna viruses', 3.499092558983666), ('sars coronavirus', 3.4871794871794872), ('viral', 1.8181818181818181), ('protein', 1.8), ('sequence', 1.7391304347826086), ('infection', 1.7083333333333333), ('analysis', 1.6666666666666667), ('common', 1.6666666666666667), ('expression', 1.6363636363636365), ('sequences', 1.625), ('virus', 1.6153846153846154), ('model', 1.6153846153846154), ('epidemic', 1.6153846153846154), ('genes', 1.588235294117647), ('genomes', 1.5833333333333333), ('viruses', 1.5517241379310345), ('genome', 1.5416666666666667), ('coronavirus', 1.5384615384615385), ('important', 1.5294117647058822), ('proteins', 1.5263157894736843), ('based', 1.5), ('suggest', 1.5), ('dependent', 1.5), ('species', 1.5), ('disease', 1.5), ('method', 1.4545454545454546), ('isolates', 1.4545454545454546), ('compared', 1.375), (