![stack](images/kim.jpg)

# Polar Data Insights and Search Analytics for the Deep and Scientific Web 

The pitch - 

"Our preliminary work in this area has shown that the unstructured textual data, when combined with structured scientific information can inform answers to grand challenge problems such as identifying ice sheet breakage/melt over decadal time spans; bird migration around Greenland, oil spills and natural disasters, sea ice decline and its relation to natural disasters, and other critical questions for the Polar community derived from the President’s National Strategy for the Arctic Region."


## Polar Data Insights (PDI for short)

![stack](images/stack.png)

In [1]:
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora, models
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
import json

import urllib.request

solr_root = 'http://integration.polar-deep-vm.apps.int.nsidc.org:8983/solr/rda/select?fl=url,content,type&indent=on&'
solr_query = 'q=url:"case-statement"&rows=200&start=1&wt=json'
solr_url = solr_root + solr_query
req = urllib.request.Request(solr_url)
# parsing response
r = urllib.request.urlopen(req).read()
json_response = json.loads(r.decode('utf-8'))
solr_documents = json_response['response']['docs']
print("Processing {0} documents. \n".format(len(solr_documents)))

Processing 170 documents. 



* *Now we tokenize each document and remove stop words and apply stemming (Wordnet lemmatizer)*


In [2]:
## we need a tokenizer
tokenizer = RegexpTokenizer(r'\w+')
## we need stemer
stemmer = WordNetLemmatizer()
## our custom stop words
my_stop_words = {
                    'http', 'www', 'edu', 'org', 'com', 'rda', 'data', 'researcher', 'event', 'service',
                    'group', 'research', 'community', 'use', 'work', 'member', 'case', 'working', 'science',
                    'meeting', 'organisational', 'news', 'plenary', 'recommendation', 'project', 'standard',
                    'statement', 'school', 'university', 'membership', 'output', '2017', 'brokering',
                    'stakeholder', 'repository', 'user', 'citation', 'chair', 'framework', 'information',
                    'metadata', 'content', 'sharing', 'pid'
                }
stop_words = my_stop_words.union(ENGLISH_STOP_WORDS)

# document list will contain our corpus after cleaning it.
document_list = []
# pairs is a list of the urls and the size of their content
pairs = []
# just the documents urls
urls = []

for item in solr_documents:
    # If we apply NER it should be the first step.
    # We tokenize words and lower case them(for now)
    tokens = tokenizer.tokenize((item['content'][0]).lower())
    # We lematize (stemming)
    stemmed_tokens = [stemmer.lemmatize(i) for i in tokens]
    # If the token is not in our stop words and the length is >2 and <20 we add it to the cleaned document
    document = [i for i in stemmed_tokens if i not in stop_words and (len(i) > 2 and len(i) < 25)]
    # To debug uncomment the next line
    # print("{0}\n Document size before stop words: {1}, after: {2} ".format(item['url'],len(stemmed_tokens),len(document)))
    document_list.append(document)
    pairs.append((item['url'],len(document)))
    urls.append(item['url'])

* Building the LDA model using Gensim a library for topic modeling, the output is a list of topics present in our corpus.

In [3]:
num_passes = 10
num_topics = 20
words_per_topic = 6

dictionary = corpora.Dictionary(document_list)
corpus = [dictionary.doc2bow(text) for text in document_list]
lda_model = models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=num_passes)
topics = lda_model.print_topics(num_topics=num_topics, num_words=words_per_topic)
# Now let's print the topics found
for topic in topics:
    print(topic)

(0, '0.012*"indigenous" + 0.009*"international" + 0.008*"sov" + 0.008*"policy" + 0.006*"creating" + 0.006*"network"')
(1, '0.013*"array" + 0.011*"database" + 0.008*"chemistry" + 0.006*"library" + 0.005*"chemical" + 0.005*"domain"')
(2, '0.010*"humanity" + 0.009*"empirical" + 0.007*"wheat" + 0.007*"practice" + 0.006*"developing" + 0.006*"provide"')
(3, '0.009*"archive" + 0.009*"transportation" + 0.009*"record" + 0.008*"professional" + 0.007*"geospatial" + 0.006*"usa"')
(4, '0.063*"cost" + 0.039*"recovery" + 0.025*"model" + 0.024*"centre" + 0.013*"study" + 0.010*"policy"')
(5, '0.013*"software" + 0.010*"code" + 0.010*"source" + 0.009*"outcome" + 0.008*"creating" + 0.008*"deliverable"')
(6, '0.028*"rice" + 0.013*"national" + 0.011*"interoperability" + 0.006*"international" + 0.006*"legal" + 0.006*"codata"')
(7, '0.020*"preservation" + 0.012*"tool" + 0.007*"policy" + 0.007*"career" + 0.007*"usa" + 0.007*"early"')
(8, '0.009*"type" + 0.006*"creating" + 0.006*"past" + 0.005*"plenaries" + 0.0

* The following code builds a list of all the documents that belong to a particular topic and their calculated probabilities

In [4]:
import ipywidgets as widgets

minum_likelihood = 0.1 # 10%
total_documents = 169

# We create a list for each topic containing the terms and an empty document list
docs_in_topics = [{'topic':t[0],'terms': t[1], 'documents': []} for t in topics]
# Add each document to their predicted topics if the probability is above the minimum_likelihood
for i in range(total_documents):
    doc_prob = lda_model.get_document_topics(bow=corpus[i],minimum_probability=minum_likelihood)
    document_url = urls[i]
    # Each document could contain more than one topic, we traverse them and add the url to n topics
    for prob in doc_prob:
        topic_index = prob[0]
        topic_probability = prob[1]
        docs_in_topics[topic_index]['documents'].append((document_url,topic_probability))

* Now we can select a topic and then we'll print all the documents for it.

In [5]:
topic_of_interest = 11 # the topic index

def getkey(doc):
    return doc[1]

print("Documents in Topic {0} ({1})".format(topic_of_interest,docs_in_topics[topic_of_interest]['terms']))
for doc in sorted(docs_in_topics[topic_of_interest]['documents'],key=getkey):
    print(" Document: {0} \n - Probability:{1}".format(doc[0],doc[1]))

Documents in Topic 11 (0.000*"national" + 0.000*"publication" + 0.000*"new" + 0.000*"practice" + 0.000*"infrastructure" + 0.000*"creating")


* After we execute the next cell we can pick any of the 200 URLs and see what LDA predicted for the selected URL

In [6]:
url_list = widgets.Dropdown(
    options=urls,
    description='URL:',
    disabled=False,)
display(url_list)

In [7]:
# Each time we pick a new URL we should execute this cell
current_url = str(url_list.value)
print("LDA predictions for:\n" + current_url)
for topic in docs_in_topics:
    for doc in topic['documents']:
        if (current_url == doc[0]):
            print("Topic {0} Probability {1} \n - Terms: {2}".format(docs_in_topics.index(topic), doc[1],topic['terms']))

LDA predictions for:
https://rd-alliance.org/group/data-development/case-statement/data-development-case-statement.html
Topic 10 Probability 0.998359203338623 
 - Terms: 0.010*"linguistics" + 0.009*"linguistic" + 0.008*"ldig" + 0.008*"health" + 0.007*"development" + 0.006*"language"


In [8]:
import pyLDAvis.gensim
import warnings
pyLDAvis.enable_notebook()
warnings.filterwarnings('ignore')



In [9]:
pyLDAvis.gensim.prepare(corpus=corpus, topic_model=lda_model, dictionary=dictionary, sort_topics=False)

### References and Links



> L. A. Lopez, R. Duerr and S. J. S. Khalsa, "Optimizing apache nutch for domain specific crawling at large scale," 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, 2015, pp. 1967-1971.
doi: 10.1109/BigData.2015.7363976

-

> 