# Polar Data Insights and Search Analytics for the Deep and Scientific Web 

The pitch - 

"Our preliminary work in this area has shown that the unstructured textual data, when combined with structured scientific information can inform answers to grand challenge problems such as identifying ice sheet breakage/melt over decadal time spans; bird migration around Greenland, oil spills and natural disasters, sea ice decline and its relation to natural disasters, and other critical questions for the Polar community derived from the President’s National Strategy for the Arctic Region."


## Polar Data Insights (PDI for short)

![stack](stack.png)


## Focused Crawls using BCube Nutch's fork



## Solr schema (relevant fields)


```xml

 <field name="_version_" type="long" indexed="false" stored="true"/>
    <field name="id" type="string" stored="true" indexed="true"/>
    <!-- fields for index-basic plugin -->
    <field name="host" type="url" stored="true" indexed="true"/>
    <field name="url" type="url" stored="true" indexed="true" required="true"/>
    <!-- stored=true for highlighting, use term vectors  and positions for fast highlighting -->
    <field name="content" type="text_general" stored="true" indexed="false"/>
    <field name="title" type="text_general" stored="true" indexed="true"/>
    <field name="tstamp" type="date" stored="true" indexed="false"/>
    <!-- catch-all field -->
    <field name="text" type="text_general" stored="false" indexed="true" multiValued="true"/>
    <!-- fields for index-anchor plugin -->
    <field name="anchor" type="text_general" stored="true" indexed="true" multiValued="true"/>
    <!-- fields for index-more plugin -->
    <field name="type" type="string" stored="true" indexed="true" multiValued="true"/>
    <field name="contentLength" type="string" stored="true" indexed="false"/>
    <field name="lastModified" type="date" stored="true" indexed="false"/>
    <field name="date" type="tdate" stored="true" indexed="true"/>
    <!-- fields for languageidentifier plugin -->
    <field name="lang" type="string" stored="true" indexed="true"/>

    <!-- - - - - - - BCUBE PLUGINS - - - - - -  -->

    <!-- fields for index-rawcontent plugin -->
    <field name="raw_content" type="text_general" stored="true" indexed="true" multiValued="false"/>
    <!-- field for index-xmlnamespaces plugin -->
    <field name="xml_namespaces" type="string" stored="true" indexed="true" multiValued="true"/>
    <!-- field for index-links plugin -->
    <field name="inlinks" type="string" stored="true" indexed="true" multiValued="true"/>
    <field name="outlinks" type="string" stored="true" indexed="true" multiValued="true"/>
    <!-- field for index-bcubefilter plugin -->
    <field name="url_hash" type="string" stored="true" indexed="true" multiValued="false"/>
    <!-- field for index-httpresponse plugin -->
    <field name="response_headers" type="string" stored="true" indexed="true" multiValued="true"/>
</field>

```

In [1]:
# We need to download nltk's wordnet first
import nltk
nltk.download('wordnet')
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora, models
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
import json
import urllib.request

[nltk_data] Downloading package wordnet to /home/vagrant/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# This can be customized to use any visible solr endpoint
# For convenience on the case of web crawls we use the content, url and mime type fields
solr_root = 'http://integration.pdi-solr.labs.nsidc.org/solr/rda/select?'
solr_fields = 'fl=url,content,type&'
solr_query = 'q=url:"case-statement"&rows=200&start=1&wt=json'
solr_url = solr_root + solr_fields + solr_query
req = urllib.request.Request(solr_url)
# parsing response
r = urllib.request.urlopen(req).read()
json_response = json.loads(r.decode('utf-8'))
solr_documents = json_response['response']['docs']
print("Processing {0} documents. \n".format(len(solr_documents)))

Processing 170 documents. 



* *Now we tokenize each document and remove stop words and apply stemming (Wordnet lemmatizer)*


In [3]:
## we need a tokenizer
tokenizer = RegexpTokenizer(r'\w+')
## we need stemer
stemmer = WordNetLemmatizer()
## our custom stop words
my_stop_words = {
                    'http', 'www', 'edu', 'org', 'com', 'rda', 'data', 'researcher', 'event', 'service',
                    'group', 'research', 'community', 'use', 'work', 'member', 'case', 'working', 'science',
                    'meeting', 'organisational', 'news', 'plenary', 'recommendation', 'project', 'standard',
                    'statement', 'school', 'university', 'membership', 'output', '2017', 'brokering',
                    'stakeholder', 'repository', 'user', 'citation', 'chair', 'framework', 'information',
                    'metadata', 'content', 'sharing', 'pid'
                }
stop_words = my_stop_words.union(ENGLISH_STOP_WORDS)

# document list will contain our corpus after cleaning it.
document_list = []
# pairs is a list of the urls and the size of their content
pairs = []
# just the documents urls
urls = []

def clean_document(doc):
    tokens = tokenizer.tokenize((doc).lower())
    # We lematize (stemming)
    stemmed_tokens = [stemmer.lemmatize(i) for i in tokens]
    # If the token is not in our stop words and the length is >2 and <20 we add it to the cleaned document
    document = [i for i in stemmed_tokens if i not in stop_words and (len(i) > 2 and len(i) < 25)]
    return document

for doc in solr_documents:
    document = clean_document(doc['content'][0])
    document_list.append(document)
    pairs.append((doc['url'],len(document)))
    urls.append(doc['url'])

* Building the LDA model using Gensim a library for topic modeling, the output is a list of topics present in our corpus.

In [4]:
num_passes = 5
num_topics = 20
words_per_topic = 6

dictionary = corpora.Dictionary(document_list)
corpus = [dictionary.doc2bow(text) for text in document_list]
lda_model = models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=num_passes)
topics = lda_model.print_topics(num_topics=num_topics, num_words=words_per_topic)
# Now let's print the topics found
for topic in topics:
    print(topic)

(0, '0.005*"creating" + 0.005*"issue" + 0.005*"past" + 0.005*"plenaries" + 0.005*"water" + 0.004*"fabric"')
(1, '0.013*"collection" + 0.009*"sample" + 0.007*"physical" + 0.007*"digital" + 0.007*"provenance" + 0.005*"schema"')
(2, '0.008*"fishery" + 0.006*"interoperability" + 0.005*"open" + 0.005*"outcome" + 0.005*"survey" + 0.005*"creating"')
(3, '0.006*"library" + 0.005*"search" + 0.005*"creating" + 0.005*"past" + 0.004*"plenaries" + 0.004*"new"')
(4, '0.009*"agriculture" + 0.007*"field" + 0.007*"semantic" + 0.007*"agricultural" + 0.006*"interoperability" + 0.005*"network"')
(5, '0.018*"governance" + 0.010*"model" + 0.008*"disciplinary" + 0.008*"interoperability" + 0.007*"infrastructure" + 0.007*"practice"')
(6, '0.027*"rice" + 0.009*"interoperability" + 0.009*"link" + 0.006*"literature" + 0.006*"infrastructure" + 0.005*"ontology"')
(7, '0.020*"national" + 0.017*"mediation" + 0.012*"component" + 0.008*"registry" + 0.008*"infrastructure" + 0.006*"model"')
(8, '0.006*"vocabulary" + 0.00

* The following code builds a list of all the documents that belong to a particular topic and their calculated probabilities

In [5]:
minum_likelihood = 0.1 # 10%
total_documents = len(urls)

# We create a list for each topic containing the terms and an empty document list
topic_list = [{'topic':t[0],'terms': t[1], 'documents': []} for t in topics]
# Add each document to their predicted topics if the probability is above the minimum_likelihood
for i in range(total_documents):
    doc_prob = lda_model.get_document_topics(bow=corpus[i],minimum_probability=minum_likelihood)
    document_url = urls[i]
    # Each document could contain more than one topic, we traverse them and add the url to n topics
    for prob in doc_prob:
        topic_index = prob[0]
        topic_probability = prob[1]
        topic_list[topic_index]['documents'].append((document_url,topic_probability))

* Now we can select a topic and then we'll print all the documents for it.

In [None]:
topic_of_interest = 11 # the topic index
def getkey(doc):
    return doc[1]

print("Documents in Topic {0} ({1})".format(topic_of_interest,topic_list[topic_of_interest]['terms']))
for doc in sorted(topic_list[topic_of_interest]['documents'],key=getkey):
    print(" Document: {0} \n - Probability:{1}".format(doc[0],doc[1]))

### Now that we have a trained model we can classify a new unseen document.

In [7]:
# For practical purposes we use a mocked up document but we can easily query Solr or another store to get the content we want to classify
# Eventually all this should be served in as a web service 

#taken from https://rd-alliance.org/groups/farm-data-sharing-ofds-wg
unseen_document = """
Farmers have the capability as they have never had before to critically evaluate management practices 
using field-scale replicated strip trials. Farmers have gained this powerful capability because yield 
monitors on combines enable accurate measurement of yields. Networks of farmers have become
increasingly common to exploit the potential of yield monitors to evaluate management practices 
at the field level. Networks of farmers have also become increasingly common because farmers understand 
the power of evaluating management practices across many fields. Collection of results from strip trials 
across many fields requires protocols for data stewardship, that is, for data reporting, sharing and archiving. 
All farmer networks have developed data stewardship protocols. """

vec = dictionary.doc2bow(clean_document(unseen_document))
predicted_topics = lda_model[vec]
print(predicted_topics)

[(4, 0.98492062)]


### Visualizing our model with PyLDAvis

In [11]:
import pyLDAvis.gensim
import warnings
warnings.filterwarnings('ignore', 'DeprecationWarning')
pyLDAvis.enable_notebook()


In [12]:
pyLDAvis.gensim.prepare(corpus=corpus, topic_model=lda_model, dictionary=dictionary, sort_topics=False)

### References and Links



> L. A. Lopez, R. Duerr and S. J. S. Khalsa, "Optimizing apache nutch for domain specific crawling at large scale," 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, 2015, pp. 1967-1971.
doi: 10.1109/BigData.2015.7363976

-

> 