## Topic Modeling and Insights Data Visualizations for ASSW abstracts.

--- 

This notebook uses topic modeling to analyze ASSW's abstracts. The notebook can be used to see how well the assigned sessions match the topics discovered from the abstracts themselves, so the organizers can adjust the sessions to improve how abstracts are assigned.

Note that topic modeling algorithms work best with a large, diverse corpus. The topics can be explored using LDAVis, which displays the topics in an X-Y plot (intertopic distance map). Topic are represented by circles whose areas are proportional to the relative prevalences of the topics in the corpus. In the display the user can enter a topic number (note 1-base numbering vs. 0-base numbering in Topic List below cells 5 and 7); the terms are displayed on the right, ranked by significance (weight). A topic can be selected on the fly by hovering over its circle; clicking selects that topic. A user can also click on a term in the RH panel to show the topics in which that term occurs. 

The slider at the top of the RH panel allows the user to vary the “saliency”, i.e. uniqueness of the terms to that topic. A value of 0.6 is optimal, according to the authors of the algorithm. Blue bars represent overall term frequency while red bars show term frequency within the topic, which will be different when the saliency Is selected be less than one.

Please see the annotated image of the LDAVis display

![](https://raw.githubusercontent.com/USCDataScience/pdi-topics/master/notebooks/pdi/img/pylda.png)


To provide a quick visual representation of the topics the notebook uses word2vec to infer additional terms and then creates wordclouds out of these words.



**Database Schema:**

Example:

```json
doc = {
        "abstract":["Clouds play a key role in the energy balance of the atmosphere due to their radiative effects, and have a critical  influence on the ice sheet's radiation budget. Changes in the glacier system on the Antarctic Peninsula (AP)  have been observed: disintegration of ice shelves, acceleration and thinning of glaciers, variations in the limits  between glacier faces and retreat of glacier fronts... "],
        "entities":["Marta Caballero1 (marta.caballero@fau.de)",
          " Matthias Braun1",
          " Thomas Mölg1 ",
          "1Friedrich Alexander-Universität Erlangen-Nürnberg",
          " Institut für Geographie",
          " Erlangen",
          " Germany "],
        "title":["Evaluation of Satellite Derived Cloud Top Properties on the Antarctic Peninsula "],
        "id":"Tue_30_AC-1_377 ",
        "_version_":1621681835183439872}
```



In [None]:
# Cell 1: Import requirements

import warnings
warnings.filterwarnings('ignore')

import nltk
import spacy
import urllib
import json
import string
import itertools as it
import numpy as np
import pandas as pd
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

from gensim import corpora, models
from gensim.models.ldamulticore import LdaMulticore
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from gensim.models.word2vec import LineSentence
from os import path


# wordcloud dependencies
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import matplotlib.pyplot as plt
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')

# Heavy imports
# from gensim.models import KeyedVectors
# pretrained_model = KeyedVectors.load_word2vec_format('models/glove.6B.100d.vec', binary=False)
nlp = spacy.load('en', disable=['tagger', 'ner'])
# nlp = spacy.load('en')



In [None]:
# Cell 3: Querying Solr

# terms = ['ice', 'climate'] to include only abstracts with specified terms
terms = ['*']
entities = ['*']

# Return "page_size" documents with each Solr query until complete
page_size = 1000
cursorMark = '*'

solr_documents = []
solr_root = 'http://integration.pdi-solr.labs.nsidc.org/solr/assw/select?indent=on&'
more_results = True


if terms[0] != '*':
    terms_wirldcard = ['*' + t + '*' for t in terms]
else:
    terms_wirldcard = ['*']

    
if entities[0] != '*':
    entities_wirldcard = ['*' + e + '*' for e in entities]
else:
    entities_wirldcard = ['*']

terms_query = '%20OR%20abstract:'.join(terms_wirldcard)
entities_query = '%20OR%20entities:'.join(entities_wirldcard)

query_string = 'q=(abstract:{}%20AND%20abstract:/.{{2}}.*/%20AND%20NOT%20title:/.{{300}}.*/)' + \
                '%20AND%20(entities:{})&wt=json&rows={}&cursorMark={}&sort=id+asc'
while (more_results):    
    solr_query = query_string.format(terms_query,
                                     entities_query,
                                     page_size,
                                     cursorMark)
    solr_url = solr_root + solr_query
    print('Querying: \n' + solr_url)
    req = urllib.request.Request(solr_url)
    # parsing response
    r = urllib.request.urlopen(req).read()
    json_response = json.loads(r.decode('utf-8'))
    solr_documents.extend(json_response['response']['docs'])
    nextCursorMark = json_response['nextCursorMark']
    if (nextCursorMark == cursorMark):
        more_results = False
        break
    else: 
        cursorMark = nextCursorMark

total_found = json_response['response']['numFound']
print("Processing {0} documents out of {1} total. \n".format(len(solr_documents), total_found))

In [None]:
years = {str(year) for year in range(2000,2020)}

nlp.Defaults.stop_words |= years
nlp.Defaults.stop_words |= ENGLISH_STOP_WORDS

nlp.Defaults.stop_words |= {'data', 'area', 'interesting', 'water', 'region', 'using', 'different', 'science'
                 'change', 'result', 'research', 'technique', 'datum', 'model', 'use',
                 'observation', 'measurement', 'sample', 'study', 'analysis',}

In [None]:
%%time
# Cell 4: Cleaning our documents  
%load_ext line_profiler


ALL_STOP_WORDS = ENGLISH_STOP_WORDS.union(nlp.Defaults.stop_words)

def flatten(top_list):
    for inner in top_list:
        if isinstance(inner, (list,tuple)):
            for j in flatten(inner):
                yield j
        else:
            yield inner

# Function to clean up the documents, lematizes the words to their regular form and removes the stop words.
def clean_document(doc):
    processed_sentences = []
    raw_sentences = []
    for num, sentence in enumerate(doc.sents):
        raw_sentences.append([item for item in str(sentence).split(' ')])
        tokens = [token.lemma_.encode('ascii',errors='ignore').decode().strip().lower() for token in sentence if token.lemma_ not in string.punctuation]
        cleaned_sentence = [token for token in tokens if token not in ALL_STOP_WORDS]
#         cleaned_sentence = [token for token in cleaned_sentence if len(token) > 1 ]
        cleaned_sentence = [token for token in cleaned_sentence if token != '-PRON-']
        processed_sentences.append(cleaned_sentence)
    return (raw_sentences, processed_sentences)

# bigram_corpus will contain our corpus after cleaning it.

unigram_documents = []

texts = [doc['abstract'][0] for doc in solr_documents]
document_list = []

for doc in nlp.pipe(texts, batch_size=200, n_threads=4):
    assert doc.is_parsed
    raw_sentences, cleaned_sentences = clean_document(doc)
    unigram_documents.append(cleaned_sentences)
    document_list


In [None]:
# all the corpus in sentences for training
unigram_sentences = []
bigram_corpus = []
bigram_docs = []

for doc in unigram_documents:
    for sentence in doc:
        unigram_sentences.append(sentence)

bigram_model = Phrases(unigram_sentences, min_count=2)
bigram_phraser = Phraser(bigram_model)

for unigram_doc in unigram_documents:
    bigram_sentences = []
    for unigram_sentence in unigram_doc:
        bigram_sentence = ' '.join(bigram_phraser[unigram_sentence])
        bigram_sentences.append(bigram_sentence)
    bigram_docs.append(bigram_sentences)


for bigram_sentdoc in bigram_docs:
    bigram_tokens = []
    bigram_doc = []
    for sentence in bigram_sentdoc:
        bigram_tokens.append(sentence.split())
    bigram_corpus.append(list(flatten(bigram_tokens)))

print(bigram_corpus[2])


 **Building the LDA model using bigrams**

In [None]:
# Cell 5: LDA Topic Modeling

# num pases should be adjusted, 3 is just a guesstimate of when convergence will be achieved.
num_passes = 3
num_topics = 20
words_per_topic = 7

dictionary = corpora.Dictionary(bigram_corpus)
lda_corpus = [dictionary.doc2bow(text) for text in bigram_corpus]

lda_model = LdaMulticore(lda_corpus,
                         num_topics=num_topics,
                         id2word=dictionary,
                         passes=num_passes,
                         workers=2
                        )
topics = lda_model.print_topics(num_topics=num_topics, num_words=words_per_topic)
print ("Topic List: \n")
for topic in topics:
    t = str((int(topic[0])+ 1))
    print('Topic ' + t + ': ', topic[1:])

import warnings
warnings.filterwarnings('ignore')

import pyLDAvis.gensim
print ("\nPyLDAVis: \n")
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(corpus=lda_corpus,
                        topic_model=lda_model,
                        dictionary=dictionary,
                        sort_topics=False)

# we recomend to adjust lambda to 0.6 as is recommended by the paper authors.

In [None]:
# Cell 6: Classifying an abstract using our GENSIM model

# ESSI abstracts taken from https://meetingorganizer.copernicus.org/EGU2018/EGU2018-9778.pdf

document = """
Submarine slope failure is a ubiquitous process and dominant pathway for sediment and organic carbon ﬂux from 
continental margins to the deep sea. Slope failure occurs over a wide range of temporal and spatial scales, 
from small (10e4-10e5 m3/event), sub-annual failures on heavily sedimented river deltas to margin-altering and 
tsunamigenic (10-100 km3/event) open slope failures occurring on glacial-interglacial timescales. 
Despite their importance to basic (closing the global source-to-sink sediment budget) and applied 
(submarine geohazards) re- search, submarine slope failure frequency and magnitude on most continental margins 
remains poorly constrained. This is primarily due to difﬁculty in 1) directly observing events, and 2) reconstructing 
age and size, particularly in the geologic record. The state of knowledge regarding submarine slope failure 
preconditioning and triggering factors is more qualitative than quantitative; a vague hierarchy of factor importance 
has been established in most settings but slope failures cannot yet be forecasted or hindcasted from 
a priori knowledge of these factors.
"""

vec = dictionary.doc2bow(clean_document(nlp(document)))
predicted_topics = lda_model[vec]
predicted_topics = [(p[0]+1, p[1]) for p in predicted_topics]
print(predicted_topics)

### References and Links



> L. A. Lopez, R. Duerr and S. J. S. Khalsa, "Optimizing apache nutch for domain specific crawling at large scale," 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, 2015, pp. 1967-1971.
doi: 10.1109/BigData.2015.7363976

> Sievert, C & Shirley, K.E.. (2014). LDAvis: A method for visualizing and interpreting topics. Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces. 63-70. 

> Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. [pdf] [bib] 