## Topic Modeling and Insights Data Visualizations for EGU sessions

--- 

This notebook does topic modeling on EGU Sessions using the submitted abstracts each year, the objective is to see if a given sessions match the topics discovered on the abstracs themselves so the organizers can adjust the sessions to asign abstracts in a better way.

One thing to notice is that topic modeling algorithms work better with a bigger more diverse corpus, the abstracts submitted to a particular category will have a lot of overlap and picking up topics is more of a challenge.

The right questions may be, **if we were to reduce the number of individual sessions what that would look like?**
Likewise, if we want to be more specific on the sessions' topics we could asjust the LDA algorithm to a higher cluster number and see the resulting topics. Now let's get to the code.

The current notebook uses Solr to retrieve these asbtracts


**Database Schema:**

Example:

```json
doc = {
"entities":[
    "Jeffrey Obelcz  and Warren T. Wood",
    "NRC Postdoctoral Fellow",
    "Naval Research Lab",
    "Seaﬂoor Sciences",
    "United States jbobelcz@gmail.com",
    "Naval",
    "Research Lab",
    "Seaﬂoor Sciences",
    "United States"],
"id": "EGU2018-9778",
"sessions": ["ESSI4.3"],
"file": ["EGU2018-9778"],
"presentation": ["Posters"],
"year": [2018],
"title": ["Towards a Quantitative Understanding of Parameters Driving Submarine Slope Failure: A Machine Learning Approach"],
"category": ["ESSI"],
"abstract":["Submarine slope failure is a ubiquitous process and dominant pathway for sediment and organic carbon ﬂux from continental margins to the deep sea. Slope failure occurs over a wide range of temporal and spatial scales ..."]
}
```

* category: the main session id, CL, AS etc. Keep in mind that these codes have changed through the years.
* presentation: oral, poster, pico etc.


-----

### Current Disciplinary Individual Sessions for Earth & Space Science Informatics (ESSI) 2018


**ESSI1 – Community-driven challenges and solutions dealing with Informatics**

 * **ESSI1.1** - Informatics in Oceanography and Ocean Science 
 * GI1.3/AS5.15/BG1.30/CL5.10/EMRP4.5/**ESSI1.6**/HS11.12/SM5.03 - Environmental sensor network (co-organized) 
 * IE4.3/SSS13.73/AS5.19/BG1.20/**ESSI1.8**/HS11.4/NH11.13 - Geostatistical and statistical tools to perform the data fusion of large datasets in geo-engineering and environmental studies (co-organized)
 * NH9.12/AS5.17/CL5.30/**ESSI1.9**/GI0.4/GMPV6.12/HS11.44/SM3.15/SSS13.66 - Methods and Tools for Natural Risk Management and Communications – Innovative ways of delivering information to end users and sharing data among the scientific community (co-organized)
 * IE4.7/SSS13.74/BG1.43/**ESSI1.10**/NH9.21/SM1.10 -Media Citizen Science for Earth Systems in the Era of Big Data (co-organized)
   
**ESSI2 – Infrastructures across the Earth and Space Sciences**

 * **ESSI2.1** - Metadata, Data Models, Semantics, and Collaboration
 * **ESSI2.2** - Data cubes of Big Earth Data - a new paradigm for accessing and processing Earth Science Data
 * IE4.1/NP4.3/AS5.13/CL5.18/**ESSI2.3**/GD10.6/HS3.7/NH11.14/SM7.03 - Big data and machine learning in geosciences (co-organized)
 * **ESSI2.4** - Virtual Research Environments: creating online collaborative environments to support research in the Earth Sciences and beyond (co-organised with American Geophysical Union)
 * **ESSI2.6** - Web-based Exchange and Processing of Environmental Data
 * **ESSI2.7** - Future Shock: Evolving Earth Science Data and Information Systems across the entire research lifecycle
 * **ESSI2.8**/GI1.6 - Environmental physical and data infrastructures: practices, access and technologies - towards system level understanding (co-organized)
 * **ESSI2.9** - Integrating data and services in solid Earth sciences
 * GI1.1/EMRP4.3/**ESSI2.10**/SSS13.15 - Applications of Data, Methods and Models in Geosciences (co-organized)
 * GI1.5/EMRP4.6/**ESSI2.11**/NH11.10/PS5.5 - Data fusion, integration, correlation and advances of non-destructive testing methods and numerical developments for engineering and geosciences applications (co-organized)
 * IE4.5/AS5.14/BG1.22/CL5.26/EMRP4.35/**ESSI2.12**/GD10.7/GI1.7 - Information extraction from satellite observations using data-driven methods (co-organized)

**ESSI3 – Open Science 2.0 Informatics for Earth and Space Sciences**

 * **ESSI3.1** - Free and Open Source Software (FOSS) for Geoinformatics and Geosciences
 * **ESSI3.2** - Innovative Evaluation and Prediction for Large Earth Science Datasets
 * **ESSI3,4** - Earth science on Cloud, HPC and Grid
 * **ESSI3.5** - Open Data, Reproducible Research, and Open Science
 
**ESSI4 – Visualization for scientific discovery and communication**

 * **ESSI4.1**/SSS11.6 - State of the Art in Earth Science Data Visualization (co-organized)
 * SC2.6/**ESSI4.2** - Visualization in Earth Science: best practices (co-organized)
 * **ESSI4.3** - Advancing Data-driven Workflows, Analytics and Visualization in Earth System Science
 * IE3.1/GI0.3/BG1.35/CR2.8/**ESSI4.4**/GM2.12/NH6.5 - Close and Long Range Sensing of Environment (co-sponsored by ISPRS) (co-organized)

In [None]:
# Cell 1: Import requirements

import warnings
warnings.filterwarnings('ignore')

import nltk
nltk.download('wordnet')
import urllib
import json
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora, models
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
import scattertext as st
import spacy
import pandas as pd
from datetime import datetime
from pandas.io.json import json_normalize
import random

import string
pseudo_rand = [ random.choice(string.ascii_letters) for i in range(4)]
seed = ''.join(pseudo_rand)


# wordcloud dependencies
import numpy as np
import pandas as pd
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import matplotlib.pyplot as plt
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')


In [None]:
# Cell 2: Loading pretrained word2vec model from GloVe 
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('models/glove.6B.100d.vec', binary=False)

In [None]:
# Cell 3: Querying Solr

# terms = ['ice', 'climate'] to include only abstracts with specified terms
terms = ['*']
years = ['*']
entities = ['*']
sessions = ['NH']

# We sample Solr for up to "page_size" documents that comply with our criteria
page_size = 1000
cursorMark = '*'

solr_documents = []
solr_root = 'http://integration.pdi-solr.labs.nsidc.org/solr/egu/select?indent=on&'
more_results = True


if terms[0] != '*':
    terms_wirldcard = ['*' + t + '*' for t in terms]
else:
    terms_wirldcard = ['*']
    
if sessions[0] != '*':
    sessions_wirldcard = ['*' + s + '*' for s in sessions]
else:
    sessions_wirldcard = ['*']

terms_query = '%20OR%20abstract:'.join(terms_wirldcard)
years_query = '%20OR%20year:'.join(years)  
entities_query = '%20OR%20entities:'.join(entities)
sessions_query = '%20OR%20sessions:'.join(sessions_wirldcard)
query_string = 'q=(abstract:{}%20AND%20abstract:/.{{2}}.*/%20AND%20NOT%20title:/.{{300}}.*/)%20AND%20(year:{})' + \
                '%20AND%20(entities:{})%20AND%20(sessions:{})&wt=json&rows={}&cursorMark={}&sort=id+asc'
while (more_results):    
    solr_query = query_string.format(terms_query,
                                     years_query,
                                     entities_query,
                                     sessions_query,
                                     page_size,
                                     cursorMark)
    solr_url = solr_root + solr_query
    print('Querying: \n' + solr_url)
    req = urllib.request.Request(solr_url)
    # parsing response
    r = urllib.request.urlopen(req).read()
    json_response = json.loads(r.decode('utf-8'))
    solr_documents.extend(json_response['response']['docs'])
    nextCursorMark = json_response['nextCursorMark']
    if (nextCursorMark == cursorMark):
        more_results = False
        break
    else: 
        cursorMark = nextCursorMark

total_found = json_response['response']['numFound']
print("Processing {0} documents out of {1} total. \n".format(len(solr_documents), total_found))

In [None]:
# Cell 4: Cleaning our documents 

## we need a tokenizer
tokenizer = RegexpTokenizer(r'\w+')
## we need stemer
stemmer = WordNetLemmatizer()
## our custom stop words (used for Gensim only)
my_stop_words = {
                    'area', 'data', 'event', 'doc', 'group', 'research', 'http', 'community', 'result', 
                    'metadata', 'content', 'sharing', 'previous', 'model', 'science', 'scientiﬁc', 'user'
                }
years = [str(year) for year in range(2000,2020)]
words_and_years = my_stop_words.union(years)
stop_words = words_and_years.union(ENGLISH_STOP_WORDS)

# Function to clean up the documents, lematizes the words to their regular form and removes the stop words.
def clean_document(doc):
    tokens = tokenizer.tokenize((doc).lower())
    # We lematize (stemming)
    stemmed_tokens = [stemmer.lemmatize(i) for i in tokens]
    # If the token is not in our stop words and the length is >2 and <20 we add it to the cleaned document
    document = [i for i in stemmed_tokens if i not in stop_words and (len(i) > 2 and len(i) < 25)]
    return document

# document list will contain our corpus after cleaning it.
document_list = []
gensim_documents = []
word_cloud_text_all = ''

# artifact of parsing the sessions from the pdf documents
garbage_str = '</a>'


for doc in solr_documents:
    document = clean_document(doc['abstract'][0])
    if 'sessions' in doc:
        sindex = doc['sessions'][0].find(garbage_str)
        if sindex != -1:
            sessions = doc['sessions'][0][0:sindex]
        else: 
            sessions = doc['sessions'][0]
    else:
        sessions = 'NAN'
    document_list.append({ 'id': doc['id'],
                                   'text': ' '.join(document), 
                                   'year': str(doc['year'][0]),
                                   'title': doc['title'][0],
                                   'category': doc['category'][0].replace('/', '').replace('<', ''),
                                   'sessions':sessions})
    gensim_documents.append(document)
    word_cloud_text_all = word_cloud_text_all + ' '.join(document)

dictionary = corpora.Dictionary(gensim_documents)
lda_corpus = [dictionary.doc2bow(text) for text in gensim_documents]

df = pd.DataFrame.from_dict(document_list)
axis_category = pd.DataFrame(df.groupby(['category', 'year'])['category'].count()).rename(columns={'category': 'count'})
print(axis_category.to_string())

 **Building the LDA model using Gensim a library for topic modeling, first we are going to reduce the sessions from 4 to 3 and see what are the topics listed.** 

In [None]:
# Cell 5: LDA Topic Modeling

# num pases should be adjusted, 5 is just a guesstimate of when convergence will be achieved.
num_passes = 5 
num_topics = 3
words_per_topic = 7

lda_model = models.ldamodel.LdaModel(lda_corpus, num_topics=num_topics, id2word = dictionary, passes=num_passes)
topics = lda_model.print_topics(num_topics=num_topics, num_words=words_per_topic)
print ("Topic List: \n")
for topic in topics:
    print(topic)

import warnings
warnings.filterwarnings('ignore')

import pyLDAvis.gensim
print ("\nPyLDAVis: \n")
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(corpus=lda_corpus, topic_model=lda_model, dictionary=dictionary, sort_topics=False, mds='tsne')


Clearly if we were to reduce the main sessions to 3 we end up with less fancy names that the current ones but kind of make sense. This modeling is context-independent, we could bring some context via Word2Vec and will discuss that later on. Now let's make an experiment on the current corpus, we trained a simple model with 3 topics, let's classify some abstracts and see where they fall into.


In [None]:
# Cell 6: Classifying an abstract using our GENSIM model

# ESSI abstracts taken from https://meetingorganizer.copernicus.org/EGU2018/EGU2018-9778.pdf

document = """
Submarine slope failure is a ubiquitous process and dominant pathway for sediment and organic carbon ﬂux from 
continental margins to the deep sea. Slope failure occurs over a wide range of temporal and spatial scales, 
from small (10e4-10e5 m3/event), sub-annual failures on heavily sedimented river deltas to margin-altering and 
tsunamigenic (10-100 km3/event) open slope failures occurring on glacial-interglacial timescales. 
Despite their importance to basic (closing the global source-to-sink sediment budget) and applied 
(submarine geohazards) re- search, submarine slope failure frequency and magnitude on most continental margins 
remains poorly constrained. This is primarily due to difﬁculty in 1) directly observing events, and 2) reconstructing 
age and size, particularly in the geologic record. The state of knowledge regarding submarine slope failure 
preconditioning and triggering factors is more qualitative than quantitative; a vague hierarchy of factor importance 
has been established in most settings but slope failures cannot yet be forecasted or hindcasted from 
a priori knowledge of these factors.
"""

vec = dictionary.doc2bow(clean_document(document))
predicted_topics = lda_model[vec]
print(predicted_topics)

**Now let's increment the number of topics to 8 and see what we get**

In [None]:
# Cell 7: LDA Topic Modeling expanding our topics

from collections import defaultdict
import re
p = re.compile('.(\".*\")')
topic_list = defaultdict(list)
# num pases should be adjusted, 5 is just a guesstimate of when convergence will be achieved.
num_passes = 5
num_topics = 20
words_per_topic = 7

lda_model = models.ldamodel.LdaModel(lda_corpus,
                                     num_topics=num_topics,
                                     id2word = dictionary,
                                     passes=num_passes,
                                     chunksize=17)
topics = lda_model.print_topics(num_topics=num_topics, num_words=words_per_topic)
print ("Topic List:\n")
for topic in topics:
    weighted_terms = topic[1].split(' + ')
    terms = [t[6:] for t in weighted_terms]
    for term in terms:
        topic_list[topics.index(topic)].append(term.replace('"',''))
    print(topic)

import warnings
warnings.filterwarnings('ignore')
import pyLDAvis.gensim



print ("\nPyLDAVis: \n")
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(corpus=lda_corpus, topic_model=lda_model, dictionary=dictionary, sort_topics=False)

In [None]:
# Cell 8, let's infer some context using word2Vec

for topic_number in range(20):
    try:
        sm = model.most_similar(topic_list[topic_number][0:3], topn=30)
    except:
        sm = model.most_similar(reversed(topic_list[topic_number][0:1]), topn=30)
    similar_words = [w[0] for w in sm]
    similar_words.extend(topic_list[topic_number])
    word_cloud_text = ' '.join(similar_words)
    print(word_cloud_text)

    wordcloud = WordCloud(
        scale=4,
        prefer_horizontal=0.60,
        min_font_size=20,
        max_font_size=80,
        max_words=100,
        background_color="white").generate(word_cloud_text)
    
    wordcloud.to_file('topic-' + str(topic_number) + '.png')
    plt.figure()
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()


In [None]:
# Cell 9: let's create a wordcloud of the whole corpus
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(word_cloud_text_all)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
# Cell 10: we list the abstracts and the predicted topics

min_likelihood  = 0.1

def createLink(doc):
    baseURL = 'https://meetingorganizer.copernicus.org/EGU' + str(doc['year']) + '/' + doc['id'] + '.pdf'
    return baseURL

def classify(doc):
    vec = dictionary.doc2bow(clean_document(doc))
    predicted_topics = lda_model[vec]
    return [p for p in predicted_topics if p[1]> min_likelihood]


from IPython.core.display import display, HTML
for doc in document_list:
    doc['predicted'] = classify(doc['text'])
    display(HTML('<br>Abstract <a href="{}" target="_blank">{}</a> belongs to session {}, predicted in topics -> {}'.format(
        createLink(doc),
        doc['id'],
        doc['sessions'],
        doc['predicted'] )))    

### References and Links



> L. A. Lopez, R. Duerr and S. J. S. Khalsa, "Optimizing apache nutch for domain specific crawling at large scale," 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, 2015, pp. 1967-1971.
doi: 10.1109/BigData.2015.7363976

> Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017. Link to preprint: arxiv.org/abs/1703.00565

> Sievert, C & Shirley, K.E.. (2014). LDAvis: A method for visualizing and interpreting topics. Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces. 63-70. 

> Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. [pdf] [bib] 