## Topic Modeling and Insights Data Visualizations for EGU sessions

--- 

This notebook does topic modeling on EGU Sessions using the submitted abstracts each year, the objective is to see if a given sessions match the topics discovered on the abstracs themselves so the organizers can adjust the sessions to asign abstracts in a better way.

One thing to notice is that topic modeling algorithms work better with a bigger more diverse corpus, the abstracts submitted to a particular category will have a lot of overlap and picking up topics is more of a challenge.

The right questions may be, **if we were to reduce the number of individual sessions what that would look like?**
Likewise, if we want to be more specific on the sessions' topics we could asjust the LDA algorithm to a higher cluster number and see the resulting topics. Now let's get to the code.



> The current notebook uses natural Hazards (NH) from 2011 to 2018 and models what the main topics have been through these years, the query can be modified to only take into account especific years or use a particular term. 



**Database Schema:**

Example:

```json
doc = {
"entities":[
    "Jeffrey Obelcz  and Warren T. Wood",
    "NRC Postdoctoral Fellow",
    "Naval Research Lab",
    "Seaﬂoor Sciences",
    "United States jbobelcz@gmail.com",
    "Naval",
    "Research Lab",
    "Seaﬂoor Sciences",
    "United States"],
"id": "EGU2018-9778",
"sessions": ["ESSI4.3"],
"file": ["EGU2018-9778"],
"presentation": ["Posters"],
"year": [2018],
"title": ["Towards a Quantitative Understanding of Parameters Driving Submarine Slope Failure: A Machine Learning Approach"],
"category": ["ESSI"],
"abstract":["Submarine slope failure is a ubiquitous process and dominant pathway for sediment and organic carbon ﬂux from continental margins to the deep sea. Slope failure occurs over a wide range of temporal and spatial scales ..."]
}
```

* **category**: the main disciplinary session id, CL, AS etc. Keep in mind that these codes have changed through the years.
* **presentation**: oral, poster, pico etc.
* **session**: the individual session for a given abstract, these can be co-organized.

-----

EGU Programme Groups:

### Disciplinary Sessions 

Geodynamics (GD) -
Geosciences Instrumentation & Data Systems (GI) -
Geomorphology (GM) -
Geochemistry, Mineralogy, Petrology & Volcanology (GMPV) -
Hydrological Sciences (HS) -
Natural Hazards (NH) -
Nonlinear Processes in Geosciences (NP) -
Ocean Sciences (OS) -
Planetary & Solar System Sciences (PS) -
Seismology (SM) -
Stratigraphy, Sedimentology & Palaeontology (SSP) -
Soil System Sciences (SSS) -
Solar-Terrestrial Sciences (ST) -
Tectonics & Structural Geology (TS) -
Atmospheric Sciences (AS) -
Biogeosciences (BG) -
Climate: Past, Present, Future (CL) -
Cryospheric Sciences (CR) -
Earth Magnetism & Rock Physics (EMRP) -
Energy, Resources and the Environment (ERE) -
Earth & Space Science Informatics (ESSI) -
Geodesy (G) -
Geodynamics (GD) -
Geosciences Instrumentation & Data Systems (GI) -
Geomorphology (GM) -
Geochemistry, Mineralogy, Petrology & Volcanology (GMPV) -
Hydrological Sciences (HS) -
Natural Hazards (NH) -
Nonlinear Processes in Geosciences (NP) -
Ocean Sciences (OS) -
Planetary & Solar System Sciences (PS) -
Seismology (SM) -
Stratigraphy, Sedimentology & Palaeontology (SSP) -
Soil System Sciences (SSS) -
Solar-Terrestrial Sciences (ST) -
Tectonics & Structural Geology (TS) -

### Union Sessions

Union Symposia (US)
Great Debates (GDB)
Medal Lectures (ML)
Short courses (SC)
Educational and Outreach Symposia (EOS)
EGU Plenary, Ceremonies and Networking (PCN)
Feedback and administrative meetings (FAM)
Townhall and splinter meetings (TSM)
Side events (SEV)
Press conferences (PC)

#### Interdisciplinary Events (IE)






In [None]:
# Cell 1: Import requirements

import warnings
warnings.filterwarnings('ignore')

import nltk
nltk.download('wordnet')
import urllib
import json
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora, models
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from datetime import datetime
from pandas.io.json import json_normalize
import random

import string
pseudo_rand = [ random.choice(string.ascii_letters) for i in range(4)]
seed = ''.join(pseudo_rand)


# wordcloud dependencies
import numpy as np
import pandas as pd
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import matplotlib.pyplot as plt
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')


In [None]:
# Cell 2: Loading pretrained word2vec model from GloVe, this takes time.
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('models/glove.6B.100d.vec', binary=False)

In [None]:
# Cell 3: Querying Solr

# terms = ['ice', 'climate'] to include only abstracts with specified terms
terms = ['*']
years = ['*']
entities = ['*']
sessions = ['NH']

# Return "page_size" documents with each Solr query until complete
page_size = 1000
cursorMark = '*'

solr_documents = []
solr_root = 'http://integration.pdi-solr.labs.nsidc.org/solr/egu/select?indent=on&'
more_results = True


if terms[0] != '*':
    terms_wirldcard = ['*' + t + '*' for t in terms]
else:
    terms_wirldcard = ['*']
    
if sessions[0] != '*':
    sessions_wirldcard = ['*' + s + '*' for s in sessions]
else:
    sessions_wirldcard = ['*']
    
if entities[0] != '*':
    entities_wirldcard = ['*' + e + '*' for e in entities]
else:
    entities_wirldcard = ['*']

terms_query = '%20OR%20abstract:'.join(terms_wirldcard)
years_query = '%20OR%20year:'.join(years)  
entities_query = '%20OR%20entities:'.join(entities_wirldcard)
sessions_query = '%20OR%20sessions:'.join(sessions_wirldcard)
query_string = 'q=(abstract:{}%20AND%20abstract:/.{{2}}.*/%20AND%20NOT%20title:/.{{300}}.*/)%20AND%20(year:{})' + \
                '%20AND%20(entities:{})%20AND%20(sessions:{})&wt=json&rows={}&cursorMark={}&sort=id+asc'
while (more_results):    
    solr_query = query_string.format(terms_query,
                                     years_query,
                                     entities_query,
                                     sessions_query,
                                     page_size,
                                     cursorMark)
    solr_url = solr_root + solr_query
    print('Querying: \n' + solr_url)
    req = urllib.request.Request(solr_url)
    # parsing response
    r = urllib.request.urlopen(req).read()
    json_response = json.loads(r.decode('utf-8'))
    solr_documents.extend(json_response['response']['docs'])
    nextCursorMark = json_response['nextCursorMark']
    if (nextCursorMark == cursorMark):
        more_results = False
        break
    else: 
        cursorMark = nextCursorMark

total_found = json_response['response']['numFound']
print("Processing {0} documents out of {1} total. \n".format(len(solr_documents), total_found))

In [None]:
# Cell 4: Cleaning our documents 

## we need a tokenizer
tokenizer = RegexpTokenizer(r'\w+')
## we need stemer
stemmer = WordNetLemmatizer()
## our custom stop words (used for Gensim only)
my_stop_words = {
                    'area', 'data', 'event', 'doc', 'group', 'research', 'http', 'community', 'result', 
                    'metadata', 'content', 'sharing', 'previous', 'model', 'science', 'scientiﬁc', 'user'
                }
years = [str(year) for year in range(2000,2020)]
words_and_years = my_stop_words.union(years)
stop_words = words_and_years.union(ENGLISH_STOP_WORDS)

# Function to clean up the documents, lematizes the words to their regular form and removes the stop words.
def clean_document(doc):
    tokens = tokenizer.tokenize((doc).lower())
    # We lematize (stemming)
    stemmed_tokens = [stemmer.lemmatize(i) for i in tokens]
    # If the token is not in our stop words and the length is >2 and <20 we add it to the cleaned document
    document = [i.encode('ascii',errors='ignore').decode()  for i in stemmed_tokens if i not in stop_words and (len(i) > 2 and len(i) < 25)]
    return document

# document list will contain our corpus after cleaning it.
document_list = []
gensim_documents = []
word_cloud_text_all = ''

# artifact of parsing the sessions from the pdf documents
garbage_str = '</a>'


for doc in solr_documents:
    document = clean_document(doc['abstract'][0])
    if 'sessions' in doc:
        sindex = doc['sessions'][0].find(garbage_str)
        if sindex != -1:
            sessions = doc['sessions'][0][0:sindex]
        else: 
            sessions = doc['sessions'][0]
    else:
        sessions = 'NAN'
    document_list.append({ 'id': doc['id'],
                                   'text': ' '.join(document), 
                                   'year': str(doc['year'][0]),
                                   'title': doc['title'][0],
                                   'category': doc['category'][0].replace('/', '').replace('<', ''),
                                   'sessions':sessions})
    gensim_documents.append(document)
    word_cloud_text_all = word_cloud_text_all + ' '.join(document)

dictionary = corpora.Dictionary(gensim_documents)
lda_corpus = [dictionary.doc2bow(text) for text in gensim_documents]

df = pd.DataFrame.from_dict(document_list)
axis_category = pd.DataFrame(df.groupby(['category', 'year'])['category'].count()).rename(columns={'category': 'count'})

# We are now listing the sessions, since we picked a main dosciplinary session we should see 
# more documents under that, however there are many co-organized sessions and those will also be lister here.
print(axis_category.to_string())

 **Building the LDA model using Gensim a library for topic modeling, first we are going to reduce the categories to 3 and see what are the topics listed.** 

In [None]:
# Cell 5: LDA Topic Modeling

# num pases should be adjusted, 3 is just a guesstimate of when convergence will be achieved.
num_passes = 3 
num_topics = 3
words_per_topic = 7

lda_model = models.ldamodel.LdaModel(lda_corpus, num_topics=num_topics, id2word = dictionary, passes=num_passes)
topics = lda_model.print_topics(num_topics=num_topics, num_words=words_per_topic)
print ("Topic List: \n")
for topic in topics:
    t = str((int(topic[0])+ 1))
    print('Topic ' + t + ': ', topic[1:])

import warnings
warnings.filterwarnings('ignore')

import pyLDAvis.gensim
print ("\nPyLDAVis: \n")
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(corpus=lda_corpus,
                        topic_model=lda_model,
                        dictionary=dictionary,
                        sort_topics=False,
                        mds='tsne')

# we recomend to adjust lambda to 0.6 as is recommended by the paper authors. 

If we were to reduce the main sessions to 3 we end up with less fancy names that the current ones but kind of make sense. This modeling is context-independent, we could bring some context via Word2Vec and will discuss that later on. Now let's make an experiment on the current corpus, we trained a simple model with 3 topics, let's classify some abstracts and see where they fall into.


In [None]:
# Cell 6: Classifying an abstract using our GENSIM model

# ESSI abstracts taken from https://meetingorganizer.copernicus.org/EGU2018/EGU2018-9778.pdf

document = """
Submarine slope failure is a ubiquitous process and dominant pathway for sediment and organic carbon ﬂux from 
continental margins to the deep sea. Slope failure occurs over a wide range of temporal and spatial scales, 
from small (10e4-10e5 m3/event), sub-annual failures on heavily sedimented river deltas to margin-altering and 
tsunamigenic (10-100 km3/event) open slope failures occurring on glacial-interglacial timescales. 
Despite their importance to basic (closing the global source-to-sink sediment budget) and applied 
(submarine geohazards) re- search, submarine slope failure frequency and magnitude on most continental margins 
remains poorly constrained. This is primarily due to difﬁculty in 1) directly observing events, and 2) reconstructing 
age and size, particularly in the geologic record. The state of knowledge regarding submarine slope failure 
preconditioning and triggering factors is more qualitative than quantitative; a vague hierarchy of factor importance 
has been established in most settings but slope failures cannot yet be forecasted or hindcasted from 
a priori knowledge of these factors.
"""

vec = dictionary.doc2bow(clean_document(document))
predicted_topics = lda_model[vec]
predicted_topics = [(p[0]+1, p[1]) for p in predicted_topics]
print(predicted_topics)

**Now let's increment the number of topics to 20 and see what we get**

In [None]:
# Cell 7: LDA Topic Modeling expanding our topics

from collections import defaultdict
import re
p = re.compile('.(\".*\")')
topic_list = defaultdict(list)
# num pases should be adjusted, 3 is just a guesstimate of when convergence will be achieved.
num_passes = 3
num_topics = 20
words_per_topic = 7

lda_model = models.ldamodel.LdaModel(lda_corpus,
                                     num_topics=num_topics,
                                     id2word = dictionary,
                                     passes=num_passes)
topics = lda_model.print_topics(num_topics=num_topics, num_words=words_per_topic)
print ("Topic List:\n")
for topic in topics:
    weighted_terms = topic[1].split(' + ')
    terms = [t[6:] for t in weighted_terms]
    for term in terms:
        topic_list[topics.index(topic)].append(term.replace('"',''))
    t = str((int(topic[0])+ 1))
    print('Topic ' + t + ': ', topic[1:])

import warnings
warnings.filterwarnings('ignore')
import pyLDAvis.gensim



print ("\nPyLDAVis: \n")
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(corpus=lda_corpus,
                        topic_model=lda_model,
                        dictionary=dictionary,
                        sort_topics=False)

In [None]:
# Cell 8, let's infer some context using word2Vec

# We are taking the 3 most salient terms for each topic and getting the 30 most similar words
# from the pre-trained GloVe vector model. We visualize the results using a word-cloud

num_words = 30

# the size of the words in the wordcloud are determined by the word frequency, if the freqcuency is the same
# they will be given random sizes.

for topic_number in range(num_topics):
    print('Word cloud for topic {} '.format(topic_number+1))
    print(topic_list[topic_number][0:3])
    file_name = 'topic-' + str(int(topic_number)+1) + '-'.join(topic_list[topic_number][0:3]) + '.png'
    salient_terms = topic_list[topic_number][0:3]
    sm = model.most_similar(salient_terms, topn=num_words)
    similar_words = [w[0] for w in sm]
    similar_words.extend(topic_list[topic_number])
    word_cloud_text = ' '.join(similar_words)


    wordcloud = WordCloud(
        scale=4,
        prefer_horizontal=0.60,
        min_font_size=20,
        max_font_size=80,
        max_words=100,
        background_color="white").generate(word_cloud_text)
    
    wordcloud.to_file('wordclouds/' + file_name)
    plt.figure()
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()


In [None]:
# Cell 9: let's create a wordcloud of the whole corpus
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(word_cloud_text_all)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
# Cell 10: we list the first 10 abstracts and the predicted topics

min_likelihood  = 0.1

def createLink(doc):
    baseURL = 'https://meetingorganizer.copernicus.org/EGU' + str(doc['year']) + '/' + doc['id'] + '.pdf'
    return baseURL

def classify(doc):
    vec = dictionary.doc2bow(clean_document(doc))
    predicted_topics = lda_model[vec]
    return [p for p in predicted_topics if p[1]> min_likelihood]


from IPython.core.display import display, HTML
# Let's predict the first 10 documents
for doc in document_list[0:10]:
    doc['predicted'] = classify(doc['text'])
    display(HTML('<br>Abstract <a href="{}" target="_blank">{}</a> belongs to session {}, predicted in topics -> {}'.format(
        createLink(doc),
        doc['id'],
        doc['sessions'],
        doc['predicted'] )))    

### References and Links



> L. A. Lopez, R. Duerr and S. J. S. Khalsa, "Optimizing apache nutch for domain specific crawling at large scale," 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, 2015, pp. 1967-1971.
doi: 10.1109/BigData.2015.7363976

> Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017. Link to preprint: arxiv.org/abs/1703.00565

> Sievert, C & Shirley, K.E.. (2014). LDAvis: A method for visualizing and interpreting topics. Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces. 63-70. 

> Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. [pdf] [bib] 