## EGU Topic Modeling and Insights Data Visualizations

This notebook does topic modeling on EGU abstracts from 2009 to 2018, the main premise is that the PDF files are organized in the same way, first the EGU copyright notice, then the abstract title, then the authors and lastly the abstract content.

We parsed the PDFs using PDFMiner's utility pdf2txt

```sh
ls *.pdf | xargs -n1 -P8 bash -c 'pdf2txt.py -o output/$0.txt -t text $0'
```

The current notebook uses Solr to index the parsed content



**Database Schema:**

```json
doc = {
       'year': year,
       'file': fname,
       'title' : title,
       'entities': entities,
       'abstract': abstract,
       'category': category,
       'sessions': full_session_code,
       'presentation': presentation,
       'timestamp': datetime.now().isoformat(),
      }
```

* category: the main session id, CL, AS etc. Keep in mind that these codes have changed through the years.
* presentation: oral, poster, pico etc.


-----

### Disciplinary Sessions 

Geodynamics (GD) -
Geosciences Instrumentation & Data Systems (GI) -
Geomorphology (GM) -
Geochemistry, Mineralogy, Petrology & Volcanology (GMPV) -
Hydrological Sciences (HS) -
Natural Hazards (NH) -
Nonlinear Processes in Geosciences (NP) -
Ocean Sciences (OS) -
Planetary & Solar System Sciences (PS) -
Seismology (SM) -
Stratigraphy, Sedimentology & Palaeontology (SSP) -
Soil System Sciences (SSS) -
Solar-Terrestrial Sciences (ST) -
Tectonics & Structural Geology (TS) -
Atmospheric Sciences (AS) -
Biogeosciences (BG) -
Climate: Past, Present, Future (CL) -
Cryospheric Sciences (CR) -
Earth Magnetism & Rock Physics (EMRP) -
Energy, Resources and the Environment (ERE) -
Earth & Space Science Informatics (ESSI) -
Geodesy (G) -
Geodynamics (GD) -
Geosciences Instrumentation & Data Systems (GI) -
Geomorphology (GM) -
Geochemistry, Mineralogy, Petrology & Volcanology (GMPV) -
Hydrological Sciences (HS) -
Natural Hazards (NH) -
Nonlinear Processes in Geosciences (NP) -
Ocean Sciences (OS) -
Planetary & Solar System Sciences (PS) -
Seismology (SM) -
Stratigraphy, Sedimentology & Palaeontology (SSP) -
Soil System Sciences (SSS) -
Solar-Terrestrial Sciences (ST) -
Tectonics & Structural Geology (TS) -

### Union Sessions

Union Symposia (US)
Great Debates (GDB)
Medal Lectures (ML)
Short courses (SC)
Educational and Outreach Symposia (EOS)
EGU Plenary, Ceremonies and Networking (PCN)
Feedback and administrative meetings (FAM)
Townhall and splinter meetings (TSM)
Side events (SEV)
Press conferences (PC)

In [None]:
# Import requirements
import nltk
nltk.download('wordnet')
import urllib
import json
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora, models
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
import scattertext as st
import spacy
import pandas as pd
from datetime import datetime
from pandas.io.json import json_normalize

In [None]:
# This query can be customized to use any solr endpoint

solr_root = 'http://integration.pdi-solr.labs.nsidc.org/solr/egu/select?indent=on&'
solr_query = 'q=year:2011%20OR%20year:2018&wt=json&rows=200&sort=random_1234%20desc'
solr_url = solr_root + solr_query

req = urllib.request.Request(solr_url)
# parsing response
r = urllib.request.urlopen(req).read()
json_response = json.loads(r.decode('utf-8'))
solr_documents = json_response['response']['docs']
print("Processing {0} randomly sampled documents. \n".format(len(solr_documents)))

In [None]:
## we need a tokenizer
tokenizer = RegexpTokenizer(r'\w+')
## we need stemer
stemmer = WordNetLemmatizer()
## our custom stop words (used for Gensim only)
my_stop_words = {
                    'area', 'time', 'measurement', 'data', 'event', 'doc', 'group', 'research', 
                    'study', 'use', 'analisys', 'result', 'case', 'model', 'information',
                    'metadata', 'content', 'sharing', 'previous', 'http', 
                    '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
                    '2011', '2012', '2013', '2014','2015', '2016', '2017',
                }

stop_words = my_stop_words.union(ENGLISH_STOP_WORDS)

# Function to clean up the documents, lematizes the words to their regular form and removes the stop words.
def clean_document(doc):
    tokens = tokenizer.tokenize((doc).lower())
    # We lematize (stemming)
    stemmed_tokens = [stemmer.lemmatize(i) for i in tokens]
    # If the token is not in our stop words and the length is >2 and <20 we add it to the cleaned document
    document = [i for i in stemmed_tokens if i not in stop_words and (len(i) > 2 and len(i) < 25)]
    return document

In [None]:
# document list will contain our corpus after cleaning it.
scattertext_documents = []
gensim_documents = []

for doc in solr_documents:
    document = clean_document(doc['abstract'][0])
    title = clean_document(doc['title'][0])
    if len(title) <= 10:
        print(title)
    if 'category' in doc:
        category = doc['category'][0]
    else:
        category = 'NAN'
    scattertext_documents.append({ 'text': ' '.join(document), 
                                   'year': str(doc['year'][0]),
                                   'title': doc['title'][0],
                                   'session':category})
    gensim_documents.append(document)

### Building the LDA model using Gensim a library for topic modeling, the output is a list of topics present in our corpus.

In [None]:
# GENSIM 
num_passes = 5
num_topics = 20
words_per_topic = 6

dictionary = corpora.Dictionary(gensim_documents)
corpus = [dictionary.doc2bow(text) for text in gensim_documents]
lda_model = models.ldamodel.LdaModel(corpus, num_topics=num_topics, id2word = dictionary, passes=num_passes)
topics = lda_model.print_topics(num_topics=num_topics, num_words=words_per_topic)
for topic in topics:
    print(topic)

### ScatterText
> Now we're going to use the ScatterText library to visualize some binary categories token distributions

In [None]:
# loading our documents into a Panda dataframe for ScatterText
df = pd.DataFrame.from_dict(scattertext_documents)
axis_year = pd.DataFrame(df.groupby('year')['year'].count())
axis_session = pd.DataFrame(df.groupby('session')['session'].count())
print(axis_session, axis_year)
print(df.iloc[0])

In [None]:
# years to compare
years = ['2018','2011']
# scattertext categories (year or session)
scatter_category = 'year'

# We load the English vector space from https://spacy.io/models/
nlp = spacy.load('en')
# we plot the results to an HTML DS3 plot
main_cat = '2018'
compared_cat = '2011'
    
# We create a corpus using Scatter's built-in method.
corpus = st.CorpusFromPandas(df, 
                             category_col=scatter_category, 
                             text_col='text',
                             nlp=nlp).build()

html = st.produce_scattertext_explorer(corpus,
          category=main_cat,
          category_name=main_cat,
          not_category_name=compared_cat,
          width_in_pixels=700)


open("scattertext.html", 'wb').write(html.encode('utf-8'))
from IPython.display import IFrame
from IPython.core.display import display, HTML
display(IFrame(src='scattertext.html', width=900, height=800))
print ("plot ready!")

* Now we can select a topic and then we'll print all the documents for it.

In [None]:
topic_of_interest = 11 # the topic index
def getkey(doc):
    return doc[1]

print("Documents in Topic {0} ({1})".format(topic_of_interest,topic_list[topic_of_interest]['terms']))
for doc in sorted(topic_list[topic_of_interest]['documents'],key=getkey):
    print(" Document: {0} \n - Probability:{1}".format(doc[0],doc[1]))

### Now that we have a trained model we can classify a new unseen document.

In [None]:
# For practical purposes we use a mocked up document but we can easily query Solr or another store to get the content we want to classify
# Eventually all this should be served in as a web service 

#taken from https://rd-alliance.org/groups/farm-data-sharing-ofds-wg
unseen_document = """
Farmers have the capability as they have never had before to critically evaluate management practices 
using field-scale replicated strip trials. Farmers have gained this powerful capability because yield 
monitors on combines enable accurate measurement of yields. Networks of farmers have become
increasingly common to exploit the potential of yield monitors to evaluate management practices 
at the field level. Networks of farmers have also become increasingly common because farmers understand 
the power of evaluating management practices across many fields. Collection of results from strip trials 
across many fields requires protocols for data stewardship, that is, for data reporting, sharing and archiving. 
All farmer networks have developed data stewardship protocols. """

vec = dictionary.doc2bow(clean_document(unseen_document))
predicted_topics = lda_model[vec]
print(predicted_topics)

### Visualizing our model with PyLDAvis

In [None]:
import pyLDAvis.gensim
import warnings
warnings.filterwarnings('ignore', 'DeprecationWarning')
pyLDAvis.enable_notebook()


In [None]:
pyLDAvis.gensim.prepare(corpus=corpus, topic_model=lda_model, dictionary=dictionary, sort_topics=False)

### References and Links



> L. A. Lopez, R. Duerr and S. J. S. Khalsa, "Optimizing apache nutch for domain specific crawling at large scale," 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, 2015, pp. 1967-1971.
doi: 10.1109/BigData.2015.7363976

-

> 