## EGU Topic Modeling

This notebook does topic modeling on EGU abstracts from 2009 to 2018, the main premise is that the PDF files are organized in the same way, first the EGU copyright notice, then the abstract title, then the authors and lastly the abstract content.

We parsed the PDFs using PDFMiner's utility pdf2txt

```sh
ls *.pdf | xargs -n1 -P8 bash -c 'pdf2txt.py -o output/$0.txt -t text $0'
```

The current notebook uses Solr to index the parsed content



**Database Schema:**

```json
doc = {
       'year': year,
       'file': fname,
       'title' : title,
       'entities': entities,
       'abstract': abstract,
       'category': category,
       'sessions': full_session_code,
       'presentation': presentation,
       'timestamp': datetime.now().isoformat(),
      }
```

* category: the main session id, CL, AS etc. Keep in mind that these codes have changed through the years.
* presentation: oral, poster, pico etc.


-----

### Disciplinary Sessions 

Geodynamics (GD) -
Geosciences Instrumentation & Data Systems (GI) -
Geomorphology (GM) -
Geochemistry, Mineralogy, Petrology & Volcanology (GMPV) -
Hydrological Sciences (HS) -
Natural Hazards (NH) -
Nonlinear Processes in Geosciences (NP) -
Ocean Sciences (OS) -
Planetary & Solar System Sciences (PS) -
Seismology (SM) -
Stratigraphy, Sedimentology & Palaeontology (SSP) -
Soil System Sciences (SSS) -
Solar-Terrestrial Sciences (ST) -
Tectonics & Structural Geology (TS) -
Atmospheric Sciences (AS) -
Biogeosciences (BG) -
Climate: Past, Present, Future (CL) -
Cryospheric Sciences (CR) -
Earth Magnetism & Rock Physics (EMRP) -
Energy, Resources and the Environment (ERE) -
Earth & Space Science Informatics (ESSI) -
Geodesy (G) -
Geodynamics (GD) -
Geosciences Instrumentation & Data Systems (GI) -
Geomorphology (GM) -
Geochemistry, Mineralogy, Petrology & Volcanology (GMPV) -
Hydrological Sciences (HS) -
Natural Hazards (NH) -
Nonlinear Processes in Geosciences (NP) -
Ocean Sciences (OS) -
Planetary & Solar System Sciences (PS) -
Seismology (SM) -
Stratigraphy, Sedimentology & Palaeontology (SSP) -
Soil System Sciences (SSS) -
Solar-Terrestrial Sciences (ST) -
Tectonics & Structural Geology (TS) -

### Union Sessions

Union Symposia (US)
Great Debates (GDB)
Medal Lectures (ML)
Short courses (SC)
Educational and Outreach Symposia (EOS)
EGU Plenary, Ceremonies and Networking (PCN)
Feedback and administrative meetings (FAM)
Townhall and splinter meetings (TSM)
Side events (SEV)
Press conferences (PC)

In [None]:
# Cell 1: Import requirements

import nltk
nltk.download('wordnet')
import urllib
import json
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora, models
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS


In [None]:
# Cell 2: Querying Solr

# The following parameters have to be in sync with the ScatterText parameters, if we want to compare 2 years then we need to
# use the same years in the ScatterText cell. In this case we are comparing 2 sessions for all the years. 

terms = ['ice', 'climate']
years = ['*']
entities = ['*']
sessions = ['CR','NH']

# We sample Solr for up to 1000 documents that comply with our criteria
page_size = 1000
cursorMark = '*'

solr_documents = []
solr_root = 'http://integration.pdi-solr.labs.nsidc.org/solr/egu/select?indent=on&'
more_results = True

terms_query = '%20OR%20abstract:'.join('*' + terms + '*') 
years_query = '%20OR%20year:'.join(years)  
entities_query = '%20OR%20entities:'.join(entities)
sessions_query = '%20OR%20category:'.join(sessions)

while (more_results):    
    solr_query = 'q=(abstract:{})%20AND%20(year:{})%20AND%20(entities:{})%20AND%20(category:{})&wt=json&rows={}&cursorMark={}&sort=id+asc'.format(terms_query,
                                                                                        years_query,
                                                                                        entities_query,
                                                                                        sessions_query,
                                                                                        page_size,
                                                                                        cursorMark)
    solr_url = solr_root + solr_query
    print('Querying: \n' + solr_url)
    req = urllib.request.Request(solr_url)
    # parsing response
    r = urllib.request.urlopen(req).read()
    json_response = json.loads(r.decode('utf-8'))
    solr_documents.extend(json_response['response']['docs'])
    nextCursorMark = json_response['nextCursorMark']
    if (nextCursorMark == cursorMark):
        more_results = False
        break
    else: 
        cursorMark = nextCursorMark

print('Processing {} documents'.format(len(solr_documents)))

In [None]:
# Cell 3: Cleaning our documents 

## we need a tokenizer
tokenizer = RegexpTokenizer(r'\w+')
## we need stemer
stemmer = WordNetLemmatizer()
## our custom stop words (used for Gensim only)
my_stop_words = {
                    'area', 'data', 'event', 'doc', 'group', 'research', 
                    'metadata', 'content', 'sharing', 'previous', 'http', 
                    '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
                    '2011', '2012', '2013', '2014','2015', '2016', '2017',
                }

stop_words = my_stop_words.union(ENGLISH_STOP_WORDS)

# Function to clean up the documents, lematizes the words to their regular form and removes the stop words.
def clean_document(doc):
    tokens = tokenizer.tokenize((doc).lower())
    # We lematize (stemming)
    stemmed_tokens = [stemmer.lemmatize(i) for i in tokens]
    # If the token is not in our stop words and the length is >2 and <20 we add it to the cleaned document
    document = [i for i in stemmed_tokens if i not in stop_words and (len(i) > 2 and len(i) < 25)]
    return document

# document list will contain our corpus after cleaning it.
scattertext_documents = []
gensim_documents = []

for doc in solr_documents:
    document = clean_document(doc['abstract'][0])
    gensim_documents.append(document)

### Building the LDA model using Gensim a library for topic modeling, the output is a list of topics present in our corpus.

In [None]:
# Cell 4: Using GENSIM to do topic modelling, this cell takes some time... hang on.

num_passes = 10 #num pases should be adjusted, 5 is just a guesstimate of when convergence will be achieved.
num_topics = 20
words_per_topic = 7

dictionary = corpora.Dictionary(gensim_documents)
lda_corpus = [dictionary.doc2bow(text) for text in gensim_documents]
lda_model = models.ldamodel.LdaModel(lda_corpus, num_topics=num_topics, id2word = dictionary, passes=num_passes)
topics = lda_model.print_topics(num_topics=num_topics, num_words=words_per_topic)
for topic in topics:
    print(topic)

### ScatterText
* Now we're going to use the ScatterText library to visualize some binary categories token distributions

In [None]:
# Cell 5: loading our documents into a Panda dataframe for ScatterText and listing the document distributions

df = pd.DataFrame.from_dict(scattertext_documents)
axis_year = pd.DataFrame(df.groupby('year')['year'].count())
axis_session = pd.DataFrame(df.groupby('session')['session'].count())
print(axis_session, axis_year)

### Now that we have a trained model we can classify a new unseen document.

In [None]:
# Cell 7: Classifying an unseen document using our GENSIM model

# For practical purposes we use a mocked up document but we can easily query Solr or another store to get the content we want to classify
# Eventually all this should be served in as a web service 
#taken from https://meetingorganizer.copernicus.org/EGU2018/EGU2014-2415.pdf

unseen_document = """
Waves  in  the  Southern  Ocean  are  the  largest  in  the  planet.  In  the  Southern  Hemisphere,  the  absence  of  large
landmasses at high latitudes allows the wind to feed energy into the ocean over a virtually unlimited fetch. The
enormous amount air-sea momentum exchanged over the Southern Ocean plays a substantial role on the global
climate. However, large biases affect the estimation of wave regime around the Antarctic continent making climate
prediction susceptible to uncertainty.
 """

vec = dictionary.doc2bow(clean_document(unseen_document))
predicted_topics = lda_model[vec]
print(predicted_topics)

### Visualizing our model with PyLDAvis

In [None]:
# Cell 8: Using pyLDAvis to visualize our topic distributions in the principal component axis.


import warnings
warnings.filterwarnings('ignore')
import pyLDAvis.gensim

pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(corpus=lda_corpus, topic_model=lda_model, dictionary=dictionary, sort_topics=False)

### References and Links



> L. A. Lopez, R. Duerr and S. J. S. Khalsa, "Optimizing apache nutch for domain specific crawling at large scale," 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, 2015, pp. 1967-1971.
doi: 10.1109/BigData.2015.7363976

> Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017. Link to preprint: arxiv.org/abs/1703.00565