## EGU Insights Data Visualizations With ScatterText

This notebook plots binary comparisons on ScatterText, a tool for finding distinguishing terms in small-to-medium-sized corpora.




**Database Schema:**

Example:

```json
doc = {
"entities":[
    "Jeffrey Obelcz  and Warren T. Wood",
    "NRC Postdoctoral Fellow",
    "Naval Research Lab",
    "Seaﬂoor Sciences",
    "United States jbobelcz@gmail.com",
    "Naval",
    "Research Lab",
    "Seaﬂoor Sciences",
    "United States"],
"id": "EGU2018-9778",
"sessions": ["ESSI4.3"],
"file": ["EGU2018-9778"],
"presentation": ["Posters"],
"year": [2018],
"title": ["Towards a Quantitative Understanding of Parameters Driving Submarine Slope Failure: A Machine Learning Approach"],
"category": ["ESSI"],
"abstract":["Submarine slope failure is a ubiquitous process and dominant pathway for sediment and organic carbon ﬂux from continental margins to the deep sea. Slope failure occurs over a wide range of temporal and spatial scales ..."]
}
```

* **category**: the main disciplinary session id, CL, AS etc. Keep in mind that these codes have changed through the years.
* **presentation**: oral, poster, pico etc.
* **session**: the individual session for a given abstract, these can be co-organized.


-----

### Disciplinary Sessions 

Geodynamics (GD) -
Geosciences Instrumentation & Data Systems (GI) -
Geomorphology (GM) -
Geochemistry, Mineralogy, Petrology & Volcanology (GMPV) -
Hydrological Sciences (HS) -
Natural Hazards (NH) -
Nonlinear Processes in Geosciences (NP) -
Ocean Sciences (OS) -
Planetary & Solar System Sciences (PS) -
Seismology (SM) -
Stratigraphy, Sedimentology & Palaeontology (SSP) -
Soil System Sciences (SSS) -
Solar-Terrestrial Sciences (ST) -
Tectonics & Structural Geology (TS) -
Atmospheric Sciences (AS) -
Biogeosciences (BG) -
Climate: Past, Present, Future (CL) -
Cryospheric Sciences (CR) -
Earth Magnetism & Rock Physics (EMRP) -
Energy, Resources and the Environment (ERE) -
Earth & Space Science Informatics (ESSI) -
Geodesy (G) -
Geodynamics (GD) -
Geosciences Instrumentation & Data Systems (GI) -
Geomorphology (GM) -
Geochemistry, Mineralogy, Petrology & Volcanology (GMPV) -
Hydrological Sciences (HS) -
Natural Hazards (NH) -
Nonlinear Processes in Geosciences (NP) -
Ocean Sciences (OS) -
Planetary & Solar System Sciences (PS) -
Seismology (SM) -
Stratigraphy, Sedimentology & Palaeontology (SSP) -
Soil System Sciences (SSS) -
Solar-Terrestrial Sciences (ST) -
Tectonics & Structural Geology (TS) -

### Union Sessions

Union Symposia (US)
Great Debates (GDB)
Medal Lectures (ML)
Short courses (SC)
Educational and Outreach Symposia (EOS)
EGU Plenary, Ceremonies and Networking (PCN)
Feedback and administrative meetings (FAM)
Townhall and splinter meetings (TSM)
Side events (SEV)
Press conferences (PC)

#### Interdisciplinary Events (IE)

In [None]:
# Cell 1: Import requirements

import nltk
nltk.download('wordnet')
import urllib
import json
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
import scattertext as st
import spacy
import pandas as pd
from datetime import datetime
import random
import string
pseudo_rand = [ random.choice(string.ascii_letters) for i in range(4)]
seed = ''.join(pseudo_rand)
# We load the English vector space from https://spacy.io/models/
nlp = spacy.load('en')

In [None]:
# Cell 2: Querying Solr

# terms = ['ice', 'climate'] to include only abstracts with specified terms
terms = ['*']
entities = ['*']

# the following parameters affect scatterText, use only up to 2 values i.e. year comparisons or category
years = ['2012', '2018']
sessions = ['NH']

# scattertext categories (year or session)
scatter_category = 'year'

if scatter_category == 'year':
    comparing = [years[0],years[1]]
else:
    comparing = [sessions[0],sessions[1]]

# We sample Solr for up to "page_size" documents that comply with our criteria
page_size = 400
cursorMark = '*'

solr_documents = []
solr_root = 'http://integration.pdi-solr.labs.nsidc.org/solr/egu/select?indent=on&'
more_results = True


if terms[0] != '*':
    terms_wirldcard = ['*' + t + '*' for t in terms]
else:
    terms_wirldcard = ['*']
    
if sessions[0] != '*':
    sessions_wirldcard = ['*' + s + '*' for s in sessions]
else:
    sessions_wirldcard = ['*']
    
if entities[0] != '*':
    entities_wirldcard = ['*' + e + '*' for e in entities]
else:
    entities_wirldcard = ['*']

terms_query = '%20OR%20abstract:'.join(terms_wirldcard)
years_query = '%20OR%20year:'.join(years)  
entities_query = '%20OR%20entities:'.join(entities_wirldcard)
sessions_query = '%20OR%20sessions:'.join(sessions_wirldcard)
query_string = 'q=(abstract:{}%20AND%20abstract:/.{{2}}.*/%20AND%20NOT%20title:/.{{300}}.*/)%20AND%20(year:{})' + \
                '%20AND%20(entities:{})%20AND%20(sessions:{})&wt=json&rows={}&sort=random_*{}%20desc'
 
solr_query = query_string.format(terms_query,
                                 years_query,
                                 entities_query,
                                 sessions_query,
                                 page_size,
                                 seed)
solr_url = solr_root + solr_query
print('Querying: \n' + solr_url)
req = urllib.request.Request(solr_url)
# parsing response
r = urllib.request.urlopen(req).read()
json_response = json.loads(r.decode('utf-8'))
solr_documents.extend(json_response['response']['docs'])
    

total_found = json_response['response']['numFound']
print("Processing {0} documents out of {1} total. \n".format(len(solr_documents), total_found))

In [None]:
# Cell 3: Cleaning our documents 

## we need a tokenizer
tokenizer = RegexpTokenizer(r'\w+')
## we need stemer
stemmer = WordNetLemmatizer()
## our custom stop words
my_stop_words = {
                    'area', 'data', 'event', 'doc', 'group', 'research', 
                    'metadata', 'content', 'sharing', 'previous', 'http'
                }

years = [str(year) for year in range(2000,2020)]
words_and_years = my_stop_words.union(years)
stop_words = words_and_years.union(ENGLISH_STOP_WORDS)


# Function to clean up the documents, lematizes the words to their regular form and removes the stop words.
def clean_document(doc):
    tokens = tokenizer.tokenize((doc).lower())
    # We lematize (stemming)
    stemmed_tokens = [stemmer.lemmatize(i) for i in tokens]
    # If the token is not in our stop words and the length is >2 and <20 we add it to the cleaned document
    document = [i.encode('ascii',errors='ignore').decode() for i in stemmed_tokens if i not in stop_words and (len(i) > 2 and len(i) < 25)]
    return document

# document list will contain our corpus after cleaning it.
scattertext_documents = []

garbage_str = '</a>'

for doc in solr_documents:
    document = clean_document(doc['abstract'][0])
    if 'sessions' in doc:
        sindex = doc['sessions'][0].find(garbage_str)
        if sindex != -1:
            sessions = doc['sessions'][0][0:sindex]
        else: 
            sessions = doc['sessions'][0]
    else:
        sessions = 'NAN'
    scattertext_documents.append({ 'id': doc['id'],
                                   'text': ' '.join(document), 
                                   'year': str(doc['year'][0]),
                                   'title': doc['title'][0],
                                   'category': doc['category'][0].replace('/', '').replace('<', ''),
                                   'sessions': sessions})

## ScatterText

* we're going to use the ScatterText library to visualize some binary categories token distributions

In [None]:
# Cell 4: loading our documents into a Panda dataframe for ScatterText and listing the document distributions

df = pd.DataFrame.from_dict(scattertext_documents)
axis_category = pd.DataFrame(df.groupby(['category', 'year'])['category'].count()).rename(columns={'category': 'count'})
print(axis_category.to_string())

In [None]:
# Cell 5: Using ScatterText to compare 2 categories, this cell also takes time, if your browser ask you to wait... wait!

# We create a corpus using Scatter's built-in method.
scatter_corpus = st.CorpusFromPandas(df, 
                             category_col=scatter_category, 
                             text_col='text',
                             nlp=nlp).build()

html = st.produce_scattertext_explorer(scatter_corpus,
          category=comparing[0],
          category_name=comparing[0],
          not_category_name=comparing[1],
          metadata=scatter_corpus.get_df()['title'],
          minimum_term_frequency=5,
          width_in_pixels=700)

open("scattertext.html", 'wb').write(html.encode('utf-8'))
from IPython.display import IFrame
from IPython.core.display import display, HTML
print ("Loading plot...")
display(IFrame(src='scattertext.html', width=900, height=800))
# The search box is not working, presumably because Jupyter getting in the way of scattertext js libraries.

### References and Links



> L. A. Lopez, R. Duerr and S. J. S. Khalsa, "Optimizing apache nutch for domain specific crawling at large scale," 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, 2015, pp. 1967-1971.
doi: 10.1109/BigData.2015.7363976

> Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017. Link to preprint: arxiv.org/abs/1703.00565