In [1]:
%matplotlib inline


# Tutorial 02: Words Analysis

Analyzing collected text data and metadata.


## Words Analyses

This tutorial covers exploring & analyzing words data.

For this tutorial, we will reload and use the :class:`~.Words` object with
the data that we collected in the last tutorial.

Note that this tutorial requires some optional dependencies, including the
`WordCloud <https://github.com/amueller/word_cloud>`_ module.




In [2]:
# Import the custom objects that are used to store collected words data
from lisc.data import Articles, ArticlesAll

# Import database and IO utilities to reload our previously collected data
from lisc.utils.db import SCDB
from lisc.utils.io import load_object

# Import plots that are available for words data
from lisc.plts.words import plot_wordcloud

import scispacy
import spacy
nlp = spacy.load("en_ner_bionlp13cg_md")

### Articles Object

LISC uses custom objects to store and organize collected words data.

These objects are used internally in the :class:`~.Words` objects.

If the data collection was set to save out data as it was collected, then
:obj:`~.Articles` objects can be loaded individually, using the label
of the search term.




In [14]:
# Set up database object
db = SCDB('lisc_db')

# Load raw data for a particular term
term = 'whipworm'

arts = Articles(term)
arts.load(db)

### ArticlesAll Object

The :obj:`~.ArticlesAll` object aggregates collected data across all articles collected
for a given search term.




In [15]:
# Collapse data across articles
arts_all = ArticlesAll(arts)

It also has methods to create and check summaries created from the aggregate data.




In [16]:
# Check an example summary
arts_all.create_summary()
arts_all.print_summary()

whipworm :
  Number of articles: 		 99
  First publication: 		 2017
  Most common author: 		 Keiser J
    number of publications: 	 6
  Most common journal: 		 PLoS neglected tropical diseases
    number of publications: 	 6 



### Words Object

The :class:`~.Words` object can also be used to reload and analyze collected data.

The `results` attribute contains a list of :class:`~.Articles` objects, one for each term.




In [17]:
# Reload the words object, specifying to also reload the article data
words = load_object('tutorial_words', directory=SCDB('lisc_db'), reload_results=True)

Note that the reloaded data is the raw data from the data collection.

The :meth:`~.Words.process_articles` method can be used to do some preprocessing on the
collected data.

By default, the :func:`~.process_articles` function is used to process articles, which
preprocesses journal and author names, and tokenizes the text data. You can also pass in
a custom function to apply custom processing to the collected articles data.

Note that some processing steps, like converting to the ArticlesAll representation,
will automatically apply article preprocessing.




In [18]:
# Preprocess article data
words.process_articles()

We can also aggregate data across articles, just as we did before, directly in the Words object.

If you run the :meth:`~.Words.process_combined_results` method, then the
`combined_results` attribute will contain the corresponding list of
:class:`~.ArticlesAll` objects, also one for each term.




In [19]:
# Process collected data into aggregated data objects
words.process_combined_results()

In [20]:
# doc = nlp("gut")

# for token in doc:
#     print(token.text, token.pos_, token.ent_type_, token.head.text)

symptoms_list = ['microbiome', 'microbiota', 'gut', 'microbial', 'metabolites', 'anemia',  
                 'intestinal', 'obstruction',  'growth', 'faltering',  
                 'abdominal', 'pain', 'vomiting', 'cholangitis', 
                 'pancreatitis', 'anorexia', 'weight',  'gall', 'bladder', 'cancer', 'gallbladder',  
                 'cough',  'developmental', 'delay', 'hyperactivity']


term_list = ["roundworm", "whipworm", "hookworms"]

In [21]:
for num in range(0,3):
    print("-------", term_list[num],"-------")
    for i in words.combined_results[num].words:
        doc = nlp(i)
        for token in doc:
            if(token.pos_ == "NOUN" or token.pos_ == "ADJ"):
                if(words.combined_results[0].words[i] > 1 and i in symptoms_list):
                    print(i, "-", words.combined_results[0].words[i], "[", token.ent_type_,"]")
                    
    print()
    print()
                    

------- roundworm -------
intestinal - 86 [  ]
abdominal - 14 [  ]
pain - 15 [  ]
obstruction - 8 [  ]
bladder - 5 [ ORGAN ]
gallbladder - 4 [ ORGAN ]
microbiota - 4 [  ]
cholangitis - 4 [  ]
weight - 2 [  ]
gut - 3 [ ORGANISM_SUBDIVISION ]
metabolites - 2 [  ]
developmental - 2 [  ]
growth - 2 [  ]
anorexia - 2 [  ]
cancer - 3 [ CANCER ]
pancreatitis - 14 [  ]


------- whipworm -------
intestinal - 86 [  ]
metabolites - 2 [  ]
weight - 2 [  ]
developmental - 2 [  ]
abdominal - 14 [  ]
gut - 3 [ ORGANISM_SUBDIVISION ]
growth - 2 [  ]
microbiota - 4 [  ]
cancer - 3 [ CANCER ]
pain - 15 [  ]


------- hookworms -------
metabolites - 2 [  ]
weight - 2 [  ]
growth - 2 [  ]
intestinal - 86 [  ]
cancer - 3 [ CANCER ]
microbiota - 4 [  ]
gut - 3 [ ORGANISM_SUBDIVISION ]
developmental - 2 [  ]
abdominal - 14 [  ]




In [11]:
# Plot a WordCloud of the collected data for the first term
# plot_wordcloud(words.combined_results[1].words, 25)

## Exploring Words Data

The :class:`~.Words` object also has some methods for exploring the data, including
allowing for indexing into and looping through collected results.




In [12]:
# # Index results for a specific label
# print(words['roundworm'])

You can also loop through all the articles found for a specified search term.

The iteration returns a dictionary with all the article data, which can be examined.




In [13]:
# Iterating through articles found for a search term of interest
for art in words['whipworm']:
    if(art['keywords'] == []):
        continue
        
    print(art['title'])
    print(art['words'])
    print()

Mucosal Vaccination With Recombinant Tm-WAP49 Protein Induces Protective Humoral and Cellular Immunity Against Experimental Trichuriasis in AKR Mice.
['trichuriasis', 'one', 'common', 'neglected', 'tropical', 'diseases', 'worlds', 'poorest', 'people', 'recombinant', 'vaccine', 'composed', 'tm-wap49', 'immunodominant', 'antigen', 'secreted', 'adult', 'trichuris', 'stichocytes', 'mucosa', 'cecum', 'parasite', 'attaches', 'development', 'prototype', 'evaluated', 'mouse', 'model', 'trichuris', 'muris', 'infection', 'ultimate', 'goal', 'producing', 'mucosal', 'vaccine', 'intranasal', 'delivery', 'intranasal', 'immunization', 'mice', 'tm-wap49', 'formulated', 'adjuvant', 'och', 'truncated', 'analog', 'alpha-galcer', 'adjuvanticity', 'stimulate', 'natural', 'killer', 'cells', 'nkt', 'mucosal', 'immunity', 'induced', 'significantly', 'high', 'levels', 'igg', 'subclasses', 'igg1', 'igg2a', 'immunized', 'mice', 'also', 'resulted', 'significant', 'reduction', 'worm', 'burden', 'challenge', 'muris

### Analyzing Words Data

Further analysis depends mostly on what one wants to do with the collected data.

For example, this might include building profiles for each search term, based on
data in collected articles. It might also include using methods from natural language
processing, such as vector embeddings and/or similarity measures.

Specific analyses might also be interested in exploring historical patterns in the literature,
examining, for example, the history of when certain topics were written about, and in what
journals, by which authors.


