# Imaging Edge Notebook 5: Validation Test Suite

ImagingEdge detects trends in the radiological research literature before they become mainstream publications, patents and products.

*Part 5: In (this notebook) of the app, a range of methods are included that were used for inspection and validation of results at different stages of the pipeline*

Other parts:

Part 1: Scrape PubMed

Part 2: Convert PubMed abstracts to Bag of Words

Part 3: Build graph connecting search terms and trends

Part 4: Graph learns from unstructured sources.

### Created by Eric Barnhill for Insight Health Data Science
#### 2018 No License

Documentation follows the [Google Python Style Guide](http://google.github.io/styleguide/pyguide.html)

## 1. Scrape Statistics

#### Ballpark estimates on query numbers
Questions tested here:

A. Is the "Radiology" keyword sufficient to catch all the abstracts of interest?
B. How many abstracts do I retrieve each month, and how consistent are they month to month?

The method below estimates PubMed queries by month, for a given search term by month, for a set of years:

In [14]:
# ESTIMATE HOW MANY RELATED QUERIES OCCUR OVER N MONTHS
def estimate_queries(start_date, end_date, mesh_term='radiology', plot=True):
    """Estimates number of queries for a given MeSH and a given rolling window.
    
    Args:
        start_date, end_date: start and end dates of the estimate
        mesh_term: mesh_term to be queried
        plot: print plot of queries by month
    
    Returns:
        Void. Displays matplotlib plot across the specified time frame.
    """
    counts = []
    curr_start = start_date
    while curr_start < end_date: 
        logging.info("Scraping week of: " + str(curr_state))
        curr_end = curr_start + relativedelta(weeks=+1)
        query_res = query_pubmed(format_pubmed_query(start_date, end_date, mesh_term))
        count = query_res['Count']
        curr_start = curr_end 
        # FOR ROLLING
        # for start_month in range(1,N_MONTHS+1):
        # FOR SEPARATE
        for start_month in months:
            start_date = datetime.date(start_year, start_month, FIRST_OF_MONTH)
            end_month = start_month + window
            if end_month > N_MONTHS:
                end_year += 1
                end_month = end_month % N_MONTHS
            end_date = datetime.date(end_year, end_month, FIRST_OF_MONTH)
            if not daily:
                counts.append(count)
                start_dates.append(start_date)
            else:
                date_range = (end_date - start_date).days
                for date_index in range(date_range):
                    logging.debug(date_index)
                    date = start_date + datetime.timedelta(days=date_index)
                    query_res = query_pubmed(format_pubmed_query(date, 
                                date+datetime.timedelta(days=1), mesh_term))
                    count = query_res['Count']
                    counts.append(count)
                    start_dates.append(date)
    counts = pd.DataFrame(np.double(np.array(counts)))
    counts.columns = ['Counts']
    logging.info("Median counts: ", counts.median(axis=0))
    counts['Dates'] = pd.to_datetime(pd.Series(start_dates))
    if plot:
        fig, ax = plt.subplots()
        ax.plot(counts.Dates, counts.Counts)
        if not daily:
            # format the ticks
            ax.xaxis.set_major_locator(chart_years)
            ax.xaxis.set_minor_locator(chart_months)
        else:
            for n, label in enumerate(ax.xaxis.get_ticklabels()):
                if n % 2 != 0:
                    label.set_visible(False)
    return counts

Query for "Radiology":

In [15]:
#counts = estimate_queries(triple2date(2010,1,1), triple2date(2018,6,1), mesh_term = 'radiology')

Query for "Radiology"

In [16]:
#counts = estimate_queries(2010, 2017, 1, mesh_term = 'radiology')

While Radiology had more queries per month than MRI, it was only 50% more or so. This suggests that the Radiology MeSH term is only reaching a small portion of the abstracts that might contain early trends of interest to radiologists. 

*UPDATE: "Diagnostic imaging" was added to "Radiology" as a default search term, which encompassed a much broader array of relevant abstracts.*

The noise is also seen looking at monthly data. Here is an example from three months:

In [17]:
#counts = estimate_queries(2011, 2012, months=range(1, 3), daily=True)

Here it's clear that on a monthly basis, publications spike on the first of the month as well. This suggests that PubMed indexes some publications with year-only dates, and some with month-only dates. These are indexed as the first of the year and month respectively. These are probably not typical abstracts, as those will come with a full publication date, so I expect them to wash out when the abstracts get filtered later. 

*UPDATE: Analysis of the by-week scrapes in the log files bears this out, and abstract collection is consistent week to week.*

In [18]:
   # print_trend_dict(trends_list, 20)
   # print("Hottest search terms:")
   # hot_keywords = search_terms(G)
   # print(hot_keywords)

In [19]:
client_id = "477619827428-9hji9hcgg2igorts485b6tq99onnvncl.apps.googleusercontent.com"
client_secret = "dTZ8Z0BdzUOmhQDWnuSImprt"

In [20]:
import oauth2client.client, oauth2client.file, oauth2client.tools
import gspread

flow = oauth2client.client.OAuth2WebServerFlow(client_id, client_secret, 'https://spreadsheets.google.com/feeds')
storage = oauth2client.file.Storage('credentials.dat')
credentials = storage.get()
if credentials is None or credentials.invalid:
    import argparse
    flags = argparse.ArgumentParser(parents=[oauth2client.tools.argparser]).parse_args([])
    credentials = oauth2client.tools.run_flow(flow, storage, flags)

gc = gspread.authorize(credentials)

# when this cell is run, your browser will take you to a Google authorization page.
# this authorization is complete, the credentials will be cached in a file named credentials.dat

In [21]:
import pandas as pd
sheet = gc.open("Web of Science - Unboosted Graph").sheet1
df = pd.DataFrame(sheet.get_all_records())
df.head(3)

Unnamed: 0,Unnamed: 1,IMEDGE 2014,IMEDGE 2015,IMEDGE 2016,SEARCH TERM,TRENDING TERM,WOS 2013,WOS 2014,WOS 2015,WOS 2016,WOS 2017
0,,49,68,,tomography,interstitial lung disease,162,178,229,271,285
1,,58,95,,tomography,chronic obstructive pulmonary,151,144,183,190,210
2,,300,320,,tomography,central nervous system,234,252,278,334,320


Formula: WoS search must outperform previous slope by 20% or more.

In [22]:
import numpy as np
cols = df.columns.values
WOS_cols = cols[np.array([True if column_name[0:3] == "WOS" else False for column_name in cols])]
df_wos = df.loc[:, WOS_cols].apply(pd.to_numeric, axis=1).apply(np.log)
print(df_wos.head(2))
print(df_wos.shape)

   WOS 2013  WOS 2014  WOS 2015  WOS 2016  WOS 2017
0  5.087596  5.181784  5.433722  5.602119  5.652489
1  5.017280  4.969813  5.209486  5.247024  5.347108
(190, 5)


In [43]:
def get_trend_gradients(df_wos):
    df_wos_grad = df_wos.copy(deep=True)
    for col_num in range(2, len(df_wos.columns)): # taking gradients, so skip first two
        print(col_num)
        ratio_col_name = 'gradient_ratio_' + str(col_num)
        gradient1 = df_wos.iloc[:,(col_num - 1)] - df_wos.iloc[:,(col_num - 2)]
        gradient2 = df_wos.iloc[:,col_num] - df_wos.iloc[:,(col_num - 1)]
        # to avoid negative numbers, set negative first gradients to 1
        # trend ratios are no longer accurate, but will catch all trends
        gradient1[gradient1 <= 0] = 1
        wos_gradient_ratio = (gradient2 / gradient1)
        df_wos_grad[ratio_col_name] = wos_gradient_ratio
    df_wos_grad['trend_eval'] = np.zeros(df_wos_grad.shape[0])
    TREND_THRESH_VAL = 1.2
    # trended_in_2015 = ( df_wos_grad.gradient_ratio_2 > 1.2 ) & ( df_wos_grad.gradient_ratio_3 < 1.2 ) & ( df_wos_grad.gradient_ratio_4 < 1.2 )
    # 2017 not considered, to catch terms that trended in 2015 and 2017 -- those should cou
    trended_in_2015 = ( df_wos_grad.gradient_ratio_2 > TREND_THRESH_VAL ) & ( df_wos_grad.gradient_ratio_3 < TREND_THRESH_VAL ) & ( df_wos_grad.gradient_ratio_4 < TREND_THRESH_VAL )
    trended_in_2016 = ( df_wos_grad.gradient_ratio_3 > TREND_THRESH_VAL ) & ( df_wos_grad.gradient_ratio_2 < TREND_THRESH_VAL ) & ( df_wos_grad.gradient_ratio_4 < TREND_THRESH_VAL )
    trended_in_2017 = ( df_wos_grad.gradient_ratio_4 > TREND_THRESH_VAL ) & ( df_wos_grad.gradient_ratio_3 < TREND_THRESH_VAL ) & ( df_wos_grad.gradient_ratio_2 < TREND_THRESH_VAL )
    trended_15_16 = ( df_wos_grad.gradient_ratio_2 > TREND_THRESH_VAL ) & ( df_wos_grad.gradient_ratio_3 > TREND_THRESH_VAL ) & ( df_wos_grad.gradient_ratio_4 < TREND_THRESH_VAL )
    trended_15_17 = ( df_wos_grad.gradient_ratio_2 > TREND_THRESH_VAL ) & ( df_wos_grad.gradient_ratio_4 > TREND_THRESH_VAL ) & ( df_wos_grad.gradient_ratio_3 < TREND_THRESH_VAL )
    trended_16_17 = ( df_wos_grad.gradient_ratio_3 > TREND_THRESH_VAL ) & ( df_wos_grad.gradient_ratio_4 > TREND_THRESH_VAL ) & ( df_wos_grad.gradient_ratio_2 < TREND_THRESH_VAL )
    trended_all = ( df_wos_grad.gradient_ratio_2 > TREND_THRESH_VAL ) & ( df_wos_grad.gradient_ratio_3 > TREND_THRESH_VAL ) & ( df_wos_grad.gradient_ratio_4 > TREND_THRESH_VAL )
    # set outcomes
    df_wos_grad.trend_eval[trended_in_2015] = 1
    df_wos_grad.trend_eval[trended_in_2016] = 2
    df_wos_grad.trend_eval[trended_in_2017] = 3
    df_wos_grad.trend_eval[trended_15_16] = 4
    df_wos_grad.trend_eval[trended_15_17] = 5
    df_wos_grad.trend_eval[trended_16_17] = 6
    df_wos_grad.trend_eval[trended_all] = 7

    THRESH = 2.3 # 2.3 for log trends, 10 for trends
    # remove events with too few samples
    for i, row in df_wos_grad.iterrows():
        if sum(row[0:5] < THRESH) > 0:
            df_wos_grad.trend_eval[i] = -1

2
3
4


In [44]:
counts = pd.value_counts(values=df_wos_grad.trend_eval).sort_index()
print("Counts by category:")
print(counts)
print("Percentages for all counts:")
print(counts / sum(counts))
counts_clear = sum(counts[2:4])
print("Percentages for clear results:")
print(counts[2:4] / counts_clear)

Counts by category:
-1.0    22
 0.0    73
 1.0    42
 2.0    28
 3.0    14
 4.0     4
 5.0     6
 6.0     1
Name: trend_eval, dtype: int64
Percentages for all counts:
-1.0    0.115789
 0.0    0.384211
 1.0    0.221053
 2.0    0.147368
 3.0    0.073684
 4.0    0.021053
 5.0    0.031579
 6.0    0.005263
Name: trend_eval, dtype: float64
Percentages for clear results:
2.0    0.608696
3.0    0.304348
4.0    0.086957
Name: trend_eval, dtype: float64


Try to understand the behaviors registering as zero:

In [35]:
df_wos_grad[df_wos_grad.trend_eval==0]
df_wos_grad.head(50)

Unnamed: 0,WOS 2013,WOS 2014,WOS 2015,WOS 2016,WOS 2017,gradient_ratio_2,gradient_ratio_3,gradient_ratio_4,trend_eval
0,5.087596,5.181784,5.433722,5.602119,5.652489,2.674869,0.668405,0.299117,1.0
1,5.01728,4.969813,5.209486,5.247024,5.347108,0.239673,0.156621,2.666196,3.0
2,5.455321,5.529429,5.627621,5.811141,5.768321,1.324986,1.86899,-0.233326,4.0
3,5.303305,5.231109,5.493061,5.602119,5.666427,0.261953,0.416324,0.58967,0.0
4,5.303305,5.517453,5.598422,5.713733,5.723585,0.378099,1.424135,0.085441,2.0
5,7.155396,7.220374,7.490529,7.482119,7.498316,4.157676,-0.031132,0.016197,1.0
6,4.762174,4.744932,4.955827,5.198497,4.976734,0.210895,1.150668,-0.913847,0.0
7,5.298317,5.4161,5.517453,5.686975,5.666427,0.860502,1.672603,-0.121215,2.0
8,6.44254,6.508769,6.668228,6.654153,6.729824,2.407694,-0.088272,0.075672,1.0
9,4.317488,4.406719,4.770685,4.744932,4.820282,4.078906,-0.070755,0.075349,1.0


In [36]:
np.log(0.2)

-1.6094379124341003

Final results for WoS benchmarking:

25.8% of search-trending combinations trended *simultaneously* with WoS
17.9% trended in ImagingEdge in 2015, but WoS in 2016
6.8% trended in ImagingEdge in 2015 but WoS in 2017

