# Question 1 Implementation

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
from question_one.impl import JaccardSimilarities, TextEmbeddingSimilarities
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial import distance
from tqdm import tqdm
%load_ext autoreload
%autoreload
%matplotlib inline

## Country Jaccard Similarity

Jaccard similarity provides a metric describing how similar two sets of data are to one another. The resulting number is the proportion of elements that the two sets have in common, out of all the possible elements in the two sets.

In our terms, a score of 1 indicates that two countries have exactly the same *reported* adverse events. A score of 0 mean that they have nothing in common. These similarities *must* be considered in light of the underlying data characteristics:

* As FDA data, the reporting will be skewed towards the US.
* Underlying collection factors make FAERS counts noisy, which is part of the justification for this discrete metric.
* Larger populations are correlated with more reports

In [2]:
jaccard = JaccardSimilarities()
jaccard.create_openfda_session()

This initial version will only remove English stop words.

In [3]:
jaccard.load_country_reactions()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bcf4k\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\bcf4k\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:19<00:00, 12.30it/s]


### Execution with US and CA Reference Points

In [4]:
us_jaccard_df = jaccard.get_jaccard_similarity_with_reference_to_country('US')
us_jaccard_df[:15]

100%|██████████████████████████████████████████████████████████████████████████████| 235/235 [00:00<00:00, 9817.93it/s]


Unnamed: 0,country,jaccard_simil
100,US,1.0
88,CA,0.609186
177,GB,0.552877
55,BR,0.549263
53,AU,0.540895
58,IE,0.535742
8,PR,0.514784
98,ZA,0.512879
10,NL,0.497751
204,CO,0.49439


In [5]:
ca_jaccard_df = jaccard.get_jaccard_similarity_with_reference_to_country('CA')
ca_jaccard_df[:15]

100%|██████████████████████████████████████████████████████████████████████████████| 235/235 [00:00<00:00, 9415.05it/s]


Unnamed: 0,country,jaccard_simil
88,CA,1.0
100,US,0.609186
53,AU,0.5968
177,GB,0.581616
58,IE,0.577409
55,BR,0.571654
98,ZA,0.560594
204,CO,0.557722
79,AR,0.550466
56,IL,0.549263


From the results, we can see that this metric does find a difference between the adverse events reported in different countries. However, minor differences (such as "increases" vs "increase") would cause the difference to be higher than it really should be.

Let's see what the difference looks like after applying lemmatization.

In [6]:
jaccard.load_country_reactions(lemmatize=True)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bcf4k\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\bcf4k\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:20<00:00, 11.33it/s]


In [7]:
us_lemma_jaccard_df = jaccard.get_jaccard_similarity_with_reference_to_country('US')
us_jaccard_simils = pd.merge(us_jaccard_df, us_lemma_jaccard_df, on='country', suffixes=('_no_lemma', '_lemma'))
us_jaccard_simils

100%|██████████████████████████████████████████████████████████████████████████████| 235/235 [00:00<00:00, 8415.11it/s]


Unnamed: 0,country,jaccard_simil_no_lemma,jaccard_simil_lemma
0,US,1.000000,1.000000
1,CA,0.609186,0.609186
2,GB,0.552877,0.552877
3,BR,0.549263,0.549263
4,AU,0.540895,0.540895
...,...,...,...
230,CX,0.001000,0.001000
231,PM,0.000000,0.000000
232,FK,0.000000,0.000000
233,PW,0.000000,0.000000


In [8]:
us_jaccard_simils.describe()

Unnamed: 0,jaccard_simil_no_lemma,jaccard_simil_lemma
count,235.0,235.0
mean,0.179392,0.179394
std,0.179249,0.179252
min,0.0,0.0
25%,0.019891,0.019891
50%,0.101614,0.101614
75%,0.343876,0.343876
max,1.0,1.0


In [9]:
ca_lemma_jaccard_df = jaccard.get_jaccard_similarity_with_reference_to_country('CA')
ca_jaccard_simils = pd.merge(ca_jaccard_df, ca_lemma_jaccard_df, on='country', suffixes=('_no_lemma', '_lemma'))
ca_jaccard_simils

100%|██████████████████████████████████████████████████████████████████████████████| 235/235 [00:00<00:00, 7854.31it/s]


Unnamed: 0,country,jaccard_simil_no_lemma,jaccard_simil_lemma
0,CA,1.000000,1.000000
1,US,0.609186,0.609186
2,AU,0.596800,0.596800
3,GB,0.581616,0.581616
4,IE,0.577409,0.577409
...,...,...,...
230,PM,0.001001,0.001001
231,CX,0.001001,0.001001
232,FK,0.000000,0.000000
233,SX,0.000000,0.000000


In [10]:
ca_jaccard_simils.describe()

Unnamed: 0,jaccard_simil_no_lemma,jaccard_simil_lemma
count,235.0,235.0
mean,0.193657,0.193659
std,0.194858,0.194861
min,0.0,0.0
25%,0.021317,0.021317
50%,0.113508,0.113508
75%,0.373081,0.373081
max,1.0,1.0


We can see from the distributions that the lemmatization has a very minor impact. We can also see that the distributions are skewed towards countries that are more similar, but the vast majority have little in common with the literalism of the Jaccard metric.

Based on the nearest neighbors results from the EDA, embeddings may be able to capture a more realistic relationship.

## Country Embedding Distance

A fundamental problem with the patient reaction events is that many are nearly identical.  This would mean that the Jaccard similarity between two countries is more different than it really should be.  An embedding representation could help us account for that.

In [11]:
# Helper function to perform country comparisons with reference to a specific country
def cosine_similarity_with_reference_to_country(country, embedder, method='word_average'):
    country_embed = embedder.transform(embedder.country_reactions_d['US'], method=method)
    result_d = {}
    for comp_country in tqdm(embedder.all_countries):
        comp_embed = embedder.transform(embedder.country_reactions_d[comp_country], method=method)
        result_d[comp_country] = 1 - distance.cosine(country_embed, comp_embed)
    df = pd.DataFrame.from_dict(result_d, orient='index')
    df.reset_index(inplace=True)
    df.columns = ['country', 'cosine_simil']
    df.sort_values(by=['cosine_simil'], ascending=False, inplace=True)
    return df

In [12]:
embedder = TextEmbeddingSimilarities('../models/BioWordVec_PubMed_MIMICIII_d200.bin')



In [13]:
embedder.create_openfda_session()
embedder.load_country_reactions()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bcf4k\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\bcf4k\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:19<00:00, 12.07it/s]


In [14]:
us_cosine_simil = cosine_similarity_with_reference_to_country('US', embedder)
us_cosine_simil[:15]

100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:03<00:00, 60.84it/s]


Unnamed: 0,country,cosine_simil
100,US,1.0
88,CA,0.99681
55,BR,0.996715
177,GB,0.99584
108,FI,0.995399
58,IE,0.995357
8,PR,0.995206
33,UM,0.994958
155,SE,0.994344
99,AF,0.99431


In [15]:
us_cosine_simil.describe()

Unnamed: 0,cosine_simil
count,235.0
mean,0.944181
std,0.082257
min,0.555807
25%,0.942931
50%,0.979092
75%,0.987893
max,1.0


Now, we see the opposite behavior to the Jaccard metric. However, this could reflect a diluting effect that the average embedding process is having on a country representation.

Let's see what happens if we use TF-IDF weights to more intelligently aggregate word embeddings.

In [16]:
us_tfidf_cosine_simil = cosine_similarity_with_reference_to_country('US', embedder, method='tfidf')
us_tfidf_cosine_simil[:15]

100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:07<00:00, 30.83it/s]


Unnamed: 0,country,cosine_simil
100,US,1.0
88,CA,0.996664
55,BR,0.996572
33,UM,0.99561
177,GB,0.995546
108,FI,0.995312
99,AF,0.99528
8,PR,0.995241
58,IE,0.994831
19,IS,0.994251


In [17]:
us_cosine_simils = pd.merge(us_cosine_simil, us_tfidf_cosine_simil, on='country', suffixes=('_avg', '_tfidf'))
us_cosine_simils

Unnamed: 0,country,cosine_simil_avg,cosine_simil_tfidf
0,US,1.000000,1.000000
1,CA,0.996810,0.996664
2,BR,0.996715,0.996572
3,GB,0.995840,0.995546
4,FI,0.995399,0.995312
...,...,...,...
230,FK,0.635951,0.637203
231,PW,0.622655,0.619369
232,GG,0.605294,0.593881
233,TV,0.563692,0.560999


In [18]:
us_cosine_simils.describe()

Unnamed: 0,cosine_simil_avg,cosine_simil_tfidf
count,235.0,235.0
mean,0.944181,0.944276
std,0.082257,0.082788
min,0.555807,0.498739
25%,0.942931,0.944125
50%,0.979092,0.979721
75%,0.987893,0.987138
max,1.0,1.0


The introduction of TF-IDF event weighting had a negligible impact.

Let's make one last attempt and remove any averaging by treating all the events for a country as if they belong to a single event record.

In [19]:
# Helper function to perform country comparisons with reference to a specific country
def cosine_similarity_with_reference_to_country_v2(country, embedder, method='word_average', use_unique_tokens=False):
    country_embed = embedder.transform(embedder.country_reactions_d['US'], method=method)
    result_d = {}
    for comp_country in tqdm(embedder.all_countries):
        if not use_unique_tokens:
            comp_embed = embedder.transform(' '.join(embedder.country_reactions_d[comp_country]).split(), method=method)
        else:
            comp_embed = embedder.transform(set(' '.join(embedder.country_reactions_d[comp_country]).split()), method=method)
        result_d[comp_country] = 1 - distance.cosine(country_embed, comp_embed)
    df = pd.DataFrame.from_dict(result_d, orient='index')
    df.reset_index(inplace=True)
    df.columns = ['country', 'cosine_simil']
    df.sort_values(by=['cosine_simil'], ascending=False, inplace=True)
    return df

In [20]:
us_cosine_simil_v2 = cosine_similarity_with_reference_to_country_v2('US', embedder, method='tfidf')
us_cosine_simils = pd.merge(us_cosine_simils, us_cosine_simil_v2, on='country', suffixes=('_v1', '_v2'))
us_cosine_simils.describe()

100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:07<00:00, 30.02it/s]


Unnamed: 0,cosine_simil_avg,cosine_simil_tfidf,cosine_simil
count,235.0,235.0,235.0
mean,0.944181,0.944276,0.945753
std,0.082257,0.082788,0.08109
min,0.555807,0.498739,0.555332
25%,0.942931,0.944125,0.946186
50%,0.979092,0.979721,0.9807
75%,0.987893,0.987138,0.98782
max,1.0,1.0,0.998392


Similarity has actually increased...

What about if we only use unique tokens?

In [21]:
us_cosine_simil_v2 = cosine_similarity_with_reference_to_country_v2('US', embedder, method='tfidf', use_unique_tokens=True)
us_cosine_simils = pd.merge(us_cosine_simils, us_cosine_simil_v2, on='country', suffixes=('_v1', '_v2'))
us_cosine_simils.describe()

100%|████████████████████████████████████████████████████████████████████████████████| 235/235 [00:07<00:00, 30.73it/s]


Unnamed: 0,cosine_simil_avg,cosine_simil_tfidf,cosine_simil_v1,cosine_simil_v2
count,235.0,235.0,235.0,235.0
mean,0.944181,0.944276,0.945753,0.946094
std,0.082257,0.082788,0.08109,0.080605
min,0.555807,0.498739,0.555332,0.555332
25%,0.942931,0.944125,0.946186,0.949591
50%,0.979092,0.979721,0.9807,0.981324
75%,0.987893,0.987138,0.98782,0.986551
max,1.0,1.0,0.998392,0.995939


Here, the difference is again negligible.

Looking at these together, it is not necessarily surprising that the actual meaning of the intents are dissimilar. The regulations underlying the data collection does encourage a level of uniformity. Furthermore, even with fine-tuned medical embeddings, relative distances between medical terminology is limite. Plus, the nature of this word2vec-inspired model is that the distance between interchangeable antonyms (such as "increased" and "decreased" is small).

## Heuristic to Establish a Naive, Conservative Baseline

When conducting the EDA, many groups of terms had the same first or last token.  These tokens could be used as naive bins to categorize different types of events.  With that type of brutish binning, and the literal nature of Jaccard similarity, that could offer a level of validation into how similar the events can be.

In [22]:
first_country_reactions_d = {}
last_country_reactions_d = {}
for country, events in embedder.country_reactions_d.items():
    first_country_reactions_d[country] = set()
    last_country_reactions_d[country] = set()
    for event in events:
        tokens = event.split()
        first = tokens[0]
        last = tokens[-1]
        first_country_reactions_d[country] = first_country_reactions_d[country] | set([first])
        last_country_reactions_d[country] = last_country_reactions_d[country] | set([last])

In [23]:
jaccard_first_baseline = JaccardSimilarities()
jaccard_last_baseline = JaccardSimilarities()

In [24]:
jaccard_first_baseline.all_countries = list(first_country_reactions_d.keys())
jaccard_first_baseline.country_reactions_d = first_country_reactions_d

In [25]:
jaccard_first_baseline.get_jaccard_similarity_with_reference_to_country('US').describe()

100%|██████████████████████████████████████████████████████████████████████████████| 235/235 [00:00<00:00, 5746.39it/s]


Unnamed: 0,jaccard_simil
count,235.0
mean,0.240894
std,0.215408
min,0.0
25%,0.037311
50%,0.171561
75%,0.454785
max,1.0


In [26]:
jaccard_last_baseline.all_countries = list(last_country_reactions_d.keys())
jaccard_last_baseline.country_reactions_d = last_country_reactions_d

In [27]:
jaccard_last_baseline.get_jaccard_similarity_with_reference_to_country('US').describe()

100%|█████████████████████████████████████████████████████████████████████████████| 235/235 [00:00<00:00, 10246.49it/s]


Unnamed: 0,jaccard_simil
count,235.0
mean,0.245724
std,0.215412
min,0.0
25%,0.038606
50%,0.177083
75%,0.461536
max,1.0


At each quartile, we see ~2x increase in similarity. This lends support to the embedding similarity findings.