# Question 1 Implementation

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
from question_one.impl import CountrySimilarities, TextEmbedder
from scipy.spatial import distance
%load_ext autoreload
%autoreload
%matplotlib inline

Let's create an instance of `question_one.impl.CountrySimilarities` to manage the features and values we're interested in.

In [2]:
country_similarities = CountrySimilarities()
country_similarities.create_openfda_session()

Load in all of the country and reaction data.

In [3]:
country_similarities.create_openfda_session()
country_similarities.load_country_reactions()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bcf4k\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
100%|██████████| 235/235 [00:19<00:00, 12.07it/s]


## Country Jaccard Similarity

Jaccard similarity provides a metric describing how similar two sets of data are to one another. The resulting number is the proportion of elements that the two sets have in common, out of all the possible elements in the two sets.

In our terms, a score of 1 indicates that two countries have exactly the same *reported* adverse events. A score of 0 mean that they have nothing in common. These similarities *must* be considered in light of the underlying data characteristics:

* As FDA data, the reporting will be skewed towards the US.
* Underlying collection factors make FAERS counts noisy, which is part of the justification for this discrete metric.
* Larger populations are correlated with more reports

For underlying reasons, we would expect the similarity between the US and Western or developed countries to be more similar than others.  Let's do a sanity check based on this hypothesis.

### Sanity Check on Western and Developed Countries

In [4]:
country_similarities.jaccard_similarity('US', 'CA')

0.6091861402095085

In [5]:
country_similarities.jaccard_similarity('US', 'GB')

0.5528771384136858

In [6]:
country_similarities.jaccard_similarity('US', 'FR')

0.40931545518701484

In [7]:
country_similarities.jaccard_similarity('US', 'NZ')

0.45587162654996355

### Sanity Check on Eastern and Developed Countries

In [8]:
country_similarities.jaccard_similarity('US', 'JP')

0.3760330578512397

In [9]:
country_similarities.jaccard_similarity('US', 'SG')

0.36289222373806274

In [10]:
country_similarities.jaccard_similarity('US', 'SK')

0.3447811447811448

In [11]:
country_similarities.jaccard_similarity('US', 'CN')

0.39845938375350143

In [12]:
country_similarities.jaccard_similarity('US', 'MN')

0.037037037037037035

### Execution with US and CA Reference Points

In [13]:
us_jaccard_d = country_similarities.get_jaccard_similarity_with_reference_to_country('US')

100%|██████████| 235/235 [00:00<00:00, 10233.09it/s]


In [14]:
us_jaccard_df = pd.DataFrame.from_dict(us_jaccard_d, orient='index')
us_jaccard_df.columns = ['jaccard_simil']
us_jaccard_df = us_jaccard_df.reset_index()
us_jaccard_df.sort_values(by=['jaccard_simil'], ascending=False)[:15]

Unnamed: 0,index,jaccard_simil
233,US,1.0
106,CA,0.609186
187,GB,0.552877
49,BR,0.549263
166,AU,0.540895
130,IE,0.535742
199,PR,0.514784
11,ZA,0.512879
70,NL,0.497751
176,CO,0.49439


In [15]:
ca_jaccard_d = country_similarities.get_jaccard_similarity_with_reference_to_country('CA')

100%|██████████| 235/235 [00:00<00:00, 9812.46it/s]


In [16]:
ca_jaccard_df = pd.DataFrame.from_dict(ca_jaccard_d, orient='index')
ca_jaccard_df.columns = ['jaccard_simil']
ca_jaccard_df = ca_jaccard_df.reset_index()
ca_jaccard_df.sort_values(by=['jaccard_simil'], ascending=False)[:15]

Unnamed: 0,index,jaccard_simil
106,CA,1.0
233,US,0.609186
166,AU,0.5968
187,GB,0.581616
130,IE,0.577409
49,BR,0.571654
11,ZA,0.560594
176,CO,0.557722
232,AR,0.550466
110,IL,0.549263


From the results, we can see that this metric does find a difference between the adverse events reported in different countries.

## Country Embedding Distance

A fundamental problem with the patient reaction events is that many are nearly identical.  This would mean that the Jaccard similarity between two countries is more different than it really should be.  An embedding representation could help us account for that.

In [17]:
embedder = TextEmbedder('models/BioWordVec_PubMed_MIMICIII_d200.bin')



To do this, we will view each country as a "document" that is comprised of "sentences", which are the adverse events. The document embedding will, first, be created by finding the average embedding of the associated sentences.

In [18]:
from tqdm import tqdm
def get_cosine_distance_with_reference_to_country(country):
    country_embed = embedder.transform(
                                    country_similarities.country_reactions_d[country])
    result_d = {}
    comparison_countries = country_similarities.all_countries
    for comp_country in tqdm(comparison_countries):
        comp_embed = embedder.transform(
                                    country_similarities.country_reactions_d[comp_country])
        result_d[comp_country] = distance.cosine(country_embed, comp_embed)
    df = pd.DataFrame.from_dict(result_d, orient='index')
    df.columns = ['cosine_simil']
    return df.reset_index().sort_values(by=['cosine_simil'])

In [21]:
us_cos_dists = get_cosine_distance_with_reference_to_country('US')
us_cos_dists[:15]

100%|██████████| 235/235 [00:03<00:00, 61.51it/s]


Unnamed: 0,index,cosine_simil
233,US,0.0
106,CA,0.00319
49,BR,0.003285
187,GB,0.00416
72,FI,0.004601
130,IE,0.004643
199,PR,0.004794
2,UM,0.005042
71,SE,0.005656
8,AF,0.00569


At first glance, these results seem similar and in line with the Jaccard similarity approach.  However, it seems odd that Afghanistan is now the 9th closest country to the US.

In [22]:
us_cos_dists[us_cos_dists['index'] == 'CN']

Unnamed: 0,index,cosine_simil
156,CN,0.010451


In [23]:
us_cos_dists[us_cos_dists['index'] == 'FR']

Unnamed: 0,index,cosine_simil
227,FR,0.012999


And China is now more similar to the US than France. This seems surprising.

In [24]:
us_cos_dists['cosine_simil'].describe()

count    235.000000
mean       0.055819
std        0.082257
min        0.000000
25%        0.012107
50%        0.020908
75%        0.057069
max        0.444193
Name: cosine_simil, dtype: float64

Looking at the distribution, we see that all countries are very close to one another. This seems to reflect the diluting effect that the average embedding process is having on a country representation.

By applying TF-IDF weights to the embeddings, we may be able to resolve that.