## Dataset
We will be using the Drug Review Dataset (Drugs.com) for this workshop.

Felix Gräßer, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. 2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. In Proceedings of the 2018 International Conference on Digital Health (DH '18). ACM, New York, NY, USA, 121-125. DOI: [https://dl.acm.org/citation.cfm?doid=3194658.3194677]

It is a part of the UCI machine learning repository.

Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.



In [None]:
import boto3
import pandas as pd
import io
import os
import time
import matplotlib.pyplot as plt
import random
import sagemaker
from sagemaker import get_execution_role
import json
import csv
import pandas as pd

In [None]:
sagemaker_session=sagemaker.Session()
sagemaker_bucket = sagemaker_session.default_bucket()
role = get_execution_role()

reviews_data_prefix = 'reviews'

s3_client = boto3.client('s3')
s3 = boto3.resource('s3')

## First, let's explore the source data

In [None]:
notes_partial=pd.read_csv('source/drugsCom_raw.tsv.zip',header=0, delimiter="\t", low_memory=False, 
                          error_bad_lines=False, encoding='utf-8')
notes_partial.head()

In [None]:
notes_partial.shape

For the purposes of this demo, we will subselect 50 rows.

In [None]:
notes_50 = notes_partial.sample(n=50)

In [None]:
notes_50.head()

## Next, pick the number of topics to use

Now that we have loaded the file, we will pick the number of topics to use. This has three steps:
1. Use Amazon Comprehend Medical to identify topics in each review.
2. Plot the distribution of the number of topics
3. Randomly select the topics from each column


#### Step 1: Identify topics

In [None]:
topic_cnt = 1
cm = boto3.client('comprehendmedical')

topics_per_row = list()

for index,row in notes_50.iterrows():
    topic_list = []
    
    # For each row, use Comprehend Medical to detect entitites
    entities = cm.detect_entities(Text=row['review'])
    time.sleep(1)
    
    # Filter by Medical Condition
    for entity in entities['Entities']:
        if entity['Category'] == 'MEDICAL_CONDITION':
            topic_list.append(entity['Text'])
            topic_cnt += 1
    topic_dict=dict(
        Index=index,
        DrugName=row['drugName'],
        TopicList=topic_list
    )
    topics_per_row.append(topic_dict)
    print (index)
    
    

#### Sample Output

In [None]:
topics_per_row[0:5]

#### Step 2: Plot topic distribution

In [None]:
topic_count = [ len(_['TopicList']) for _ in topics_per_row]
plt.hist(topic_count,bins=20,label='Topic Count')
plt.xlabel('Number of topics')

#### Step 3: Create topic list and review the output. In this example, we only use reviews with at least 5 germane topics

In [None]:
reviews=pd.DataFrame(columns=['drugName','topic1','topic2','topic3','topic4','topic5'])
min_topic_len = 5

for row in topics_per_row:
    topic_list = row['TopicList']
    if len(topic_list) >= min_topic_len:
        # Randomly select topics
        random_topics = random.choices(topic_list, k=min_topic_len)
        reviews=reviews.append({
            'drugName' : row['DrugName'],
            'topic1' : str(random_topics[0]),
            'topic2' : str(random_topics[1]),
            'topic3' : str(random_topics[2]),
            'topic4' : str(random_topics[3]),
            'topic5' : str(random_topics[4])},
            ignore_index=True)

In [None]:
reviews.head()

*Now get the number of reviews that met the criteria*

In [None]:
reviews.shape


#### Now that the topics have been generated, save a CSV on both local disk as well as Amazon S3

# Batch preparation of drug name/topics

Running a large number of calls serially through Amazon Comprehend Medical is, of course, inefficent. Comprehend Medical gives you the ability to batch process millions of notes in a single API call. For a high level overview, see https://aws.amazon.com/blogs/aws/introducing-batch-mode-processing-for-amazon-comprehend-medical/.

## N.B. If you run the Comprehend Medical batch calculation across the entire dataset, you may incure charges, and the full analysis will complete after this workshop is complete. Do not do this as part of the workshop. We have provided files for you.

Here are the steps to do this for this data set:
1. For each line, extract the review and convert it to a JSON file named `<drug_id>.json` with the form `{"Text": "...review..."}`.
2. Upload each of the files to S3 with a shared key space. For example: 
```
lfs401-data/json/112.json
lfs401-data/json/3528.json
```
3. Use Amazon Comprehend Medical Batch Mode to process the files. You can either use the console or the Comprehend Medical API. Save the output to a different key space. The job is submitted asynchronously so you can poll for the job status.
4. Once the job is complete, you can access all the files in S3. Pull them back down to your instance to continue.


## The post-processing is covered below. Resume with the following cell.



We have downloaded the output zip file already for you. Unzip the output zip file and get the output path.

In [None]:
os.getcwd()

In [None]:
%%bash
tar xzf ../output.tgz

In [None]:
mypath = os.getcwd()
output_path = os.path.join(mypath,'output', [_ for _ in os.listdir('output') if os.path.isdir(os.path.join('output',_))][0])

print ('Comprehend Medical Output Path: %s' % output_path)
os.chdir(output_path)
os.getcwd()

#### Check number of files - should be 215063

In [None]:
! ls -1 | grep -v Manifest | wc -l

#### Next, iterate through each file. If the file is a valid output name, add it to a dictionary. The file format should look similar to `12345.json.out`, where the numeric value is the id for the entry

In [None]:
def get_topics_from_file(filename):
    with open(filename) as f:
        return json.load(f)

In [None]:
results_dict = dict()
counter = 0
with os.scandir(output_path) as it:
    for entry in it:
        if not (entry.name.startswith('.') or entry.name.startswith('Manifest')) and entry.is_file():
            d = get_topics_from_file(entry)
            id = entry.name.split('.')[0]
            
            results_dict[id] = d
        counter += 1
        if counter % 10000 == 0:
            print (counter)

In [None]:
len(results_dict)

#### Read in original files and iterate through to create a list of python dictionaries containing `id`, `drug name`, `condition`, and `review`

In [None]:
filename = 'drugsCom_raw.tsv'
os.chdir(os.path.join(mypath, 'source'))
orig_list = list()

with open(filename) as csvfile:
    myreader = csv.reader(csvfile, delimiter='\t')
    for row in myreader:
        if row[0] == '':
            continue
        else:
            orig_list.append({
                'id': row[0],
                'drugName': row[1],
                'condition': row[2],
                'review': row[3]
            })

#### Confirm the length of the dictionary you lozaded in is the same as in the original file

In [None]:
len(orig_list) == len(results_dict)

#### Now, create a list for each entry that contains the index, drug name, and list of topics identified by Comprehend Medical. This is effectively an application inner join on index for `orig_list` and `result_dict`

In [None]:
topics_per_row = list()
topic_count = 1

for entry in orig_list:
    index = entry['id']
    drugName = entry['drugName']
    v = results_dict[index]

    topic_list = []
    for entity in v['Entities']:
        if entity['Category'] == 'MEDICAL_CONDITION':
            topic_list.append(entity['Text'])
            topic_count += 1
    
    topic_dict = dict(
        Index=index,
        DrugName=drugName,
        TopicList=topic_list
    )
    topics_per_row.append(topic_dict)

In [None]:
len(topics_per_row) == len(orig_list)

#### Let's look at an example

In [None]:
topics_per_row[0:5]

#### Now plot a histogram showing the distribution of number of topics per entry

In [None]:
topic_count = [ len(_['TopicList']) for _ in topics_per_row]
plt.hist(topic_count,bins=20,label='Topic Count', range=(0,30))
plt.xlabel('Number of topics')

#### For this example, we select 5 topics per entry. All entries with fewer than 5 relevant topics identified by Comprehend Medical are discarded. This yields ~80k distinct entries. 

In [None]:
%%time

min_topic_len = 5
counter = 0
review_list = []

for row in topics_per_row:
    topic_list = row['TopicList']
    if len(topic_list) >= min_topic_len:
        # Randomly select topics
        random_topics = random.choices(topic_list, k=min_topic_len)
        review_list.append({
            'drugName' : row['DrugName'],
            'topic1' : str(random_topics[0]),
            'topic2' : str(random_topics[1]),
            'topic3' : str(random_topics[2]),
            'topic4' : str(random_topics[3]),
            'topic5' : str(random_topics[4])})
        
reviews = pd.DataFrame(review_list)

In [None]:
reviews.shape

In [None]:
reviews.head()

#### Now that we have identified the topics, save the full file

In [None]:
def write_files_to_disk(sample_nums, reviews):
    for sample_num in sample_nums:
        print (sample_num)
        path = 'reviews_%ssample.csv' % str(sample_num)
        write_file_to_disk(path, sample_num, reviews)
    print ('all reviews')
    write_file_to_disk('reviews_all.csv', reviews.shape[0], reviews)

def write_file_to_disk(path, sample_num, reviews):
    sampled_reviews = reviews.sample(n=sample_num)
    sampled_reviews.to_csv(path, header=True, index=False)


In [None]:
sample_numbers = [1000, 2000, 5000, 10000, 20000, 50000]
write_files_to_disk(sample_numbers, reviews)

## Congrats! You are done with data prep. Move to the next notebook.