# Data Analytics in Healthcare

## Week 1 - Data Aquisition

Objectives: 

- Introduction to JSON
- Learn to get data from an API
- Use a scrapy spider to crawl a website
- Use Pandas to format JSON

In [1]:
import scrapy
import requests
import pandas as pd
from pprint import pprint
import json

#### Intro to JSON (from [this link](https://www.digitalocean.com/community/tutorials/an-introduction-to-json))

“JSON” stands for “JavaScript Object Notation”. It’s a lightweight, text-based format, and is frequently used in conjunction with REST APIs and web-based services. You can find more details on the specifics of the JSON format at the JSON web site.

The basic structures of JSON are:

* A set of name/value pairs

* An ordered list of values

A JSON object is a key-value data format that is typically rendered in curly braces. When you’re working with JSON, you’ll likely see JSON objects in a .json file, but they can also exist as a JSON object or string within the context of a program.

A JSON object looks something like this


```
{
  "first_name" : "Sammy",
  "last_name" : "Shark",
  "location" : "Ocean",
  "online" : true,
  "followers" : 987 
}
```
JSON can store nested objects in JSON format in addition to nested arrays. These objects and arrays will be passed as values assigned to keys, and typically will be comprised of key-value pairs as well.

Nested Objects

In the users.json file below, for each of the four users ("sammy", "jesse", "drew", "jamie") there is a nested JSON object passed as the value for each of the users, with its own nested keys of "username" and "location" that relate to each of the users. The first nested JSON object is highlighted below.

```
{ 
  "sammy" : {
    "username"  : "SammyShark",
    "location"  : "Indian Ocean",
    "online"    : true,
    "followers" : 987
  },
  "jesse" : {
    "username"  : "JesseOctopus",
    "location"  : "Pacific Ocean",
    "online"    : false,
    "followers" : 432
  },
  "drew" : {
    "username"  : "DrewSquid",
    "location"  : "Atlantic Ocean",
    "online"    : false,
    "followers" : 321
  },
  "jamie" : {
    "username"  : "JamieMantisShrimp",
    "location"  : "Pacific Ocean",
    "online"    : true,
    "followers" : 654
  }
}
```

In the example above, curly braces are used throughout to form a nested JSON object with associated username and location data for each of four users. Just like any other value, when using objects, commas are used to separate elements.

Nested Arrays

Data can also be nested within the JSON format by using JavaScript arrays that are passed as a value. JavaScript uses square brackets [ ] on either end of its array type. Arrays are ordered collections and can contain values of differing data types.

We may use an array when we are dealing with a lot of data that can be easily grouped together, like when there are various websites and social media profiles associated with a single user.

With the first nested array highlighted, a user profile for Sammy may look like this:

```
{ 
  "first_name" : "Sammy",
  "last_name" : "Shark",
  "location" : "Ocean",
  "websites" : [ 
    {
      "description" : "work",
      "URL" : "https://www.digitalocean.com/"
    },
    {
      "desciption" : "tutorials",
      "URL" : "https://www.digitalocean.com/community/tutorials"
    }
  ],
  "social_media" : [
    {
      "description" : "twitter",
      "link" : "https://twitter.com/digitalocean"
    },
    {
      "description" : "facebook",
      "link" : "https://www.facebook.com/DigitalOceanCloudHosting"
    },
    {
      "description" : "github",
      "link" : "https://github.com/digitalocean"
    }
  ]
}
```

### 1. Get the the data from from the API 

Go the the website https://clinicaltrialsapi.cancer.gov/v1/ and explore the API. Download data from the trial given as an example (NCT02194738) and view the data. The function given to you used the GET function from the requests library in Python. pprint will allow you to view the JSON in a formatted manner. Be careful, these responses can be large!

In [2]:
def get_response_json(api_args, api_root="https://clinicaltrialsapi.cancer.gov/v1/"):
    """
    This function returns the json of a GET response
    
    arguments:
    api_root -- str, the root website of the API
    api_args -- str, the arguements to the API
    
    returns
    json response, str
    """
    return requests.get(api_root + api_args).json()

In [3]:
# Retrieve the data using the argument and store it in a variable
json_text = get_response_json(api_args="clinical-trial/NCT02194738")

In [4]:
# Use pprint to print the text
pprint(json_text)

{'accepts_healthy_volunteers_indicator': 'NO',
 'acronym': None,
 'amendment_date': '2016-03-23T00:00:00',
 'anatomic_sites': ['Lung'],
 'arms': [{'arm_description': 'Patients undergo collection of blood and tissue '
                              'samples for EGFR and ALK testing via direct '
                              'sequencing and FISH. Patients that have had '
                              'surgery prior to pre-registration will submit '
                              'samples from the previous surgery for testing.',
           'arm_name': 'Ancillary-Correlative (marker identification and '
                       'sequencing)',
           'arm_type': None,
           'interventions': [{'intervention_code': 'C38113',
                              'intervention_description': None,
                              'intervention_name': 'Cytology Specimen '
                                                   'Collection Procedure',
                              'intervention_type': 'Othe

           {'contact_email': None,
            'contact_name': 'Sherman Baker',
            'contact_phone': '804-628-1939',
            'generic_contact': None,
            'local_site_identifier': '',
            'org_address_line_1': '1701 Thomson Drive',
            'org_address_line_2': 'Alan B Pearson Regional Cancer Center Suite '
                                  '200',
            'org_city': 'Lynchburg',
            'org_coordinates': {'lat': 37.3527, 'lon': -79.1576},
            'org_country': 'United States',
            'org_email': None,
            'org_family': None,
            'org_fax': None,
            'org_name': 'Lynchburg Hematology-Oncology Clinic',
            'org_phone': '434-200-5925',
            'org_postal_code': '24501',
            'org_state_or_province': 'VA',
            'org_status': 'ACTIVE',
            'org_status_date': '2015-07-29',
            'org_to_family_relationship': None,
            'org_tty': None,
            'recruitment_status': 

            'org_name': 'Salina Regional Health Center',
            'org_phone': '785-452-7000',
            'org_postal_code': '67401',
            'org_state_or_province': 'KS',
            'org_status': 'ACTIVE',
            'org_status_date': '2013-01-17',
            'org_to_family_relationship': 'AFFILIATION',
            'org_tty': None,
            'recruitment_status': 'ACTIVE',
            'recruitment_status_date': '2015-05-06'},
           {'contact_email': 'ctnursenav@kumc.edu',
            'contact_name': 'Chao Hui Huang',
            'contact_phone': '913-945-7552',
            'generic_contact': None,
            'local_site_identifier': '',
            'org_address_line_1': '3901 Rainbow Boulevard',
            'org_address_line_2': None,
            'org_city': 'Kansas City',
            'org_coordinates': {'lat': 39.0552, 'lon': -94.6108},
            'org_country': 'United States',
            'org_email': 'ctnursenav@kumc.edu',
            'org_family': 'Universit

### 2. Query the API using arguments

Read the documentation for the API on the website, and retrieve the data to get all trial IDs for all trials in NY where the eligibility criteria was restricted to women and whose primary purpose was basic science.

In [5]:
# Use the API documentation to understand the nature of the arguements. Chain them together, and get the response
api_args = "clinical-trials?eligibility.structured.gender=female&include=nct_id&sites.org_state_or_province=NY"

# store response
json_text = get_response_json(api_args=api_args)

# eyeball the response
pprint(json_text)

{'total': 212,
 'trials': [{'nct_id': 'NCT01272037'},
            {'nct_id': 'NCT02750826'},
            {'nct_id': 'NCT01953588'},
            {'nct_id': 'NCT00565851'},
            {'nct_id': 'NCT02311933'},
            {'nct_id': 'NCT01872975'},
            {'nct_id': 'NCT01101451'},
            {'nct_id': 'NCT02101788'},
            {'nct_id': 'NCT02065687'},
            {'nct_id': 'NCT02446600'}]}


### 3. Structure the JSON data

Pandas is a very useful python library that allows you to view data in tabular form. It is very good (and fast!) for viewing CSV file and even connecting to databases. For this exercise, we shall use it to structure the JSON data in a more readable format.

Hint: use the function:

```pd.io.json.json_normalize()``` 

(This will help it to read data from nested JSON files i.e. where dictionaries and lists are embedded within others).

In [8]:
# From the json text, retrive the list of trials by using the key
trial_array = json_text['trials']

# Using pandas functions, convert the JSON into a Data Frame.
trial_df = pd.io.json.json_normalize(trial_array)

# Use the head function to view what is in the data fram. Note the header name "nct_id".
print(trial_df.head())

# Each column can be accessed using the name
trail_list = list(trial_df["nct_id"])

        nct_id
0  NCT01272037
1  NCT02750826
2  NCT01953588
3  NCT00565851
4  NCT02311933


### 4. Scrape Data using a spider

In this part, we shall use Scrapy (another python library) to create a "Spider" to crawl the websites to gather data.
Scrapy documentation has a nice tutorial to get started at https://doc.scrapy.org/en/latest/intro/tutorial.html. We shall be modifying the code here for our purpose.

To get started you can type "!scrapy startproject tutorial" in a new cell. This allows you to run command line arguements from within Jupyter, which can often be handy. The command will create a directory structure that can be used by your Scrapy commands.

The code from the tutorial is below:

Your objective are:

1. Create a meaningful query (use the query from Part 2, if nothing comes to mind) to get data from the API
2. Then get ALL the data from the trials retrieved by your query, and save them in individual files.

Be sure to change the name of the variable as well the class to something meaninful, and use "scrapy _spiderName_" from the command line to run the spider.

HINT: Replace the url list in the code snippet below, and construct that list within the start_requests function. Remember to save the code (including the imports) in a file appropriately named (for e.g. 

In [9]:
!scrapy startproject tutorial

New Scrapy project 'tutorial', using template directory '/home/sayan/anaconda3/lib/python3.5/site-packages/scrapy/templates/project', created in:
    /home/sayan/Masters/Career/UCSF/health-analytics-ucsf/week_1/tutorial

You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com


In [131]:
class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

 - METAL https://github.com/Kitware/MetaIO
 - DICOM – Waveform An extension of Dicom for storing waveform data
 - European Data Format (.edf, .rec)
 - CSV (.csv),XML,TXT
 - ecgML – A markup language for electrocardiogram data acquisition and analysis.
 - EDF/EDF+ – European Data Format.
 - FEF – File Exchange Format for Vital signs, CEN TS 14271.
 - GDF v1.x – The General Data Format for biomedical signals – Version 1.x.
 - GDF v2.x – The General Data Format for biomedical signals – Version 2.x.
 - HL7aECG – Health Level 7 v3 annotated ECG.
 - MFER – Medical waveform Format Encoding Rules
 - OpenXDF – Open Exchange Data Format
 - TDMS (.tdms)
 - LVM (.lvm)
 - SCP-ECG – Standard Communication Protocol for Computer assisted electrocardiography EN1064:2007,
 - SIGIF – A digital SIGnal Interchange Format with application in neurophysiology.
 - WFDB – Format of Physiobank