# Data Analytics in Healthcare

## Week 1 - Data Aquisition

Objectives: 

- Introduction to JSON
- Learn to get data from an API
- Use a scrapy spider to crawl a website
- Use Pandas to format JSON

In [3]:
import scrapy
import requests
import pandas as pd
from pprint import pprint
import json

#### Intro to JSON (from [this link](https://www.digitalocean.com/community/tutorials/an-introduction-to-json))

“JSON” stands for “JavaScript Object Notation”. It’s a lightweight, text-based format, and is frequently used in conjunction with REST APIs and web-based services. You can find more details on the specifics of the JSON format at the JSON web site.

The basic structures of JSON are:

* A set of name/value pairs

* An ordered list of values

A JSON object is a key-value data format that is typically rendered in curly braces. When you’re working with JSON, you’ll likely see JSON objects in a .json file, but they can also exist as a JSON object or string within the context of a program.

A JSON object looks something like this


```
{
  "first_name" : "Sammy",
  "last_name" : "Shark",
  "location" : "Ocean",
  "online" : true,
  "followers" : 987 
}
```
JSON can store nested objects in JSON format in addition to nested arrays. These objects and arrays will be passed as values assigned to keys, and typically will be comprised of key-value pairs as well.

Nested Objects

In the users.json file below, for each of the four users ("sammy", "jesse", "drew", "jamie") there is a nested JSON object passed as the value for each of the users, with its own nested keys of "username" and "location" that relate to each of the users. The first nested JSON object is highlighted below.

```
{ 
  "sammy" : {
    "username"  : "SammyShark",
    "location"  : "Indian Ocean",
    "online"    : true,
    "followers" : 987
  },
  "jesse" : {
    "username"  : "JesseOctopus",
    "location"  : "Pacific Ocean",
    "online"    : false,
    "followers" : 432
  },
  "drew" : {
    "username"  : "DrewSquid",
    "location"  : "Atlantic Ocean",
    "online"    : false,
    "followers" : 321
  },
  "jamie" : {
    "username"  : "JamieMantisShrimp",
    "location"  : "Pacific Ocean",
    "online"    : true,
    "followers" : 654
  }
}
```

In the example above, curly braces are used throughout to form a nested JSON object with associated username and location data for each of four users. Just like any other value, when using objects, commas are used to separate elements.

Nested Arrays

Data can also be nested within the JSON format by using JavaScript arrays that are passed as a value. JavaScript uses square brackets [ ] on either end of its array type. Arrays are ordered collections and can contain values of differing data types.

We may use an array when we are dealing with a lot of data that can be easily grouped together, like when there are various websites and social media profiles associated with a single user.

With the first nested array highlighted, a user profile for Sammy may look like this:

```
{ 
  "first_name" : "Sammy",
  "last_name" : "Shark",
  "location" : "Ocean",
  "websites" : [ 
    {
      "description" : "work",
      "URL" : "https://www.digitalocean.com/"
    },
    {
      "desciption" : "tutorials",
      "URL" : "https://www.digitalocean.com/community/tutorials"
    }
  ],
  "social_media" : [
    {
      "description" : "twitter",
      "link" : "https://twitter.com/digitalocean"
    },
    {
      "description" : "facebook",
      "link" : "https://www.facebook.com/DigitalOceanCloudHosting"
    },
    {
      "description" : "github",
      "link" : "https://github.com/digitalocean"
    }
  ]
}
```

### 1. Get the the data from from the API 

Go the the website https://clinicaltrialsapi.cancer.gov/v1/ and explore the API. Download data from the trial given as an example (NCT02194738) and view the data. The function given to you used the GET function from the requests library in Python. pprint will allow you to view the JSON in a formatted manner. Be careful, these responses can be large!

In [4]:
def get_response_json(api_args, api_root="https://clinicaltrialsapi.cancer.gov/v1/"):
    """
    This function returns the json of a GET response
    
    arguments:
    api_root -- str, the root website of the API
    api_args -- str, the arguements to the API
    
    returns
    json response, str
    """
    return requests.get(api_root + api_args).json()

In [5]:
# Retrieve the data using the argument and store it in a variable
json_text = get_response_json(api_args="clinical-trial/NCT02194738")

In [6]:
# Use pprint to print the text
pprint(json_text)

{u'accepts_healthy_volunteers_indicator': u'NO',
 u'acronym': None,
 u'amendment_date': u'2016-03-23T00:00:00',
 u'anatomic_sites': [u'Lung'],
 u'arms': [{u'arm_description': u'Patients undergo collection of blood and tissue samples for EGFR and ALK testing via direct sequencing and FISH. Patients that have had surgery prior to pre-registration will submit samples from the previous surgery for testing.',
            u'arm_name': u'Ancillary-Correlative (marker identification and sequencing)',
            u'arm_type': None,
            u'interventions': [{u'intervention_code': u'C38113',
                                u'intervention_description': None,
                                u'intervention_name': u'Cytology Specimen Collection Procedure',
                                u'intervention_type': u'Other',
                                u'synonyms': [u'Cytology Specimen Collection Procedure',
                                              u'Cytologic Sampling']},
                  

                                    {u'description': u'Note: Post-surgical patients should proceed to registration immediately following preregistration',
                                     u'display_order': 11,
                                     u'inclusion_indicator': True},
                                    {u'description': u'PATIENT REGISTRATION ELIGIBILITY CRITERIA:',
                                     u'display_order': 12,
                                     u'inclusion_indicator': True},
                                    {u'description': u'Completely resected NSCLC; patients with squamous cell carcinoma are eligible only if the registering site has EA5142 IRB approved',
                                     u'display_order': 13,
                                     u'inclusion_indicator': True},
                                    {u'description': u'Pathologic stage IIIA, II, or large IB (defined as size >= 4 cm)',
                                     u'display_order':

             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2014-09-22'},
            {u'contact_email': u'som_dcop@wright.edu',
             u'contact_name': u'Howard M. Gross',
             u'contact_phone': u'937-775-1350',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'9000 North Main Street',
             u'org_address_line_2': None,
             u'org_city': u'Dayton',
             u'org_coordinates': {u'lat': 39.8348, u'lon': -84.2576},
             u'org_country': u'United States',
             u'org_email': u'som_dcop@wright.edu',
             u'org_family': None,
             u'org_fax': None,
             u'org_name': u'Samaritan North Health Center',
             u'org_phone': u'913-775-1350',
             u'org_postal_code': u'45415',
             u'org_state_or_province': u'OH',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2016-12-14',
     

             u'org_state_or_province': u'MI',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2008-12-30',
             u'org_to_family_relationship': None,
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2014-10-03'},
            {u'contact_email': None,
             u'contact_name': u'Philip J. Stella',
             u'contact_phone': u'734-712-4673',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'22101 Moross Road',
             u'org_address_line_2': None,
             u'org_city': u'Detroit',
             u'org_coordinates': {u'lat': 42.4243, u'lon': -82.8975},
             u'org_country': u'United States',
             u'org_email': None,
             u'org_family': None,
             u'org_fax': None,
             u'org_name': u'Saint John Hospital and Medical Center',
             u'org_phone': u'313-343-3166

             u'recruitment_status_date': u'2014-10-10'},
            {u'contact_email': None,
             u'contact_name': u'John Allan Ellerton',
             u'contact_phone': u'702-384-0013',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'7445 Peak Drive',
             u'org_address_line_2': None,
             u'org_city': u'Las Vegas',
             u'org_coordinates': {u'lat': 36.1831, u'lon': -115.2587},
             u'org_country': u'United States',
             u'org_email': None,
             u'org_family': None,
             u'org_fax': None,
             u'org_name': u'Comprehensive Cancer Centers of Nevada - Northwest',
             u'org_phone': u'702-384-0013',
             u'org_postal_code': u'89128',
             u'org_state_or_province': u'NV',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2009-10-18',
             u'org_to_family_relationship': None,
             u'o

            {u'contact_email': u'bernicl@sutterhealth.org',
             u'contact_name': u'Eunpi Cho',
             u'contact_phone': u'415-209-2686',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'301 Old San Francisco Road',
             u'org_address_line_2': None,
             u'org_city': u'Sunnyvale',
             u'org_coordinates': {u'lat': 37.3717, u'lon': -122.0255},
             u'org_country': u'United States',
             u'org_email': u'bernicl@sutterhealth.org',
             u'org_family': None,
             u'org_fax': None,
             u'org_name': u'Palo Alto Medical Foundation-Sunnyvale',
             u'org_phone': u'(408)-730-2800',
             u'org_postal_code': u'94086',
             u'org_state_or_province': u'CA',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2016-03-18',
             u'org_to_family_relationship': None,
             u'org_tty': None,
     

             u'org_coordinates': {u'lat': 40.0429, u'lon': -75.48},
             u'org_country': u'United States',
             u'org_email': u'ewend@mlhs.org',
             u'org_family': None,
             u'org_fax': None,
             u'org_name': u'Paoli Memorial Hospital',
             u'org_phone': u'610-648-1637',
             u'org_postal_code': u'19301',
             u'org_state_or_province': u'PA',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2008-12-31',
             u'org_to_family_relationship': None,
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2015-02-20'},
            {u'contact_email': u'ewend@mlhs.org',
             u'contact_name': u'Albert S. DeNittis',
             u'contact_phone': u'484-476-2649',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'130 South Bryn Mawr Avenue',
             u

             u'org_coordinates': {u'lat': 33.7714, u'lon': -84.3774},
             u'org_country': u'United States',
             u'org_email': None,
             u'org_family': u'Winship Cancer Institute of Emory University',
             u'org_fax': None,
             u'org_name': u'Emory University Hospital Midtown',
             u'org_phone': u'888-946-7447',
             u'org_postal_code': u'30308',
             u'org_state_or_province': u'GA',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2015-01-22',
             u'org_to_family_relationship': u'ORGANIZATIONAL',
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2014-11-06'},
            {u'contact_email': None,
             u'contact_name': u'Howard A. Zaren',
             u'contact_phone': u'912-819-5704',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'225 

            {u'contact_email': None,
             u'contact_name': u'Tatjana Kolevska',
             u'contact_phone': u'626-564-3455',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'8100 Bruceville Road',
             u'org_address_line_2': None,
             u'org_city': u'Sacramento',
             u'org_coordinates': {u'lat': 38.4748, u'lon': -121.4432},
             u'org_country': u'United States',
             u'org_email': None,
             u'org_family': None,
             u'org_fax': None,
             u'org_name': u'South Sacramento Cancer Center',
             u'org_phone': u'916-683-9616',
             u'org_postal_code': u'95823',
             u'org_state_or_province': u'CA',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2015-07-24',
             u'org_to_family_relationship': None,
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
            

            {u'contact_email': None,
             u'contact_name': u'Sherman Baker',
             u'contact_phone': u'804-628-1939',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'6605 West Broad Street',
             u'org_address_line_2': None,
             u'org_city': u'Richmond',
             u'org_coordinates': {u'lat': 37.5861, u'lon': -77.4907},
             u'org_country': u'United States',
             u'org_email': None,
             u'org_family': None,
             u'org_fax': None,
             u'org_name': u'Virginia Cancer Institute',
             u'org_phone': u'804-287-3000',
             u'org_postal_code': u'23230',
             u'org_state_or_province': u'VA',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2015-07-29',
             u'org_to_family_relationship': None,
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
             u'recrui

             u'org_city': u'Los Angeles',
             u'org_coordinates': {u'lat': 34.1252, u'lon': -118.292},
             u'org_country': u'United States',
             u'org_email': None,
             u'org_family': None,
             u'org_fax': None,
             u'org_name': u'Kaiser Permanente Los Angeles Medical Center',
             u'org_phone': u'626-564-3455',
             u'org_postal_code': u'90027',
             u'org_state_or_province': u'CA',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2015-07-23',
             u'org_to_family_relationship': None,
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2015-02-27'},
            {u'contact_email': None,
             u'contact_name': u'Han A. Koh',
             u'contact_phone': u'626-564-3455',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'6650 Alton P

             u'org_address_line_1': u'1700 South Potomac Street',
             u'org_address_line_2': None,
             u'org_city': u'Aurora',
             u'org_coordinates': {u'lat': 39.6993, u'lon': -104.8375},
             u'org_country': u'United States',
             u'org_email': u'kgeisen@co-cancerresearch.org',
             u'org_family': None,
             u'org_fax': None,
             u'org_name': u'Rocky Mountain Cancer Centers-Aurora',
             u'org_phone': u'888-259-7622',
             u'org_postal_code': u'80012',
             u'org_state_or_province': u'CO',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2014-01-27',
             u'org_to_family_relationship': None,
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2015-03-10'},
            {u'contact_email': None,
             u'contact_name': u'Mehmet Sitki Copur',
             u'contact_phone': u'800-998-2119',
 

             u'org_fax': None,
             u'org_name': u'Loma Linda University Medical Center',
             u'org_phone': u'909-558-4000\r909-558-4800\r909-558-4647\r909-558-3375',
             u'org_postal_code': u'92354',
             u'org_state_or_province': u'CA',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2010-08-09',
             u'org_to_family_relationship': None,
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2015-04-30'},
            {u'contact_email': None,
             u'contact_name': u'Bret E.B. Friday',
             u'contact_phone': u'888-203-7267',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'1702 South University Drive',
             u'org_address_line_2': None,
             u'org_city': u'Fargo',
             u'org_coordinates': {u'lat': 46.8551, u'lon': -96.824},
             u'org_co

             u'org_address_line_1': u'240 Indian River Road',
             u'org_address_line_2': u'Building A Suite 1A',
             u'org_city': u'Orange',
             u'org_coordinates': {u'lat': 41.2827, u'lon': -73.0273},
             u'org_country': u'United States',
             u'org_email': None,
             u'org_family': None,
             u'org_fax': u'203-795-1665',
             u'org_name': u'Smilow Cancer Hospital-Orange Care Center',
             u'org_phone': u'203-795-1664',
             u'org_postal_code': u'06477',
             u'org_state_or_province': u'CT',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2013-05-25',
             u'org_to_family_relationship': None,
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2015-07-23'},
            {u'contact_email': None,
             u'contact_name': u'William Patrick Fusselman',
             u'contact_phone': u'319-297-

             u'org_name': u'Bon Secours Saint Francis Medical Center',
             u'org_phone': u'804-594-4900',
             u'org_postal_code': u'23114',
             u'org_state_or_province': u'VA',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2014-01-28',
             u'org_to_family_relationship': None,
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2015-09-16'},
            {u'contact_email': u'angie.yrayta@hhchealth.org',
             u'contact_name': u'Wylie David Hosmer',
             u'contact_phone': u'860-224-5900',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'80 Seymour Street',
             u'org_address_line_2': None,
             u'org_city': u'Hartford',
             u'org_coordinates': {u'lat': 41.7569, u'lon': -72.6855},
             u'org_country': u'United States',
             u'org_ema

             u'org_to_family_relationship': None,
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2016-01-18'},
            {u'contact_email': u'CTO@hmc.psu.edu',
             u'contact_name': u'Chandra P. Belani',
             u'contact_phone': u'717-531-3779',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'1800 East Park Avenue',
             u'org_address_line_2': None,
             u'org_city': u'State College',
             u'org_coordinates': {u'lat': 40.8081, u'lon': -77.9215},
             u'org_country': u'United States',
             u'org_email': u'smoyer@mountnittany.org',
             u'org_family': None,
             u'org_fax': None,
             u'org_name': u'Mount Nittany Medical Center',
             u'org_phone': u'814-231-7000',
             u'org_postal_code': u'16803',
             u'org_state_or_province': u'PA',
    

             u'org_to_family_relationship': None,
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2016-08-25'},
            {u'contact_email': None,
             u'contact_name': u'John Allan Ellerton',
             u'contact_phone': u'702-384-0013',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'3150 Tenaya Way',
             u'org_address_line_2': u'Suite 200',
             u'org_city': u'Las Vegas',
             u'org_coordinates': {u'lat': 36.1831, u'lon': -115.2587},
             u'org_country': u'United States',
             u'org_email': None,
             u'org_family': None,
             u'org_fax': None,
             u'org_name': u'Ann M Wierman MD LTD',
             u'org_phone': u'702-749-3700',
             u'org_postal_code': u'89128',
             u'org_state_or_province': u'NV',
             u'org_status': u'ACTIVE',
        

             u'org_state_or_province': u'MI',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2015-03-31',
             u'org_to_family_relationship': None,
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2017-02-08'},
            {u'contact_email': None,
             u'contact_name': u'Jeffrey Henry Muler',
             u'contact_phone': u'419-824-1842',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'2142 North Cove Boulevard',
             u'org_address_line_2': None,
             u'org_city': u'Toledo',
             u'org_coordinates': {u'lat': 41.6724, u'lon': -83.6099},
             u'org_country': u'United States',
             u'org_email': None,
             u'org_family': None,
             u'org_fax': None,
             u'org_name': u"The Toledo Hospital / Toledo Children's Hospital",
             u'org_ph

             u'org_coordinates': {u'lat': 39.1183, u'lon': -88.6059},
             u'org_country': u'United States',
             u'org_email': None,
             u'org_family': None,
             u'org_fax': None,
             u'org_name': u'Carle Physician Group-Effingham',
             u'org_phone': u'217-347-6400',
             u'org_postal_code': u'62401',
             u'org_state_or_province': u'IL',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2015-07-09',
             u'org_to_family_relationship': None,
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2014-11-12'},
            {u'contact_email': u'kcheek@dmhhs.org',
             u'contact_name': u'James Lloyd Wade',
             u'contact_phone': u'217-876-4740',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'450 Mayo Drive',
             u'org_address_li

             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2017-01-04'},
            {u'contact_email': None,
             u'contact_name': u'Robert Anthony Chapman',
             u'contact_phone': u'313-916-1784',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'15855 Nineteen Mile Road',
             u'org_address_line_2': None,
             u'org_city': u'Clinton Township',
             u'org_coordinates': {u'lat': 42.5939, u'lon': -82.9509},
             u'org_country': u'United States',
             u'org_email': None,
             u'org_family': None,
             u'org_fax': None,
             u'org_name': u'Henry Ford Macomb Hospital-Clinton Township',
             u'org_phone': u'313-916-1784',
             u'org_postal_code': u'48038',
             u'org_state_or_province': u'MI',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2017-01-19',
        

             u'org_address_line_1': u'4647 Zion Avenue',
             u'org_address_line_2': None,
             u'org_city': u'San Diego',
             u'org_coordinates': {u'lat': 32.7956, u'lon': -117.0709},
             u'org_country': u'United States',
             u'org_email': None,
             u'org_family': None,
             u'org_fax': None,
             u'org_name': u'Kaiser Permanente-San Diego Zion',
             u'org_phone': u'626-564-3455',
             u'org_postal_code': u'92120',
             u'org_state_or_province': u'CA',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2015-07-23',
             u'org_to_family_relationship': None,
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2015-02-27'},
            {u'contact_email': u'bernicl@sutterhealth.org',
             u'contact_name': u'Eunpi Cho',
             u'contact_phone': u'415-209-2686',
             u'generic_co

             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'13 Wolf Creek Drive',
             u'org_address_line_2': u'Suite 1',
             u'org_city': u'Swansea',
             u'org_coordinates': {u'lat': 38.5483, u'lon': -89.9999},
             u'org_country': u'United States',
             u'org_email': None,
             u'org_family': None,
             u'org_fax': None,
             u'org_name': u'Cancer Care Specialists of Illinois-Swansea',
             u'org_phone': u'618-416-7970',
             u'org_postal_code': u'62226',
             u'org_state_or_province': u'IL',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2015-04-10',
             u'org_to_family_relationship': None,
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2015-04-15'},
            {u'contact_email': u'crcwm-regulatory@crcwm.org',
             u'

             u'recruitment_status_date': u'2015-01-26'},
            {u'contact_email': None,
             u'contact_name': u'Matthew James Schuchert',
             u'contact_phone': u'412-647-2811',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'200 Lothrop Street',
             u'org_address_line_2': None,
             u'org_city': u'Pittsburgh',
             u'org_coordinates': {u'lat': 40.4428, u'lon': -79.9532},
             u'org_country': u'United States',
             u'org_email': None,
             u'org_family': u'University of Pittsburgh Cancer Institute',
             u'org_fax': None,
             u'org_name': u'UPMC-Presbyterian Hospital',
             u'org_phone': u'412-647-2811',
             u'org_postal_code': u'15213',
             u'org_state_or_province': u'PA',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2015-05-05',
             u'org_to_family_relationship':

             u'org_coordinates': {u'lat': 41.1783, u'lon': -111.9156},
             u'org_country': u'United States',
             u'org_email': None,
             u'org_family': u'Huntsman Cancer Institute',
             u'org_fax': None,
             u'org_name': u'McKay-Dee Hospital Center',
             u'org_phone': u'801-387-7426',
             u'org_postal_code': u'84403',
             u'org_state_or_province': u'UT',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2008-12-31',
             u'org_to_family_relationship': u'AFFILIATION',
             u'org_tty': None,
             u'recruitment_status': u'ACTIVE',
             u'recruitment_status_date': u'2015-12-23'},
            {u'contact_email': u'research@utahcancer.com',
             u'contact_name': u'William Elliott Nibley',
             u'contact_phone': u'801-933-6070',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'170 

             u'contact_name': u'Ramaswamy Govindan',
             u'contact_phone': u'800-600-3606',
             u'generic_contact': None,
             u'local_site_identifier': u'',
             u'org_address_line_1': u'12634 Olive Boulevard',
             u'org_address_line_2': None,
             u'org_city': u'Creve Coeur',
             u'org_coordinates': {u'lat': 38.6572, u'lon': -90.4582},
             u'org_country': u'United States',
             u'org_email': u'info@siteman.wustl.edu',
             u'org_family': u'Alvin J. Siteman Cancer Center',
             u'org_fax': None,
             u'org_name': u'Barnes-Jewish West County Hospital',
             u'org_phone': None,
             u'org_postal_code': u'63141',
             u'org_state_or_province': u'MO',
             u'org_status': u'ACTIVE',
             u'org_status_date': u'2014-02-13',
             u'org_to_family_relationship': u'ORGANIZATIONAL',
             u'org_tty': None,
             u'recruitment_status': u

### 2. Query the API using arguments

Read the documentation for the API on the website, and retrieve the data to get all trial IDs for all trials in NY where the eligibility criteria was restricted to women and whose primary purpose was basic science.

In [7]:
# Use the API documentation to understand the nature of the arguements. Chain them together, and get the response
api_args = "clinical-trials?eligibility.structured.gender=female&include=nct_id&sites.org_state_or_province=NY"

# store response
json_text = get_response_json(api_args=api_args)

# eyeball the response
pprint(json_text)

{u'total': 212,
 u'trials': [{u'nct_id': u'NCT01272037'},
             {u'nct_id': u'NCT02750826'},
             {u'nct_id': u'NCT01953588'},
             {u'nct_id': u'NCT00565851'},
             {u'nct_id': u'NCT02311933'},
             {u'nct_id': u'NCT01872975'},
             {u'nct_id': u'NCT01101451'},
             {u'nct_id': u'NCT02101788'},
             {u'nct_id': u'NCT02065687'},
             {u'nct_id': u'NCT02446600'}]}


### 3. Structure the JSON data

Pandas is a very useful python library that allows you to view data in tabular form. It is very good (and fast!) for viewing CSV file and even connecting to databases. For this exercise, we shall use it to structure the JSON data in a more readable format.

Hint: use the function:

```pd.io.json.json_normalize()``` 

(This will help it to read data from nested JSON files i.e. where dictionaries and lists are embedded within others).

In [8]:
# From the json text, retrive the list of trials by using the key
trial_array = json_text['trials']

# Using pandas functions, convert the JSON into a Data Frame.
trial_df = pd.io.json.json_normalize(trial_array)

# Use the head function to view what is in the data fram. Note the header name "nct_id".
print(trial_df.head())

# Each column can be accessed using the name
trail_list = list(trial_df["nct_id"])

        nct_id
0  NCT01272037
1  NCT02750826
2  NCT01953588
3  NCT00565851
4  NCT02311933


### 4. Scrape Data using a spider

In this part, we shall use Scrapy (another python library) to create a "Spider" to crawl the websites to gather data.
Scrapy documentation has a nice tutorial to get started at https://doc.scrapy.org/en/latest/intro/tutorial.html. We shall be modifying the code here for our purpose.

To get started you can type "!scrapy startproject tutorial" in a new cell. This allows you to run command line arguements from within Jupyter, which can often be handy. The command will create a directory structure that can be used by your Scrapy commands.

The code from the tutorial is below:

Your objective are:

1. Create a meaningful query (use the query from Part 2, if nothing comes to mind) to get data from the API
2. Then get ALL the data from the trials retrieved by your query, and save them in individual files.

Be sure to change the name of the variable as well the class to something meaninful, and use "scrapy _spiderName_" from the command line to run the spider.

HINT: Replace the url list in the code snippet below, and construct that list within the start_requests function. Remember to save the code (including the imports) in a file appropriately named (for e.g. 

In [None]:
!scrapy startproject tutorial

In [None]:
class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

 - METAL https://github.com/Kitware/MetaIO
 - DICOM – Waveform An extension of Dicom for storing waveform data
 - European Data Format (.edf, .rec)
 - CSV (.csv),XML,TXT
 - ecgML – A markup language for electrocardiogram data acquisition and analysis.
 - EDF/EDF+ – European Data Format.
 - FEF – File Exchange Format for Vital signs, CEN TS 14271.
 - GDF v1.x – The General Data Format for biomedical signals – Version 1.x.
 - GDF v2.x – The General Data Format for biomedical signals – Version 2.x.
 - HL7aECG – Health Level 7 v3 annotated ECG.
 - MFER – Medical waveform Format Encoding Rules
 - OpenXDF – Open Exchange Data Format
 - TDMS (.tdms)
 - LVM (.lvm)
 - SCP-ECG – Standard Communication Protocol for Computer assisted electrocardiography EN1064:2007,
 - SIGIF – A digital SIGnal Interchange Format with application in neurophysiology.
 - WFDB – Format of Physiobank