![TAP](https://avatars2.githubusercontent.com/u/13385739?v=3&s=200 "TAP")
# Module 1: Data Ingest
This module will give you an overview for how to data into the Toolkit from a variety of formats.

#### Import the necessary libraries and connect to the server:

In [None]:
import json
import os.path
import trustedanalytics as ia
import xml.etree.ElementTree as ET

# Connect to the analytics server...
ia.connect()

In [None]:
# CONSTANTS...
HDFS_DATADIR_PATH = "data/TAPfest"
CSVFILENAME = "mtrees2015.bin"
XMLDIRNAME = "drugbank.xml"
JSONDIRNAME = "Inpat"

###Datasets!
Let's examine them:


|Name   |Size   |nRecords   |Format   |Comment   |
|---|---|---|---|---|
|MeSH   |2MB   |56,341   |_.csv_   |*c/o the National Library of Medicine|
|PubChem   |1.4GB   |36,069   |_.xml_   |*c/o the National Library of Medicine   |
|Drugbank   |223MB   |7740   |_.xml_   |Data set of curated drug metadata   |
|Inpat   |410MB   |474   |_.json_   |PennMed inpatient data (subset)   |



#### Importing _csv_

In [None]:
mesh_schema = schema = [("NAME", str), ("PATH", str)]
mesh_csv = ia.CsvFile(os.path.join(HDFS_DATADIR_PATH, CSVFILENAME), schema, delimiter=";")
mesh_frame = ia.Frame(source=mesh_csv, name="tutorial_mesh_frame")

#### Import _xml_

In [None]:
def parse_xml_to_frame(path, tag, name):
    """
    Helper function to convert an xml file on the hdfs into a data frame...
    """
    xml = ia.XmlFile(path, tag)
    
    # Check that the frame doesn't already exist. Drop it, if it does...
    if name in ia.get_frame_names():
        sys.stderr.write("Dropping existing frame named {NAME}...\n".format(NAME=name))
        ia.drop_frames(name)	
    frame = ia.Frame(xml, name=name)
    return frame

In [None]:
drugbank_frame = parse_xml_to_frame(path=os.path.join(HDFS_DATADIR_PATH, XMLDIRNAME), tag="drug", name="tutorial_drugbank_frame")

#### Importing _json_

In [None]:
inpat_json = ia.JsonFile(os.path.join(HDFS_DATADIR_PATH, JSONDIRNAME))
try:
    ia.drop_frames(["tutorial_inpat_frame"])
except:
    pass
inpat_frame = ia.Frame(inpat_json, name="tutorial_inpat_frame")

In [None]:
def extract_PATID(row):
    my_json = json.loads(row[0])
    PATID = my_json['PATID'] if 'PATID' in my_json else 'None'
    return PATID

In [None]:
inpat_frame.add_columns(extract_PATID, [("PATIENT_ID", str)])