# Exercise: HGNC in SQLite

Task: 
   1. Analyse the JSON file, find a way to automatically scan the whole json for datatypes
   2. Connect to MySQL, create database
   3. create appropriate tables in SQLite database
   4. Store the data in the

links:
+ [Reference manual](http://dev.mysql.com/doc/refman/5.7/en/)
+ [HUGO Gene Nomenclature Committee (HGNC)](http://www.genenames.org/)

In [21]:
import os
import json
import pymysql

# Load the Data

The data is in the JSON format - this means that all atomic data is in nested dictionaries and lists.

In [3]:
base = os.environ['BUG_FREE_EUREKA_BASE']

In [6]:
data_path = os.path.join(base, 'data', 'exercise02', 'hgnc_complete_set.json')

with open(data_path) as f:
    hgnc_json = json.load(f)

# Exploration of Structure of Data

We want to understand what data is inside this json. To do this, we'll look at the keys of each nested dictionary.

The first set of keys shows that there is a response (the data) and a response header (data about the way it was downloaded). We will further explore the response.

In [13]:
hgnc_json.keys()

dict_keys(['response', 'responseHeader'])

The response contains `numFound`, which lists how many results there are in `docs`. Disregard `start`.

In [14]:
hgnc_json['response'].keys()

dict_keys(['numFound', 'docs', 'start'])

In [7]:
hgnc_json['response']['docs'][0].keys()

dict_keys(['rgd_id', 'symbol', 'location_sortable', 'date_modified', 'ensembl_gene_id', 'locus_type', 'ucsc_id', 'ccds_id', 'entrez_id', 'location', 'hgnc_id', 'cosmic', 'locus_group', 'gene_family', 'merops', 'omim_id', 'mgd_id', 'pubmed_id', 'gene_family_id', 'name', 'date_approved_reserved', 'uuid', 'uniprot_ids', 'status', '_version_', 'vega_id', 'refseq_accession'])

`docs` is a list where each entry is the data associated with a gene. Below is an example of the first element of this list:

In [12]:
print(json.dumps(hgnc_json['response']['docs'][0], indent=2))

{
  "rgd_id": [
    "RGD:69417"
  ],
  "symbol": "A1BG",
  "location_sortable": "19q13.43",
  "date_modified": "2015-07-13",
  "ensembl_gene_id": "ENSG00000121410",
  "locus_type": "gene with protein product",
  "ucsc_id": "uc002qsd.5",
  "ccds_id": [
    "CCDS12976"
  ],
  "entrez_id": "1",
  "location": "19q13.43",
  "hgnc_id": "HGNC:5",
  "cosmic": "A1BG",
  "locus_group": "protein-coding gene",
  "gene_family": [
    "Immunoglobulin like domain containing"
  ],
  "merops": "I43.950",
  "omim_id": [
    138670
  ],
  "mgd_id": [
    "MGI:2152878"
  ],
  "pubmed_id": [
    2591067
  ],
  "gene_family_id": [
    594
  ],
  "name": "alpha-1-B glycoprotein",
  "date_approved_reserved": "1989-06-30",
  "uuid": "c5fd27c5-7aa4-447c-83b0-1ccc73d90925",
  "uniprot_ids": [
    "P04217"
  ],
  "status": "Approved",
  "_version_": 1546503090507612160,
  "vega_id": "OTTHUMG00000183507",
  "refseq_accession": [
    "NM_130786"
  ]
}


# Create a Database Schema

Analyze the structure of each entry. Depending on the data type, decide what sort of relation you need to store the data. For example, a list would correspond to a 1-to-many relationship, while an atomic would correspond to a 1-to-1.

Your goal is to connect to your database, build an appropriate schema, and upload the data from the hgnc_json file. 

Choose one 1-to-n relationship and some 1-to-1 for your schema.

In [22]:
secrets_path = os.path.join(base, 'secrets.json')

with open(secrets_path) as f:
    secrets = json.load(f)

print(json.dumps(secrets, indent=2))

{
  "test_db": {
    "port": 3306,
    "host": "localhost",
    "password": "",
    "user": "root",
    "db": "mysql"
  }
}


In [23]:
db_params = secrets['test_db']

conn = pymysql.connect(**db_params)

In [29]:
schema_sql = """
select 'YOUR CODE HERE'
"""

with conn.cursor() as cursor:
    cursor.execute(schema_sql)
    print(*cursor.fetchone())

YOUR CODE HERE


# Upload the Data

Iterate over the data, use a new cursor and `cursor.executemany()` to execute the same statement over entry of `docs`.