Generating FGDC Triples
===================


This notebook does 2 things, first it retrieves a Solr query with FGDC documents and uses the current parsers to create a directory with TTL files. Second, upload these TTL files to Parliament.
Adjust the variables to specific needs.

----------


Variables
-------------

* **solr_endpoint**:  The Solr URL.
* **solr_query**:  Solr query, note that for this notebook this is hard coded to just FGDC docs.
* **solr_num_docs**: Number of documents to triplelize
* **solr_user**: Solr username
* **solr_password**: Solr password
* **dump_dir**: Where the ttl files are going
* **parliament_endpont**: Parliament **Sparql** endpoint
* **parliament_user**:  For future use
* **parliament_password**: For future use
* **parliament_graph**: The graph we are going to write to... now this should be dev, test but probably never PROD.


-----------


In [1]:
# Generic imports
import os
import sys
import requests
import json
import glob
import shutil
from requests.auth import HTTPBasicAuth
from rdflib import Graph, URIRef
from rdflib.plugins.stores import sparqlstore

# If we execute this notebook in the top level we don't need this line
sys.path.append('../../')

# BCube imports
from semproc.preprocessors.metadata_preprocessors import FgdcItemReader
from semproc.serializers.rdfgraphs import RdfGrapher
from semproc.parser import Parser

In [3]:
# Variables, some are set on the server side for security.

# Solr
solr_endpoint = os.environ['SOLR_URL']
solr_query = 'q=raw_content%3A"*FGDC-STD-001-1998*"&sort=date+desc&fl=id%2Cdate%2Craw_content%2Curl_hash&wt=json&indent=true'
solr_num_docs = 50
solr_page_size = 10
solr_user = os.environ['SOLR_USER']
solr_password = os.environ['SOLR_PASS']
dump_dir = '../../fgdc_triples'

# Parliament
parliament_endpoint = os.environ['PARLIAMENT_ENDPOINT']
parliament_graph = 'dev'

In [3]:
# Retrieve Solr documents, parse them, triplelize them and store the ttl's in dump_dir

# First purge the output dir
if os.path.exists(dump_dir):
    shutil.rmtree(dump_dir)
os.makedirs(dump_dir)

# We need to remove special chars that might be present
def sanitize_content(content):
    sanitized_content = content.replace('\\\n', '').replace('\r\n', '').\
    replace('\\r', '').replace('\\n', '').replace('\n', '')
    return sanitized_content.encode('unicode_escape')

# Return the ttl representation of parsed_json
def triplelize(parsed_json):
    triplelizer = RdfGrapher(parsed_json)
    triplelizer.serialize()
    return triplelizer.emit_format()

# Get the documents in a paginated way and create the turtle file
for num_docs in range(0, solr_num_docs, solr_page_size):
    # for this round we init the counters
    triplelized = 0
    parse_errors = 0
    s_query = solr_endpoint + '/collection1/select?' + solr_query + \
    '&rows=' + str(solr_page_size) + '&start=' + str(num_docs)
    # print ('Solr query: {0}'.format(s_query))    
    # Retrieve documents
    r = requests.get(s_query, auth=HTTPBasicAuth(solr_user,solr_password))
    # Load the documents using UTF-8
    data = json.loads(r.content.decode(encoding='UTF-8'))
    # Note that we are generating one ttl for each valid parsed document
    # we could optimize this by hainvg one ttl per page.
    for doc in data['response']['docs']:
        content = sanitize_content(doc['raw_content'])
        parser = Parser(content)
        reader = FgdcItemReader(parser.xml, doc['id'], doc['date'])
        try:
            parsed_json = reader.parse_item()
            triplelized += 1
        except AttributeError:
            parsed_json = None            
        if parsed_json is not None:
            triples = triplelize(parsed_json)
            file_name = dump_dir + '/' + doc['url_hash'] + '.ttl'
            with open(file_name, "w") as ttl_file:
               ttl_file.write(triples)
            # uncomment the next line just for dev purposes.
            # print ('Triplelized ' + doc['id'] + ' as ' + file_name + '\n')
        else:
            parse_errors += 1
    print ('A total of {0} files where triplelized and {1} couldn\'t be parsed.'.format(triplelized, parse_errors))


A total of 7 files where triplelized and 3 couldn't be parsed.
A total of 2 files where triplelized and 8 couldn't be parsed.
A total of 2 files where triplelized and 8 couldn't be parsed.
A total of 5 files where triplelized and 5 couldn't be parsed.
A total of 4 files where triplelized and 6 couldn't be parsed.


In [4]:
# Finally we are going to send the .ttl files to Parliament in the defined Graph
ttls = glob.glob(dump_dir + '/' + '*.ttl')
success = 0
failed = 0

store = sparqlstore.SPARQLUpdateStore(parliament_endpoint, parliament_endpoint)
# This should be updated to take advantage of Parliament's bulk update endpoint
named_graph = 'urn:' + parliament_graph
sg = Graph(store, identifier=URIRef(named_graph))
print ('Sending: {0} ttl files to Parliament'.format(len(ttls)))
for ttl in ttls:
    g = Graph()
    g.parse(ttl, format="turtle")
    as_nt = g.serialize(format='nt')    
    try:
        sg.update("INSERT DATA { GRAPH <%s> { %s } }" % (named_graph, as_nt))
        success += 1
    except:
        failed += 1
print('Successfully inserted {0} files in Parliament, {1} failed'.format(success, failed))

Sending: 20 ttl files to Parliament
Successfully inserted 20 files in Parliament, 0 failed


Notes
-----------

This notebook uses the python osg library a.k.a. GDAL with all the issues that come from dealing with dependencies.
To properly replicate the python environment using a virtualenv is strongly advised. Having a virtual env is just the first step. We must manually install GDAL first, in debian/ubuntu we'll need to:


```sh
sudo apt-get install build-essential python-all-dev libgdal-dev
wget http://download.osgeo.org/gdal/1.11.0/gdal-1.11.0.tar.gz
tar xvfz gdal-1.11.0.tar.gz && cd gdal-1.11.0

export LD_PRELOAD=/usr/local/lib/libgdal.so.1
./configure --with-python
make
sudo make install
```

After we have GDAL in our system we can proceed to install the python wrappers:

```sh
pip install gdal
```

We can also install using the ones included in GDAL itself:

```sh
cd /PATH/TO/gdal-1.11.0/swig/python
make
python setup.py install --prefix=$VIRTUALENV_PATH
```

* [Set up gdal on Ubuntu 14.04](https://milkator.wordpress.com/2014/05/06/set-up-gdal-on-ubuntu-14-04/)
* [Python GDAL package missing header file when installing via pip](https://gis.stackexchange.com/questions/28966/python-gdal-package-missing-header-file-when-installing-via-pip)
* [Python gdal undefined symbol GDALRasterBandGetVirtualMem](https://stackoverflow.com/questions/27116402/python-gdal-undefined-symbol-gdalrasterbandgetvirtualmem)