Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate report comparing launch dates in GCIS vs dbpedia for same platforms #117

Closed
zednis opened this issue Aug 4, 2015 · 15 comments
Closed
Assignees
Labels

Comments

@zednis
Copy link
Contributor

zednis commented Aug 4, 2015

Compare the launch dates of platforms in GCIS (i.e. from CEOS) to launch dates from dbpedia.

@zednis zednis added the report label Aug 4, 2015
@zednis zednis self-assigned this Aug 4, 2015
@zednis
Copy link
Contributor Author

zednis commented Aug 18, 2015

Query to get launch dates for platforms from GCIS. Next I will update the query to compare with launch dates from dbpedia.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX vivo: <http://vivoweb.org/ontology/core#>
PREFIX gcis: <http://data.globalchange.gov/gcis.owl#>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX dbpprop: <http://dbpedia.org/property/>

SELECT ?s ?launchDate ?deactivated
FROM <http://data.globalchange.gov>
WHERE {
  ?s a gcis:Platform .
  OPTIONAL { ?s dbpprop:deactivated ?deactivated }
  OPTIONAL { ?s dbpprop:launchDate ?launchDate }
} ORDER BY ?launchDate

@bduggan
Copy link
Contributor

bduggan commented Aug 18, 2015

Great, thanks, I'm adding this to the (newly created) gcis-sparql repo:

https://github.com/USGCRP/gcis-sparql

I'll send an email (outside this ticket) about this repo.

Brian

@zednis
Copy link
Contributor Author

zednis commented Aug 25, 2015

I have updated my query to select the dbpedia URI for the matching instance. I will then be able to write use a federated query to retrieve the launch date for the platform from dbpedia.

SELECT ?s ?match ?launchDate ?deactivated
WHERE {
  SERVICE <https://data.globalchange.gov/sparql> {
    ?s a gcis:Platform .
    ?s skos:exactMatch ?match .
    ?match skos:inScheme <http://data.globalchange.gov/lexicon/dbpedia> .
    OPTIONAL { ?s dbp:deactivated ?deactivated }
    OPTIONAL { ?s dbp:launchDate ?launchDate }
  }
} ORDER BY ?launchDate

I have run into an issue where I am unable to retrieve the value of skos:inScheme from the lexicon concept from the triplestore. (above query returns 0 results)

see http://data.globalchange.gov/lexicon/dbpedia.thtml for example in REST API.

This statement is generated by the representation.ttl.tut template.

@bduggan is it possible RDF from this template is not being included in the triplestore load?

@zednis
Copy link
Contributor Author

zednis commented Aug 25, 2015

I have updated my query to use the a owl:sameAs and a regex filter.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX vivo: <http://vivoweb.org/ontology/core#>
PREFIX gcis: <http://data.globalchange.gov/gcis.owl#>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX db: <http://dbpedia.org/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT DISTINCT
?platform_gcis
?platform_dbpedia 
?launchDate_gcis
?launchDate_dbpedia 
?cospar_dbpedia
WHERE {
  FILTER(str(?match) = str(?platform_dbpedia))
  SERVICE <https://data.globalchange.gov/sparql> {
    ?platform_gcis a gcis:Platform .
    ?platform_gcis owl:sameAs ?match .
    ?platform_gcis dbp:launchDate ?launchDate_gcis
    FILTER regex(?match, "dbpedia.org", "i") .
  }
  SERVICE <http://dbpedia.org/sparql> {
    ?platform_dbpedia dbp:launchDate ?launchDate_dbpedia .
    OPTIONAL { ?platform_dbpedia dbp:cosparId ?cospar_dbpedia }
  }
} LIMIT 20

without a limit on the result set size the query times out.

The launch dates in dbpedia seem to generally be missing the year component. Looking at example RDF on dbpedia the dbp:cosparId property is often (but not always) used to contain the launch year.

results:

"platform_gcis" "platform_dbpedia" "launchDate_gcis" "launchDate_dbpedia" "cospar_dbpedia"
"http://data.globalchange.gov/platform/aqua" "http://dbpedia.org/resource/Aqua_(satellite)" "2002-05-04T00:00:00-06:00" "--05-04" "2002"
"http://data.globalchange.gov/platform/suomi-national-polar-orbiting-partnership" "http://dbpedia.org/resource/Suomi_NPP" "2011-10-28T00:00:00-06:00" "--10-28" "2011"
"http://data.globalchange.gov/platform/geostationary-operational-environmental-satellite-2" "http://dbpedia.org/resource/GOES_2" "1984-02-26T00:00:00-06:00" "--06-16" "1977"
"http://data.globalchange.gov/platform/uk-disaster-monitoring-constellation-2" "http://dbpedia.org/resource/UK-DMC_2" "2009-07-29T00:00:00-06:00" "--07-29" "2009"
"http://data.globalchange.gov/platform/vnredsat-1" "http://dbpedia.org/resource/VNREDSat_1A" "2013-05-07T00:00:00-06:00" "--05-07" "2013"
"http://data.globalchange.gov/platform/quick-scatterometer" "http://dbpedia.org/resource/QuikSCAT" "1999-06-19T00:00:00-06:00" "--06-19" "1999"
"http://data.globalchange.gov/platform/communication-oceanographic-meteorological-satellite" "http://dbpedia.org/resource/Chollian" "2010-06-26T00:00:00-06:00" "--06-26" "2010"
"http://data.globalchange.gov/platform/automatic-identification-system-satellite-1" "http://dbpedia.org/resource/AISSat-1" "2010-07-12T00:00:00-06:00" "--07-12" "2010"
"http://data.globalchange.gov/platform/resource-satellite-2" "http://dbpedia.org/resource/Resourcesat-2" "2011-04-20T00:00:00-06:00" "2011-04-20" "2011"
"http://data.globalchange.gov/platform/geostationary-operational-environmental-satellite-11" "http://dbpedia.org/resource/GOES_11" "2000-05-03T00:00:00-06:00" "--05-03" "2000"
"http://data.globalchange.gov/platform/geostationary-operational-environmental-satellite-6" "http://dbpedia.org/resource/GOES_6" "1984-02-26T00:00:00-06:00" "--04-28" "1983"
"http://data.globalchange.gov/platform/bilsat-research-satellite" "http://dbpedia.org/resource/BILSAT-1" "2003-09-01T00:00:00-06:00" "--09-27" "2003"
"http://data.globalchange.gov/platform/advanced-land-observing-satellite" "http://dbpedia.org/resource/Advanced_Land_Observation_Satellite" "2006-01-24T00:00:00-06:00" "--01-24" "2006"
"http://data.globalchange.gov/platform/landsat-7" "http://dbpedia.org/resource/Landsat_7" "1999-04-15T00:00:00-06:00" "--04-15" "1999"
"http://data.globalchange.gov/platform/oceansat-1" "http://dbpedia.org/resource/Oceansat-1" "1999-05-26T00:00:00-06:00" "1999-05-26"
"http://data.globalchange.gov/platform/odin" "http://dbpedia.org/resource/Odin_(satellite)" "2001-02-20T00:00:00-06:00" "--02-20" "2001"
"http://data.globalchange.gov/platform/earths-magnetic-field-and-environment-explorers" "http://dbpedia.org/resource/Swarm_(spacecraft)" "2013-11-22T00:00:00-06:00" "--11-22" "SWARM A: 2013-067B"
"http://data.globalchange.gov/platform/earths-magnetic-field-and-environment-explorers" "http://dbpedia.org/resource/Swarm_(spacecraft)" "2013-11-22T00:00:00-06:00" "--11-22" "SWARM B: 2013-067A"
"http://data.globalchange.gov/platform/earths-magnetic-field-and-environment-explorers" "http://dbpedia.org/resource/Swarm_(spacecraft)" "2013-11-22T00:00:00-06:00" "--11-22" "SWARM C: 2013-067C"
"http://data.globalchange.gov/platform/earths-magnetic-field-and-environment-explorers" "http://dbpedia.org/resource/Swarm_(spacecraft)" "2013-11-22T00:00:00-06:00" "43349.0" "SWARM A: 2013-067B"

bduggan pushed a commit to USGCRP/gcis-rdf that referenced this issue Aug 26, 2015
@bduggan
Copy link
Contributor

bduggan commented Aug 26, 2015

On Tuesday, August 25, Stephan Zednik wrote:

see http://data.globalchange.gov/lexicon/dbpedia.thtml for example in REST API.

This statement is generated by the representation.ttl.tut template.

@bduggan is it possible RDF from this template is not being included in the triplestore load?

Yes, fixed.

I'm re-runing the import, should be updated in 30 minutes or so.

Brian

@bduggan
Copy link
Contributor

bduggan commented Aug 26, 2015

A nice refinement would be to only show entries for which the date differs. e.g. why is the GOES-2 launch listed as 1984 in one system and 1977 in another?

@zednis
Copy link
Contributor Author

zednis commented Aug 26, 2015

@bduggan that might be a bit hard to do in the query since we would have to potentially (but not always) combine and reformat the ?launchDate_dbpedia and ?cospar_dbpedia variables into a date. I think it would probably be easier to do that analysis in a spreadsheet where you can use some simple parsing logic to attempt to process dbepdia's inconsistent dates.

Also, I am attempting to update the gcis-sparql files for this report but the federated query is frequently timing out.

@zednis
Copy link
Contributor Author

zednis commented Aug 31, 2015

@justgo129 @bduggan query added to gcis-sparql. Is this ticket ready to be closed?

@justgo129
Copy link
Contributor

I just took a look. Is there a way to standardize the date formatting in the output?

@zednis
Copy link
Contributor Author

zednis commented Sep 1, 2015

It would be far easier to apply some post-processing to the query results to fix the dates then to add that logic to the query. The dbpedia RDF uses inconsistent literal types with the launch date values and updating the query to standardize the formatting would make the query much more complicated and probably make the timeout issue worse. Additionally, because they frequently split the year out of the launch date and encode the month and day of the launch using xsd:gMonthDay (which I have never seen used in RDF before) the process to standardize the query would be to extract the appropriate date components from ?launchDate_dbpedia and ?cospar_dbpedia (with checks because of the data inconsistency) and build a new date serialization using a string concatenation.

This could perhaps be done in the query but it would make it very ugly, and probably slower.

It would be much easier to do this as post-processing on the CSV using perl or python.

@justgo129
Copy link
Contributor

Works for me. Could we at least git rid of the "Cospar_dbpedia" entries beginning with "SWARM A" or would that also be a post-processing candidate? I'd think we could at least strip out the text within the SPARQL query. After that, feel free to repush and close.

@zednis
Copy link
Contributor Author

zednis commented Sep 2, 2015

I am not sure we should strip out values from ?cospar_dbepdia. That property is not explicitly for the year of the launch but for the COSPAR ID. It seems that the year is often (but not always) part of the COSPAR ID. I would keep the cospar id intact and leave the logic of parsing it and extracting relevant year information (if any) to post-processing.

@justgo129
Copy link
Contributor

@rewolfe are you all all right with the suggestion of @zednis?

@rewolfe
Copy link
Member

rewolfe commented Sep 3, 2015

@justgo129 - yes, I think that post-processing is the best approach. It
looks like it is pretty difficult to parse dates in SPARQL.

On Wed, Sep 2, 2015 at 9:38 PM, justgo129 notifications@github.com wrote:

@rewolfe https://github.com/rewolfe are you all all right with the
suggestion of @zednis https://github.com/zednis?


Reply to this email directly or view it on GitHub
#117 (comment)
.

Robert Wolfe, NASA GSFC @ USGCRP, o: 202-419-3470, m: 301-257-6966

@justgo129
Copy link
Contributor

Thanks, @rewolfe. As the query has been added to gcis-sparql, I declare #117 to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants