### Blended Metadata Stats

This means either ISO in FGDC, FGDC in ISO, or JSON in XML.



In [1]:
%matplotlib inline
import pandas as pd
import json as js  # name conflict with sqla
import sqlalchemy as sqla
from sqlalchemy.orm import sessionmaker

In [2]:
# grab the clean text from the rds
with open('../local/big_rds.conf', 'r') as f:
    conf = js.loads(f.read())

# our connection
engine = sqla.create_engine(conf.get('connection'))

####Starting with ISO in the FGDC

This is going to be (at least from what I've seen to date), CI_OnlineResource elements only. It's a work-around for including some basic info about a link that isn't part of the FGDC spec. Flagging WMS or landing page URLs, things like that. 

In [4]:
sql = """
select b.response_id, b.tags, count(b.tags) as num_per_tag
from blended_metadatas b
where source_standard = 'FGDC'
group by b.response_id, b.tags
order by b.response_id, num_per_tag DESC;
"""

df = pd.read_sql(sql, engine)

from IPython.display import display
with pd.option_context('display.max_colwidth', 1000):
    display(df[:10])

Unnamed: 0,response_id,tags,num_per_tag
0,137746,metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/CI_OnlineResource,57
1,137746,metadata/idinfo/citation/citeinfo/CI_OnlineResource,5
2,138577,metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/CI_OnlineResource,57
3,138577,metadata/idinfo/citation/citeinfo/CI_OnlineResource,5
4,140207,metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/CI_OnlineResource,57
5,140207,metadata/idinfo/citation/citeinfo/CI_OnlineResource,5
6,140236,metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/CI_OnlineResource,2
7,140744,metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/CI_OnlineResource,57
8,140744,metadata/idinfo/citation/citeinfo/CI_OnlineResource,5
9,142472,metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/CI_OnlineResource,57


There's a pattern. ??? Are they all from \*Dap servers? 

Update: pretty much. The data.gov ones could be - just going solely on the URL, you can't tell for data.gov.


| Host                       |  Number of Responses |  Is it ERDDAP? | 
|----------------------------|----------------------|----------------| 
| "upwell.pfeg.noaa.gov"     | 4389                 | "ERDDAP"       | 
| "oos.soest.hawaii.edu"     | 4788                 | "ERDDAP"       | 
| "ecowatch.ncddc.noaa.gov"  | 4218                 | "ERDDAP"       | 
| "data.nanoos.org"          | 204                  | "ERDDAP"       | 
| "coastwatch.pfeg.noaa.gov" | 11172                | "ERDDAP"       | 
| "catalog.data.gov"         | 81                   | "OTHER"        | 
| "bluehub.jrc.ec.europa.eu" | 4902                 | "ERDDAP"       | 


The sql for that:

```
select r.host, count(r.host), case when r.source_url ilike '%erddap%' then 'ERDDAP' else 'OTHER' end as is_it_dap
from blended_metadatas b join responses r on r.id = b.response_id
where source_standard = 'FGDC' and tags ilike '%networka%'
group by is_it_dap, r.host
order by r.host DESC;
```

In [11]:
sql = """
select distinct b.tags
from blended_metadatas b
where source_standard = 'FGDC';
"""

df = pd.read_sql(sql, engine)
for i in df.itertuples():
    print i[1]

metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/CI_OnlineResource
metadata/idinfo/citation/citeinfo/CI_OnlineResource
metadata/idinfo/citation/citeinfo/lworkcit/citeinfo/CI_OnlineResource


These are only used in two places - citations and distributions.

In [13]:
sql = """
select b.tags, count(b.response_id) as num_per_tag
from blended_metadatas b
where source_standard = 'FGDC'
group by b.tags
order by num_per_tag DESC;
"""
df = pd.read_sql(sql, engine)
for i in df.itertuples():
    print i[1], 'is found in', i[2], 'FGDC records.'

metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/CI_OnlineResource is found in 29754 FGDC records.
metadata/idinfo/citation/citeinfo/CI_OnlineResource is found in 2530 FGDC records.
metadata/idinfo/citation/citeinfo/lworkcit/citeinfo/CI_OnlineResource is found in 3 FGDC records.


One last question - are these all from the same place (and maybe roughly the same time period)?

In [14]:
sql = """
select distinct r.host, date_trunc('month', r.metadata_age)::date as date_bin
from blended_metadatas b join responses r on r.id = b.response_id
where source_standard = 'FGDC'
order by r.host DESC, date_bin ASC;
"""

df = pd.read_sql(sql, engine)

In [15]:
df

Unnamed: 0,host,date_bin
0,upwell.pfeg.noaa.gov,2015-07-01
1,upwell.pfeg.noaa.gov,2015-08-01
2,upwell.pfeg.noaa.gov,2015-09-01
3,oos.soest.hawaii.edu,2015-06-01
4,oos.soest.hawaii.edu,2015-07-01
5,oos.soest.hawaii.edu,2015-08-01
6,oos.soest.hawaii.edu,2015-09-01
7,ecowatch.ncddc.noaa.gov,2015-07-01
8,ecowatch.ncddc.noaa.gov,2015-08-01
9,ecowatch.ncddc.noaa.gov,2015-09-01


No and almost - it is a very recent thing in this set. See the Is it ERDDAP??! above.

#### How about FGDC in the ISO?

The XML isn't mix-and-match here; there's no inclusion of an FGDC namespace or anything like that. But things are hard to capture in the ISO standard that people need/want to know about spatial data so the work-around is to park it in a comment.

72,375 comments extracted in total.
58,650 not just the term "ORIGIN".

40,795 related to some FGDC text matching on of the following patterns:

```
'%FGDC content not mapped to ISO. From Xpath: %'
'% translated from % to % '
```


The first is the workaround - not a good place to put it in ISO. The second is, often, related to the attribute information and trying to avoid the external 19110 record.

In [30]:
sql = """
with i as (
    select id, response_id, tags, extracted_info, 'not mapped' as code
    from blended_metadatas
    where source_standard = 'ISO'
        and extracted_info ilike '%%FGDC content not mapped to ISO. From Xpath: %%'

    union all
    
    select id, response_id, tags, extracted_info, 'translated' as code
    from blended_metadatas
    where source_standard = 'ISO'
        and extracted_info ilike '%%translated from %% to %%'
)
select tags, code, count(code) as num_per_tag
from i
group by tags, code
order by tags, num_per_tag;
"""


This is also not very good data (the two patterns can coincide.)



| Tags                                                                                                                                                | Pattern      | Count | 
|-----------------------------------------------------------------------------------------------------------------------------------------------------|--------------|-------| 
| "MI_Metadata"                                                                                                                                       | "not mapped" | 2647  | 
| "MI_Metadata"                                                                                                                                       | "translated" | 3037  | 
| "MI_Metadata/contentInfo"                                                                                                                           | "not mapped" | 6685  | 
| "MI_Metadata/contentInfo"                                                                                                                           | "translated" | 10197 | 
| "MI_Metadata/contentInfo/MD_CoverageDescription"                                                                                                    | "not mapped" | 6685  | 
| "MI_Metadata/contentInfo/MD_CoverageDescription/dimension/MD_Band"                                                                                  | "not mapped" | 23436 | 
| "MI_Metadata/*/distributionOrderProcess/MD_StandardOrderProcess"                          | "not mapped" | 84    | 
| "MI_Metadata/*/distributionOrderProcess/MD_StandardOrderProcess/plannedAvailableDateTime" | "not mapped" | 6     | 
| "MI_Metadata/spatialRepresentationInfo/MD_GridSpatialRepresentation/cellGeometry"                                                                   | "not mapped" | 1252  | 


In [4]:
# sql for unique comment lines that aren't the xml
# embedded in the comment
sql = """
with b as (
    with i as (
        select id, response_id, tags, regexp_replace(extracted_info, E'[\n\r]+', ' | ', 'g') as the_comment, 'not mapped' as code
        from blended_metadatas
        where source_standard = 'ISO'
            and extracted_info ilike '%FGDC content not mapped to ISO. From Xpath: %'

        union all

        select id, response_id, tags, regexp_replace(extracted_info, E'[\n\r]+', ' | ', 'g') as the_comment, 'translated' as code
        from blended_metadatas
        where source_standard = 'ISO'
            and extracted_info ilike '%translated from % to %'
    )
    -- select id, response_id, trim(leading ' ' from unnest(string_to_array(the_comment, '|')))
    select trim(leading ' ' from unnest(string_to_array(the_comment, '|'))) as a_comment, 
        count(trim(leading ' ' from unnest(string_to_array(the_comment, '|')))) as comment_count
    from i
    group by a_comment
)
select a_comment, comment_count
from b
where left(a_comment, 1) != '<' and a_comment not ilike '%%ORIGIN%%'
order by comment_count DESC;
"""

# df = pd.read_sql(sql, engine)

In [3]:
# occurences of each unique comment line
df = pd.read_csv('outputs/fgdc_in_iso_unique_comment_lines.csv')

In [6]:
from IPython.display import display
with pd.option_context('display.max_colwidth', 1000):
    display(df)

Unnamed: 0,a_comment,count
0,FGDC content not mapped to ISO. From Xpath: //attrdomv/edom,59747
1,FGDC content not mapped to ISO. From Xpath: //attrdomv/udom,48658
2,translated from eainfo/detailed to MD_CoverageDescription,26456
3,FGDC content not mapped to ISO. From Xpath: //enttyp,19317
4,FGDC content not mapped to ISO. From Xpath: //attrdomv/codesetd,14063
5,FGDC content not mapped to ISO. From Xpath: //enttyp,13824
6,FGDC content not mapped to ISO. From Xpath: //attrdomv/edom,10199
7,Other FGDC spatial reference elements not mapped to ISO from Xpath: //spref,5603
8,FGDC content not mapped to ISO. From Xpath: //attrdomv/udom,5400
9,translated from eainfo to MD_FeatureCatalogueDescription,5023


In [7]:
# how many responses contain each unique comment line?

df = pd.read_csv('outputs/fgdc_in_iso_unique_comment_lines_w_response_count.csv')
from IPython.display import display
with pd.option_context('display.max_colwidth', 1000):
    display(df)

Unnamed: 0,a_comment,found_in_responses
0,translated from eainfo to MD_FeatureCatalogueDescription,3187
1,Other FGDC spatial reference elements not mapped to ISO from Xpath: //spref,3091
2,translated from eainfo/detailed to MD_CoverageDescription,2350
3,FGDC content not mapped to ISO. From Xpath: //enttyp,2347
4,translated from eainfo to MD_FeatureCatalogueDescription,1934
5,/horizsys/geograph mapped to MD_GridSpatialRepresentation,1272
6,FGDC content not mapped to ISO. From Xpath: //spdoinfo/rastinfo/rasttype,1252
7,FGDC content not mapped to ISO. From Xpath: //attrdomv/udom,1242
8,FGDC content not mapped to ISO. From Xpath: //attrdomv/edom,767
9,FGDC content not mapped to ISO. From Xpath: //enttyp,660


In [None]:
sql = """
-- count per host
with i as (
	select id, response_id, tags, 
		trim(leading ' ' from unnest(string_to_array(regexp_replace(extracted_info, E'[\n\r]+', ' | ', 'g'), '|'))) as the_comment
	from blended_metadatas
	where source_standard = 'ISO'
		and (extracted_info ilike '%FGDC content not mapped to ISO. From Xpath: %' or extracted_info ilike '%translated from % to %')
)
select r.host, count(r.host) as host_count, the_comment
from i join responses r on r.id = i.response_id
where the_comment != '' and the_comment not like '%ORIGIN%' and left(the_comment, 1) != '<' and left(the_comment, 1) != '2'
group by the_comment, r.host
order by r.host, host_count desc, the_comment;
"""

In [10]:
# so where do they come from?

df = pd.read_csv('outputs/fgdc_in_iso_unique_comment_lines_w_host_count.csv')
from IPython.display import display
with pd.option_context('display.max_colwidth', 1000, 'display.max_rows', 100):
    display(df)

Unnamed: 0,host,host_count,the_comment
0,catalog.data.gov,36078,FGDC content not mapped to ISO. From Xpath: //attrdomv/edom
1,catalog.data.gov,27273,FGDC content not mapped to ISO. From Xpath: //attrdomv/udom
2,catalog.data.gov,12827,translated from eainfo/detailed to MD_CoverageDescription
3,catalog.data.gov,10074,FGDC content not mapped to ISO. From Xpath: //enttyp
4,catalog.data.gov,9823,FGDC content not mapped to ISO. From Xpath: //attrdomv/edom
5,catalog.data.gov,9216,FGDC content not mapped to ISO. From Xpath: //enttyp
6,catalog.data.gov,8478,FGDC content not mapped to ISO. From Xpath: //attrdomv/codesetd
7,catalog.data.gov,5023,FGDC content not mapped to ISO. From Xpath: //attrdomv/udom
8,catalog.data.gov,3759,translated from eainfo to MD_FeatureCatalogueDescription
9,catalog.data.gov,2723,Other FGDC spatial reference elements not mapped to ISO from Xpath: //spref


These fall into two really big buckets - attribute information (and from data.gov, I'm a solid 90% that these are for vector datasets) and spatial reference information. It's not necessarily that the attribute information can't be described by an ISO; it just can't be described in a 19115/19139, you'd have to generate the external 19110 Feature Catalogue. There isn't, to my knowledge, a spatial data portal platform that handles the relationship between a 139 and a 110 - the harvesters don't grab the reference from the appropriate contentInfo block and the systems don't map between XML files. They are atomic entities. I'm also pretty sure that this is true of desktop clients as well.

You can capture some of these in the 19139 but in a way that isn't always obvious to the consumer of the record. Something like indirect spatial reference can be encoded in the spatial reference info block (this was part of the geo-ide wiki recommendations and the provided XSLT 2013-ish). But there's no clear indicator that *this* spatial reference block is an indirect spatial reference and *that* block is the EPSG reference. 

I'd note that any time we bin by host, we will probably not want to make too many blanket statements about that. There are some things that are part of that system and some things that are part of their harvest source. We're one portal away from certian kinds of issues.

I am excluding lines that are clearly identifiable as the fgdc xml as string embedded in the comments. These are split by line, so there's a bit of cruft ("Polygon" for example). 

#### Let's see how many comments are about attributes or spref

Not number of responses or by host - just of all the comment blobs, how many are eainfo-related or spref-related?


In [None]:
# 72375 comments total

### JSON in the XML



In [3]:
sql = """
select r.host, count(r.source_url) as responses_with_json 
from responses r 
    join xml_with_jsons x on x.file = r.source_url_sha
group by r.host
order by responses_with_json DESC;
"""
df = pd.read_sql(sql, engine)

In [5]:
from IPython.display import display
with pd.option_context('display.max_rows', 100):
    display(df)

Unnamed: 0,host,responses_with_json
0,datasets.antwerpen.be,13062
1,data.cdc.gov,2959
2,data.baltimorecity.gov,2530
3,data.ok.gov,1413
4,data.hartford.gov,1406
5,data.lacity.org,1355
6,data.kingcounty.gov,1227
7,data.seattle.gov,1215
8,catalog.data.gov,1003
9,data.cityofnewyork.us,901


It's almost entirely open data sites.

In [6]:
sql = """
select x.xpath, count(x.xpath) as xpaths 
from responses r 
    join xml_with_jsons x on x.file = r.source_url_sha
group by x.xpath
order by xpaths DESC;
"""
df = pd.read_sql(sql, engine)

In [8]:
from IPython.display import display
with pd.option_context('display.max_colwidth', 1000):
    display(df)

Unnamed: 0,xpath,xpaths
0,dataset/data/record/geometry,13062
1,response/row/row/location_1/@human_address,8361
2,response/row/row/location/@human_address,2808
3,RDF/Dataset/relation/Description/value,2528
4,response/row/row/geom/@human_address,1073
5,response/row/row/mailing_address/@human_address,1062
6,response/row/row/city/@human_address,971
7,view/columns/viewColumn/cachedContents/top/item/item/@human_address,812
8,response/row/row/geocode/@human_address,684
9,response/row/row/geolocation/@human_address,533


And even in that, it's almost always a chunk of geojson or address information. There are a couple related to FGDC or ISO supplemental information blocks.

In [9]:
sql = """
with i as 
(
	select d.response_id, jsonb_array_elements(d.identity::jsonb) ident
	from identities d
	where d.identity is not null
)

select x.xpath, count(r.source_url) as responses_with_json, i.ident->'protocol' as protocol
from responses r 
	join xml_with_jsons x on x.file = r.source_url_sha
	join i on i.response_id = r.id
group by x.xpath, protocol
order by x.xpath, responses_with_json DESC;
"""
df = pd.read_sql(sql, engine)
from IPython.display import display
with pd.option_context('display.max_colwidth', 1000):
    display(df)

Unnamed: 0,xpath,responses_with_json,protocol
0,metadata/idinfo/descript/supplinf,23,FGDC
1,MI_Metadata/identificationInfo/MD_DataIdentification/supplementalInformation/CharacterString,18,ISO


In [10]:
sql = """
with i as 
(
	select d.response_id, jsonb_array_elements(d.identity::jsonb) ident
	from identities d
	where d.identity is not null
)

select x.xpath,x.extracted_json, i.ident->'protocol' as protocol
from responses r 
	join xml_with_jsons x on x.file = r.source_url_sha
	join i on i.response_id = r.id
order by x.xpath;
"""

df = pd.read_sql(sql, engine)
from IPython.display import display
with pd.option_context('display.max_colwidth', 1000):
    display(df)

Unnamed: 0,xpath,extracted_json,protocol
0,metadata/idinfo/descript/supplinf,{u'gdaId': 6065816},FGDC
1,metadata/idinfo/descript/supplinf,"{u'pageWidth': 22.75, u'scale': 24000, u'gdaId': 5939749, u'cellId': 23661, u'pageHeight': 29}",FGDC
2,metadata/idinfo/descript/supplinf,{u'gdaId': 6065815},FGDC
3,metadata/idinfo/descript/supplinf,"{u'pageWidth': 22.75, u'scale': 24000, u'gdaId': 5938475, u'cellId': 7111, u'pageHeight': 29}",FGDC
4,metadata/idinfo/descript/supplinf,{u'gdaId': 6152781},FGDC
5,metadata/idinfo/descript/supplinf,{u'gdaId': 6065815},FGDC
6,metadata/idinfo/descript/supplinf,{u'gdaId': 7086663},FGDC
7,metadata/idinfo/descript/supplinf,{u'gdaId': 7088320},FGDC
8,metadata/idinfo/descript/supplinf,"{u'pageWidth': 22.75, u'scale': 24000, u'gdaId': 5546652, u'cellId': 23927, u'pageHeight': 29}",FGDC
9,metadata/idinfo/descript/supplinf,{u'gdaId': 6152782},FGDC


These look like representation artifacts - whatever it is (layout information for some reporting system?), it'll be in both. The ISO/FGDC are all from data.gov.