### Blended Metadata Stats

This means either ISO in FGDC, FGDC in ISO, or JSON in XML.



In [23]:
%matplotlib inline
import pandas as pd
import json as js  # name conflict with sqla
import sqlalchemy as sqla
from sqlalchemy.orm import sessionmaker

In [24]:
# grab the clean text from the rds
with open('../local/big_rds.conf', 'r') as f:
    conf = js.loads(f.read())

# our connection
engine = sqla.create_engine(conf.get('connection'))

####Starting with ISO in the FGDC

This is going to be (at least from what I've seen to date), CI_OnlineResource elements only. It's a work-around for including some basic info about a link that isn't part of the FGDC spec. Flagging WMS or landing page URLs, things like that. 

In [4]:
sql = """
select b.response_id, b.tags, count(b.tags) as num_per_tag
from blended_metadatas b
where source_standard = 'FGDC'
group by b.response_id, b.tags
order by b.response_id, num_per_tag DESC;
"""

df = pd.read_sql(sql, engine)

from IPython.display import display
with pd.option_context('display.max_colwidth', 1000):
    display(df[:10])

Unnamed: 0,response_id,tags,num_per_tag
0,137746,metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/CI_OnlineResource,57
1,137746,metadata/idinfo/citation/citeinfo/CI_OnlineResource,5
2,138577,metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/CI_OnlineResource,57
3,138577,metadata/idinfo/citation/citeinfo/CI_OnlineResource,5
4,140207,metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/CI_OnlineResource,57
5,140207,metadata/idinfo/citation/citeinfo/CI_OnlineResource,5
6,140236,metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/CI_OnlineResource,2
7,140744,metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/CI_OnlineResource,57
8,140744,metadata/idinfo/citation/citeinfo/CI_OnlineResource,5
9,142472,metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/CI_OnlineResource,57


There's a pattern. ??? Are they all from \*Dap servers? 

Update: pretty much. The data.gov ones could be - just going solely on the URL, you can't tell for data.gov.


| Host                       |  Number of Responses |  Is it ERDDAP? | 
|----------------------------|----------------------|----------------| 
| "upwell.pfeg.noaa.gov"     | 4389                 | "ERDDAP"       | 
| "oos.soest.hawaii.edu"     | 4788                 | "ERDDAP"       | 
| "ecowatch.ncddc.noaa.gov"  | 4218                 | "ERDDAP"       | 
| "data.nanoos.org"          | 204                  | "ERDDAP"       | 
| "coastwatch.pfeg.noaa.gov" | 11172                | "ERDDAP"       | 
| "catalog.data.gov"         | 81                   | "OTHER"        | 
| "bluehub.jrc.ec.europa.eu" | 4902                 | "ERDDAP"       | 


The sql for that:

```
select r.host, count(r.host), case when r.source_url ilike '%erddap%' then 'ERDDAP' else 'OTHER' end as is_it_dap
from blended_metadatas b join responses r on r.id = b.response_id
where source_standard = 'FGDC' and tags ilike '%networka%'
group by is_it_dap, r.host
order by r.host DESC;
```

In [11]:
sql = """
select distinct b.tags
from blended_metadatas b
where source_standard = 'FGDC';
"""

df = pd.read_sql(sql, engine)
for i in df.itertuples():
    print i[1]

metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/CI_OnlineResource
metadata/idinfo/citation/citeinfo/CI_OnlineResource
metadata/idinfo/citation/citeinfo/lworkcit/citeinfo/CI_OnlineResource


These are only used in two places - citations and distributions.

In [13]:
sql = """
select b.tags, count(b.response_id) as num_per_tag
from blended_metadatas b
where source_standard = 'FGDC'
group by b.tags
order by num_per_tag DESC;
"""
df = pd.read_sql(sql, engine)
for i in df.itertuples():
    print i[1], 'is found in', i[2], 'FGDC records.'

metadata/distinfo/stdorder/digform/digtopt/onlinopt/computer/networka/CI_OnlineResource is found in 29754 FGDC records.
metadata/idinfo/citation/citeinfo/CI_OnlineResource is found in 2530 FGDC records.
metadata/idinfo/citation/citeinfo/lworkcit/citeinfo/CI_OnlineResource is found in 3 FGDC records.


One last question - are these all from the same place (and maybe roughly the same time period)?

In [14]:
sql = """
select distinct r.host, date_trunc('month', r.metadata_age)::date as date_bin
from blended_metadatas b join responses r on r.id = b.response_id
where source_standard = 'FGDC'
order by r.host DESC, date_bin ASC;
"""

df = pd.read_sql(sql, engine)

In [15]:
df

Unnamed: 0,host,date_bin
0,upwell.pfeg.noaa.gov,2015-07-01
1,upwell.pfeg.noaa.gov,2015-08-01
2,upwell.pfeg.noaa.gov,2015-09-01
3,oos.soest.hawaii.edu,2015-06-01
4,oos.soest.hawaii.edu,2015-07-01
5,oos.soest.hawaii.edu,2015-08-01
6,oos.soest.hawaii.edu,2015-09-01
7,ecowatch.ncddc.noaa.gov,2015-07-01
8,ecowatch.ncddc.noaa.gov,2015-08-01
9,ecowatch.ncddc.noaa.gov,2015-09-01


No and almost - it is a very recent thing in this set. See the Is it ERDDAP??! above.

#### How about FGDC in the ISO?

The XML isn't mix-and-match here; there's no inclusion of an FGDC namespace or anything like that. But things are hard to capture in the ISO standard that people need/want to know about spatial data so the work-around is to park it in a comment.

72,375 comments extracted in total.
58,650 not just the term "ORIGIN".

40,795 related to some FGDC text matching on of the following patterns:

```
'%FGDC content not mapped to ISO. From Xpath: %'
'% translated from % to % '
```

The first is the workaround - not a good place to put it in ISO. The second is, often, related to the attribute information and trying to avoid the external 19110 record.

In [30]:
sql = """
with i as (
    select id, response_id, tags, extracted_info, 'not mapped' as code
    from blended_metadatas
    where source_standard = 'ISO'
        and extracted_info ilike '%FGDC content not mapped to ISO. From Xpath: %'

    union all
    
    select id, response_id, tags, extracted_info, 'translated' as code
    from blended_metadatas
    where source_standard = 'ISO'
        and extracted_info ilike '%translated from % to %'
)
select tags, code, count(code) as num_per_tag
from i
group by tags, code
order by tags, num_per_tag;
"""


Ok no idea why pandas complains - works in pg so here's the data. This is also not very good data (the two patterns can coincide.)



| Tags                                                                                                                                                | Pattern      | Count | 
|-----------------------------------------------------------------------------------------------------------------------------------------------------|--------------|-------| 
| "MI_Metadata"                                                                                                                                       | "not mapped" | 2647  | 
| "MI_Metadata"                                                                                                                                       | "translated" | 3037  | 
| "MI_Metadata/contentInfo"                                                                                                                           | "not mapped" | 6685  | 
| "MI_Metadata/contentInfo"                                                                                                                           | "translated" | 10197 | 
| "MI_Metadata/contentInfo/MD_CoverageDescription"                                                                                                    | "not mapped" | 6685  | 
| "MI_Metadata/contentInfo/MD_CoverageDescription/dimension/MD_Band"                                                                                  | "not mapped" | 23436 | 
| "MI_Metadata/distributionInfo/MD_Distribution/distributor/MD_Distributor/distributionOrderProcess/MD_StandardOrderProcess"                          | "not mapped" | 84    | 
| "MI_Metadata/distributionInfo/MD_Distribution/distributor/MD_Distributor/distributionOrderProcess/MD_StandardOrderProcess/plannedAvailableDateTime" | "not mapped" | 6     | 
| "MI_Metadata/spatialRepresentationInfo/MD_GridSpatialRepresentation/cellGeometry"                                                                   | "not mapped" | 1252  | 
