Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting up csv-reconcile-geo #3

Closed
VojtechDostal opened this issue Apr 13, 2021 · 12 comments
Closed

Setting up csv-reconcile-geo #3

VojtechDostal opened this issue Apr 13, 2021 · 12 comments

Comments

@VojtechDostal
Copy link

FWIW, if you don't mind running your own reconciliation service, I've just written a geo scoring plugin for csv-reconcile.

With this you could, say run a SPARQL query to find coordinate locations of points you're looking to match against, export that as a TSV file and use that to run csv-reconcile.

You can get the service up and running as simply as the following:

$ python -m venv serverenv
$ source serverenv/bin/activate
$ python -m pip install csv-reconcile
$ python -m pip install csv-reconcile-geo
$ csv-reconcile --init-db query.tsv item coord --scorer geo 

Here item is the name of the column containing the QID's and coord is the name of the coordinate column in well-known text format, the default export format for coordinates.

This was just my first pass at it. There's certainly room for improvement, but it may suit your immediate needs.

@gitonthescene Please could you assist me with this? I am a bit disoriented and I am not sure if I understand the overall idea of 'my own' reconciliation service correctly. Am I right in assuming that I need to load File number 1 into openrefine, load File number 2 into command line via the commands above, add a reconciliation service "http://127.0.0.1:5000/reconcile" to OpenRefine and reconcile?

I think I was able to start virtualenv on my system (I am on Windows and "source" did not work, but I think I was able to find a solution at https://stackoverflow.com/questions/8921188/issue-with-virtualenv-cannot-activate) and then I was able to install csv-reconcile and csv-reconcile-geo. However, this is what I get when I run the program:

(venv) C:\Users\vojte\Downloads>csv-reconcile --init-db query.tsv item coord --scorer geo
c:\users\vojte\venv\lib\site-packages\normality\__init__.py:72: ICUWarning: Install 'pyicu' for better text transliteration.
  text = ascii_text(text)
Traceback (most recent call last):
  File "C:\Users\vojte\AppData\Local\Programs\Python\Python37-32\Lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\vojte\AppData\Local\Programs\Python\Python37-32\Lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\vojte\venv\Scripts\csv-reconcile.exe\__main__.py", line 7, in <module>
  File "c:\users\vojte\venv\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\vojte\venv\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\users\vojte\venv\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\vojte\venv\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "c:\users\vojte\venv\lib\site-packages\csv_reconcile\__init__.py", line 210, in main
    initdb.init_db()
  File "c:\users\vojte\venv\lib\site-packages\csv_reconcile\initdb.py", line 76, in init_db
    (mid, word) + tuple(matchFields))
sqlite3.IntegrityError: UNIQUE constraint failed: reconcile.id
sqlite3.IntegrityError: UNIQUE constraint failed: reconcile.id

My query.tsv is from https://w.wiki/3BV9

What do you think is happening? Sorry to spam the issue with my questions

Originally posted by @VojtechDostal in wetneb/openrefine-wikibase#101 (comment)

@gitonthescene
Copy link
Owner

Hi there. The last part of the error mentions a UNIQUE constraint failed. This is most likely because your id column does not contain distinct values. From the description in the README.org:

The CSV file must contain a column containing distinct values to reconcile to. We’ll call this the id column. We’ll call the column being reconciled against the name column.

To fix this, you'll need to make sure you have at most one coordinate for each id. You can do this by loading this file into OpenRefine and removing duplicates.

Checking your query it looks like these items have more than one P625:

wd:Q24971118
wd:Q24971120
wd:Q24971143
wd:Q45118284
wd:Q59770038
wd:Q60480402
wd:Q64504015
wd:Q64759900
wd:Q64800690
wd:Q64815574
wd:Q64815820
wd:Q64815854
wd:Q64816265
wd:Q68029107
wd:Q68915501
wd:Q94433573
wd:Q94435811
wd:Q94443559

Ideally each of these would only have one coordinate location with preferred rank.

The good news is it looks like you're very close to getting this working.

@gitonthescene
Copy link
Owner

Alternatively, you could use GROUP BY with the SAMPLE aggregate in your query to pick one of the coordinates as in the following:

select ?item (SAMPLE(?coord1) as ?coord)  with {
  
  select * where {
  ?item wdt:P31/wdt:P279* wd:Q1746392 .
  ?item wdt:P17 wd:Q213 . 
    }
} as %pamatky where {

  INCLUDE %pamatky .
  ?item wdt:P131|wdt:P131/wdt:P131|wdt:P131/wdt:P131/wdt:P131 wd:Q1085 .
  ?item wdt:P625 ?coord1 .

} GROUP BY ?item

@VojtechDostal
Copy link
Author

VojtechDostal commented Apr 14, 2021

@gitonthescene Thanks so much! I was suspecting that something much more serious was going on in the error log. I am now able to run the reconciliation service, load it into OpenRefine and successfully run it from there on my OpenRefine project. I can see the correct item QIDs when I use the "cell.recon.match.id" command on the reconciled column. The only thing that's a bit impractical is that I don't see QIDs in the default view. I gathered from https://github.com/gitonthescene/csv-reconcile that I should create a config file and use the MANIFEST command to be able to see those. I thus created a "config.txt" file with this contents:

MANIFEST = {
  "identifierSpace": "http://www.wikidata.org/entity/",
  "schemaSpace": "http://www.wikidata.org/prop/direct/",
  "view": {"url":"https://www.wikidata.org/wiki/{{id}}"},
  "name": "My reconciliation service"
}

I've put the file into the same folder and query.tsv and pointed to it using:

$ csv-reconcile --init-db query.tsv item coord --scorer geo --config config.txt

That did create a reconciliation service but did not help with the default view. Could you please help with that last step?

thank you very much.

@gitonthescene
Copy link
Owner

I'm not sure what you mean by "the default view". Usually, when something reconciles, what's shown is the value it got reconciled to. In this case it would show the matching coordinates. It looks like you grabbed the part of the manifest which makes this clickable to take you to the Wikidata page.

If you want the preview from hovering over a candidate, you'll want to copy the preview section of the manifest. Namely "preview":{"height":100,"url":"https://wikidata.reconci.link/en/preview?id={{id}}","width":400}. Is this what you're looking for?

Maybe posting a screenshot would help.

@VojtechDostal
Copy link
Author

I'm not sure what you mean by "the default view". Usually, when something reconciles, what's shown is the value it got reconciled to. In this case it would show the matching coordinates. It looks like you grabbed the part of the manifest which makes this clickable to take you to the Wikidata page.

If you want the preview from hovering over a candidate, you'll want to copy the preview section of the manifest. Namely "preview":{"height":100,"url":"https://wikidata.reconci.link/en/preview?id={{id}}","width":400}. Is this what you're looking for?

Maybe posting a screenshot would help.

I'd like to display QID in the reconciled suggestions (instead of the values it got reconciled to) and make it clickable to go to respective Wikidata item...

Bez názvu

This is because I usually need to go through some of the suggestions manually and it's practical to just have a clickable link to Wikidata.

I tried to put your text into my config file, instead of the original contents, but it produced the following error:

(venv) C:\Users\vojte\Downloads>csv-reconcile --init-db query.tsv item coord --scorer geo --config config.txt
Traceback (most recent call last):
  File "c:\users\vojte\appdata\local\programs\python\python37-32\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\vojte\appdata\local\programs\python\python37-32\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\vojte\AppData\Local\Programs\Python\Python37-32\Scripts\csv-reconcile.exe\__main__.py", line 7, in <module>
  File "c:\users\vojte\appdata\local\programs\python\python37-32\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\vojte\appdata\local\programs\python\python37-32\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\users\vojte\appdata\local\programs\python\python37-32\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\vojte\appdata\local\programs\python\python37-32\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "c:\users\vojte\appdata\local\programs\python\python37-32\lib\site-packages\csv_reconcile\__init__.py", line 207, in main
    app = create_app(dict(CSVFILE=csvfile, CSVCOLS=(idcol, namecol)), config)
  File "c:\users\vojte\appdata\local\programs\python\python37-32\lib\site-packages\csv_reconcile\__init__.py", line 56, in create_app
    app.config.from_pyfile(config)
  File "c:\users\vojte\appdata\local\programs\python\python37-32\lib\site-packages\flask\config.py", line 132, in from_pyfile
    exec(compile(config_file.read(), filename, "exec"), d.__dict__)
  File "C:\Users\vojte\Downloads\config.txt", line 1
    "preview":{"height":100,"url":"https://wikidata.reconci.link/en/preview?id={{id}}","width":400}
    ^
SyntaxError: illegal target for annotation

@gitonthescene
Copy link
Owner

gitonthescene commented Apr 14, 2021

Oh, okay. Like I said what you had originally should make the link clickable once an entry is reconciled and what’s displayed is the value that matched not the id. That first value in your screenshot should be clickable. I believe you’ll need to make a choice on the second for it to be clickable but maybe the candidates are already clickable.

That code for preview needs to be added to the MANIFEST entry you had, not replace it. You’ll need to put it before the closing } and put a , on the line before it to separate it from the name entry. This is because it needs to be valid JSON syntax. But again this is for hovering over candidates. If you don’t need that, you can leave it out.

The reconciliation service always shows the value you reconcile to and not the id, but the link should still work. If you really want the id, you can add a column with cell.recon.match.id and then reconcile that column with “Use values as identifiers”. For this you can use the standard Wikidata reconciliation service.

@VojtechDostal
Copy link
Author

@gitonthescene Alas, it's working now. I did two mistakes - 1) I did not remove and readd the service from OpenRefine after I changed the config file, 2) I had a wrong formatting of the URL formatter in the config file (the default outcome from the Query service is not QID, but full url with http and everything).

So, just for my future reference, the steps are:

  1. Create config file with this contents:
MANIFEST = {
  "identifierSpace": "http://www.wikidata.org/entity/",
  "schemaSpace": "http://www.wikidata.org/prop/direct/",
  "view": {"url":"{{id}}"},
  "name": "Csv-reconcile geo",
  "preview":{"height":100,"url":"https://wikidata.reconci.link/en/preview?id={{id}}","width":400}
}
  1. Create virtual environment and reconciliation service:
C:\Users\vojte\Downloads>virtualenv --system-site-packages -p python ./venv
created virtual environment CPython3.7.0.final.0-32 in 1687ms
  creator CPython3Windows(dest=C:\Users\vojte\Downloads\venv, clear=False, no_vcs_ignore=False, global=True)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=C:\Users\vojte\AppData\Local\pypa\virtualenv)
    added seed packages: pip==21.0.1, setuptools==54.1.2, wheel==0.36.2
  activators BashActivator,BatchActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator

C:\Users\vojte\Downloads>.\venv\Scripts\activate

(venv) C:\Users\vojte\Downloads>csv-reconcile --init-db query.tsv item coord --scorer geo --config config.txt
c:\users\vojte\appdata\local\programs\python\python37-32\lib\site-packages\normality\__init__.py:72: ICUWarning: Install 'pyicu' for better text transliteration.
  text = ascii_text(text)
Initialized the database.
 * Serving Flask app "csv-reconcile" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
  1. Start reconciliation service in OpenRefine on the coord column with the default format for coordinates - POINT( longitude latitude ).

http://127.0.0.1:5000/reconcile

thank you again @gitonthescene and I'll spread the word about this great tool and plugin in the Czech Wikidata community :)

@gitonthescene
Copy link
Owner

@VojtechDostal Hey. Just a heads up. It looks like I swapped latitude and longitude in my distance calculation. I'll be publishing a fix later today. Sorry for the mix up.

@VojtechDostal
Copy link
Author

@VojtechDostal Hey. Just a heads up. It looks like I swapped latitude and longitude in my distance calculation. I'll be publishing a fix later today. Sorry for the mix up.

I think I swapped them in my table to compensate for that. It hadn't surprised me because lat and lon are sometimes deliberately swapped in some tools for some reason.

I noticed that sometimes your reconciliation service may not find the best match which it is supposed to find, one which it is really close to the input object and is included in the query results. Are you interested in examples of cases like that if I discover them in future?

@gitonthescene
Copy link
Owner

Yes, please. You can open a new issue for anything strange you see.

Just to be clear, in the most recent version uploaded to PyPI you shouldn’t need to swap lat and longitude. I probably should have updated the minor version.

@VojtechDostal
Copy link
Author

VojtechDostal commented Nov 4, 2022

The correct wording is now seemingly:

csv-reconcile init --scorer geo --config config.txt query.tsv item coord

Then I need to do the following:

csv-reconcile serve

Then it works.
I must remember to always reconcile on the wkt-literal value column formatted as point(lat lon)

@gitonthescene
Copy link
Owner

I’m traveling currently but will have a look at this when I’m back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants