-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duplicate file names and event data in source Dataverse repository #45
Comments
For some reason there are two files named "20190309-icews-events.zip" in the dataverse repo (they don't contain the same events). The duplicate file names are causing issues, e.g. the download right now is by the file name/label. For me this is causing a timeout when it tries to download the file, but I'm guessing it's also what leads to the error you are getting. I'll have to change how the files are downloaded and labelled for the database. |
Looking at the metatdata of the 2 files, they have at least four attributes that differ, that may help differentiate them: Those 4 things are:
Those might help to form a filename if needed. |
Hey, thank you. I've been (and still am) on vacation, but had a change to look a bit more. Gonna paste this here, partly for myself. Aside from the repeated The files are downloaded using the library("icews")
library("dataverse")
file_list = get_dataset(get_doi()$daily)
head(file_list$files[, c("label", "id")])
Right now the file name is used to reconcile the local and remote states, so that will have to switch. There are two tables, library("RSQLite")
con = connect()
dbGetQuery(con, "select * from source_files limit 5;")
dbGetQuery(con, "select * from null_source_files limit 5;") The source file is also included in the events table (i.e. I'm going to have to switch those internal tables up. Maybe have a new source_file table with something like
The various state and downloader functions will need to be switched to use the integer ID instead of file name. There's probably also going to have to be some kind of one-time upgrade functionality that implements these changes on an existing table, to avoid having to nuke and re-download everything. Well, lesson for me to not use file names as unique IDs when there's already a perfectly good unique ID on dataverse. I hope to get to this at the end of this week or next week. |
Make DVN to local files work again:
Make local files to DB work again:
Misc
|
@mayeulk can you try updating the package and seeing if it works now? |
Hi, It seems to work, also I could not finish the process (full disk):``` Ingesting records from 'events.2015.20180710092545.tab'
|
😀 that sounds about right. I just updated all the way through 15 June, and have ~8GB for the database and ~5GB for the raw ".tsv" files. |
Hi, it repeatedly fails now on '20190409-icews-events.zip' and '20190409-icews-events-1.zip'
|
Maybe related to this, '@icews' twiteer feed mentions duplicates, see: https://twitter.com/icews?lang=en
|
I lowered the unicity requirement (primary key) as a quick, temporary fix, running this against the sqlite database:
I kept the old data to analyse this. Running again |
With my version of the database (with
On this new db, I ran the following to find duplicates:
There are 2434 rows returned ("duplicates"), of which: |
I had the exact same issue with "20190409-icews-events-1.zip" and "20190409-icews-events.zip", thought I had managed to fix it (#46). The two files contain the same exact set of events (by event ID), so what should happen is this:
Could you check if you get the same results for these queries?:
"20190409-icews-events-1.tab" only
Both "20190409-icews-events.tab" and "20190409-icews-events-1.tab".
All from the file version without "-1":
|
Here are the results (ran on my sqlite db with duplicates):
"20190409-icews-events-1.tab"
|
I guess we can think of ways to remove duplicates in sql, which might be faster than in R (or not). |
Here, I implement in sql a solution to remove the duplicates linked to this issue.
|
This is a possible solution to duplicates (issue andybega#45 )
Ran over the full events table, the DELETE SQL query takes 15 s on my laptop. |
Running now the update function puts back the dupes from the 20190409 file:
|
I've changed the title to something more readable. I believe there are two options here:
|
Hi, given the recent changes in ICEWS dataverse, I think this is not an issue anymore. Duplicate events are still a problem, but that should be taken care of when ingesting new data. Have you tried updating the data recently? (It should work even with the previous data present, but might give some essentially ineffectual messages, #54 (comment)) |
After a fresh install on Ubuntu 18.04, the following fails after downloading 151 files (73.1 MB) with an error:
Launching
update_icews(dryrun = FALSE)
again and again does not solve the issue.The following (launched after the error) might help:
The text was updated successfully, but these errors were encountered: