duplicate file names and event data in source Dataverse repository #45

mayeulk · 2019-05-26T21:37:26Z

After a fresh install on Ubuntu 18.04, the following fails after downloading 151 files (73.1 MB) with an error:

library("icews")
library("DBI")
library("dplyr")
library("usethis")
setup_icews(data_dir = "/home/mk/Documents/data/icews", use_db = TRUE, keep_files = TRUE,  r_profile = TRUE)

update_icews(dryrun = TRUE)
update_icews(dryrun = FALSE)

# (...... downloads 151 files, ingesting correctly 294687 rows in sqlite database)
Downloading '20190309-icews-events.zip'
Error in writeBin(as.vector(f), tmp) : can only write vector objects

Launching update_icews(dryrun = FALSE) again and again does not solve the issue.

The following (launched after the error) might help:

> update_icews(dryrun = TRUE)
File system changes:
Found 151 local data file(s)
Downloading 84 file(s)
Removing 0 old file(s)

Database changes:
Deleting old records for 0 file(s)
Ingesting records from 84 file(s)

Plan:
Download            '20190309-icews-events.zip'
Download            '20190309-icews-events.zip'
Ingest records from '20190309-icews-events.tab'
Ingest records from '20190309-icews-events.tab'
Download            '20190311-icews-events.zip'
Ingest records from '20190311-icews-events.tab'
Download            '20190312-icews-events.zip'
Ingest records from '20190312-icews-events.tab'
Download            '20190313-icews-events.zip'
Ingest records from '20190313-icews-events.tab'
Download            '20190314-icews-events.zip'
(etc.)

The text was updated successfully, but these errors were encountered:

andybega · 2019-05-27T17:51:54Z

For some reason there are two files named "20190309-icews-events.zip" in the dataverse repo (they don't contain the same events). The duplicate file names are causing issues, e.g. the download right now is by the file name/label. For me this is causing a timeout when it tries to download the file, but I'm guessing it's also what leads to the error you are getting. I'll have to change how the files are downloaded and labelled for the database.

mayeulk · 2019-05-31T08:26:38Z

Looking at the metatdata of the 2 files, they have at least four attributes that differ, that may help differentiate them:

https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/QI2T9A/DER3I5&version=200.0
https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/QI2T9A/5DCQQF&version=200.0

Those 4 things are:

the MD5
the last part of the direct link: DER3I5 vs 5DCQQF
the deposit date: 2019-03-10 vs 2019-03-11
the first ICEWS version in which they appear:

Those might help to form a filename if needed.
Hope this helps.

andybega · 2019-06-03T02:01:46Z

Hey, thank you. I've been (and still am) on vacation, but had a change to look a bit more. Gonna paste this here, partly for myself. Aside from the repeated 20190309-icews-events.zip, there's also 20190409-icews-events-1.zip and 20190410-icews-events-1.zip, as well as 20190503-thru-20190519-icews-events.zip.

The files are downloaded using the dataverse packages. Looking at get_file, which does the actual downloading, it seems that it is also possible to use a numeric ID instead of the file name. E.g.:

library("icews")
library("dataverse")

file_list = get_dataset(get_doi()$daily)
head(file_list$files[, c("label", "id")])

label      id
1 20181004-icews-events.zip 3234868
2 20181005-icews-events.zip 3235021
3 20181006-icews-events.zip 3238491
4 20181007-icews-events.zip 3238493
5 20181008-icews-events.zip 3238584
6 20181009-icews-events.zip 3238918

Right now the file name is used to reconcile the local and remote states, so that will have to switch. There are two tables, source_files and null_source_files, that list the ingested source files (and source files that contained no new events and thus wouldn't show up in the events table source_file column).

library("RSQLite")
con = connect()
dbGetQuery(con, "select * from source_files limit 5;")
dbGetQuery(con, "select * from null_source_files limit 5;")

The source file is also included in the events table (i.e. query_icews("select * from events limit 5;") %>% str() will show a source_file column at the end), but that's more for informative purposes since querying the events table at every update would take a really long time.

I'm going to have to switch those internal tables up. Maybe have a new source_file table with something like

source_file_id: source file integer ID from dataverse
source_file_name: the possible duplicated source file name
source_id: maybe another ID for this table since the dataverse source file id is a pretty big integer. Not sure.

The various state and downloader functions will need to be switched to use the integer ID instead of file name.

There's probably also going to have to be some kind of one-time upgrade functionality that implements these changes on an existing table, to avoid having to nuke and re-download everything.

Well, lesson for me to not use file names as unique IDs when there's already a perfectly good unique ID on dataverse. I hope to get to this at the end of this week or next week.

andybega · 2019-06-12T13:21:56Z

andybega · 2019-06-13T13:57:44Z

@mayeulk can you try updating the package and seeing if it works now?

mayeulk · 2019-06-16T13:51:39Z

Hi, It seems to work, also I could not finish the process (full disk):```

Ingesting records from 'events.2015.20180710092545.tab'
|================================================================================================================================| 100% 241 MB
Error in result_bind(res@ptr, params) : database or disk is full
Error in result_create(conn@ptr, statement) :
no such savepoint: dbWriteTable

My previous attempts led to db of 120 MB, this one: 6GB
I'll confirm in a few days (getting a new disk).
Thanks! Cheers,
Mayeul

andybega · 2019-06-17T11:15:54Z

😀 that sounds about right. I just updated all the way through 15 June, and have ~8GB for the database and ~5GB for the raw ".tsv" files.

mayeulk · 2019-06-19T17:03:22Z

Hi, it repeatedly fails now on '20190409-icews-events.zip' and '20190409-icews-events-1.zip'

Downloading '20190406-icews-events.zip'
Ingesting records from '20190406-icews-events.tab'
Downloading '20190407-icews-events.zip'
Ingesting records from '20190407-icews-events.tab'
Downloading '20190408-icews-events.zip'
Ingesting records from '20190408-icews-events.tab'
Downloading '20190409-icews-events.zip'
Ingesting records from '20190409-icews-events.tab'
Downloading '20190409-icews-events-1.zip'
Ingesting records from '20190409-icews-events-1.tab'
Error in result_bind(res@ptr, params) : 
  UNIQUE constraint failed: events.event_id, events.event_date
> date()
[1] "Wed Jun 19 18:59:23 2019"
> update_icews(dryrun = FALSE); date()
Downloading '20190409-icews-events.zip'
Ingesting records from '20190409-icews-events-1.tab'
Error in result_bind(res@ptr, params) : 
  UNIQUE constraint failed: events.event_id, events.event_date
>

mayeulk · 2019-06-19T17:16:56Z

Maybe related to this, '@icews' twiteer feed mentions duplicates, see:

https://twitter.com/icews?lang=en
Direct link to tweet Apr 7, 2019: https://t.co/W3XSnPU0Vo

ICEWS ‏ @icews Apr 7
We uploaded the 3 missing events file for Mar 26-28. Unfortunately, there will be duplicate events in these files and files from Mar 27-Apr 6, predominantly affecting Mar 28 and Apr 5. Use the Event ID field to identify these duplicates. http://bit.ly/2ORIfQX #icews #dataverse

mayeulk · 2019-06-20T21:33:14Z

I lowered the unicity requirement (primary key) as a quick, temporary fix, running this against the sqlite database:

CREATE TABLE events_copy ( event_id INTEGER NOT NULL, event_date INTEGER NOT NULL,
 source_name TEXT, source_sectors TEXT, source_country TEXT, event_text TEXT, cameo_code TEXT, 
intensity REAL, target_name TEXT, target_sectors TEXT, target_country TEXT, story_id INTEGER, 
sentence_number INTEGER, publisher TEXT, city TEXT, district TEXT, province TEXT, country TEXT, 
latitude REAL, longitude REAL, year INTEGER NOT NULL, yearmonth INTEGER NOT NULL,
 source_file TEXT NOT NULL, PRIMARY KEY (event_id, event_date, source_file) );

INSERT INTO events_copy
   SELECT * FROM events;

ALTER TABLE events RENAME TO events_old;
ALTER TABLE events_copy RENAME TO events;

I kept the old data to analyse this. Running again update_icews(dryrun = FALSE) now works on this particular setup.

mayeulk · 2019-06-20T22:17:15Z

With my version of the database (with PRIMARY KEY (event_id, event_date, source_file) ), it now works:

Ingesting records from '20190618-icews-events.tab'
Cleaning up and optimizing database, this might take a while
Complete
File and/or database update done

On this new db, I ran the following to find duplicates:

SELECT event_id, event_date, COUNT(*)
FROM events
GROUP BY event_id, event_date
HAVING COUNT(*) > 1

There are 2434 rows returned ("duplicates"), of which:
2319 rows are for event_date == 20190409
77 rows are for event_date == 20190410
38 rows are for event_date == 20190408

andybega · 2019-06-21T07:50:10Z

I had the exact same issue with "20190409-icews-events-1.zip" and "20190409-icews-events.zip", thought I had managed to fix it (#46).

The two files contain the same exact set of events (by event ID), so what should happen is this:

The version without "-1" is ingested first, and all events are added to the database.
For the next version with "-1", write_data_to_db() should have recognized that there are potentially duplicate events, realized that all events are duplicates and thus there is nothing to add, in which case it should have added the "-1" file name to a table tracking source files with no new events (i.e. all duplicates).

Could you check if you get the same results for these queries?:

SELECT name FROM null_source_files WHERE name LIKE '20190409%';

"20190409-icews-events-1.tab" only

SELECT name FROM source_files WHERE name LIKE '20190409%';

Both "20190409-icews-events.tab" and "20190409-icews-events-1.tab".

SELECT source_file, count(*) AS n_events 
FROM events 
WHERE source_file LIKE '20190409%' 
GROUP BY source_file;

All from the file version without "-1":

                source_file n_events
1 20190409-icews-events.tab     2434

mayeulk · 2019-06-22T18:54:17Z

Here are the results (ran on my sqlite db with duplicates):

SELECT name FROM null_source_files WHERE name LIKE '20190409%';
0 rows returned

SELECT name FROM source_files WHERE name LIKE '20190409%';

"20190409-icews-events-1.tab"
"20190409-icews-events.tab"
2 rows

SELECT source_file, count(*) AS n_events 
FROM events 
WHERE source_file LIKE '20190409%' 
GROUP BY source_file;

```----
source_file                     n_events
"20190409-icews-events-1.tab"	"2434"
"20190409-icews-events.tab"	"2434"
----
2 rows

mayeulk · 2019-06-22T19:07:10Z

I guess we can think of ways to remove duplicates in sql, which might be faster than in R (or not).

mayeulk · 2019-06-22T19:31:54Z

Here, I implement in sql a solution to remove the duplicates linked to this issue.
For speed of queries in this testing phase, we do not make a full copy of the db but only of a recent subset here, into table events_extract

CREATE TABLE events_extract  ( event_id INTEGER NOT NULL, event_date INTEGER NOT NULL,
 source_name TEXT, source_sectors TEXT, source_country TEXT, event_text TEXT, cameo_code TEXT, 
intensity REAL, target_name TEXT, target_sectors TEXT, target_country TEXT, story_id INTEGER, 
sentence_number INTEGER, publisher TEXT, city TEXT, district TEXT, province TEXT, country TEXT, 
latitude REAL, longitude REAL, year INTEGER NOT NULL, yearmonth INTEGER NOT NULL,
 source_file TEXT NOT NULL, PRIMARY KEY (event_id, event_date, source_file) );

INSERT INTO events_extract
   SELECT * FROM events where event_date>20190400;


-- The duplicates to fix
SELECT event_id, event_date, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) > 1
-- 2434 rows returned 

-- Same, without showing the count
SELECT event_id, event_date --, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) > 1
   

-- Full rows of duplicates
SELECT * FROM events_extract where event_date || '-' || event_id IN
(SELECT event_date || '-' || event_id --, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) >1)
ORDER by event_date, event_id, source_file;
-- 4868 rows returned
-- we see that all duplicates are pairs coming from two files: -icews-events.tab and -icews-events-1.tab

-- From those duplicates, select only those loaded from -icews-events-1.tab
SELECT * FROM events_extract where event_date || '-' || event_id IN
(SELECT event_date || '-' || event_id --, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) >1)
AND source_file LIKE '%-icews-events-1.tab'
ORDER by event_date, event_id, source_file;
-- 2434 rows returned

-- DELETE the duplicates
DELETE FROM events_extract where event_date || '-' || event_id IN
(SELECT event_date || '-' || event_id --, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) >1)
AND source_file LIKE '%-icews-events-1.tab';
-- 2434 rows affected

-- Check there are no duplicates left
SELECT event_id, event_date, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) > 1;
0 rows returned

This is a possible solution to duplicates (issue andybega#45 )

mayeulk · 2019-06-22T20:17:11Z

Ran over the full events table, the DELETE SQL query takes 15 s on my laptop.
took 15465ms, 2434 rows affected

mayeulk · 2019-06-23T17:54:37Z

Running now the update function puts back the dupes from the 20190409 file:

# R code
> update_icews(dryrun = FALSE)
Ingesting records from '20190409-icews-events-1.tab'
Downloading '20190622-icews-events.zip'
Ingesting records from '20190622-icews-events.tab'
Cleaning up and optimizing database, this might take a while

-- SQL code
-- The duplicates
SELECT event_id, event_date, COUNT(*)
FROM events
GROUP BY event_id, event_date
HAVING COUNT(*) > 1
-- 2434 rows returned

mayeulk · 2019-06-23T18:01:34Z

I've changed the title to something more readable. I believe there are two options here:

fix the R icews package
fix the data in the Icews Harvard Dataverse repository. I'm not sure the data authors would be willing to do so, but it is certainly a better option, among other things because other projects (not in R) might want to use the data, too.

andybega · 2020-06-09T12:55:11Z

Hi, given the recent changes in ICEWS dataverse, I think this is not an issue anymore. Duplicate events are still a problem, but that should be taken care of when ingesting new data. Have you tried updating the data recently?

(It should work even with the previous data present, but might give some essentially ineffectual messages, #54 (comment))

andybega added a commit that referenced this issue Jun 13, 2019

Fix update_icews error due to duplicated dvn file names (#45)

5fd0d92

mayeulk mentioned this issue Jun 19, 2019

Auto-retry when dataverse is slow #48

Open

mayeulk added a commit to mayeulk/icews that referenced this issue Jun 22, 2019

Create delete_dupes.sql

21d5ea6

This is a possible solution to duplicates (issue andybega#45 )

mayeulk mentioned this issue Jun 22, 2019

Create delete_dupes.sql #50

Merged

mayeulk changed the title ~~Error in writeBin(as.vector(f), tmp) : can only write vector objects~~ duplicate file names and event data in source Dataverse repository Jun 23, 2019

andybega closed this as completed Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicate file names and event data in source Dataverse repository #45

duplicate file names and event data in source Dataverse repository #45

mayeulk commented May 26, 2019

andybega commented May 27, 2019

mayeulk commented May 31, 2019

andybega commented Jun 3, 2019

andybega commented Jun 12, 2019 •

edited

Loading

andybega commented Jun 13, 2019

mayeulk commented Jun 16, 2019

andybega commented Jun 17, 2019

mayeulk commented Jun 19, 2019

mayeulk commented Jun 19, 2019

mayeulk commented Jun 20, 2019

mayeulk commented Jun 20, 2019

andybega commented Jun 21, 2019

mayeulk commented Jun 22, 2019

mayeulk commented Jun 22, 2019

mayeulk commented Jun 22, 2019

mayeulk commented Jun 22, 2019

mayeulk commented Jun 23, 2019

mayeulk commented Jun 23, 2019

andybega commented Jun 9, 2020

duplicate file names and event data in source Dataverse repository #45

duplicate file names and event data in source Dataverse repository #45

Comments

mayeulk commented May 26, 2019

andybega commented May 27, 2019

mayeulk commented May 31, 2019

andybega commented Jun 3, 2019

andybega commented Jun 12, 2019 • edited Loading

andybega commented Jun 13, 2019

mayeulk commented Jun 16, 2019

andybega commented Jun 17, 2019

mayeulk commented Jun 19, 2019

mayeulk commented Jun 19, 2019

mayeulk commented Jun 20, 2019

mayeulk commented Jun 20, 2019

andybega commented Jun 21, 2019

mayeulk commented Jun 22, 2019

mayeulk commented Jun 22, 2019

mayeulk commented Jun 22, 2019

mayeulk commented Jun 22, 2019

mayeulk commented Jun 23, 2019

mayeulk commented Jun 23, 2019

andybega commented Jun 9, 2020

andybega commented Jun 12, 2019 •

edited

Loading