Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some text fields include quotes, e.g. ""Fight"" instead of "Fight" #75

Open
andybega opened this issue Oct 26, 2020 · 0 comments
Open

Comments

@andybega
Copy link
Owner

Some of the text field values include outer double-quotes in their value, e.g.:

> query_icews("select * from events where event_id = 25326166;")
  event_id event_date    source_name                                  source_sectors
1 25326166   20170101 Women (Turkey) "Social,General Population / Civilian / Social"
  source_country                                            event_text cameo_code
1         Turkey "Conduct suicide, car, or other non-military bombing"       <NA>
  intensity target_name target_sectors target_country story_id sentence_number
1       -10      Turkey           NULL         Turkey 43113964               6
                   publisher   city district province country latitude longitude year
1 Associated Press Newswires Ankara     NULL   Ankara  Turkey  39.9199   32.8543 2017
  yearmonth              source_file
1    201701 Events.2017.20201006.tab

The "source_sectors" and "event_text" values include quotes...they shouldn't. This is the proper format:

> query_icews("select * from events limit 1;")
  event_id event_date        source_name
1   926685   19950101 Extremist (Russia)
                                     source_sectors     source_country        event_text
1 Radicals / Extremists / Fundamentalists,Dissident Russian Federation Praise or endorse
  cameo_code intensity   target_name                              target_sectors
1        051       3.4 Boris Yeltsin Elite,Executive,Executive Office,Government
      target_country story_id sentence_number        publisher   city district province
1 Russian Federation 28235806               5 The Toronto Star Moscow     <NA>   Moskva
             country latitude longitude year yearmonth                    source_file
1 Russian Federation  55.7522   37.6156 1995    199501 events.1995.20150313082510.tab

Is this in the raw data files or package error?

Some of these are from Events.2017, and I manually checked to verify that these quotes are indeed present in the tab delimited raw data files.

Screen Shot 2020-10-26 at 13 37 52

What files are affected?

Check a couple of the fields to see what source file(s) these are coming from:

"event_text"

> query_icews("select distinct(source_file), count(*) as N from events where event_text like '\"%' group by source_file;")
               source_file     N
1 Events.2017.20201006.tab 59000

"source_sectors"

> query_icews("select distinct(source_file), count(*) as N from events where source_sectors like '\"%' group by source_file;")
               source_file      N
1 Events.2017.20201006.tab 512730

"target_sectors"

> query_icews("select distinct(source_file), count(*) as N from events where target_sectors like '\"%' group by source_file;")
               source_file      N
1 Events.2017.20201006.tab 425417

Of course. "Events.2017....tab"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant