CSV parse error while generating report. #1

jikk · 2016-01-12T18:59:44Z

I encountered the following exception while I process an input from Taintmark experiment.

Error message (Exception trace)

Traceback (most recent call last):
  File "fairtest_driver.py", line 93, in main
    driver(conf, fpath, sens, target)
  File "fairtest_driver.py", line 119, in driver
    report([inv], "testing", conf.OUTPUT_DIR)
  File "build/bdist.linux-x86_64/egg/fairtest/investigation.py", line 402, in report
    plot_dir=sub_plot_dir)
  File "build/bdist.linux-x86_64/egg/fairtest/modules/bug_report/report.py", line 240, in bug_report
    output_stream)
  File "build/bdist.linux-x86_64/egg/fairtest/modules/bug_report/report.py", line 333, in print_context_ct
    print >> output_stream, pretty_ct(ct)
  File "build/bdist.linux-x86_64/egg/fairtest/modules/bug_report/report.py", line 584, in pretty_ct
    pretty_table = prettytable.from_csv(output)
  File "build/bdist.linux-x86_64/egg/prettytable.py", line 1337, in from_csv
    dialect = csv.Sniffer().sniff(fp.read(1024))
  File "/usr/lib/python2.7/csv.py", line 188, in sniff
    raise Error, "Could not determine delimiter"

To reproduce the problem, you can use the attached file (test0.csv) as an input to fairtest with the following setting.

SENS = ['dest', 'msg_type', 'url', 'app_version', 'cc', 'device_type', 'efs', 'lang', 'nonce', 'signature', 'time', 'ywsid', 'application/x-www-form-urlencoded', 'email', 'password']
TARGET = 'input_gps'

The text was updated successfully, but these errors were encountered:

jikk · 2016-01-12T19:22:36Z

I dig down a bit of this problem and here's a minimal code snippet to reproduce the problem.
reproduce.zip contains two files.

bug.csv: csv payload to trigger bug
reproduce.py: code snippet to reproduce the issue.

The content of bug.csv seems to be a valid csv file but for some reason, prettytable module having hard time parsing it.

ftramer · 2016-01-12T21:13:36Z

Hi,

Thanks for letting us know about this.

I don't know exactly what's going wrong with this dataset. There seems to
be some bug in the prettytable library we are using...
However, I don't encounter the bug when I get rid of some of the "weird"
fields in your data (nonces, signatures, and others that look like random
data).
These random data fields should be removed before using FairTest anyhow,
because they are irrelevant as far as fairness is concerned.

By the way, may I ask why are you specifying all of your fields as
sensitive, even those that are seemingly random? The attributes defined as
sensitive should be those for which you want to test for spurious
relationships with your target (input_gps it seems). Note that random or
constant data can not not present any associations with any target.

Cheers,

Florian

2016-01-12 19:59 GMT+01:00 Kangkook Jee notifications@github.com:

I encountered the following exception while I process an input from
Taintmark experiment.

Error message (Exception trace)

Traceback (most recent call last):
File "fairtest_driver.py", line 93, in main
driver(conf, fpath, sens, target)
File "fairtest_driver.py", line 119, in driver
report([inv], "testing", conf.OUTPUT_DIR)
File "build/bdist.linux-x86_64/egg/fairtest/investigation.py", line 402, in report
plot_dir=sub_plot_dir)
File "build/bdist.linux-x86_64/egg/fairtest/modules/bug_report/report.py", line 240, in bug_report
output_stream)
File "build/bdist.linux-x86_64/egg/fairtest/modules/bug_report/report.py", line 333, in print_context_ct
print >> output_stream, pretty_ct(ct)
File "build/bdist.linux-x86_64/egg/fairtest/modules/bug_report/report.py", line 584, in pretty_ct
pretty_table = prettytable.from_csv(output)
File "build/bdist.linux-x86_64/egg/prettytable.py", line 1337, in from_csv
dialect = csv.Sniffer().sniff(fp.read(1024))
File "/usr/lib/python2.7/csv.py", line 188, in sniff
raise Error, "Could not determine delimiter"

To reproduce the problem, you can use the attached file (test0.csv
https://github.com/columbia/fairtest/files/87797/test0.csv.zip) as an
input to fairtest with the following setting.

SENS = ['dest', 'msg_type', 'url', 'app_version', 'cc', 'device_type', 'efs', 'lang', 'nonce', 'signature', 'time', 'ywsid', 'application/x-www-form-urlencoded', 'email', 'password']
TARGET = 'input_gps'

—
Reply to this email directly or view it on GitHub
#1.

jikk · 2016-01-13T14:36:00Z

Thanks a lot Florian for your prompt follow-up.

First of all, I'd like to answer your question regarding why are we specifying all of fields and we do this for the following reasons.

Our project deals with large amount of data which comes in with somewhat arbitrary schema and we don't know how it looks like beforehand.
Data that I provided to reproduce the problems is a HTTP GET message generated during the communication between a mobile app and its backend. As you can tell, since the communications is not designed by us, we don't have any prior knowledge about each field's meaning.
Our purpose of using Fairtest to our project is to tell difference between fields that are random and fields that has co-relation with the TARGET (input) fields.
If you think using Fairtest on this purpose isn't a perfect match, please let us know about your thought.

Thanks again for your help!

Regards, Kangkook

@roxanageambasu @francislan Please have a look and let me know If you have any ideas or comments on this.

ftramer · 2016-01-13T16:06:32Z

Ah okay this makes sense.

However, for performance reasons, it might be a good idea to preprocess the
data beforehand to remove any obviously random data (e.g. Non numeric
fields that have a different value in every instance)

Cheers,

Florian
On Jan 13, 2016 3:36 PM, "Kangkook Jee" notifications@github.com wrote:

Thanks a lot Florian for your prompt follow-up.

First of all, I'd like to answer your question regarding why are we
specifying all of fields and we do this for the following reasons.

Our project deals with large amount of data which comes in with
somewhat arbitrary schema and we don't know how it looks like
beforehand.

Data that I provided to reproduce the problems is a HTTP GET message
generated during the communication between a mobile app and its backend. As
you can tell, since the communications is not designed by us, we don't have
any prior knowledge about each field's meaning.

Our purpose of using Fairtest to our project is to tell difference
between fields that are random and fields that has co-relation with the
TARGET (input) fields.

If you think using Fairtest on this purpose isn't a perfect match,
please let us know about your thought.

Thanks again for your help!

Regards, Kangkook

@roxanageambasu https://github.com/roxanageambasu @francislan
https://github.com/francislan Please have a look and let me know If you
have any ideas or comments on this.

—
Reply to this email directly or view it on GitHub
#1 (comment).

vatlidak · 2016-01-13T16:46:03Z

Hi,

I traced this bug for quite some time and it's a bug in the csv module used by prettytable to parse CSVs.

The csv module has a sniffer class that tries to guess the delimiter (called dialect) of the CSV based on frequency analysis of characters per line. That is, the character which appears with the same frequency on all lines of the CSV will be selected as delimiter. This complicated scheme is failing to guess the delimiter properly and there is no way to shortcut this idiotic process and set the delimiter yourself.

Kangkook: If Florian's suggestion to remove some funky fields from the sensitive attributes is not working for you, we may consider using some other library for presenting the reports because I am skeptical that this bug will appear again.

jikk · 2016-01-13T19:43:23Z

Thanks @vatlidak for your work. I was at the similar place debugging the issue and I wanted to find a way to provide delimiter character from our side so that it can bypass the problematic code path. Unfortunately I haven't yet succeeded.

At this point, it doesn't seem to be an option to remove funky fields with some kind of pre-processing.

jikk assigned vatlidak Jan 12, 2016

jikk mentioned this issue Mar 28, 2016

an workaround patch to resolve prettytable exception with the followi… #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV parse error while generating report. #1

CSV parse error while generating report. #1

jikk commented Jan 12, 2016

jikk commented Jan 12, 2016

ftramer commented Jan 12, 2016

jikk commented Jan 13, 2016

ftramer commented Jan 13, 2016

vatlidak commented Jan 13, 2016

jikk commented Jan 13, 2016

CSV parse error while generating report. #1

CSV parse error while generating report. #1

Comments

jikk commented Jan 12, 2016

jikk commented Jan 12, 2016

ftramer commented Jan 12, 2016

jikk commented Jan 13, 2016

ftramer commented Jan 13, 2016

vatlidak commented Jan 13, 2016

jikk commented Jan 13, 2016