Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detection is really slow in some cases #13

Closed
kaskawu opened this issue May 8, 2020 · 5 comments
Closed

detection is really slow in some cases #13

kaskawu opened this issue May 8, 2020 · 5 comments

Comments

@kaskawu
Copy link

kaskawu commented May 8, 2020

Hey there, first of all, great project!

The following commands takes a significant amount of time:

> python3 -m timeit -n 1 -- "from clevercsv import Detector; Detector().detect('fileurl="file://$PROJECT_DIR$/../aaaaaa_aaaaaaa_aaaaa/.aaa/." filepath=$')" 
1 loop, best of 5: 13.2 sec per loop
python3 -m timeit "from clevercsv import Detector; Detector().detect('a'*18)" 
1 loop, best of 5: 8.24 sec per loop

After benchmarking a little bit, the apparent cause is that the unix_path and url regexes in the detector are susceptible to a ReDOS .

These change, which replace the regexes with (hopefully) equivalent ones fixes the most oblivious issues:

-    "url": "((https?|ftp):\/\/(?!\-))?(((([\p{L}\p{N}]*\-?[\p{L}\p{N}]+)+\.)+([a-z]{2,}|local)(\.[a-z]{2,3})?)|localhost|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(\:\d{1,5})?))(\/[\p{L}\p{N}_\/()~?=&%\-\#\.:]*)?(\.[a-z]+)?",
-    "unix_path": "(\/|~\/|\.\/)(?:[a-zA-Z0-9\.\-\_]+\/?)+",
+    "url": "((https?|ftp):\/\/(?!\-))?(((?:[\p{L}\p{N}-]+\.)+([a-z]{2,}|local)(\.[a-z]{2,3})?)|localhost|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(\:\d{1,5})?))(\/[\p{L}\p{N}_\/()~?=&%\-\#\.:]*)?(\.[a-z]+)?",
+    "unix_path": "[~.]?(?:\/[a-zA-Z0-9\.\-\_]+)+\/?",

New results:

> python3 -m timeit -n 1 -- "from clevercsv import Detector; Detector().detect('fileurl="file://$PROJECT_DIR$/../aaaaaa_aaaaaaa_aaaaa/.aaa/." filepath=$')" 
1 loop, best of 5: 4.17 msec per loop
:0: UserWarning: The test results are likely unreliable. The worst time (347 msec) was more than four times slower than the best time (4.17 msec).
> python3 -m timeit "from clevercsv import Detector; Detector().detect('a'*18)" 
1 loop, best of 5: 217 usec per loop

Python version: 3.8

@kaskawu kaskawu changed the title detection is really slow detection is really slow in some cases May 8, 2020
@GjjvdBurg
Copy link
Collaborator

Hi @kaskawu! Thanks for your interest in the package and for reporting this issue. Strangely, I have a hard time replicating your results:

$ python3 -m timeit -- "from clevercsv import Detector; Detector().detect('fileurl="file://$PROJECT_DIR$/../aaaaaa_aaaaaaa_aaaaa/.aaa/." filepath=$')"
500 loops, best of 5: 721 usec per loop

and with the change you propose:

$ python3 -m timeit -- "from clevercsv import Detector; Detector().detect('fileurl="file://$PROJECT_DIR$/../aaaaaa_aaaaaaa_aaaaa/.aaa/." filepath=$')"
1 loop, best of 5: 638 usec per loop

What version of the regex package are you using?

That said, it does seem to make a massive difference on your system, so I'm certainly open to making this change. I do however want to make sure I fully understand the cause before implementing any changes. Thanks!

@kaskawu
Copy link
Author

kaskawu commented May 8, 2020

> pip3 freeze | grep regex
regex==2020.5.7
> 

That said, I tested across multiple python versions. I tried python 3.7 and 3.8, and the slowdown only happens on 3.8:

Python 3.7:

> python3 --version
Python 3.7.7
> python3 -m timeit -- "from clevercsv import Detector; Detector().detect('fileurl="file://$PROJECT_DIR$/../aaaaaa_aaaaaaa_aaaaa/.aaa/." filepath=$')"
1 loop, best of 5: 5.75 msec per loop

Python 3.8:

> python3 --version
Python 3.8.2
> python3 -m timeit -n 1 -r 1 -- "from clevercsv import Detector; Detector().detect('fileurl="file://$PROJECT_DIR$/../aaaaaa_aaaaaaa_aaaaa/.aaa/." filepath=$')"
1 loop, best of 1: 19.7 sec per loop

@GjjvdBurg
Copy link
Collaborator

Wow that's very interesting! Thanks for doing some more digging. I'll take a more detailed look at this soon, hopefully I can reproduce it in someway and figure out a good solution. Thanks again for reporting it!

@lmmentel
Copy link

Same here, performance drops with python3.8

python --version
Python 3.8.1
python -m timeit -n 1 -r 1 -- "from clevercsv import Detector; Detector().detect('fileurl="file://$PROJECT_DIR$/../aaaaaa_aaaaaaa_aaaaa/.aaa/." filepath=$')" 
1 loop, best of 1: 8.34 sec per loop

@GjjvdBurg
Copy link
Collaborator

Thanks again @kaskawu for reporting this issue. I've updated CleverCSV using the unix_path regex you suggested above (diving into it, that regex seemed to be the problem). I'm preparing an updated release of the package now. Thanks also @lmmentel for confirming!

GjjvdBurg added a commit that referenced this issue May 12, 2020
* Fix speed of ``unix_path`` regex used in type detection. ([issue
  #13](#13)). Thanks
  to @kaskawu.
GjjvdBurg added a commit that referenced this issue May 20, 2020
* Update URL regex to avoid catastrophic backtracking and increase
  performance. See [issue
  #13](#13) and
  [issue #15](#15).
  Thanks to @kaskawu for the fix and @jlumbroso for re-raising the issue.
* Add ``num_chars`` keyword argument to ``read_as_dicts`` and ``csv2df``
  wrappers.
* Improve documentation w.r.t. handling large files. Thanks to @jlumbroso for
  raising this issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants