Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interesting Benchmarks running resilparse 'HTML2text' sequentially vs parallel #23

Closed
getorca opened this issue Oct 26, 2022 · 28 comments
Closed

Comments

@getorca
Copy link

getorca commented Oct 26, 2022

After running some benchmarking on resiliparse "HTMl2text" extract_plain_text(tree, main_content=True)) it seems the extract_plain_text method is significantly slower in parallel than sequentially.

sequentially : 508.147 items/sec
parallel : 62.7322 items/sec

I ran the benchmarking with a tool I wrote, https://github.com/Nootka-io/wee-benchmarking-tool. I'll work on pulling out a minimal example.

It seems strange to me, and not sure where to begin profiling/debugging. Other libraries see little improvement, but resiliparse is the only one showing a dramatic drop, although it's still the fastest.

@getorca
Copy link
Author

getorca commented Oct 26, 2022

Is part of it written in C++/Cython? That might explain it, I'm using Dask and dask bags, which i think add some significant overhead or possibly reduce prevent parallelism in cython.

@phoerious
Copy link
Member

Depends on how you run it in parallel. If you start separate Python processes, they should all be equally fast. If you run it in threads, you may experience slowdown due to GIL locking. The latter might be avoidable perhaps, but I'd need a proper test case.

@phoerious
Copy link
Member

phoerious commented Oct 26, 2022

BTW interesting benchmarking suite. I used the ScrapingHub suite for developing and evaluating Resiliparse's accuracy, but this one is new to me.

So far, it's a pretty basic rule-based and precision-oriented extractor, but we have work done for future improvement. But great to see that it's still the fastest of the bunch.

@getorca
Copy link
Author

getorca commented Oct 26, 2022

Member

Dask bags run on a multiprocessing scheduler. And there is about a 200us overhead per task. Now this was on a fairly small sample 181 extractions. The dask multiprocess scheduler also serialises functions and I wonder if the cython code it adds more overhead. Be interesting to see how the results differ if I get closer to more realworld samples for me, of around 20k, and a different DAG, eg 20 tasks running 1000 html files, vs 200 tasks running 1 html file.

I'm incredibly impressed with the speed and accuracy of extract_plain_text(tree, main_content=True)) for article extraction. For me it's definitely worth digging into how its done and working on adding title, authors, and a published date.

@huu4ontocord
Copy link

Have you figured this out yet? I am also seeing a slowdown when running in parrallel.

@getorca
Copy link
Author

getorca commented Oct 31, 2022

Have you figured this out yet? I am also seeing a slowdown when running in parrallel.

Not yet, taking a look over the next day or two, but I suspect, in my case, with dask, it's the overhead of tasks, and how non python code is serialised with bags. First thing I'll do is try running on a larger dataset, and doing more than 1 extraction in each task. If that doesn't work, I'll have to reassess, but using cython parallel processing is another option. Or this could be a case where parallelisation isn't necessary.

@huu4ontocord
Copy link

huu4ontocord commented Oct 31, 2022

@lexborisov @phoerious I'm wondering if we are using a version of lexbor with threading turned on in resiliparse? If so, then running in multiprocesses might not be helpful. I found a mention of threads in lexbor/lexbor@51930b4 ?

@huu4ontocord
Copy link

or it could be some blocking stuff done here in fastwarc/resiliparse itself? https://github.com/chatnoir-eu/chatnoir-resiliparse/tree/1f4b92d2a5f49f903e5f1b3ac087b5ac8db69aac/resiliparse_inc

@lexborisov
Copy link

@lexborisov @phoerious I'm wondering if we are using a version of lexbor with threading turned on in resiliparse? If so, then running in multiprocesses might not be helpful. I found a mention of threads in lexbor/lexbor@51930b4 ?

The lexbor project does not use multithreading for html parsing or anything at all. At the same time, the project is completely thread-safe.

@phoerious
Copy link
Member

phoerious commented Nov 1, 2022

I am having trouble reproducing this. I was able to reproduce the threading slowdown once after starting Resiliparse directly from my IDE, but that vanished after a clean rebuild of everything with debugging and line tracing turned off. When started from the command line, I could not reproduce it at all.

Here are my benchmarking results for a simple HTML file using different parallel joblib backends:

Using Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0].
Benchmarking 10,000 extractions...
----------------------------------
sequential: 3.070s (3,257.8 extractions/s)                                                                                                                                                                                                               
multiprocessing, 24 workers: 0.274s (36,437.4 extractions/s)                                                                                                                                                                                             
loky, 24 workers: 0.498s (20,066.1 extractions/s)                                                                                                                                                                                                        
threading, 24 workers: 3.227s (3,098.9 extractions/s)

The threading backend is about as fast as the sequential run, which is to be expected (it's a little slower due to the added overhead). The multiprocessing backend is 10-12x faster.

I get the same result when I pull a build directly from PyPi instead if using a local build.

@phoerious
Copy link
Member

phoerious commented Nov 1, 2022

Strange. I can reproduce the slowdown with your benchmarking tool:

┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Library     ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ resiliparse │ 0.6525   │ 0.8529    │ 0.9852 │ 0.904  │ 0.8819          │ 62.3391   │
└─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘

The results are (surprisingly) slightly better if I change the dask compute scheduler to threads:

┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Library     ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ resiliparse │ 0.6525   │ 0.8529    │ 0.9852 │ 0.904  │ 0.8819          │ 220.0468  │
└─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘

The default seems to be processes, which should be faster in theory. synchronous gives slightly worse results than threading:

┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Library     ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ resiliparse │ 0.6525   │ 0.8529    │ 0.9852 │ 0.904  │ 0.8819          │ 180.6137  │
└─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘```

Compare this to running it without any scheduler at all:

┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Library     ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ resiliparse │ 0.6525   │ 0.8529    │ 0.9852 │ 0.904  │ 0.8819          │ 493.7656  │
└─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘

Dask must be introducing significant overhead somewhere.

I extended my own benchmarker by adding a custom multiprocessing Pool scheduler and a ThreadPoolExecutor and ProcessPoolExecutor in addition to what joblib has to offer. The multiprocessing Pool is comparable to joblib's multiprocessing scheduler and the ThreadPoolExecutor is about as fast as joblib's threading backend. ProcessPoolExecutor is significantly worse than a multiprocessing Pool, probably because it creates an entirely new process in each iteration:

Using Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0].
Benchmarking 10,000 extractions...
----------------------------------
sequential: 2.829s (3,534.2 extractions/s)                                                                                                                                                                                                               
multiprocessing, 24 workers: 0.368s (27,205.6 extractions/s)                                                                                                                                                                                             
loky, 24 workers: 0.886s (11,288.3 extractions/s)                                                                                                                                                                                                        
threading, 24 workers: 3.082s (3,244.8 extractions/s)                                                                                                                                                                                                    
multiprocessingpool, 24 workers: 0.311s (32,114.7 extractions/s)                                                                                                                                                                                         
processpoolexecutor, 24 workers: 1.752s (5,706.5 extractions/s)                                                                                                                                                                                          
threadpoolexecutor, 24 workers: 2.790s (3,584.1 extractions/s)

@phoerious
Copy link
Member

phoerious commented Nov 1, 2022

I think the problem is not in Resiliparse. When I print self.elapsed_time in your extractor and replace the Resiliparse extractor implementation with return '', I get the following timings:

Sequential: 0.0071163177490234375
Parallel: 2.8729026317596436

Compare that to the timings when I add the real implementation back in (with main content extraction, which is the most expensive extraction):

Sequential: 0.3654904365539551
Parallel: 2.890957832336426

So Resiliparse adds 0.36 and 0.02 seconds, respectively. The rest is overhead of your benchmarking implementation. I suspect that reading and distributing the HTML inputs is really expensive, therefore bottlenecking the entire execution. Perhaps a simpler multiprocessing backend would be better.

Here are the timings of my benchmarks also with main content extraction turned on:

Using Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0].
Benchmarking 10,000 extractions...
----------------------------------
sequential: 4.542s (2,201.9 extractions/s)                                                                                                                                                                                                               
multiprocessing, 24 workers: 0.430s (23,244.4 extractions/s)                                                                                                                                                                                             
loky, 24 workers: 0.953s (10,493.8 extractions/s)                                                                                                                                                                                                        
threading, 24 workers: 3.603s (2,775.4 extractions/s)                                                                                                                                                                                                    
multiprocessingpool, 24 workers: 0.424s (23,611.7 extractions/s)                                                                                                                                                                                         
processpoolexecutor, 24 workers: 1.878s (5,325.0 extractions/s)                                                                                                                                                                                          
threadpoolexecutor, 24 workers: 3.376s (2,962.1 extractions/s)

@phoerious
Copy link
Member

phoerious commented Nov 1, 2022

FWIW, here's my benchmark implementation. The sample HTML was taken from issue #24:

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
from multiprocessing import Pool
import os
import sys

from joblib import delayed, parallel_backend, Parallel
from time import monotonic
from tqdm import tqdm

from resiliparse.parse import detect_encoding
from resiliparse.parse.html import HTMLTree
from resiliparse.extract.html2text import extract_plain_text

html_byte = b'\n\n\n\n\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />\r\n<meta http-equiv="X-UA-Compatible" content="IE=9">\r\n<link rel="stylesheet" type="text/css" href="https://firgraf.oh.gov.hu/include/style.css" media="screen" />\r\n<title>Int\xc3\xa9zm\xc3\xa9nyi adatok</title>\r\n<!-- Global site tag (gtag.js) - Google Analytics -->\r\n<script async src="https://www.googletagmanager.com/gtag/js?id=UA-198540847-1"></script>\r\n<script>\r\n  window.dataLayer = window.dataLayer || [];\r\n  function gtag(){dataLayer.push(arguments);}\r\n  gtag(\'js\', new Date());\r\n  gtag(\'config\', \'UA-198540847-1\');\r\n</script>\r\n</head>\r\n<body>\r\n<table width="80%" cellpadding="0" cellspacing="0" align="center" style="border:3px solid;\r\nborder-radius:8px; border: 3px solid #0994dc; background-color:#FFFFFF">\r\n  <tr>\r\n    <td valign="top" rowspan="2" bgcolor=\'#FFFFFF\'></td>\r\n    <td align=\'center\' height=\'70\' bgcolor=\'#FFFFFF\' style=\'font: bold small-caps 28px monospace;\'><img src=\'https://firgraf.oh.gov.hu/images/firgraf_logo.png\' width=\'1200\'></td>\r\n  </tr>\r\n  <tr>\r\n    <td valign="top" align=\'center\' bgcolor="#FFFFFF">\r\n      \r\n      <table>\r\n\t<tr>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/index.php">Kezd\xc5\x91lap</a></td>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/kkk.php">K\xc3\xa9pz\xc3\xa9si \xc3\xa9s kimeneti k\xc3\xb6vetelm\xc3\xa9nyek</a></td>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/int.php">Int\xc3\xa9zm\xc3\xa9nyi adatok</a></td>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/torzs.php">T\xc3\xb6rzsadatok</a></td>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/gyorslista.php">Gyorslist\xc3\xa1k</a></td>\r\n\t  <td class="menu"><a class="menu" href="http://www.felvi.hu/hivataliugyek/">Vissza a felvi.hu-ra</a></td>\r\n\t</tr>\r\n      </table>\r\n    </td>\r\n  </tr>\r\n  <tr>\r\n    <td bgcolor=\'#ffffff\'>\r\n      &nbsp;\r\n    </td>\r\n    <td colspan="2" style="padding: 0.5em">\r\n      <div align="center"><font size="4" color="#000000">Int\xc3\xa9zm\xc3\xa9nyi adatok</font></div><hr>\r\n      <div align=\'left\' valign=\'top\'><form name=\'hataly\' method=\'get\' action=\'/prg/int.php?nyilvantartottszakid=36318\'><a href=\'/prg/int.php?hatalyvalt=hat\xc3\xa1lyoss\xc3\xa1g+bekapcsol\xc3\xa1sa&nyilvantartottszakid=36318\'>[A hat\xc3\xa1lyoss\xc3\xa1gi sz\xc5\xb1r\xc5\x91k bekapcsol\xc3\xa1sa.]</a></form>\n</div><form name=form1 method=post action=\'/prg/int.php?nyilvantartottszakid=36318\'><div align=\'left\' valign=\'top\'>\xe2\x96\xa0 <a href=\'kkk.php?graf=MSZKSMU\'>KKK teljes gr\xc3\xa1f</a> \xe2\x96\xa0 <a href=\'int.php?adatmod=nyilvszak&szervezetid=36\'>SZTE nyilv\xc3\xa1ntartott k\xc3\xa9pz\xc3\xa9sei</a><br>A gr\xc3\xa1fban a csom\xc3\xb3pontokra kattintva b\xc5\x91vebb inform\xc3\xa1ci\xc3\xb3 olvashat\xc3\xb3 az adott csom\xc3\xb3pontr\xc3\xb3l.<br>Gr\xc3\xa1fn\xc3\xa9zet:   <select name=grafnezet>\n<option value="resz">csak a nyilv\xc3\xa1ntartott r\xc3\xa9szgr\xc3\xa1fot</option>\n<option value="mind">a teljes gr\xc3\xa1fban a nyilv\xc3\xa1ntartott r\xc3\xa9szgr\xc3\xa1fot</option>\n</select> mutatja.<br>A gr\xc3\xa1fban a ny\xc3\xadl kezdete \xc3\xa9s v\xc3\xa9ge k\xc3\xb6z\xc3\xb6tti minim\xc3\xa1lis t\xc3\xa1vols\xc3\xa1g:   <select name=grafminlen>\n<option value="0">legkisebb</option>\n<option selected value="1">1 egys\xc3\xa9g</option>\n<option value="2">2 egys\xc3\xa9g</option>\n<option value="3">3 egys\xc3\xa9g</option>\n<option value="4">4 egys\xc3\xa9g</option>\n<option value="5">5 egys\xc3\xa9g</option>\n</select> (A nagyobb \xc3\xa9rt\xc3\xa9k szell\xc5\x91sebb\xc3\xa9 teszi az \xc3\xa1br\xc3\xa1t.)<br> <button type=\'submit\'  style="background-color:#E5E5E5; color:#000000; font-size: 12px;" name=\'muv\' value=\'n\xc3\xa9zetet friss\xc3\xadt\'>n\xc3\xa9zetet friss\xc3\xadt</button> </div><br><table width=\'100%\' align=\'center\' border=\'0\'><tr><td width=\'50%\' align=\'left\' valign=\'top\'><a href=\'/prg/int.php?nyilvantartottszakid=36317\'>\xc2\xab el\xc5\x91z\xc5\x91: szoci\xc3\xa1lis munka (36317)</a></td><td width=\'50%\' align=\'right\'><a href=\'/prg/int.php?nyilvantartottszakid=6150\'>k\xc3\xb6vetkez\xc5\x91: szoci\xc3\xa1lpedag\xc3\xb3gia (6150) \xc2\xbb</a></td></tr></table>\n<br><div align=\'left\' valign=\'top\'><b><a href=\'torzsadat.php?tabla=szervezet&sid=70\'>(SZTE) Szegedi Tudom\xc3\xa1nyegyetem</a> - <a href=\'torzsadat.php?tabla=nyilvantartottszak&sid=21715\'>(MSZKSMU) szoci\xc3\xa1lis munka [36318]</a></b></div><br><div align=\'left\' valign=\'top\'><?xml version="1.0" encoding="UTF-8" standalone="no"?>\n<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"\n "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">\n<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n -->\n<!-- Title: MSZKSMU Pages: 1 -->\n<svg width="340pt" height="116pt"\n viewBox="0.00 0.00 340.00 116.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">\n<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 112)">\n<title>MSZKSMU</title>\n<polygon fill="#ffffff" stroke="transparent" points="-4,4 -4,-112 336,-112 336,4 -4,4"/>\n<g id="clust1" class="cluster">\n<title>cluster_vegzettseg</title>\n<polygon fill="none" stroke="#ffff00" points="231,-8 231,-62 324,-62 324,-8 231,-8"/>\n</g>\n<!-- START -->\n<g id="node1" class="node">\n<title>START</title>\n<ellipse fill="#d3d3d3" stroke="#d3d3d3" cx="27" cy="-63" rx="27" ry="18"/>\n<text text-anchor="middle" x="27" y="-60.8" font-family="Times,serif" font-size="9.00" fill="#000000">START</text>\n</g>\n<!-- MSZKSMU -->\n<g id="node2" class="node">\n<title>MSZKSMU</title>\n<g id="a_node2"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=414" xlink:title="MSZKSMU\\nszoci\xc3\xa1lis munka">\n<polygon fill="#e0ffff" stroke="#e0ffff" points="164,-81 91,-81 91,-45 164,-45 164,-81"/>\n<text text-anchor="middle" x="127.5" y="-65.8" font-family="Times,serif" font-size="9.00" fill="#000000">MSZKSMU</text>\n<text text-anchor="middle" x="127.5" y="-55.8" font-family="Times,serif" font-size="9.00" fill="#000000">szoci\xc3\xa1lis munka</text>\n</a>\n</g>\n</g>\n<!-- START&#45;&gt;MSZKSMU -->\n<g id="edge1" class="edge">\n<title>START&#45;&gt;MSZKSMU</title>\n<path fill="none" stroke="#0000ff" stroke-width="2" d="M54.1967,-63C62.3906,-63 71.6286,-63 80.7147,-63"/>\n<polygon fill="#0000ff" stroke="#0000ff" stroke-width="2" points="80.8451,-66.5001 90.8451,-63 80.845,-59.5001 80.8451,-66.5001"/>\n</g>\n<!-- MSPCKSM -->\n<g id="node3" class="node">\n<title>MSPCKSM</title>\n<g id="a_node3"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=5710" xlink:title="MSPCKSM\\nklinikai szoci\xc3\xa1lis munka">\n<polygon fill="#ffe4e1" stroke="#ffe4e1" points="328,-108 227,-108 227,-72 328,-72 328,-108"/>\n<text text-anchor="middle" x="277.5" y="-92.8" font-family="Times,serif" font-size="9.00" fill="#000000">MSPCKSM</text>\n<text text-anchor="middle" x="277.5" y="-82.8" font-family="Times,serif" font-size="9.00" fill="#000000">klinikai szoci\xc3\xa1lis munka</text>\n</a>\n</g>\n</g>\n<!-- MSZKSMU&#45;&gt;MSPCKSM -->\n<g id="edge3" class="edge">\n<title>MSZKSMU&#45;&gt;MSPCKSM</title>\n<path fill="none" stroke="#000000" d="M164.1941,-69.6049C179.9274,-72.4369 198.7348,-75.8223 216.4633,-79.0134"/>\n<polygon fill="#000000" stroke="#000000" points="216.2835,-82.5372 226.7454,-80.8642 217.5237,-75.6479 216.2835,-82.5372"/>\n</g>\n<!-- 1287 -->\n<g id="node4" class="node">\n<title>1287</title>\n<g id="a_node4"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=vegzettseg&idmezo=vegzettsegid&id=1287" xlink:title="MMSAZMO\\nokleveles\\nszoci\xc3\xa1lis munk\xc3\xa1s">\n<polygon fill="#ffff00" stroke="#ffff00" points="316,-54 239,-54 239,-16 316,-16 316,-54"/>\n<text text-anchor="middle" x="277.5" y="-42.8" font-family="Times,serif" font-size="9.00" fill="#000000">MMSAZMO</text>\n<text text-anchor="middle" x="277.5" y="-32.8" font-family="Times,serif" font-size="9.00" fill="#000000">okleveles</text>\n<text text-anchor="middle" x="277.5" y="-22.8" font-family="Times,serif" font-size="9.00" fill="#000000">szoci\xc3\xa1lis munk\xc3\xa1s</text>\n</a>\n</g>\n</g>\n<!-- MSZKSMU&#45;&gt;1287 -->\n<g id="edge2" class="edge">\n<title>MSZKSMU&#45;&gt;1287</title>\n<path fill="none" stroke="#ff0000" d="M164.1941,-56.1504C183.6481,-52.519 207.8022,-48.0103 228.805,-44.0897"/>\n<polygon fill="#ff0000" stroke="#ff0000" points="229.6399,-47.4944 238.8279,-42.2188 228.3554,-40.6133 229.6399,-47.4944"/>\n<text text-anchor="middle" x="195.5" y="-54.6" font-family="Times,serif" font-size="8.00" fill="#ff0000">START</text>\n</g>\n</g>\n</svg>\n</div><br><br><div align=\'left\' valign=\'top\'><b>Nyilv\xc3\xa1ntartott szak:</div></b><table border=\'1\' cellpadding=\'2\' cellspacing=\'0\'><tr><td align=\'left\' valign=\'top\'><b>nyilv. szak ID</b></td><td align=\'left\' valign=\'top\'><b>k\xc3\xb3d</b></td><td align=\'left\' valign=\'top\'><b>n\xc3\xa9v</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g kezdete</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>meghird. kezdete</b></td><td align=\'left\' valign=\'top\'><b>meghird. v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>telephely</b></td><td align=\'left\' valign=\'top\'><b>nyelv</b></td><td align=\'left\' valign=\'top\'><b>munkarend</b></td></tr>\n<tr><td align=\'left\' valign=\'top\'><a href=\'torzsadat.php?tabla=nyilvantartottszak&sid=21715\'>36318</a></td><td align=\'left\' valign=\'top\'>MSZKSMU</td><td align=\'left\' valign=\'top\'>szoci\xc3\xa1lis munka</td><td align=\'left\' valign=\'top\'>2020-01-01</td><td align=\'left\' valign=\'top\'></td><td align=\'left\' valign=\'top\'>2020-01-01</td><td align=\'left\' valign=\'top\'></td><td align=\'left\' valign=\'top\'>Szeged</td><td align=\'left\' valign=\'top\'>magyar</td><td align=\'left\' valign=\'top\'>levelez\xc5\x91</td></tr>\n</table><div align=\'left\' valign=\'top\'><b>Nyilv\xc3\xa1ntartott k\xc3\xa9pz\xc3\xa9si elemek:</b></div><table border=\'1\' cellpadding=\'2\' cellspacing=\'0\'>\n<tr><td align=\'left\' valign=\'top\'><b>k\xc3\xb3d</b></td><td align=\'left\' valign=\'top\'><b>n\xc3\xa9v</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g kezdete</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>meghird. kezdete</b></td><td align=\'left\' valign=\'top\'><b>meghird. v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>t\xc3\xadpus</b></td><td align=\'left\' valign=\'top\'><b>minimum kredit</b></td><td align=\'left\' valign=\'top\'><b>maximum kredit</b></td></tr><tr><td align=\'left\' valig=\'top\'><a href=\'torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=5710\'>MSPCKSM</a></td><td align=\'left\' valig=\'top\'>klinikai szoci\xc3\xa1lis munka</td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>specializ\xc3\xa1ci\xc3\xb3</td><td align=\'left\' valig=\'top\'>35</td><td align=\'left\' valig=\'top\'>40</td></tr><tr><td align=\'left\' valig=\'top\'><a href=\'torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=414\'>MSZKSMU</a></td><td align=\'left\' valig=\'top\'>szoci\xc3\xa1lis munka</td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>szak</td><td align=\'left\' valig=\'top\'>120</td><td align=\'left\' valig=\'top\'>120</td></tr></table></form>\r\n    </td>\r\n  </tr>\r\n  <tr>\r\n    <td colspan="2" bgcolor=\'#0994dc\' width="100%">\r\n      <table width="100%">\r\n\t<tr>\r\n\t  <td align=\'left\'>\r\n\t      <font size=\'1\' color=\'#ffffff\'>Az adatb\xc3\xa1zis 2022-09-24 hajnalban friss\xc3\xbclt.</font>\r\n\t  </td>\r\n\t  <td align="right">\r\n\t    <font size=\'1\' color=\'#ffffff\'>K\xc3\xa9sz\xc3\xbclt az EKOP-1.A.1-08/C-2009-0009  "Az Oktat\xc3\xa1si Hivatal k\xc3\xb6zigazgat\xc3\xa1si szolg\xc3\xa1ltat\xc3\xa1sainak elektroniz\xc3\xa1l\xc3\xa1sa" projekt keret\xc3\xa9ben. &copy; 2012.</font>\r\n\t  </td>\r\n\t</tr>\r\n    </td>\r\n  </tr>\r\n</table>\r\n</body>\r\n</html>\r\n\n'
encoding = detect_encoding(html_byte)


ITERATIONS = 10000


def extract(*_):
    tree = HTMLTree.parse_from_bytes(html_byte, encoding)
    extract_plain_text(tree, main_content=True)


def run_bench(backend='sequential', workers=None):
    workers = workers or os.cpu_count()
    start = monotonic()
    if backend == 'sequential':
        for _ in tqdm(range(ITERATIONS), mininterval=0.3, leave=False):
            extract()
    elif backend == 'multiprocessingpool':
        with Pool(workers) as p:
            p.map(extract, tqdm(range(ITERATIONS), mininterval=0.3, leave=False))
    elif backend == 'processpoolexecutor':
        with ProcessPoolExecutor(workers) as p:
            p.map(extract, tqdm(range(ITERATIONS), mininterval=0.3, leave=False))
    elif backend == 'threadpoolexecutor':
        with ThreadPoolExecutor(workers) as p:
            p.map(extract, tqdm(range(ITERATIONS), mininterval=0.3, leave=False))
    else:
        with parallel_backend(backend, n_jobs=workers):
            Parallel()(delayed(extract)() for _ in tqdm(range(ITERATIONS), mininterval=0.3, leave=False))
    time_taken = monotonic() - start
    desc = backend
    if backend != 'sequential':
        desc = f'{backend}, {workers} workers'
    print(f'{desc}: {time_taken:.3f}s ({ITERATIONS / time_taken:,.1f} extractions/s)')


print(f'Using Python {sys.version}.')
print(f'Benchmarking {ITERATIONS:,} extractions...')
print(f'----------------------------------')
run_bench()
run_bench('multiprocessing')
run_bench('loky')
run_bench('threading')
run_bench('multiprocessingpool')
run_bench('processpoolexecutor')
run_bench('threadpoolexecutor')

@getorca
Copy link
Author

getorca commented Nov 1, 2022

I think the problem is not in Resiliparse. When I print self.elapsed_time in your extractor and replace the Resiliparse extractor implementation with return '', I get the following timings:

Sequential: 0.0071163177490234375 Parallel: 2.8729026317596436

Compare that to the timings when I add the real implementation back in (with main content extraction, which is the most expensive extraction):

Sequential: 0.3654904365539551 Parallel: 2.890957832336426

So Resiliparse adds 0.36 and 0.02 seconds, respectively. The rest is overhead of your benchmarking implementation. I suspect that reading and distributing the HTML inputs is really expensive, bottlenecking the entire execution. Perhaps a simpler multiprocessing backend would be better.

Here are the timings of my benchmarks also with main content extraction turned on:

Using Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0].
Benchmarking 10,000 extractions...
----------------------------------
sequential: 4.542s (2,201.9 extractions/s)                                                                                                                                                                                                               
multiprocessing, 24 workers: 0.430s (23,244.4 extractions/s)                                                                                                                                                                                             
loky, 24 workers: 0.953s (10,493.8 extractions/s)                                                                                                                                                                                                        
threading, 24 workers: 3.603s (2,775.4 extractions/s)                                                                                                                                                                                                    
multiprocessingpool, 24 workers: 0.424s (23,611.7 extractions/s)                                                                                                                                                                                         
processpoolexecutor, 24 workers: 1.878s (5,325.0 extractions/s)                                                                                                                                                                                          
threadpoolexecutor, 24 workers: 3.376s (2,962.1 extractions/s)

So Resiliparse adds 0.36 and 0.02 seconds, respectively. The rest is overhead of your benchmarking implementation. I suspect that reading and distributing the HTML inputs is really expensive, bottlenecking the entire execution. Perhaps a simpler multiprocessing backend would be better.

Unlikely, since the slowdown accelerates when I pass more extractions, only the actual extraction is timed. Also the slowdown is not seen from other libraries like goose, which speed up. Resiliparse is the only one that shows a slow down.

I did some further benchmarking with some minimal cython examples, those do get a speed up, run in the same way. So i can rule out cython being an issue with dask.

@phoerious
Copy link
Member

phoerious commented Nov 1, 2022

Yes, it adds up, but that could already be passing the html parameter to the extraction function or spawning the executor itself. As I said, replacing the whole implementation with return '' should see a significant speedup if the problem were Resiliparse, but there is barely a difference.

@phoerious
Copy link
Member

phoerious commented Nov 1, 2022

Just to be clear, this is my extractor implementation, which takes about 2.9 seconds on its own:

from wee_cli.extractors import BaseExtractor


class ResiliparseExtract(BaseExtractor):

    name = 'resiliparse'

    @staticmethod
    def extract(html):
        return ''

Adding the Resiliparse extractor back in makes almost no difference in execution time. I am pretty sure the other libraries are slowed down as well, but their sequential extraction times are long enough that you don't notice it that much. Hence you still get an overall speedup (although a lower one than might be expected). Resiliparse is probably the only library of the bunch whose sequential timing is faster than your overhead.

@getorca
Copy link
Author

getorca commented Nov 1, 2022

How are you timing that? If I add the extractor as above, it times it at 0.00015997886657714844s

I suspect your timing includes the overhead from loading in the html files into an iterable. If you look at the timings for my benchmark tool only the extraction times are included:

    def extract_in_parallel(self):
        """
        runs the extracts in parallel using dask.
        This isn't the most effient way with dask, but the sequence is loaded first to get the timings as close as possible
        :return:
        """
        sequence = []
        for path in Path('datasets/scrappinghub_aeb/html').glob('*.html.gz'):
            with gzip.open(path, 'rt', encoding='utf8') as f:
                html = f.read()
            item_id = path.stem.split('.')[0]
            sequence.append({
                'item_id': item_id,
                'html': html
            })
        start = time.time()
        bagged = db.from_sequence(sequence)\
            .map(self.parallel_extract)
        bagged = bagged.compute()
        self.elapsed_time = time.time() - start
        self.extracts = {item['item_id']:{'articleBody': item['articleBody']} for item in bagged}

https://github.com/Nootka-io/wee-benchmarking-tool/blob/main/wee_cli/extractors/__init__.py

@getorca
Copy link
Author

getorca commented Nov 1, 2022

However I added dask bag to the benchmarking tool, and it runs quicker than sequentially, the only difference is the function called from map is calling a global variable, which might be giving skewed results across all the parallel computing, since the object doesn't need to be serialised.

Benchmarking 10,000 extractions...
----------------------------------
sequential: 5.175s (1,932.2 extractions/s)                                                                 
multiprocessing, 16 workers: 0.642s (15,568.2 extractions/s)                                               
loky, 16 workers: 1.145s (8,736.9 extractions/s)                                                           
threading, 16 workers: 2.914s (3,432.2 extractions/s)                                                      
multiprocessingpool, 16 workers: 0.551s (18,133.1 extractions/s)                                           
processpoolexecutor, 16 workers: 1.797s (5,565.0 extractions/s)                                            
threadpoolexecutor, 16 workers: 2.567s (3,895.2 extractions/s)                                             
dask_bag, 16 workers: 1.718s (5,819.8 extractions/s)

@phoerious
Copy link
Member

phoerious commented Nov 1, 2022

I added a print(self.elapsed_time) right after the second-to-last line in your snippet above. When I time sequential runs, I get < 0.001 seconds. You can also add the print at the end of __call__, which makes hardly a difference.

I timed the parallel execution with cProfile and it spends 2256ms in <method 'acquire' of '_thread.lock' objects>, which hints at really inefficient parallelization. Second place goes to tokenize with 623ms.

@getorca
Copy link
Author

getorca commented Nov 1, 2022

NVM, i puked up on my own tool and forgot to run in parallel. Yes, that's likely overhead of serialising the html object.

@phoerious
Copy link
Member

The profiler output is very sparse, since a lot of stuff is happening in native functions, but if you ask me, I would assume that the pickler is problably the blottleneck. You should definitely try to distribute the tasks to workers before you start running and timing them, similar to what Pool does. Spawning an individual process for each task and feeding it an individually pickled object is going to be really inefficient.

@phoerious
Copy link
Member

phoerious commented Nov 1, 2022

I just replaced the call to bagged.compute() with a call to Pool.map:

from multiprocessing import Pool
with Pool() as p:
    start = time.time()
    bagged = p.map(self.parallel_extract, sequence)
    self.elapsed_time = time.time() - start

which brings the extraction time down from 2.9 seconds to 0.2 seconds.

Output:

┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Library     ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ resiliparse │ 0.6525   │ 0.8529    │ 0.9852 │ 0.904  │ 0.8819          │ 948.5001  │
└─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘

Still only a speedup of around 2x, but I guess that's I/O-bound by loading the HTML sources into Dask bags.

@phoerious
Copy link
Member

phoerious commented Nov 1, 2022

Your report motivated me to further optimise the extractor by minimising the time the GIL is held, so you can speed up execution even with Python threads. Now the GIL is required only for the initial call and a few constant-time setup operations. The extraction itself is done entirely without locking.

Before:

Using Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0].
Benchmarking 10,000 extractions...
----------------------------------
sequential: 2.824s (3,541.2 extractions/s)
threading, 24 workers: 3.853s (2,595.7 extractions/s)
threadpoolexecutor, 24 workers: 3.610s (2,770.1 extractions/s) 

After:

Using Python 3.10.6 (main, Aug 10 2022, 11:40:04) [GCC 11.3.0].
Benchmarking 10,000 extractions...
----------------------------------
sequential: 2.752s (3,633.8 extractions/s)
threading, 24 workers: 2.264s (4,417.9 extractions/s)
threadpoolexecutor, 24 workers: 1.936s (5,165.3 extractions/s)

A simple ThreadPoolExecutor is now about as fast as a ProcessPoolExecutor and about 1.4x faster than a sequential run. A proper multiprocessing Pool is of course still around 6.4x faster.

@getorca
Copy link
Author

getorca commented Nov 1, 2022

Thanks, working though it has been a big help, I'm going to add a few other parallel processing options to my benchmarking tool, and likely change my real world example away from dask.

@phoerious
Copy link
Member

Try to avoid serialisation as much as possible and try to pre-load strings into memory to be less dependent on the disk and page cache.

I suppose this issue has resolved itself, so I'm closing it, but feel free to add any new discoveries. 😉

@phoerious
Copy link
Member

phoerious commented Nov 1, 2022

One more thing I just noticed: You should be using time.monotonic() or time.perf_counter() / time.process_time() (for higher resolution) instead of time.time() when measuring execution timings.

@huu4ontocord
Copy link

Thank you for all this great discussion. The addition of less GIL for threading is awesome. When will it be pushed to the official pypi? Thank you for all your hardwork.

@phoerious
Copy link
Member

It's already pushed. You can get 0.13.7 from PyPi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants