Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In parallel trafilatura is marginally slower than goose #262

Closed
getorca opened this issue Oct 24, 2022 · 9 comments
Closed

In parallel trafilatura is marginally slower than goose #262

getorca opened this issue Oct 24, 2022 · 9 comments
Labels
question Further information is requested

Comments

@getorca
Copy link

getorca commented Oct 24, 2022

I'm not quite sure where to begin with this, it's a strange one. In a real world scenario I tried switching from Goose3 to Trafilatura. I'm processing html extractions in parallel with dask. After switching to trafilatura, I noticed a 30% slowdown. I ended up writing my own evaluation library to verify the results.

Results from running in parallel:
┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ goose3 │ 0.9678 │ 0.8561 │ 0.9547 │ 0.9027 │ 0.8343 │ 383.4737 │
│ trafilatura │ 0.9124 │ 0.9485 │ 0.908 │ 0.9278 │ 0.8567 │ 361.3232 │
└─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘

Results from running sequentially:
┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ goose3 │ 0.9678 │ 0.8561 │ 0.9547 │ 0.9027 │ 0.8343 │ 9.7953 │
│ trafilatura │ 0.9124 │ 0.9485 │ 0.908 │ 0.9278 │ 0.8567 │ 23.0045 │
└─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘

Note: the dataset evaluated in from scrapinghub/article-extraction-benchmark tool. The only portion of the code that runs in parallel for the bench marks is the extraction. Only the extraction is timed for calculating items/sec.

In summary: trafilatura is marginally slower than Goose3 in parallel. However sequentially it is twice as fast as Goose3.

I'm not sure where to begin with this. It can be difficult to profile parallel processing. It may be related to some of the memory leak issues reported with trafilutura, although it appears those have been resolved. Or the caching, I haven't looked into how that functions.

I will work on publishing my benchmarking tool this afternoon.

@adbar
Copy link
Owner

adbar commented Oct 26, 2022

Hi @getorca, thanks for the evaluation!

First, you get different results than the Scrapinhub team, I assume it's because of more recent package versions? The Trafilatura results appear to be slightly degraded, I wonder if it's a regression or just a different experimental setting on your side.

Second, I am not familiar with the way Dask parallelizes tasks. Your results are odd indeed but I cannot explain this difference.

@adbar adbar added the question Further information is requested label Oct 26, 2022
@getorca
Copy link
Author

getorca commented Oct 26, 2022

Owner

Hi @adbar There was a mistake in the timings for parallel, but trafilatura doesn't see anywhere near as much of an improvement as goose3, however, resiliparse, is significantly slower. It's very odd, I'm trying to dig into why, it might be related to heavy use of cython/c++, for resiliparse.

I'm using dask bags, which use multiprocessing. It's a high level library on top of pythons multiprocessing/threads(I believe), that also creates an optimised DAG. The multiprocessing scheduler does add about 200us overhead per task. And has some issues with shared memory, which is what made me think of possible memory leaks. As well as it's fastest on pure python objects, I wonder if some of the libraries trafilatura imports are cython/c++.

First, you get different results than the Scrapinhub team, I assume it's because of more recent package versions? The Trafilatura results appear to be slightly degraded, I wonder if it's a regression or just a different experimental setting on your side.

I use a method similar to scrapinghubs "shingles"(n-gams) to compute accuracy, precision and fscore, but with vectors from spacy, closer to how the original moz evaluations worked.

My benchmarking tool is available here, https://github.com/Nootka-io/wee-benchmarking-tool. I will try to work on some minimal samples to try and sort out the parallel "oddities".

@adbar
Copy link
Owner

adbar commented Oct 27, 2022

@getorca Thanks for sharing!

I don't understand why newspaper3k is performing that well, it's not the case in the Scrapinghub benchmark, nor it is the case in any multilingual benchmark I've seen. My experience is that it is good for English but less so in other settings.

You may also want to use another Trafilatura function, I wrote a PR (Nootka-io/wee-benchmarking-tool#1).

@getorca
Copy link
Author

getorca commented Oct 27, 2022

Interesting, I haven't looked at too many other benchmarks.

It appears the shingles as defined in scraping hubs library can produce either 1 or 4 false positives for a single incorrect token depending on the location in the body of text, as well as the shingle length being less than the . Take a look at the below minimal example:

from wee_cli.evaluate import do_complex_scoring, scores_from_cm

"""
A demonstration of where the occurrence of an incorrect token when using shingles to calculate Precision and Recall,
 can lead to different score.
 There are also always 3 less shingles than tokens, as well as up 4x more false positives and 4x more false negatives.
"""

def get_shingles(tokens):
    return [tuple(tokens[i:i+4]) for i in range(0, max(1, len(tokens) - 4 + 1))]

gt_tokens = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']
pred_tokens_a = ['x', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']
pred_tokens_b = ['a', 'b', 'c', 'x', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']

gt_shingles = get_shingles(gt_tokens)
pred_shingles_a = get_shingles(pred_tokens_a)
pred_shingles_b = get_shingles(pred_tokens_b)

# token confusion  matrix A
t_a_cm = []
t_a_cm.append(do_complex_scoring(gt_tokens, pred_tokens_a))
print('** Score built with tokens on A**')
print(scores_from_cm(t_a_cm))

# ** Score built with tokens on A**
# {'accuracy': 1.0, 'precision': 0.9166666666666666, 'recall': 1.0, 'fscore': 0.9565217391304348}


# token confusion  matrix B
t_b_cm = []
t_b_cm.append(do_complex_scoring(gt_tokens, pred_tokens_b))
print('** Score built with tokens on B **')
print(scores_from_cm(t_b_cm))

# ** Score built with tokens on B **
# {'accuracy': 1.0, 'precision': 0.9166666666666666, 'recall': 1.0, 'fscore': 0.9565217391304348}


# shingles confusion matrix A
a_cm = []
a_cm.append(do_complex_scoring(gt_shingles, pred_shingles_a))
print('** Score built with shingles on B **')
print(scores_from_cm(a_cm))

# ** Score built with tokens on B **
# {'accuracy': 1.0, 'precision': 0.8888888888888888, 'recall': 1.0, 'fscore': 0.9411764705882353}


# shingles confusion matrix B
b_cm = []
b_cm.append(do_complex_scoring(gt_shingles, pred_shingles_b))
print('** Score built with shingles on B **')
print(scores_from_cm(b_cm))

# ** Score built with shingles on B **
# {'accuracy': 0.625, 'precision': 0.5555555555555556, 'recall': 0.625, 'fscore': 0.5882352941176471}


For this reason, I'm not convinced shingles are better. The way you do it is interesting, but a bit hard to annotate. I think the best solution might be pulling all the text from the html, and using this to calculate true negatives. I think that should provide more accuracy and better normalization when combined with the tokens.

And thanks, I'll take a look at the pull request shortly.

@adbar
Copy link
Owner

adbar commented Oct 28, 2022

Thanks again for sharing! Yes, my annotation method needs time and cannot be extrapolated easily, besides there are other ways to evaluate. But as you demonstrate results of the shingles method can vary a lot.

@getorca
Copy link
Author

getorca commented Oct 28, 2022

No problem, I switched from a ratio of averages to a average of ratios, which appear to giver better metrics, and should mean outliers have less influences on the metrics. Thanks, for pointing out the issue there.

@getorca
Copy link
Author

getorca commented Nov 1, 2022

@adbar Good news, it seems like this was related to the overhead from the way dask serialises the python objects. chatnoir-eu/chatnoir-resiliparse#23

I've switched to python multiprocessing pool, and trafilutura is a little over 2x as fast in parallel compared to in sequence. And stays massively faster than goose3.

I'll push an update to my benchmark tool this aft.

@adbar
Copy link
Owner

adbar commented Nov 3, 2022

@getorca Very nice, does that mean we can close this issue now?

@getorca
Copy link
Author

getorca commented Nov 4, 2022

Yes, absolutely, go ahead

@adbar adbar closed this as completed Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants