Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping always pauses and doesn't finish #51

Open
jnmiller opened this issue Mar 13, 2024 · 16 comments
Open

Scraping always pauses and doesn't finish #51

jnmiller opened this issue Mar 13, 2024 · 16 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@jnmiller
Copy link

jnmiller commented Mar 13, 2024

Every time I try to scrape a season (men's), the process gets stuck and hangs. Ctrl-C always gives the same stack trace:

Getting data for season 2022
No games on 11/08/21:   4%|███▌                                                                              | 8 of 182 days scraped in 3.3 sec
Scraping 184 games on 11/09/21:   4%|███                                                                   | 8 of 182 days scraped in 204.8 sec
Traceback (most recent call last):
  File "<SNIP>/./scrape.py", line 30, in <module>
    infos, box_scores, pbps = scraper.get_games_season(season, info=True, box=False, pbp=False)
  File "<SNIP>/lib/python3.10/site-packages/cbbpy/mens_scraper.py", line 80, in get_games_season
    return _get_games_season(season, "mens", info, box, pbp)
  File "<SNIP>/lib/python3.10/site-packages/cbbpy/cbbpy_utils.py", line 233, in _get_games_season
    info = _get_games_range(
  File "<SNIP>/lib/python3.10/site-packages/cbbpy/cbbpy_utils.py", line 186, in _get_games_range
    result = Parallel(n_jobs=cpus)(
  File "<SNIP>/lib/python3.10/site-packages/joblib/parallel.py", line 1952, in __call__
    return output if self.return_generator else list(output)
  File "<SNIP>/lib/python3.10/site-packages/joblib/parallel.py", line 1595, in _get_outputs
    yield from self._retrieve()
  File "<SNIP>/lib/python3.10/site-packages/joblib/parallel.py", line 1707, in _retrieve
    time.sleep(0.01)
KeyboardInterrupt

Is the source site just detecting the scraping and blocking my IP address? Or is something else going on?

I can sometimes successfully scrape a very short date range (like a weekend) but immediately after a success, it stops working and hangs again.

@dcstats
Copy link
Owner

dcstats commented Mar 14, 2024

Will look into this, thanks!

@jnmiller
Copy link
Author

It's sure looking like a bot detector - starting fresh (no attempts in last 12-24h) it will scrape 100-250 games, then stop. I removed the joblib parallel loop, making it sequential, then ran the debugger eventually got a request returning a 503. When I open the that url in a browser, it also shows an error. But when I browse some other pages and try that page again later, it will start working both in the browser and the scraper (presumably it identifies my IP address as being a human browsing again?).

Some mitigations might be

  • Add timeouts so it doesn't hang as long. Longer backoff if getting a 503
  • Configurable slowdown: use lower concurrency in joblib, or add random sleeps after after reading each page
  • Add support for rotating proxies
  • Try adding received cookies into the requests, maybe it's detecting scraping based on cookies being absent (tricky if combined with proxies)

I could possibly contribute if time allows. In the meantime is this data downloadable in bulk anywhere (at least 2010-2024 seasons)? I've looked and haven't yet found a free source with that whole time span and including pbp.

@dcstats
Copy link
Owner

dcstats commented Mar 15, 2024

@jnmiller interesting... the scraper uses rotating headers that have helped with the bot detection to the point where I've never had it block any of my scrapes. I haven't had the chance to run it since you raised this issue, so it's definitely possible that they've added more robust bot detection, but I don't see any issues raised on the cousin package for R (ncaahoopR), so I'm thinking this might be something different. let me try scraping a season when I get a second, but in the meantime I do have some data I can send you. what's your email?

@dcstats dcstats self-assigned this Mar 15, 2024
@dcstats dcstats added the bug Something isn't working label Mar 15, 2024
@dcstats dcstats added this to the 2.0.3 milestone Mar 15, 2024
@jnmiller
Copy link
Author

Thanks, that would be great! G-mail: jarednmiller

@dcstats
Copy link
Owner

dcstats commented Mar 19, 2024

@jnmiller sent. I scraped the 23-24 season last night without issue, so I'm not sure what could be causing this issue. I'll still add some of these mitigations, but I'll have to do some more digging to figure out what might be causing this issue to pop up selectively

dcstats added a commit that referenced this issue Mar 19, 2024
@Mstolte02
Copy link

I am having the same issue unfortunately. Any chance you'd have data from 2017 to 2023 handy?

@dcstats
Copy link
Owner

dcstats commented Mar 20, 2024

@Mstolte02 @jnmiller could you both tell me what versions of python as well as the packages cbbpy, pandas, numpy, python-dateutil, pytz, tqdm, lxml, joblib, beautifulsoup4, and requests you're using? want to see if I can replicate this issue

@Mstolte02 what's your email? I can send you data

@Mstolte02
Copy link

Mstolte02 commented Mar 20, 2024 via email

@dcstats
Copy link
Owner

dcstats commented Mar 20, 2024

@Mstolte02 github obfuscates email addresses - send me an email (you can find mine at the bottom of CBBpy's README) and I'll reply with the data

@dgilmore33
Copy link

I'm on date 11/13/23, looks like it just takes a long **s (not an email) time.

@dcstats could you open & assign me the issue of speeding up the method? I could use multi-threading and a rate-limiter. Once you do, I'll email you on the CBB.py email.

Thanks for making this repo! Looking forward to working together :)

@dcstats
Copy link
Owner

dcstats commented Mar 21, 2024

@dgilmore33 could you tell me what versions of python and the required packages you're using? I want to replicate this issue first, because locally I'm able to scrape entire seasons in around 30 minutes

@dgilmore33
Copy link

@dcstats honestly I don't have an "issue", I'm used to long times to load data. I'll live.

Also, the more I think about it, better to keep a full season scrape at the current timeframe

version : 3.9.6
packages :

pip

  • altgraph 0.17.2
  • appnope 0.1.4
  • asttokens 2.4.1
  • attrs 23.2.0
  • bleach 6.1.0
  • certifi 2023.7.22
  • charset-normalizer 3.3.1
  • comm 0.2.2
  • cssselect 1.2.0
  • debugpy 1.8.1
  • decorator 5.1.1
  • exceptiongroup 1.2.0
  • executing 2.0.1
  • fastjsonschema 2.19.1
  • future 0.18.2
  • GDAL 3.8.4
  • idna 3.4
  • importlib_metadata 7.0.2
  • ipykernel 6.29.3
  • ipython 8.18.1
  • jedi 0.19.1
  • jsonschema 4.21.1
  • jsonschema-specifications 2023.12.1
  • jupyter_client 8.6.1
  • jupyter_core 5.7.2
  • kaggle 1.5.16
  • lxml 5.1.0
  • macholib 1.15.2
  • matplotlib-inline 0.1.6
  • nbformat 5.10.3
  • nest-asyncio 1.6.0
  • numpy 1.26.2
  • packaging 24.0
  • pandas 2.1.4
  • parso 0.8.3
  • pexpect 4.9.0
  • pip 21.2.4
  • platformdirs 4.2.0
  • prompt-toolkit 3.0.43
  • psutil 5.9.8
  • ptyprocess 0.7.0
  • pure-eval 0.2.2
  • Pygments 2.17.2
  • pyquery 2.0.0
  • python-dateutil 2.8.2
  • python-slugify 8.0.1
  • pytz 2023.3.post1
  • pyzmq 25.1.2
  • referencing 0.34.0
  • requests 2.31.0
  • rpds-py 0.18.0
  • setuptools 58.0.4
  • six 1.15.0
  • stack-data 0.6.3
  • text-unidecode 1.3
  • tornado 6.4
  • tqdm 4.66.1
  • traitlets 5.14.2
  • typing_extensions 4.10.0
  • tzdata 2023.3
  • urllib3 2.0.7
  • wcwidth 0.2.13
  • webencodings 0.5.1
  • wheel 0.37.0
  • zipp 3.18.1

conda==23.7.4

  • appnope 0.1.3 pyhd8ed1ab_0 conda-forge
  • asttokens 2.4.1 pyhd8ed1ab_0 conda-forge
  • attrs 23.2.0 pyh71513ae_0 conda-forge
  • beautifulsoup4 4.12.3 pypi_0 pypi
  • blas 1.0 mkl
  • blinker 1.7.0 pyhd8ed1ab_0 conda-forge
  • bottleneck 1.3.5 py311hb9e55a9_0
  • brotli 1.0.9 hca72f7f_7
  • brotli-bin 1.0.9 hca72f7f_7
  • brotli-python 1.0.9 py311h814d153_8 conda-forge
  • bs4 0.0.2 pypi_0 pypi
  • bzip2 1.0.8 h1de35cc_0
  • ca-certificates 2024.2.2 h8857fd0_0 conda-forge
  • cbbpy 2.0.2 pypi_0 pypi
  • certifi 2023.11.17 pypi_0 pypi
  • charset-normalizer 3.3.2 pyhd8ed1ab_0 conda-forge
  • click 8.1.7 unix_pyh707e725_0 conda-forge
  • comm 0.1.4 pyhd8ed1ab_0 conda-forge
  • contourpy 1.2.0 py311ha357a0b_0
  • cssselect 1.2.0 pypi_0 pypi
  • cycler 0.11.0 pyhd3eb1b0_0
  • dash 2.16.1 pyhd8ed1ab_0 conda-forge
  • debugpy 1.6.7 py311hcec6c5f_0
  • decorator 5.1.1 pyhd8ed1ab_0 conda-forge
  • exceptiongroup 1.2.0 pyhd8ed1ab_0 conda-forge
  • executing 2.0.1 pyhd8ed1ab_0 conda-forge
  • flask 3.0.2 pyhd8ed1ab_0 conda-forge
  • fonttools 4.25.0 pyhd3eb1b0_0
  • freetype 2.12.1 hd8bbffd_0
  • giflib 5.2.1 h6c40b1e_3
  • idna 3.6 pyhd8ed1ab_0 conda-forge
  • importlib-metadata 7.0.0 pyha770c72_0 conda-forge
  • importlib_metadata 7.0.0 hd8ed1ab_0 conda-forge
  • importlib_resources 6.3.2 pyhd8ed1ab_0 conda-forge
  • intel-openmp 2023.1.0 ha357a0b_43548
  • ipykernel 6.26.0 pyh3cd1d5f_0 conda-forge
  • ipython 8.18.1 pyh707e725_3 conda-forge
  • itsdangerous 2.1.2 pyhd8ed1ab_0 conda-forge
  • jedi 0.19.1 pyhd8ed1ab_0 conda-forge
  • jinja2 3.1.3 pyhd8ed1ab_0 conda-forge
  • joblib 1.3.2 pypi_0 pypi
  • jpeg 9e h6c40b1e_1
  • jsonschema 4.21.1 pyhd8ed1ab_0 conda-forge
  • jsonschema-specifications 2023.12.1 pyhd8ed1ab_0 conda-forge
  • jupyter_client 8.6.0 pyhd8ed1ab_0 conda-forge
  • jupyter_core 5.5.1 py311h6eed73b_0 conda-forge
  • kiwisolver 1.4.4 py311hcec6c5f_0
  • lcms2 2.12 hf1fd2bf_0
  • lerc 3.0 he9d5cce_0
  • libbrotlicommon 1.0.9 hca72f7f_7
  • libbrotlidec 1.0.9 hca72f7f_7
  • libbrotlienc 1.0.9 hca72f7f_7
  • libcxx 14.0.6 h9765a3e_0
  • libdeflate 1.17 hb664fd8_1
  • libffi 3.4.4 hecd8cb5_0
  • libpng 1.6.39 h6c40b1e_0
  • libsodium 1.0.18 hbcb3906_1 conda-forge
  • libtiff 4.5.1 hcec6c5f_0
  • libwebp 1.3.2 hf6ce154_0
  • libwebp-base 1.3.2 h6c40b1e_0
  • lxml 5.1.0 pypi_0 pypi
  • lz4-c 1.9.4 hcec6c5f_0
  • markupsafe 2.1.5 py311he705e18_0 conda-forge
  • matplotlib 3.8.0 py311hecd8cb5_0
  • matplotlib-base 3.8.0 py311h41a4f6b_0
  • matplotlib-inline 0.1.6 pyhd8ed1ab_0 conda-forge
  • mkl 2023.1.0 h8e150cf_43560
  • mkl-service 2.4.0 py311h6c40b1e_1
  • mkl_fft 1.3.8 py311h6c40b1e_0
  • mkl_random 1.2.4 py311ha357a0b_0
  • munkres 1.1.4 py_0
  • nba-api 1.4.1 pypi_0 pypi
  • nbformat 5.10.3 pyhd8ed1ab_0 conda-forge
  • ncurses 6.4 hcec6c5f_0
  • nest-asyncio 1.5.8 pyhd8ed1ab_0 conda-forge
  • numexpr 2.8.7 py311h728a8a3_0
  • numpy 1.26.2 py311h728a8a3_0
  • numpy-base 1.26.2 py311h53bf9ac_0
  • openjpeg 2.4.0 h66ea3da_0
  • openssl 3.2.1 hd75f5a5_1 conda-forge
  • packaging 23.1 py311hecd8cb5_0
  • pandas 2.1.4 py311hdb55bb0_0
  • parso 0.8.3 pyhd8ed1ab_0 conda-forge
  • pexpect 4.8.0 pyh1a96a4e_2 conda-forge
  • pickleshare 0.7.5 py_1003 conda-forge
  • pillow 10.0.1 py311h7d39338_0
  • pip 23.3.1 py311hecd8cb5_0
  • pkgutil-resolve-name 1.3.10 pyhd8ed1ab_1 conda-forge
  • platformdirs 4.1.0 pyhd8ed1ab_0 conda-forge
  • plotly 5.19.0 pyhd8ed1ab_0 conda-forge
  • prompt-toolkit 3.0.42 pyha770c72_0 conda-forge
  • psutil 5.9.7 py311he705e18_0 conda-forge
  • ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge
  • pure_eval 0.2.2 pyhd8ed1ab_0 conda-forge
  • pygments 2.17.2 pyhd8ed1ab_0 conda-forge
  • pyparsing 3.0.9 py311hecd8cb5_0
  • pyquery 2.0.0 pypi_0 pypi
  • pysocks 1.7.1 pyha2e5f31_6 conda-forge
  • python 3.11.5 hf27a42d_0
  • python-dateutil 2.8.2 pyhd3eb1b0_0
  • python-fastjsonschema 2.19.1 pyhd8ed1ab_0 conda-forge
  • python-tzdata 2023.3 pyhd3eb1b0_0
  • python_abi 3.11 2_cp311 conda-forge
  • pytz 2023.3.post1 py311hecd8cb5_0
  • pyzmq 24.0.1 py311habfacb3_1 conda-forge
  • readline 8.2 hca72f7f_0
  • referencing 0.34.0 pyhd8ed1ab_0 conda-forge
  • requests 2.31.0 pyhd8ed1ab_0 conda-forge
  • retrying 1.3.3 py_2 conda-forge
  • rpds-py 0.18.0 py311hd64b9fd_0 conda-forge
  • setuptools 68.2.2 py311hecd8cb5_0
  • six 1.16.0 pyhd3eb1b0_1
  • soupsieve 2.5 pypi_0 pypi
  • sportsipy 0.6.0 pypi_0 pypi
  • sportsreference 0.5.2 pypi_0 pypi
  • sqlite 3.41.2 h6c40b1e_0
  • stack_data 0.6.2 pyhd8ed1ab_0 conda-forge
  • tbb 2021.8.0 ha357a0b_0
  • tenacity 8.2.3 pyhd8ed1ab_0 conda-forge
  • tk 8.6.12 h5d9f67b_0
  • tornado 6.3.3 py311h6c40b1e_0
  • tqdm 4.66.2 pypi_0 pypi
  • traitlets 5.14.0 pyhd8ed1ab_0 conda-forge
  • typing-extensions 4.9.0 hd8ed1ab_0 conda-forge
  • typing_extensions 4.9.0 pyha770c72_0 conda-forge
  • tzdata 2023c h04d1e81_0
  • urllib3 2.1.0 pypi_0 pypi
  • wcwidth 0.2.12 pyhd8ed1ab_0 conda-forge
  • werkzeug 3.0.1 pyhd8ed1ab_0 conda-forge
  • wheel 0.41.2 py311hecd8cb5_0
  • xz 5.4.5 h6c40b1e_0
  • zeromq 4.3.4 h23ab428_0
  • zipp 3.17.0 pyhd8ed1ab_0 conda-forge
  • zlib 1.2.13 h4dc903c_0
  • zstd 1.5.5 hc035e20_0

@dcstats
Copy link
Owner

dcstats commented Mar 21, 2024

@dgilmore33 how long scraping taking for you? if it's anything longer than 30 seconds per day, I think it's worth looking into speeding it up. I could also do something as simple as increasing the number of concurrently running jobs. I'm using multiprocessing, but you mentioned multithreading - would multithreading be better for this than multiprocessing?

@crdarlin
Copy link

crdarlin commented Mar 21, 2024 via email

@dgilmore33
Copy link

@crdarlin I'll update requests now, thx for the tip

@dcstats multithreading hasn't provided a performance boost w/ multiprocessing in my experience, so I wouldn't expect it to. I forked the repo so I can just change the # of workers.

Ultimately, I got my game_data for the regular season, so I should be fine updating it day-by-day until the end of the tourney. Thanks for the RE's!

@dcstats
Copy link
Owner

dcstats commented Mar 22, 2024

For now, I'm gonna mark this as an issue for a future release so I can push some other fixes. @jnmiller if you're still experience hanging on the latest version of CBBpy, let me know what versions of python and the required packages you're using so I can try to replicate.

@dcstats dcstats modified the milestones: 2.0.3, 2.1.0 Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants