-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scraper for understat.com (Fixes #430) #436
Conversation
This commit adds a scraper `scrape_understat.py` that aims to scrape the goal time and goal scorer for each match from the '15-'16 season. It also pulls in the substitution information.
This commit clears the docstring of the `parse_match` function to add information about the exact structure of the goals and subs data. It also adds the complete datafiles for the seasons 16-17, 17-18, 18-19, 19-20, and 20-21. Also adds beautifulsoup dependency in the requirement.txt file.
@jack89roberts I've added a scraper in the |
@jack89roberts I've updated the string that was causing the black check error. It should be fine now, I checked locally too. |
Thanks @chahak13 , looks good 🙂 I'll just check it all runs ok for me locally before merging, haven't had a chance to do that yet but hopefully will by the end of the week. |
Okay, there seem to be a couple of issues. I'll correct them and push the code. Sorry :/ |
The choices for the `season` CLI argument are now based on the keys of the `base_url` dictionary. This makes it easier to add/remove seasons as only one variable needs to be changed.
Now raise KeyError for wrong season value and also raise Error if no response is received. Furthermore, used `black` formatter to add strict formatting to the code. This led to splitting a couple of strings to two lines with f-strings.
@jack89roberts the |
@chahak13 Yes, and all the code style checks are passing now 🙂 I just need to run it and have a quick look through myself before merging, which I'll try to do by end of this week. |
This mainly adds the isort changes that happen in the imports of most of the files.
@jack89roberts Are the tests in the dev branch not passing? Because the pytest error is not in any file that I changed. |
Don't worry about the failing test @chahak13 , they fail from forks at the moment due to not having access to an environment variable we store in the settings of the repo. I just pushed a couple of small changes so the script runs successfully on my laptop. I think I have spotted a problem though - if I run it myself the output files I get slightly different outputs to yours. Digging into it a bit it looks like it doesn't pick up a team making multiple subs at the same time, e.g. in this match: https://understat.com/match/14097 there are two subs at 60 mins:
In this PR the Aboubakar Kamara sub is in the 2021 file but not the Josh Onomah sub, in my run it's the other way round (maybe it returns one at random?) Do you think it's possible to fix that? (P.S. Fixed it so the tests can run successfully now) |
Ah that's weird. Let me look into it. I'll get it done and let you know. |
I had been using `find` function in finding the rows that corresponded to a substitution which led just recording one substitution if more than one substitutions were being made at the same time. Changed the call to `find_all` and looping over all the rows to now record all the substitutions instead. As can be seen in the data files, this has led to the inclusion of several instances where we were missing 1 or even 2 other substitutions.
@jack89roberts That should take care of it. I was using a single |
That's great, I'll merge it to develop now. Thanks again @chahak13 ! |
This commit adds a scraper
scrape_understat.py
that aims toscrape the goal time and goal scorer for each match from the
'15-'16 season. It also pulls in the substitution information.