Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scraper for understat.com (Fixes #430) #436

Merged
merged 16 commits into from
Oct 15, 2021

Conversation

chahak13
Copy link
Contributor

This commit adds a scraper scrape_understat.py that aims to
scrape the goal time and goal scorer for each match from the
'15-'16 season. It also pulls in the substitution information.

Chahak Mehta added 2 commits October 11, 2021 22:31
This commit adds a scraper `scrape_understat.py` that aims to
scrape the goal time and goal scorer for each match from the
'15-'16 season. It also pulls in the substitution information.
This commit clears the docstring of the `parse_match` function to
add information about the exact structure of the goals and subs
data. It also adds the complete datafiles for the seasons 16-17,
17-18, 18-19, 19-20, and 20-21. Also adds beautifulsoup dependency
in the requirement.txt file.
@chahak13
Copy link
Contributor Author

@jack89roberts I've added a scraper in the scraper/ directory and the data files are named goals_subs_data_{season}.json in the data/ directory. Let me know if this looks good or if any other changes are to be made!

@chahak13
Copy link
Contributor Author

@jack89roberts I've updated the string that was causing the black check error. It should be fine now, I checked locally too.

@jack89roberts
Copy link
Contributor

Thanks @chahak13 , looks good 🙂 I'll just check it all runs ok for me locally before merging, haven't had a chance to do that yet but hopefully will by the end of the week.

@chahak13
Copy link
Contributor Author

Okay, there seem to be a couple of issues. I'll correct them and push the code. Sorry :/

Chahak Mehta added 3 commits October 13, 2021 23:16
The choices for the `season` CLI argument are now based on the
keys of the `base_url` dictionary. This makes it easier to
add/remove seasons as only one variable needs to be changed.
Now raise KeyError for wrong season value and also raise Error
if no response is received. Furthermore, used `black` formatter to
add strict formatting to the code. This led to splitting a couple
of strings to two lines with f-strings.
@chahak13
Copy link
Contributor Author

@jack89roberts the black is using 88 characters limit, right? If so, things look good on my local setup. Should hopefully work now. All except one change are majorly formatting things to confirm to black. Sorry for the mess :/

@jack89roberts
Copy link
Contributor

@chahak13 Yes, and all the code style checks are passing now 🙂 I just need to run it and have a quick look through myself before merging, which I'll try to do by end of this week.

@chahak13
Copy link
Contributor Author

@jack89roberts Are the tests in the dev branch not passing? Because the pytest error is not in any file that I changed.

@jack89roberts
Copy link
Contributor

jack89roberts commented Oct 14, 2021

Don't worry about the failing test @chahak13 , they fail from forks at the moment due to not having access to an environment variable we store in the settings of the repo. I just pushed a couple of small changes so the script runs successfully on my laptop.

I think I have spotted a problem though - if I run it myself the output files I get slightly different outputs to yours. Digging into it a bit it looks like it doesn't pick up a team making multiple subs at the same time, e.g. in this match: https://understat.com/match/14097 there are two subs at 60 mins:

Josh Onomah -> Bobby Reid
Aboubakar Kamara -> Neeskens Kebano

In this PR the Aboubakar Kamara sub is in the 2021 file but not the Josh Onomah sub, in my run it's the other way round (maybe it returns one at random?) Do you think it's possible to fix that?

(P.S. Fixed it so the tests can run successfully now)

@chahak13
Copy link
Contributor Author

Digging into it a bit it looks like it doesn't pick up a team making multiple subs at the same time,

Ah that's weird. Let me look into it. I'll get it done and let you know.

Chahak Mehta added 3 commits October 14, 2021 18:42
I had been using `find` function in finding the rows that corresponded
to a substitution which led just recording one substitution if more
than one substitutions were being made at the same time. Changed the
call to `find_all` and looping over all the rows to now record all the
substitutions instead. As can be seen in the data files, this has led
to the inclusion of several instances where we were missing 1 or even
2 other substitutions.
@chahak13
Copy link
Contributor Author

@jack89roberts That should take care of it. I was using a single find command instead of find_all. I've updated that. Let me know if it looks okay now or if there are any other changes. I've also added the data files with the new sub info added.

@jack89roberts
Copy link
Contributor

That's great, I'll merge it to develop now. Thanks again @chahak13 !

@jack89roberts jack89roberts merged commit 820dc13 into alan-turing-institute:develop Oct 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants