golf_scraper

Pulls historical data for pga tour, euro tour, web tour. Built on Scrapy. The Medium post this was built for can be found here.

Disclaimer

Please don't use this maliciously, or overwhelm servers. Unaltered, it follows the websites' robots.txt files.
The PGA tour and Euro tour are fairly likely to update their websites over time and the way this is currently set up, it isn't very robust to change. I am likely to continue to use this in the future, and so I will try and maintain it. However, I can't guarantee that I will.

Dependencies

scrapy
scrapy_splash
docker image to run scrapy_splash as detailed in the scrapy_splash docs
I use tabula on some special case Euro tournaments

Get Started

Set up scrapy_splash server. Typical command is docker run -p 8050:8050 scrapinghub/splash
Run scrapy crawl schedules in the root directory to output a "sched.csv" file in the root directory. This will have all the tournament links for the years set in golf_scraper/settings.py.
Run python3 insp.py to see if any years were missed (this is pretty common). Usually it takes 2-3 runs to get every year. The schedules bot should automatically detect if a year/tour's schedule has been picked up.

Note: The Web tour doesn't go very far back, I start in 2010
Another note: I also uploaded a copy of what the sched.py should output since the .csv file isn't that big. If you're lazy and it's still 2019 you can use that

Once the schedule is complete, it's time to collect leaderboards. scrapy crawl scores will iterate through the schedule and collect links. Through trial and error, there are many tournaments that are match play or cancelled which make them unsuitable for my ratings system. I skip them automatically. I usually only collect about 100 tournaments at a time because something usually goes wrong.

There are about 8 tournaments that Euro Tour only has special javascript leaderboards for. Instead of writing a completely new parse function just for those tournaments, I downloaded a .pdf of the results and parsed that.

Once a few tournaments are collected, I run python3 save.py. This adds them to a "has saved" csv file so that they're not collected again. If there are collection errors, you don't have to run save.py and it won't record those tournaments. Tournaments are outputted to a leaderboards folder in the format leaderboard/[season]/[tour]/[tournament_name].csv.

The nature of this collection is rocky, and the websites (especially the PGA Tour website) are not built with scrapers in mind. Therefore this repo will probably run into issues. Let me know if you have any, I'll try and diagnose.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
__pycache__		__pycache__
golf_scraper		golf_scraper
pdfs		pdfs
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
insp_sched.py		insp_sched.py
insp_trns.py		insp_trns.py
save.py		save.py
sched_copy.csv		sched_copy.csv
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pycache

pycache

golf_scraper

golf_scraper

pdfs

pdfs

.DS_Store

.DS_Store

.gitignore

.gitignore

README.md

README.md

insp_sched.py

insp_sched.py

insp_trns.py

insp_trns.py

save.py

save.py

sched_copy.csv

sched_copy.csv

scrapy.cfg

scrapy.cfg

Repository files navigation

golf_scraper

Disclaimer

Dependencies

Get Started

About

Releases

Packages

Languages

btatkinson/golf_scraper

Folders and files

Latest commit

History

Repository files navigation

golf_scraper

Disclaimer

Dependencies

Get Started

About

Resources

Stars

Watchers

Forks

Languages