Skip to content

chuchro3/WebCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

WebCrawler

OVERVIEW:
This is a web crawler intended to scrape job postings given a workday job postings URL. The files are stored by job posting ID, and contain a json with a detailed description of the posting from the given sub-urls, as well as notable labels pulled from the original posting description containing info like job title, location, posted date in a list.
Once we get all job URLs on the first page, retrieving the details of each job posting from branching URLs can be done in parallel (see options below).
The crawler has been successfully tested with 3 different workday job posting websites. It should readily expand to more!



CONTENTS:

  • crawler.py
    main logic for scraping workday job postings, as well as starting the main program
  • util.py
    utility functions used by the crawler. not crawler dependent



USAGE: python3 crawler.py EXAMPLES:

Options:

  • -h, --help show this help message and exit

  • -u MAIN_LINK, --url=MAIN_LINK Job Posting URL [Default: https://mastercard.wd1.myworkdayjobs.com/CorporateCareers]

  • -d DEST_DIR, --dest=DEST_DIR Destination Directory [Default: ./test]

  • -t THREAD_COUNT, --threads=THREAD_COUNT Number of parallel threads [Default: 4]

  • -v, --verbose Verbose output to sdout [Default: False]

About

Web Crawler for Workday job postings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages