The challenge involves writing a program that:
-
Scrapes the index webpage hosted at
cfcunderwriting.com
-
Writes a list of all externally loaded resources (e.g. images/scripts/fonts not hosted on cfcunderwriting.com) to a JSON output file.
-
Enumerates the page's hyperlinks and identifies the location of the "Privacy Policy" page.
-
Uses the privacy policy URL identified in step 3 and scrapes the page's content. Produces a case-insensitive word frequency count for all of the visible text on the page and writes it to a JSON output file.
-
Install the required dependencies.
$ pip install -r requirements.txt
-
Execute the program.
$ python webscraper.py
-
Find the JSON output files created in the current directory.
external_resources.json
andword_frequency.json