build initial crawler script #1

corb1999 · 2021-11-15T01:54:46Z

finish building the script that does the following:

makes a list of urls to crawl, each url another page. use purr to generate the list
do an initial page read, then calculate how many total pages need to be read
write a function to read the html, then parse it into a tidy dataframe and include a Sys.sleep() to not spam
minor cleaning on the pages then compiling them all into one dataframe
append a time stamp to the result then export to a dropzone

…gets the products and prices. partially solves issue #1 but now need to do some initial cleaning and then write out the final dataframe

…bject then write to csv. it works, this solves issue #1 and will in the future run this crawler periodically and then later work to compile outputs

corb1999 mentioned this issue Nov 15, 2021

run the crawler multiple times in a month, then build unionizer etl #2

Closed

corb1999 self-assigned this Nov 15, 2021

corb1999 added a commit that referenced this issue Nov 20, 2021

this commit is the first workable code that crawls all the pages and …

b9d2eba

…gets the products and prices. partially solves issue #1 but now need to do some initial cleaning and then write out the final dataframe

corb1999 closed this as completed Nov 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build initial crawler script #1

build initial crawler script #1

corb1999 commented Nov 15, 2021

build initial crawler script #1

build initial crawler script #1

Comments

corb1999 commented Nov 15, 2021