Web crawler made using Spark RDDs and Dataframes.
- Built a web crawler using Spark which starts with a URL, each URL having a new dataset of URLs to crawl
- Used RDDs and transformations and outputted tuples of the form (url, indegree)
Problem Statement can be found here.
2 other programs using spark are also uploaded:
Data is uploaded in the csv files.
Command to see output in one merged file:
$ cat webcrawler/part-*|sort > out_q3.txt