Skip to content

dchandak99/Web-Crawler

Repository files navigation

Web-Crawler

Web crawler made using Spark RDDs and Dataframes.

Highlights

  • Built a web crawler using Spark which starts with a URL, each URL having a new dataset of URLs to crawl
  • Used RDDs and transformations and outputted tuples of the form (url, indegree)

Problem Statement

Problem Statement can be found here.

Other details

2 other programs using spark are also uploaded:

Data is uploaded in the csv files.

Command to see output in one merged file:
$ cat webcrawler/part-*|sort > out_q3.txt

About

Web crawler made using Spark RDDs and Dataframes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages