Skip to content

The project implements HTTP protocol Constructed HTTP request and response messages from scratch and emulating browsers requests required to fetch a page. Parsed HTTP request, handled cookies receives in response messages. Handled authentication using HTTP Post method parsed the HTML page using Jsoup Library.

Notifications You must be signed in to change notification settings

aj04/Web_Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler

The coding of the project is done in JAVA. The project implements HTTP protocol Constructed HTTP request and response messages from scratch and emulating browsers requests required to fetch a page. Parsed HTTP request, handled cookies receives in response messages. Handled authentication using HTTP Post method parsed the HTML page using Jsoup Library.

Program Summary Approach:

Web Crawler is implemented to crawl on web pages of fakebook and return secret flags There are in all five secret flags and as soon as those flags are obtained crawling is ended. I have used Queue of link list which is like frontier and hashset that ensures that duplicate urls are not visited.

Challenges Faced:

-Imitate the HTTP POST request messages from client and parse login credentials to the server successfully. -Handling Cookies: extracting csrf token and sessionid values from received response header and parsing the same on new request messages to server.

-Run time of the program, initially I had run time of around 25min. we tried many possibilities and even mutithreading to reduce the run-time. A close analysis of the Header concluded that Connection header had the value keep alive. Removing that from the loop headers paced up the program exponentially. and the run-time was reduced to 6min. -Error handling:HTTP status codes handled 301, 302, 403/404, 500

Code tested:

I manually disabled and then re-enabled it : This allowed me to check server internal 500 error: the program continued to run untill the connection was reestablished and displayed the five secret flags at the output authentication: User gives invalid credentials at the input of command line. the server responds with "Please enter a correct username and password" same message is redirected to the user.

About

The project implements HTTP protocol Constructed HTTP request and response messages from scratch and emulating browsers requests required to fetch a page. Parsed HTTP request, handled cookies receives in response messages. Handled authentication using HTTP Post method parsed the HTML page using Jsoup Library.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published