Skip to content

Williamarnoclement/Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler

The project

This repository is dedicated to my from scratch web search engine with its index on a MySQL database, a crawler written in JAVA and a ranker written in PHP (not included in this git version). A demo version indexed ~4 700 pages.

How it works

A recursive function is dedicated to crawl the web. For each new web page founded, several classes are instanciated. There are multiples checks to do before registering the new page into the index, like read the robot.txt file, count how many pages redirect on the current page, and much more.

Here is a short diagram to understand the crawler (written in french).

how it works

About me

My name is William-Arno and I love make things ! Discover my code on Github profile and on my website.

I did an article (in french) about the process of making a search engine available on my blog.

About

Web crawler built w/ java

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published