Skip to content

An example of the Go Colly scrapper library in action

Notifications You must be signed in to change notification settings

Veerse/go_colly_and_kafka_POC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

reezocar_test

Purpose

This program goes through the pages of a www.autoreflex.com announce listing and extracts for each announce :

  • Announce id
  • URL
  • Title of the announce
  • Price
  • Phone number
  • Garage name
  • Garage address
  • Car mileage
  • Car first registration date
  • Announce postal code
  • Car engine (Diesel/Essence)
  • Transmission (Automatique/Manuelle)
  • Fiscal power (this field actually contains horse power + fiscal power)

Listing used for this POC : http://www.autoreflex.com/137.0.-1.-1.-1.0.999999.1900.999999.-1.99.0.1?fulltext=&geoban=M137R99

Prerequisite

  • GoLang 1.13
  • Kafka 2.5.0
  • Go Colly scrapper library
  • Docker

Initialisation

A) Create an Announces Kafka topic :

  • ./kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic announces

B) Use following Docker commands to run a MongoDB container :

  • Create a volume to persist data docker volume create announces-volume
  • Run MongoDB image docker run --name announces-mongodb -v announces-volume:/data/db -p 27017:27017 -d mongo (for the sake of the POC we use no auth. We could have set environment variables MONGO_INITDB_ROOT_USERNAME and MONGO_INITDB_ROOT_PASSWORD to set credentials)

C) Download Go Colly and MongoDB libraries :

  • go get -u github.com/gocolly/colly/...
  • go get -u go.mongodb.org/mongo-driver/mongo

D) Run consumer and producer

Screenshots

Producer :

alt text

Consumer (log message hidden by default) :

alt text

MongoDB announces collection :

alt text

Areas for improvment

  • Better parsing of announces fields (for instance, fiscal_power field contains actually horse power + fiscal power separated by a "/", car mileage could be formatted to int ("120 000km" to "120000"), etc..)
  • When scanning a lot of pages (15+/20+ pages), there could sometimes be a client timeout error which must be handled
  • Avoid duplicates : can be easily done by overriding MongoDB's _id field with the announce ID

About

An example of the Go Colly scrapper library in action

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages