Skip to content
Crawler for Trip Advisor reviews
Branch: master
Clone or download
Latest commit f019bc6 Dec 20, 2018
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENSE Initial commit Nov 15, 2015 Update May 27, 2016 '///' Nov 12, 2018 dropped python 2 support Nov 8, 2017 html format changed Jan 4, 2018

Trip Advisor crawler

This is a simple crawler script for Trip Advisor.

It is aimed at researchers and students that want to experiment with text mining problems on review data.

usage: [-h] [-f] [-r MAXRETRIES] [-t TIMEOUT] [-a {Hotel,Restaurant}] [-p PAUSE] [-m MAXREVIEWS] -o OUT ID [ID ...]

required arguments:

-o OUT, --out OUT Output base path

ID IDs for which to download reviews

ID format:

  • domain:locationcode e.g. com:187893 = reviews form any hotel in Tuscany, from the com domain
  • domain:locationcode:citycode e.g. jp:187899:187899 = reviews from any hotel in the city of Pisa from the jp domain
  • domain:locationcode:citycode:hotelcode e.g. it:187899:187899:662603 = all reviews for a specific hotel from the it domain
  • domain:locationcode:citycode:hotelcode:reviewcode e.g. it:187899:187899:662603:322965103 = a single specific review

optional arguments:

-h, --help show help message and exit

-f, --force Force download even if already successfully downloaded

-a {Hotel,Restaurant}, --activity {Hotel,Restaurant} Type of activity to crawl (default: Hotel)

-r MAXRETRIES, --maxretries MAXRETRIES Max retries to download a file. Default: 3

-t TIMEOUT, --timeout TIMEOUT Timeout in seconds for http connections. Default: 180

-p PAUSE, --pause PAUSE Seconds to wait between http requests. Default: 0.2

-m MAXREVIEWS, --maxreviews MAXREVIEWS Maximum number of reviews per item to download. Default:unlimited

You can’t perform that action at this time.