This repo contains code and data files related to scraping data from the website BoatTrader. This was done as a part of the Final Project for Statistical Data Mining.
The data scraping is done using an API call. The API is not advertised publically, however is used to display search results page. The API Uri and associated query parameters can be found in the the script,index.js.
The API can only return a maximum of 1000 results in a single query. A paging approach is used to retrieve more results. The API also has a maximum limit of 10,000 results in total (or 10 pages of 1000 results each). The later point is evidenced by the maximum number of pages on the search results being 357 with a page size of 28 results.
The process in the script uses the paged API query to get back 10,000 results. The ordering parameter can be used to retrieve a larger data set by changing the sort
parameter between modified-asc
and modified-desc
to return back the 10,000 earliest and 10,000 latest updated records respectively.
The Data generated by the script is saved in a CSV format for each page. Each run of the script generates 10 csv files. The following parameters are returned.
id
- Unique ID for the recordurl
- Boat Trader URL for the boattype
- Type of the boatboatClass
- Class of the boatmake
- Make of the Boatmodel
- Model of the Boatyear
- Year of the Boatcondition
- New/Usedlength_ft
- Nominal Length of the boat in ftbeam_ft
- Bean of the Boat in ftdryWeight_lb
- Dry weight of the Boat in ft.created
- Date the posting was createdhullMaterial
- Material of the Boat's HullfuelType
- Fuel type of the BoatnumEngines
- Number of Engines listed for the BoatmaxEngineYear
- Newest engine YearminEngineYear
- Oldest Engine YeartotalHP
- Total Power of the Engines combines in HPengineCategory
- Engine Category ( notemultiple
is used when the engines are dissimilar)price
- Listing price for the boatcity
country
state
zip
seller id
The script was run to get the 10,000 newest and 10,000 oldest updated records from the website. This data is available in the newest and oldest folders respectively. Each folder has 10 page files with 1000 records each. These need to be merged before analysis.
It is possible that duplicates might exist after merging the data files. It is recommended to use the id
and/or url
columns to filter duplicates.
- Install all required dependencies :
npm i
- Run script :
node index.js
The code and data files in the repo are provided as is. The author of the repo provides no guarantee the script will work at a later date. The author further assumes no responsibility for misuse of data or scripts.
If you plan to use this code or data in your project, make sure to read the LICENSE document.
No pull requests will be accepted.