Data Scraping - Boattrader.com

This repo contains code and data files related to scraping data from the website BoatTrader. This was done as a part of the Final Project for Statistical Data Mining.

Process

The data scraping is done using an API call. The API is not advertised publically, however is used to display search results page. The API Uri and associated query parameters can be found in the the script,index.js.

Known limitations

The API can only return a maximum of 1000 results in a single query. A paging approach is used to retrieve more results. The API also has a maximum limit of 10,000 results in total (or 10 pages of 1000 results each). The later point is evidenced by the maximum number of pages on the search results being 357 with a page size of 28 results.

The process in the script uses the paged API query to get back 10,000 results. The ordering parameter can be used to retrieve a larger data set by changing the sort parameter between modified-asc and modified-desc to return back the 10,000 earliest and 10,000 latest updated records respectively.

Returned Data and Parameters

The Data generated by the script is saved in a CSV format for each page. Each run of the script generates 10 csv files. The following parameters are returned.

id - Unique ID for the record
url - Boat Trader URL for the boat
type - Type of the boat
boatClass - Class of the boat
make - Make of the Boat
model - Model of the Boat
year - Year of the Boat
condition - New/Used
length_ft - Nominal Length of the boat in ft
beam_ft - Bean of the Boat in ft
dryWeight_lb - Dry weight of the Boat in ft.
created - Date the posting was created
hullMaterial - Material of the Boat's Hull
fuelType - Fuel type of the Boat
numEngines - Number of Engines listed for the Boat
maxEngineYear - Newest engine Year
minEngineYear - Oldest Engine Year
totalHP - Total Power of the Engines combines in HP
engineCategory - Engine Category ( note multiple is used when the engines are dissimilar)
price - Listing price for the boat
city
country
state
zip
seller id

Usage of the Data

The script was run to get the 10,000 newest and 10,000 oldest updated records from the website. This data is available in the newest and oldest folders respectively. Each folder has 10 page files with 1000 records each. These need to be merged before analysis.

Duplicate removal

It is possible that duplicates might exist after merging the data files. It is recommended to use the id and/or url columns to filter duplicates.

Running the script

Install all required dependencies : npm i
Run script : node index.js

Disclaimer

The code and data files in the repo are provided as is. The author of the repo provides no guarantee the script will work at a later date. The author further assumes no responsibility for misuse of data or scripts.

If you plan to use this code or data in your project, make sure to read the LICENSE document.

Contribution

No pull requests will be accepted.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
csv		csv
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Scraping - Boattrader.com

Process

Known limitations

Returned Data and Parameters

Usage of the Data

Duplicate removal

Running the script

Disclaimer

Contribution

About

Releases

Packages

Contributors 2

Languages

License

adhokshaja/SDM-JS-DataScraping

Folders and files

Latest commit

History

Repository files navigation

Data Scraping - Boattrader.com

Process

Known limitations

Returned Data and Parameters

Usage of the Data

Duplicate removal

Running the script

Disclaimer

Contribution

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages