Source real estate prices from public sources.
crawl_id
- The Common Crawl this comes from, e.g.CC-MAIN-2017-22
warc_url
- The WARC path, e.g.s3://commoncrawl/crawl-data/CC-MAIN-2018-30/segments/1531676588961.14/warc/CC-MAIN-20180715183800-20180715203800-00037.warc.gz
warc_record_id
- The WARC record id, e.g.urn:uuid:a04218d3-a49e-4fde-8c2e-16379cfa28c6
url
- The URL, e.g.http://563463.hudsonriverproperties.com/blog/Best+Deals+Of+The+Year
domain
- The domain, e.g.563463.hudsonriverproperties.com
index
- The index of this record (eg if the page has multiple listings), starting at0
external_id
- Optional, the external identifiercountry
- ISO-3166-2 code for the country, egUS
address
- Street address, eg704 SAND CREEK CIR
city
- City, egWeston
state
- State, egFL
postal_code
- Optional, Postal code, eg33327
warc_date
- Date page was crawledlisting_date
- Date listing was created, if knownpage_date
- Date page was authored, if knownsold_date
- Date listing was sold, if knownprice
- Listing price, in $sold_price
- Sale price, in $, if knownbeds
- # of bedroomsbaths
- # of bathshalf_baths
- # of half bathssqft
- Floor space in sqftlot_size
- Lot size in acreslat
- Latitudelng
- Longitudeyear_built
- Year built (eg1990
)
The system is designed to be run interactively in a browser while debugging. When you're ready to crawl at scale, the built-up rules are run in a server-side javascript environment.
- run
./server
- navigate to
javascript:(function(){var s=document.createElement('div');s.innerHTML='Loading...';s.style.color='black';s.style.padding='20px';s.style.position='fixed';s.style.zIndex='9999';s.style.fontSize='3.0em';s.style.border='2px solid black';s.style.right='40px';s.style.top='40px';s.setAttribute('class','selector_gadget_loading');s.style.background='white';document.body.appendChild(s);s=document.createElement('script');s.setAttribute('type','text/javascript');s.setAttribute('src','http://localhost:8000/js/bootstrap.js?' + (new Date()).getTime());document.body.appendChild(s);})();
- check your console
- Get a WARC path file: http://commoncrawl.org/the-data/get-started/
- Download a random file
pipenv shell
./filter.sh warcfile.gz
rm -rf stripped && mkdir stripped && ./export.py tmp stripped
- There are file in
stripped/
cd js
yarn test
cd js
yarn test-e2e