Skip to content
This repository has been archived by the owner on Oct 4, 2024. It is now read-only.

using some HPC techniques to wrangle a large, irregular housing-market dataset on an ordinary laptop, in a finite amount of time

License

Notifications You must be signed in to change notification settings

bcgov/bcstats_ohcs_craigslist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lifecycle:Retired status: archive

bcstats ohcs craigslist

Extractor / parser, written for web-scraped craigslist data as provided to BC Stats, by Harmari Inc.

Overview

Wrangle a large, irregularly formatted, housing-market dataset on an ordinary computer, in a finite amount of time, using some large data / HPC-ish techniques

  • out of memory
  • parallelism

The challenge

The original data incl. an irregularly formatted CSV file (22GB) incl. approx. 1,000,000 HTML files stuffed into a CSV, where each HTML-file attribute, spans approx. 500 lines. Python 3's "import csv" and R's "library{vroom}" couldn't read the data at this time, so custom out-of-memory slice/extract/parse was used. Moreover, Python3's BeautifulSoup html-parsing, was accelerated using full machine parallelism. The data contain sensitive information and will not be posted

How to produce separate outputs for Apartments (vs Sublets) Place only apartments (or sublets) related data input files, in the code directory, to produce a merged output file that contains only apartments (or sublets) related data

Process analytics

Sample visualization of process monitor for one of the steps in this "big-data" application Process analytics

License

Copyright 2020 Province of British Columbia

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

About

using some HPC techniques to wrangle a large, irregular housing-market dataset on an ordinary laptop, in a finite amount of time

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published