Skip to content

Kuwala is a tool for integrating third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times

License

Notifications You must be signed in to change notification settings

arifluthfi16/kuwala

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logo Banner

Slack License

What is Kuwala?

Kuwala is a tool to build rich features for analytics based on clean data. It uses PySpark in combination with Parquet for data processing. The resulting clean datasets are then loaded into a Postgres data warehouse. We then use dbt for fast, flexible, and reproducible feature generation.

How can I use Kuwala?

There are basically 3 ways you can work with the clean data at the moment:

  1. Directly write SQL queries for the Postgres data warehouse
  2. Build transformations using dbt on top of Postgres
  3. Jupyter notebooks with convenience functions

Which data pipelines are available right now?

OpenStreetMap (OSM) POIs

Points of interest are places that are physically accessible. This includes, for example, businesses, restaurants, schools, tourist attractions and parks. A complete list of categories and further information can be found in our OSM documentation. We take the daily updated .pbf files with the entire data on OSM from Geofabrik. We filter objects based on tags that are irrelevant for POIs. We then further aggregate the tags to high-level categories allowing for easy query building and extract other metadata such as the address, contact details, or the building footprint. By extracting and cleaning the data from OpenStreetMap, Kuwala has one of the largest POI databases scalable to any place in the world.

Google POIs

Google Popular Times

Kuwala offers a scraper that retrieves all available metadata for POIs as seen on Google Search. You can verify the POIs from OSM and enrich them further with an hourly, standardized score for visitation frequency throughout the week. This helps to understand the flow of people throughout a city. We do not use the Google Maps API so there is no need for registration. Instead, the results are generated based on search strings which can be based on OpenStreetMap (OSM) data. For more information, please go to the complete documentation.

High-Resolution Demographic Data

High-Resolution Demographic Data

The high-resolution demographic data comes from Facebook's Data for Good initiative. It provides population estimates for the whole world at a granularity of roughly 30 x 30 meters for different demographic groups such as total, female, male or youth. It is a statistical model based on official census data combined with Facebook's data and satellite images. The demographic data represents the highest granularity and most up-to-date data of population estimates that is available.

H3 (Spatial Index)

H3 is a hierarchically ordered indexing method for geo-spatial data which represents the world in unique hexagons of different sizes (bins). H3 makes it possible to aggregate data fast on different levels and different forms. It is computationally efficient for databases and has applications in weighting data. One example might be weighting less granular data like income data with the high-resolution demographic data provided through Kuwala. H3 was developed by Uber. For the complete documentation please go to the H3 Repo


Quick Start & Demo

Prerequisites

Installed version of Python3, Docker and docker-compose v2 (Go here for instructions) or use the Binder batch in the next section.

Note: We recommend giving Docker at least 8 GB of RAM (On Docker Desktop you can go under settings -> resources)

Demo correlating Uber traversals with Google popularities

badge

Jupyter Notebook Popularity Correlation

We have a notebook with which you can correlate any value associated with a geo-reference with the Google popularity score. In the demo, we have preprocessed popularity data and a test dataset with Uber rides in Lisbon, Portugal.

Run the demo

You could either use the deployed example on Binder using the badge above or run everything locally. The Binder example simply uses Pandas dataframes and is not connecting to a data warehouse.
To run the demo locally, launch Docker in the background and from inside the root directory run:

Linux/Mac:

cd kuwala/scripts && sh initialize_core_components.sh && sh run_cli.sh

and for Windows (Please use PowerShell or any Docker integrated terminal):

cd kuwala/scripts && sh initialize_windows.sh && cd windows && sh initialize_core_components.sh && sh run_cli.sh

Run the data pipelines yourself

To run the pipelines yourself, please follow the instructions for the CLI.


Using Pipelines Independently

Apart from using the CLI, you can also run the pipelines individually without Docker. For more detailed instructions please take a look at the ./kuwala/README.md.

We currently have the following pipelines published:

  • osm-poi: Global collection of point of interests (POIs)
  • population-density: Detailed population and demographic data
  • google-poi: Scraping API to retrieve POI information from Google (incl. popularity score)

Experimental:


Further Documentation

For each major component we have dedicated READMEs. This is an overview of all components:


How You Can Contribute

Every new issue, question, or comment is a contribution and very welcome! This project lives from your feedback and involvement!

Be part of our community

The best first step to get involved is to join the Kuwala Community on Slack. There we discuss everything related to data integration and new pipelines. Every pipeline will be open-source. We entirely decide, based on you, our community, which sources to integrate. You can reach out to us on Slack or email to request a new pipeline or contribute yourself.

Contribute to the project

If you want to contribute yourself, you can use your choice's programming language and database technology. We have the only requirement that it is possible to run the pipeline locally and use Uber's H3 functionality to handle geographical transformations. We will then take the responsibility to maintain your pipeline.

Note: To submit a pull request, please fork the project and then submit a PR to the base repo.

Liberating the Work With Data

By working together as a community of data enthusiasts, we can create a network of seamlessly integratable pipelines. It is now causing headaches to integrate third-party data into applications. But together, we will make it straightforward to combine, merge and enrich data sources for powerful models.

What's Coming Next For the Pipelines?

Based on the use-cases we have discussed in the community and potential users, we have identified a variety of data sources to connect with next:

Semi-Structured Data

Already structured data but not adapted to the Kuwala framework:

Unstructured Data

Unstructured data becomes structured data:

  • Building Footprints from satellite images

Wishlist

Data we would like to integrate, but a scalable approach is still missing:

  • Small scale events (e.g., a festival, movie premiere, nightclub events)

About

Kuwala is a tool for integrating third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 85.7%
  • Jupyter Notebook 13.1%
  • Shell 1.2%