Skip to content
Regression analysis of Berlin property listing prices as a function of location data (cafés, parks, hospitals, transport, etc.) and social ratings of these locations.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
analysis
cleaning
listings
locations
papers_sources
repository_media
visualisation
.gitignore
README.md
berlin_polygon.json
mongodb_methods.py

README.md

Berlin Property / Social Analysis

Heatmap of food and drink Berlin

This repository hosts the code used for a correlation analysis of Berlin property listing data and social signalings (popularity, desirability, etc.) embedded in social media data. The project was originally completed for the Digital Economy and Decision Analytics course at the statistics chair, Humboldt-Universität zu Berlin. It will likely grow from its current condition to become a master's thesis and an online, interactive tool and / or product.

While the analysis component is the end goal of the project, constructing the tools necessary for building the data pipeline is what I am proud to showcase here.

Technology Used:

  • Python3
  • MongoDB
  • JavaScript
  • HTML5

Motivation

Traditional real estate price analysis looks at area effects and the obvious structural details of a sale property. What is relatively new is the availability of place data (Google, Facebook, etc.) where people's ratings and / or check ins provide new numeric information about the general makeup of a particular neighbourhood, which, while obvious to those familiar with a place, is not well represented from a data perspective.

With such data available through various APIs and other data gathering methods, the relationship of these sociospatial data on real estate can be more effectively examined.

Data

Data is gathered from two APIs and a third web source. One data source provides live property listing data, which, in the context of this project, was gathered over 2-3 months with a result of ~5,700 unique listings. Social data comes in the form of positional location data (lat, lng) and ratings data for these locations. In total, the count of these sociospatial points sums to ~30,000 locations of 7 types: food / drink (cafés, bars, restaurants), transit, medical, parks, education, groceries, shopping.

Data management is done with a local MongoDB instance.

Code

The tools for data gathering and management are written in Python, using Pymongo for interfacing with the MongoDB. Ultimately there are two data sources, which require custom-built processes to match project specifications. Two auxiliary tools perform travel distance computations and spatial clustering for restructuring certain data. Using lat, lng coordinates and rating data, new features for each property listing are calculated. These are regressed against the listing prices in order to study the relationships.

Project Diagram

Visualisations

To visualise the ~30,000 locations of varying types, the gmap Python package is used to generate HTML files with embedded JS, which can then be can opened in the browser for an interactive look at the positional data plotted.

Examples from the 'visualisation' directory:

Transport

Transport All

Education

Education All

Cafés, Bars, Restaurants

Education All

You can’t perform that action at this time.