Berlin Property / Social Analysis
This repository hosts the code used for a correlation analysis of Berlin property listing data and social signalings (popularity, desirability, etc.) embedded in social media data. The project was originally completed for the Digital Economy and Decision Analytics course at the statistics chair, Humboldt-Universität zu Berlin. It will likely grow from its current condition to become a master's thesis and an online, interactive tool and / or product.
While the analysis component is the end goal of the project, constructing the tools necessary for building the data pipeline is what I am proud to showcase here.
Traditional real estate price analysis looks at area effects and the obvious structural details of a sale property. What is relatively new is the availability of place data (Google, Facebook, etc.) where people's ratings and / or check ins provide new numeric information about the general makeup of a particular neighbourhood, which, while obvious to those familiar with a place, is not well represented from a data perspective.
With such data available through various APIs and other data gathering methods, the relationship of these sociospatial data on real estate can be more effectively examined.
Data is gathered from two APIs and a third web source. One data source provides live property listing data, which, in the context of this project, was gathered over 2-3 months with a result of ~5,700 unique listings. Social data comes in the form of positional location data (lat, lng) and ratings data for these locations. In total, the count of these sociospatial points sums to ~30,000 locations of 7 types: food / drink (cafés, bars, restaurants), transit, medical, parks, education, groceries, shopping.
Data management is done with a local MongoDB instance.
The tools for data gathering and management are written in Python, using Pymongo for interfacing with the MongoDB. Ultimately there are two data sources, which require custom-built processes to match project specifications. Two auxiliary tools perform travel distance computations and spatial clustering for restructuring certain data. Using lat, lng coordinates and rating data, new features for each property listing are calculated. These are regressed against the listing prices in order to study the relationships.
To visualise the ~30,000 locations of varying types, the gmap Python package is used to generate HTML files with embedded JS, which can then be can opened in the browser for an interactive look at the positional data plotted.
Examples from the 'visualisation' directory: