<a href="https://colab.research.google.com/github/cpwan/citadel-summer-datathon-2021/blob/eda/spatial_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dependency

In [1]:
!pip install pysal -q
!pip install plotly==5.1.0 -q

[K     |████████████████████████████████| 2.4MB 10.0MB/s 
[K     |████████████████████████████████| 112kB 48.9MB/s 
[K     |████████████████████████████████| 61kB 8.1MB/s 
[K     |████████████████████████████████| 61kB 4.6MB/s 
[K     |████████████████████████████████| 163kB 28.5MB/s 
[K     |████████████████████████████████| 51kB 6.7MB/s 
[K     |████████████████████████████████| 51kB 5.4MB/s 
[K     |████████████████████████████████| 215kB 29.6MB/s 
[K     |████████████████████████████████| 5.7MB 32.2MB/s 
[K     |████████████████████████████████| 71kB 6.9MB/s 
[K     |████████████████████████████████| 143kB 52.6MB/s 
[K     |████████████████████████████████| 235kB 44.2MB/s 
[K     |████████████████████████████████| 245kB 39.7MB/s 
[K     |████████████████████████████████| 56.1MB 53kB/s 
[K     |████████████████████████████████| 1.0MB 38.2MB/s 
[K     |████████████████████████████████| 2.0MB 39.3MB/s 
[K     |████████████████████████████████| 1.0MB 27.7MB/s 
[K     

# Download data

In [2]:
!gdown --id 1ecgxSTxCmhCvVTSFgUBRGVjyGf3rLREW -O dataset.zip

Downloading...
From: https://drive.google.com/uc?id=1ecgxSTxCmhCvVTSFgUBRGVjyGf3rLREW
To: /content/dataset.zip
69.1MB [00:02, 30.1MB/s]


In [3]:
!unzip dataset.zip

Archive:  dataset.zip
  inflating: Datasets/econ_state.csv  
  inflating: Datasets/demographics.csv  
  inflating: Datasets/venues.csv.gz  
  inflating: Datasets/real_estate.csv.gz  
  inflating: Datasets/listings.csv   
  inflating: Datasets/calendar.csv.gz  


In [4]:
!gzip -dk Datasets/*.gz

In [5]:
linksToGeoJson={
    'los-angeles':'http://data.insideairbnb.com/united-states/ca/los-angeles/2021-04-07/visualisations/neighbourhoods.geojson',
    "asheville":"http://data.insideairbnb.com/united-states/nc/asheville/2021-04-19/visualisations/neighbourhoods.geojson",
    "austin":"http://data.insideairbnb.com/united-states/tx/austin/2021-04-16/visualisations/neighbourhoods.geojson",
    "nashville":"http://data.insideairbnb.com/united-states/tn/nashville/2021-02-19/visualisations/neighbourhoods.geojson",
    "new-orleans":"http://data.insideairbnb.com/united-states/la/new-orleans/2021-04-10/visualisations/neighbourhoods.geojson"
}

In [6]:
!mkdir geoJson
import urllib.request 
for (filename,url) in linksToGeoJson.items():
  print('Downloading', filename)
  urllib.request.urlretrieve(url, f'geoJson/{filename}.geojson')

Downloading los-angeles
Downloading asheville
Downloading austin
Downloading nashville
Downloading new-orleans


# Load data

In [18]:
# dsets: name of csv
# regions: the geo-region to look at, specified by the filename of the geojsons
dsets=['listings','calendar','demographics','econ_state','real_estate','venues']
regions=[*linksToGeoJson.keys()]
regions

['los-angeles', 'asheville', 'austin', 'nashville', 'new-orleans']

In [8]:
import esda
import pandas as pd
import geopandas as gpd
from geopandas import GeoDataFrame
import libpysal as lps
import numpy as np
import matplotlib.pyplot as plt
from shapely.geometry import Point

from re import sub
from decimal import Decimal

import json
from pathlib import Path

import plotly.express as px

%matplotlib inline

  shapely_geos_version, geos_capi_version_string


## Helper functions

In [27]:
# helper function to clean the dataset
def cleanPrice(df):
  def string2float(s):
    return Decimal(sub(r'[^\d.]', '', s))
  df['price']=df['price'].apply(string2float).astype(float)
  return df

In [28]:
# helper function to convert dataframe to GeoDataFrame
def constructGDF(df):
  geometry = [Point(xy) for xy in zip(df.longitude, df.latitude)]
  crs = {'init': 'epsg:4326'} 
  bl_gdf = GeoDataFrame(df, crs=crs, geometry=geometry)
  return bl_gdf

In [29]:
# helper function to get the median (mean) of each neighbourhood
def getMedian(df,gdf,key):
  median=df[key].groupby([df['neighbourhood']]).mean()
  gdf = gdf.join(median, on='neighbourhood')
  gdf.rename(columns={key: f'median_{key}'}, inplace=True)
  return gdf

In [30]:
# helper function to plot the choropleth map given 
def plot_plotly(gdf,counties,key,dset,region):
  fig = px.choropleth(gdf,geojson=counties,locations='neighbourhood',
                    featureidkey='properties.neighbourhood',
                    color=key,
                          color_continuous_scale="Viridis",
                          scope="usa",
                          labels={key:key[7:]}
                          )
  fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
  fig.update_geos(fitbounds="locations")
  # fig.show()
  fname=f"./plotly/{dset}/{region}/{key[7:]}.html"
  fig.write_html(fname) # hardcoded to convert `median_XXX` to `XXX`
  print('Figure saved to ', fname)
  return fig

## Driver code for aggregating values in neighbourhoods and plotting in plotly

In [31]:
def aggregateByNbhd(dset,region):
  df=pd.read_csv(f'./Datasets/{dset}.csv')
  try:
    df=cleanPrice(df)
  except:
    print("Cleaning skipped: Price not in dataframe")
  
  gdf = gpd.read_file(f'geoJson/{region}.geojson')

  region_gdf=constructGDF(df)
  sj_gdf = gpd.sjoin(gdf, region_gdf, how='inner', op='intersects', lsuffix='left', rsuffix='right')

  for k in sj_gdf.keys():
    try:
      gdf=getMedian(sj_gdf,gdf,k)
    except:
      print("Cannot get median of ", k)
  return gdf

def plotAndSave(gdf,dset,region):
  Path(f"./plotly/{dset}/{region}").mkdir(parents=True, exist_ok=True)
  counties = json.load(open(f'geoJson/{region}.geojson'))
  for key in gdf.keys():
    if 'median' in key:
      plot_plotly(gdf,counties,key,dset,region)

## Run the driver code to obtain plots

In [33]:
# # Demo
# dset=dsets[0]
# region=regions[0]
# print(f'Dataset: {dset}, Region: {region}')
# gdf=aggregateByNbhd(dset,region)
# plotAndSave(gdf,dset,region)
# print("Done")

Dataset: listings, Region: los-angeles



'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6


CRS mismatch between the CRS of left geometries and the CRS of right geometries.
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: EPSG:4326
Right CRS: +init=epsg:4326 +type=crs




Cannot get median of  neighbourhood
Cannot get median of  neighbourhood_group
Cannot get median of  geometry
Cannot get median of  amenities
Cannot get median of  bed_type
Cannot get median of  cancellation_policy
Cannot get median of  city
Cannot get median of  instant_bookable
Cannot get median of  metropolitan
Cannot get median of  name
Cannot get median of  property_type
Cannot get median of  room_type
Cannot get median of  state
Cannot get median of  weekly_price
Cannot get median of  zipcode
Figure saved to  ./plotly/listings/los-angeles/index_right.html
Figure saved to  ./plotly/listings/los-angeles/accommodates.html
Figure saved to  ./plotly/listings/los-angeles/availability_30.html
Figure saved to  ./plotly/listings/los-angeles/bathrooms.html
Figure saved to  ./plotly/listings/los-angeles/bedrooms.html
Figure saved to  ./plotly/listings/los-angeles/beds.html
Figure saved to  ./plotly/listings/los-angeles/has_availability.html
Figure saved to  ./plotly/listings/los-angeles/host

In [37]:
dsets

['listings', 'calendar', 'demographics', 'econ_state', 'real_estate', 'venues']

In [39]:
# only 'listings' and 'venues' has the geo-coords, 
# for other dataset, consider joining them with listings/venues
for dset in ['listings','venues']:
  for region in regions:
    print(f'Dataset: {dset}, Region: {region}')
    try:
      gdf=aggregateByNbhd(dset,region)
      plotAndSave(gdf,dset,region)
    except Exception as e:
      print("Aggregation/plotting failed: ",e)
    print("Done")

Dataset: listings, Region: los-angeles



'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6


CRS mismatch between the CRS of left geometries and the CRS of right geometries.
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: EPSG:4326
Right CRS: +init=epsg:4326 +type=crs




Cannot get median of  neighbourhood
Cannot get median of  neighbourhood_group
Cannot get median of  geometry
Cannot get median of  amenities
Cannot get median of  bed_type
Cannot get median of  cancellation_policy
Cannot get median of  city
Cannot get median of  instant_bookable
Cannot get median of  metropolitan
Cannot get median of  name
Cannot get median of  property_type
Cannot get median of  room_type
Cannot get median of  state
Cannot get median of  weekly_price
Cannot get median of  zipcode
Figure saved to  ./plotly/listings/los-angeles/index_right.html
Figure saved to  ./plotly/listings/los-angeles/accommodates.html
Figure saved to  ./plotly/listings/los-angeles/availability_30.html
Figure saved to  ./plotly/listings/los-angeles/bathrooms.html
Figure saved to  ./plotly/listings/los-angeles/bedrooms.html
Figure saved to  ./plotly/listings/los-angeles/beds.html
Figure saved to  ./plotly/listings/los-angeles/has_availability.html
Figure saved to  ./plotly/listings/los-angeles/host


'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6


CRS mismatch between the CRS of left geometries and the CRS of right geometries.
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: EPSG:4326
Right CRS: +init=epsg:4326 +type=crs




Cannot get median of  neighbourhood
Cannot get median of  neighbourhood_group
Cannot get median of  geometry
Cannot get median of  amenities
Cannot get median of  bed_type
Cannot get median of  cancellation_policy
Cannot get median of  city
Cannot get median of  instant_bookable
Cannot get median of  metropolitan
Cannot get median of  name
Cannot get median of  property_type
Cannot get median of  room_type
Cannot get median of  state
Cannot get median of  weekly_price
Cannot get median of  zipcode
Figure saved to  ./plotly/listings/asheville/index_right.html
Figure saved to  ./plotly/listings/asheville/accommodates.html
Figure saved to  ./plotly/listings/asheville/availability_30.html
Figure saved to  ./plotly/listings/asheville/bathrooms.html
Figure saved to  ./plotly/listings/asheville/bedrooms.html
Figure saved to  ./plotly/listings/asheville/beds.html
Figure saved to  ./plotly/listings/asheville/has_availability.html
Figure saved to  ./plotly/listings/asheville/host_id.html
Figure 


'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6


CRS mismatch between the CRS of left geometries and the CRS of right geometries.
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: EPSG:4326
Right CRS: +init=epsg:4326 +type=crs




Cannot get median of  neighbourhood
Cannot get median of  neighbourhood_group
Cannot get median of  geometry
Cannot get median of  amenities
Cannot get median of  bed_type
Cannot get median of  cancellation_policy
Cannot get median of  city
Cannot get median of  instant_bookable
Cannot get median of  metropolitan
Cannot get median of  name
Cannot get median of  property_type
Cannot get median of  room_type
Cannot get median of  state
Cannot get median of  weekly_price
Cannot get median of  zipcode
Figure saved to  ./plotly/listings/austin/index_right.html
Figure saved to  ./plotly/listings/austin/accommodates.html
Figure saved to  ./plotly/listings/austin/availability_30.html
Figure saved to  ./plotly/listings/austin/bathrooms.html
Figure saved to  ./plotly/listings/austin/bedrooms.html
Figure saved to  ./plotly/listings/austin/beds.html
Figure saved to  ./plotly/listings/austin/has_availability.html
Figure saved to  ./plotly/listings/austin/host_id.html
Figure saved to  ./plotly/listi


'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6


CRS mismatch between the CRS of left geometries and the CRS of right geometries.
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: EPSG:4326
Right CRS: +init=epsg:4326 +type=crs




Cannot get median of  neighbourhood
Cannot get median of  neighbourhood_group
Cannot get median of  geometry
Cannot get median of  amenities
Cannot get median of  bed_type
Cannot get median of  cancellation_policy
Cannot get median of  city
Cannot get median of  instant_bookable
Cannot get median of  metropolitan
Cannot get median of  name
Cannot get median of  property_type
Cannot get median of  room_type
Cannot get median of  state
Cannot get median of  weekly_price
Cannot get median of  zipcode
Figure saved to  ./plotly/listings/nashville/index_right.html
Figure saved to  ./plotly/listings/nashville/accommodates.html
Figure saved to  ./plotly/listings/nashville/availability_30.html
Figure saved to  ./plotly/listings/nashville/bathrooms.html
Figure saved to  ./plotly/listings/nashville/bedrooms.html
Figure saved to  ./plotly/listings/nashville/beds.html
Figure saved to  ./plotly/listings/nashville/has_availability.html
Figure saved to  ./plotly/listings/nashville/host_id.html
Figure 


'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6


CRS mismatch between the CRS of left geometries and the CRS of right geometries.
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: EPSG:4326
Right CRS: +init=epsg:4326 +type=crs




Cannot get median of  neighbourhood
Cannot get median of  neighbourhood_group
Cannot get median of  geometry
Cannot get median of  amenities
Cannot get median of  bed_type
Cannot get median of  cancellation_policy
Cannot get median of  city
Cannot get median of  instant_bookable
Cannot get median of  metropolitan
Cannot get median of  name
Cannot get median of  property_type
Cannot get median of  room_type
Cannot get median of  state
Cannot get median of  weekly_price
Cannot get median of  zipcode
Figure saved to  ./plotly/listings/new-orleans/index_right.html
Figure saved to  ./plotly/listings/new-orleans/accommodates.html
Figure saved to  ./plotly/listings/new-orleans/availability_30.html
Figure saved to  ./plotly/listings/new-orleans/bathrooms.html
Figure saved to  ./plotly/listings/new-orleans/bedrooms.html
Figure saved to  ./plotly/listings/new-orleans/beds.html
Figure saved to  ./plotly/listings/new-orleans/has_availability.html
Figure saved to  ./plotly/listings/new-orleans/host


'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6


CRS mismatch between the CRS of left geometries and the CRS of right geometries.
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: EPSG:4326
Right CRS: +init=epsg:4326 +type=crs




Cannot get median of  neighbourhood
Cannot get median of  neighbourhood_group
Cannot get median of  geometry
Cannot get median of  city
Cannot get median of  id
Cannot get median of  name
Cannot get median of  types
Figure saved to  ./plotly/venues/los-angeles/index_right.html
Figure saved to  ./plotly/venues/los-angeles/latitude.html
Figure saved to  ./plotly/venues/los-angeles/longitude.html
Figure saved to  ./plotly/venues/los-angeles/rating.html
Done
Dataset: venues, Region: asheville



'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6


CRS mismatch between the CRS of left geometries and the CRS of right geometries.
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: EPSG:4326
Right CRS: +init=epsg:4326 +type=crs




Cannot get median of  neighbourhood
Cannot get median of  neighbourhood_group
Cannot get median of  geometry
Cannot get median of  city
Cannot get median of  id
Cannot get median of  name
Cannot get median of  types
Figure saved to  ./plotly/venues/asheville/index_right.html
Figure saved to  ./plotly/venues/asheville/latitude.html
Figure saved to  ./plotly/venues/asheville/longitude.html
Figure saved to  ./plotly/venues/asheville/rating.html
Done
Dataset: venues, Region: austin



'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6


CRS mismatch between the CRS of left geometries and the CRS of right geometries.
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: EPSG:4326
Right CRS: +init=epsg:4326 +type=crs




Cannot get median of  neighbourhood
Cannot get median of  neighbourhood_group
Cannot get median of  geometry
Cannot get median of  city
Cannot get median of  id
Cannot get median of  name
Cannot get median of  types
Figure saved to  ./plotly/venues/austin/index_right.html
Figure saved to  ./plotly/venues/austin/latitude.html
Figure saved to  ./plotly/venues/austin/longitude.html
Figure saved to  ./plotly/venues/austin/rating.html
Done
Dataset: venues, Region: nashville



'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6


CRS mismatch between the CRS of left geometries and the CRS of right geometries.
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: EPSG:4326
Right CRS: +init=epsg:4326 +type=crs




Cannot get median of  neighbourhood
Cannot get median of  neighbourhood_group
Cannot get median of  geometry
Cannot get median of  city
Cannot get median of  id
Cannot get median of  name
Cannot get median of  types
Figure saved to  ./plotly/venues/nashville/index_right.html
Figure saved to  ./plotly/venues/nashville/latitude.html
Figure saved to  ./plotly/venues/nashville/longitude.html
Figure saved to  ./plotly/venues/nashville/rating.html
Done
Dataset: venues, Region: new-orleans



'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6


CRS mismatch between the CRS of left geometries and the CRS of right geometries.
Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.

Left CRS: EPSG:4326
Right CRS: +init=epsg:4326 +type=crs




Cannot get median of  neighbourhood
Cannot get median of  neighbourhood_group
Cannot get median of  geometry
Cannot get median of  city
Cannot get median of  id
Cannot get median of  name
Cannot get median of  types
Figure saved to  ./plotly/venues/new-orleans/index_right.html
Figure saved to  ./plotly/venues/new-orleans/latitude.html
Figure saved to  ./plotly/venues/new-orleans/longitude.html
Figure saved to  ./plotly/venues/new-orleans/rating.html
Done
