# Iggy Enrich Demo

In this demo we will show some of the capabilities of Iggy Place Data and show an example workflow for two example use cases:
- vacation rental recommendation
- vacation rental ratings prediction

## Install and Import libraries, Download data

In [None]:
import pandas as pd
import geopandas as gpd
import numpy as np
from shapely import wkt
from keplergl import KeplerGl
from google.cloud import storage
from iggyenrich.iggy_enrich import IggyEnrich
from iggyenrich.iggy_data_package import LocalIggyDataPackage
import sklearn.preprocessing as preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import pairwise, mean_squared_error
from sklearn import linear_model
import matplotlib.pyplot as plt

In [None]:
from google.cloud import storage

def download_public_file(bucket_name, prefix, desired_blobs):
    
    storage_client = storage.Client.create_anonymous_client()
    bucket = storage_client.bucket(bucket_name)

    blobs = bucket.list_blobs(prefix=prefix)
    for blob in blobs:
        if blob.name in desired_blobs:
            blob.download_to_filename(blob.name)

In [None]:
!mkdir austin-datasets
download_public_file('iggy-web-demo', 'austin-datasets', ['austin-datasets/samples_iggy-package-wkt-20211209155137_tx_austin_quadkeys.tar.gz', 'austin-datasets/vacation_rentals.tar.gz'])

In [None]:
!tar xzvf austin-datasets/samples_iggy-package-wkt-20211209155137_tx_austin_quadkeys.tar.gz -C austin-datasets/
!tar xzvf austin-datasets/vacation_rentals.tar.gz -C austin-datasets/

## Load 3rd Party Data and Enrich with Iggy Data
Now we will load some third party data - information on vacation rentals in Austin, TX and enrich this data with Iggy data. We will use a selected subset of features spanning the zipcode, census block group (CBG) and Isochrone 10 minute walk boundaries. For more information on our data and the boundaries please read our [Data Readme](https://docs.askiggy.com/reference/place-data).

We will enrich the vacation listings data using the [IggyEnrich](https://pypi.org/project/iggyenrich/0.0.2/) Python package.
This package makes it very easy to enrich your own data using selected features and to selected boundaries.

In [None]:
vacation_rental_data = pd.read_csv('austin-datasets/vacation_rental_austin.csv')
rental_data_clean = vacation_rental_data.copy()
rental_data_clean.drop(['name'], axis=1, inplace=True)

In [None]:
vacation_rental_columns = list(rental_data_clean.columns)

Next we will select particular Iggy features to enrich our dataset with.

In [None]:
features = [
 'acs_housing_units_built_2000_to_2009_cbg',
 'acs_housing_units_built_2010_to_2013_cbg',
 'acs_housing_units_built_2014_or_later_cbg',
 'acs_median_age_cbg',
 'acs_pct_households_cohabiting_couple_with_children_cbg',
 'acs_pct_households_female_head_with_children_cbg',
 'acs_pct_households_male_head_with_children_cbg',
 'acs_pct_households_married_couple_with_children_cbg',
 'acs_pct_housing_units_built_1939_or_earlier_cbg',
 'acs_pct_housing_units_built_1940_to_1949_cbg',
 'acs_pct_housing_units_built_1950_to_1959_cbg',
 'acs_pct_pop_commutes_by_public_transport_any_cbg',
 'park_pct_area_intersecting_boundary_qk_isochrone_walk_10m',
 'poi_count_cbg',
 'poi_count_qk_isochrone_walk_10m',
 'poi_count_zipcode',
 'poi_is_bar_count_per_sqkm_qk_isochrone_walk_10m',
 'poi_is_brand_crossfit_count_qk_isochrone_walk_10m',
 'poi_is_brand_trader_joes_count_qk_isochrone_walk_10m',
 'poi_is_convenience_store_or_pharmacy_count_per_sqkm_qk_isochrone_walk_10m',
 'poi_is_cultural_count_qk_isochrone_walk_10m',
 'poi_is_games_and_amusement_recreation_count_qk_isochrone_walk_10m',
 'poi_is_grocery_store_count_qk_isochrone_walk_10m',
 'poi_is_historical_site_count_per_sqkm_qk_isochrone_walk_10m',
 'poi_is_museum_count_qk_isochrone_walk_10m',
 'poi_is_nature_recreation_count_per_sqkm_qk_isochrone_walk_10m',
 'poi_is_parking_count_per_capita_qk_isochrone_walk_10m',
 'poi_is_performance_venue_count_qk_isochrone_walk_10m',
 'poi_is_restaurant_count_per_sqkm_qk_isochrone_walk_10m',
 'water_intersects_zipcode',
]

In [None]:
pkg_spec = {
    "iggy_version_id": "20211209155137",
    "crosswalk_prefix": "tx_austin_quadkeys",
    "base_loc": "austin-datasets/",
    "iggy_prefix": "tx_austin_quadkeys"
}
pkg = LocalIggyDataPackage(**pkg_spec)
iggy = IggyEnrich(iggy_package=pkg)

iggy.load(features=features)
rentals_enriched_df = iggy.enrich_df(rental_data_clean, latitude_col="latitude", longitude_col="longitude")

Now we have enriched the data, we will plot maps at the different boundary levels from coarse-grained (zip code) to fine-grained (quadkey zoom 19).

In [None]:
rentals_enriched_df['geometry'] = gpd.points_from_xy(rentals_enriched_df['longitude'], rentals_enriched_df['latitude'])
rentals_enriched_df_g = gpd.GeoDataFrame(rentals_enriched_df, geometry='geometry', crs="WGS84")


In [None]:
%run map_configs/zipcode_poi_count_austin.py
zipcode_poi_count_austin = KeplerGl(height=600, width=400, config=config)
zipcode_poi_count_austin.add_data(rentals_enriched_df_g[["latitude", "longitude", "poi_count_zipcode", "zip_geometry"]], "poi_count_zipcode")
zipcode_poi_count_austin

<center>Figure 1: Poi Count for each rental when enriching at a zipcode boundary<center>

In [None]:
%run map_configs/cbg_poi_count_austin.py
cbg_poi_count_austin = KeplerGl(height=600, width=400, config=config)
cbg_poi_count_austin.add_data(rentals_enriched_df_g[["latitude", "longitude", "poi_count_cbg", "cbg_geometry"]], "poi_count_cbg")
cbg_poi_count_austin

<center>Figure 2: Poi Count for each rental when enriching at a CBG boundary<center>

In [None]:
%run map_configs/isochrone_poi_count_austin.py
isochrone_poi_count_austin = KeplerGl(height=600, width=400, config=config)
isochrone_poi_count_austin.add_data(rentals_enriched_df_g[["latitude", "longitude", "poi_count_qk_isochrone_walk_10m"]], "poi_count_isochrone_walk_10m")
isochrone_poi_count_austin

<center>Figure 3: Poi Count for each rental when enriching at an Isochrone 10 minute walk boundary<center>

With these plots we demonstrate the different nuances and detail that can be gained by using higher fidelity boundaries. For example at the zipcode level in Figure 1, all of the properties in the Clarkville Historic District area have the same POI Count, indicating they are in the same zipcode. If you built a model with this data, all of those properties would be treated the same with respect to this feature. In Figure 2 we can see that we are starting to get some more nuance in the POI Count feature, with different parts of the Clarkville Historic District being in different CBGs and so having different POI Counts. However the CBG is an arbitrary administrative boundary that very often does not represent how humans behave. This is where Figure 3 shows the effect that the POI Count within an Isochrone 10 minute walk boundary can have. Looking within the Clarkville Historic District, we can see that vacation properties in the South East have a higher POI count within a 10 minutes walk when compared with properties in the North West. This is likely due to the fact the main Downtown area being to the East, and so this is included in the catchment area of properties in the South East of the Clarkville District.

## Explore Iggy Features

### Scenario
In this scenario we will explore Iggy's features through the lens of building a vacation rental recommendation system in Austin, TX. We will take three personas - a retired couple, a family with kids and a group of friends in their 30s. Each persona values certain characteristics of the locations in which they stay, and we can model this with Iggy. 

### Retired Couple
This retired couple values cultural sites, grocery stores, nature, historical sites, museums, public transport access and older houses. Based on their preferences we select the vacation data with corresponding Iggy features.

In [None]:
couple_selected_features = [
    "poi_is_cultural_count_qk_isochrone_walk_10m",
    "poi_is_grocery_store_count_qk_isochrone_walk_10m",
    "poi_is_nature_recreation_count_per_sqkm_qk_isochrone_walk_10m",
    "poi_is_historical_site_count_per_sqkm_qk_isochrone_walk_10m",
    "poi_is_museum_count_qk_isochrone_walk_10m",
    "acs_pct_pop_commutes_by_public_transport_any_cbg",
    "acs_pct_housing_units_built_1939_or_earlier_cbg",
    "acs_pct_housing_units_built_1940_to_1949_cbg",
    "acs_pct_housing_units_built_1950_to_1959_cbg"
]

In [None]:
couple_enriched_df = rentals_enriched_df[couple_selected_features]

In [None]:
couple_enriched_df.head()

### Family with Children
This family values being near to Trader Joes & other grocery stores, parking, games and recreation, being near water (rivers, lakes, sea etc), being near parks, nearer buildings and areas with children. Based on their preferences we select the vacation data with corresponding Iggy features.

In [None]:
family_selected_features = [
    "poi_is_brand_trader_joes_count_qk_isochrone_walk_10m",
    "poi_is_grocery_store_count_qk_isochrone_walk_10m",
    "poi_is_parking_count_per_capita_qk_isochrone_walk_10m",
    "poi_is_games_and_amusement_recreation_count_qk_isochrone_walk_10m",
    "water_intersects_zipcode",
    "park_pct_area_intersecting_boundary_qk_isochrone_walk_10m",
    "acs_housing_units_built_2000_to_2009_cbg",
    "acs_housing_units_built_2010_to_2013_cbg",
    "acs_housing_units_built_2014_or_later_cbg",
    "acs_pct_households_cohabiting_couple_with_children_cbg",
    "acs_pct_households_female_head_with_children_cbg",
    "acs_pct_households_male_head_with_children_cbg",
    "acs_pct_households_married_couple_with_children_cbg"
]

In [None]:
family_enriched_df = rentals_enriched_df[family_selected_features]

In [None]:
family_enriched_df.head()

### Group of Friends
This group of friends values being near crossfit facilities, convenience stores, restaurants, bars, performance venues and being around other young adults. Based on their preferences we select the vacation data with corresponding Iggy features.

In [None]:
friends_selected_features = [
    "poi_is_brand_crossfit_count_qk_isochrone_walk_10m",
    "poi_is_convenience_store_or_pharmacy_count_per_sqkm_qk_isochrone_walk_10m",
    "poi_is_restaurant_count_per_sqkm_qk_isochrone_walk_10m",
    "poi_is_bar_count_per_sqkm_qk_isochrone_walk_10m",
    "poi_is_performance_venue_count_qk_isochrone_walk_10m",
    "acs_median_age_cbg",
]

In [None]:
friends_enriched_df = rentals_enriched_df[friends_selected_features + vacation_rental_columns]

In [None]:
friends_enriched_df.head()

## Use Cases

### Recommendations

The group of friends from the previous section had visited Austin in 2021 and had previously stayed in a vacation rental with id number 40956278 in our dataset. They are visiting Austin again in 2022 and liked their previous rental. Based on their stated preferences and the fact that they liked their previous location, we will recommend 5 listings that closest match their preferences. Other requirements are that the property accomodates at least 4 people and that the rental allows bookings to be made for 3 nights.

In [None]:
# Filter dataset by number of guests and number of nights
friends_enriched_df_processed = friends_enriched_df[friends_enriched_df.accommodates >= 4]
friends_enriched_df_processed = friends_enriched_df_processed[(friends_enriched_df_processed.minimum_minimum_nights <= 3) & (friends_enriched_df_processed.maximum_minimum_nights >= 3)] 
friends_enriched_df_processed = friends_enriched_df_processed[friends_selected_features + ['id' ,'accommodates', 'minimum_minimum_nights', 'maximum_minimum_nights']]

In [None]:
# Clean and scale data
friends_enriched_df_processed.dropna(inplace=True)
friends_enriched_df_processed_ids = friends_enriched_df_processed['id']
friends_enriched_df_processed_ids = friends_enriched_df_processed_ids.reset_index(drop=True)
friends_enriched_df_processed = friends_enriched_df_processed[friends_selected_features]

In [None]:
# Scale data
scaler = preprocessing.StandardScaler()
scaled_enriched_df = scaler.fit_transform(friends_enriched_df_processed)

In [None]:
# Calculate similarities
similarities = pairwise.cosine_similarity(scaled_enriched_df)

In [None]:
# Select top 5 most similar rentals to the rental from 2021
rental_of_interest = 40956278
num_similar_rentals = 5
idx_of_rental = friends_enriched_df_processed_ids[friends_enriched_df_processed_ids == rental_of_interest].index[0]
ind = np.argpartition(similarities[idx_of_rental], -(num_similar_rentals+1))[-(num_similar_rentals+1):]

In [None]:
recommendations_rental_data = vacation_rental_data.set_index('id')
recommendations_rental_data.loc[list(friends_enriched_df_processed_ids.iloc[ind].values), :]

In [None]:
# Convert Pandas dataframe to Geopandas dataframe
recommendations_rental_data['geometry'] = gpd.points_from_xy(vacation_rental_data.longitude, vacation_rental_data.latitude)
rental_data_viz = gpd.GeoDataFrame(recommendations_rental_data.loc[list(friends_enriched_df_processed_ids.iloc[ind].values), :].reset_index()[['geometry', 'accommodates', 'id', 'minimum_minimum_nights', 'maximum_minimum_nights']], geometry='geometry', crs="WGS84")
rental_data_viz.set_index('id', inplace=True)

In [None]:
%run map_configs/recommended_rentals_austin_map.py
recommended_rentals = KeplerGl(height=600, width=400)
recommended_rentals.add_data(rental_data_viz.drop(rental_of_interest), "Most Similar rentals")
recommended_rentals.add_data(rental_data_viz[rental_data_viz.index==rental_of_interest], "Rental of Interest")
recommended_rentals

<center>Figure 4: Recommended rentals in Austin<center>

In [None]:
all_rental_data_viz = gpd.GeoDataFrame(recommendations_rental_data.loc[list(friends_enriched_df_processed_ids.values), :].reset_index()[['geometry', 'accommodates', 'id', 'minimum_minimum_nights', 'maximum_minimum_nights']], geometry='geometry', crs="WGS84")
all_rental_data_viz['similarities'] = similarities[idx_of_rental]

In [None]:
%run map_configs/recommended_rentals_austin_map.py
all_rentals = KeplerGl(height=600, width=400)
all_rentals.add_data(all_rental_data_viz[['geometry', 'similarities']], "Similarity of All Rentals")
all_rentals

<center>Figure 5: Similarities of All Rentals in Austin vs Rental of Interest<center>

As you can see from Figure 4, the recommendation engine has selected 5 rentals that fit the requirements, and also are closest to the group of friend's preferences. Given the criteria we used to select these recommended rentals, it makes sense that four are in the same neighborhood as the rental of interest.

## Prediction

For this use case, we would like to train a machine learning model to predict the average `review_scores_location` score a vacation rental has been given by the people who have stayed there. We would also be interested in learning how features are correlated with the score, i.e. does an increase in a feature increase the score. For this task we will only use the selected Iggy features.

In [None]:
# Subset dataframe and clean
modeling_df = rentals_enriched_df[features + ['review_scores_location']]
modeling_df_clean = modeling_df.dropna(axis=1, thresh=8000).dropna()

In [None]:
# Split into X and y
modeling_df_clean_feats = modeling_df_clean.drop(['review_scores_location', 'poi_count_cbg', 'poi_count_zipcode','poi_count_qk_isochrone_walk_10m'], axis=1)
modeling_df_target = modeling_df_clean[['review_scores_location']]

In [None]:
# Split into Train and Test
X_train, X_test, y_train,  y_test = train_test_split(modeling_df_clean_feats, modeling_df_target, test_size=0.3)

In [None]:
# Scale data
scaler = preprocessing.StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

In [None]:
# Train linear regression model
reg = linear_model.Lasso(alpha=0.01)
reg.fit(scaled_X_train, y_train)
preds = reg.predict(scaled_X_test)

In [None]:
model_coefs = pd.DataFrame(zip(modeling_df_clean_feats.columns, reg.coef_), columns=['Feature', 'Coefficient'])
model_coefs = model_coefs[abs(model_coefs['Coefficient']) > 1e-5]

In [None]:
ax = model_coefs.plot.barh(x='Feature', y='Coefficient')

<center>Figure 6: Coefficients of Features in Linear Regression Model<center>

We built a Lasso regression model using the selected Iggy features to predict review_scores_location. Then we plotted the non-zero coefficients in Figure 6 to see if certain features are positively or negatively correlated with review_scores_location. As you can see features such as `acs_median_age_cbg`, `acs_pct_housing_units_built_1939_or_earlier_cbg`, `poi_is_restaurant_count_per_sqkm_qk_isochrone_walk_10m` were all positively correlated with `review_scores_location` whereas `acs_pct_households_female_head_with_children_cbg` is negatively correlated.