<a href="https://colab.research.google.com/github/dernameistegal/airbnb_price/blob/main/SavingDataInColab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Airbnb Data Set Introduction

This simple introduction to the Airbnb data set(s) will give you a short overview over the available data. The city used for this introduction is Berlin, hence if you want to run the exact same notebook for a different city you would need to change a few minor details. Otherwise, if you have downloaded all necessary data sets and run this notebook in the same directory it should run smoothly.

### Index
1. Load data set
2. Price analysis
    * (Inspect reviews)
3. Main file (listings.csv.gz)
4. "Analyze" Images
5. "Analyze" Reviews
6. Calendar file
7.  neighbourhoods Geo.json file

In [177]:
#@title imports
%%capture
!pip install transformers
!pip install geopandas
import json
import os
import math
import pandas as pd
import numpy as np
import gzip
from PIL import Image
import matplotlib.pyplot as plt
import descartes
import geopandas as gpd
import requests
from io import BytesIO
import matplotlib.image as mpimg

from shapely.geometry import Point, Polygon

import seaborn as sns

from transformers import pipeline

import folium
from folium.plugins import FastMarkerCluster
from branca.colormap import LinearColormap

In [7]:
#@title mount drive
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [119]:
# make directories in drive
!mkdir -p /content/drive/MyDrive/data/data1/
!mkdir -p  /content/drive/MyDrive/data/hostpics/
!mkdir -p  /content/drive/MyDrive/data/thumbnails/

In [120]:
# load data to drive
%%capture
!wget -O /content/drive/MyDrive/data/data1/listings.csv.gz http://data.insideairbnb.com/austria/vienna/vienna/2021-11-07/data/listings.csv.gz
!wget -O /content/drive/MyDrive/data/data1/calendar.csv.gz http://data.insideairbnb.com/austria/vienna/vienna/2021-11-07/data/calendar.csv.gz
!wget -O /content/drive/MyDrive/data/data1/reviews.csv.gz http://data.insideairbnb.com/austria/vienna/vienna/2021-11-07/data/reviews.csv.gz
!wget -O /content/drive/MyDrive/data/data1/listings.csv http://data.insideairbnb.com/austria/vienna/vienna/2021-11-07/visualisations/listings.csv
!wget -O /content/drive/MyDrive/data/data1/reviews.csv http://data.insideairbnb.com/austria/vienna/vienna/2021-11-07/visualisations/reviews.csv
!wget -O /content/drive/MyDrive/data/data1/neighbourhoods.csv http://data.insideairbnb.com/austria/vienna/vienna/2021-11-07/visualisations/neighbourhoods.csv
!wget -O /content/drive/MyDrive/data/data1/neighbourhoods.geojson http://data.insideairbnb.com/austria/vienna/vienna/2021-11-07/visualisations/neighbourhoods.geojson

In [121]:
#read files and show header for overview
listings = pd.read_csv("listings.csv")
reviews = pd.read_csv("reviews.csv")
listings_meta = pd.read_csv("listings.csv.gz")
reviews_meta = pd.read_csv("reviews.csv.gz")
calendar = pd.read_csv("calendar.csv.gz")

In [122]:
listings_meta.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,15883,https://www.airbnb.com/rooms/15883,20211107161644,2021-11-08,b&b near Old Danube river,"Four rooms, each one differently and individua...",small and personal<br /><br />Four rooms at th...,https://a0.muscache.com/pictures/18eff738-a737...,62142,https://www.airbnb.com/users/show/62142,Eva,2009-12-11,"Vienna, Wien, Austria",Mein größtes Hobby: Reisen! Am liebsten mit me...,within an hour,100%,100%,f,https://a0.muscache.com/im/pictures/user/24166...,https://a0.muscache.com/im/pictures/user/24166...,Donaustadt,6.0,6.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Vienna, Austria",Donaustadt,,48.24262,16.42767,Room in bed and breakfast,Hotel room,3,,1 private bath,1.0,2.0,"[""Essentials"", ""Smoke alarm"", ""Free street par...",$120.00,1,365,1,1,365,365,1.0,365.0,,t,29,59,89,364,2021-11-08,14,3,0,2017-11-19,2019-07-17,4.71,4.86,4.93,4.93,4.86,4.71,4.5,,f,3,1,0,0,0.29
1,38768,https://www.airbnb.com/rooms/38768,20211107161644,2021-11-08,central cityapartement- wifi- nice neighbourhood,39m² apartment with beautiful courtyard of the...,the Karmeliterviertel became very popular in t...,https://a0.muscache.com/pictures/ad4089a3-5355...,166283,https://www.airbnb.com/users/show/166283,Hannes,2010-07-14,"Wien, Wien, Austria",I am open minded and like travelling myself. I...,,,,t,https://a0.muscache.com/im/users/166283/profil...,https://a0.muscache.com/im/users/166283/profil...,Leopoldstadt,1.0,1.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,"Vienna, Austria",Leopoldstadt,,48.21924,16.37831,Entire rental unit,Entire home/apt,5,,1 bath,1.0,3.0,"[""Shared patio or balcony"", ""Iron"", ""Portable ...",$61.00,3,100,3,3,1125,1125,3.0,1125.0,,t,11,20,50,140,2021-11-08,334,11,2,2012-06-16,2021-09-05,4.75,4.8,4.66,4.91,4.93,4.74,4.7,,t,3,3,0,0,2.92
2,40625,https://www.airbnb.com/rooms/40625,20211107161644,2021-11-08,"Near Palace Schönbrunn, Apt. 1",Welcome to my Apt. 1!<br /><br />This is a 2be...,The neighbourhood offers plenty of restaurants...,https://a0.muscache.com/pictures/11509144/d55c...,175131,https://www.airbnb.com/users/show/175131,Ingela,2010-07-20,"Vienna, Wien, Austria",I´m originally from Sweden but have been livin...,within a few hours,97%,81%,t,https://a0.muscache.com/im/users/175131/profil...,https://a0.muscache.com/im/users/175131/profil...,Rudolfsheim-Fünfhaus,16.0,16.0,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,t,"Vienna, Austria",Rudolfsheim-Fnfhaus,,48.18434,16.32701,Entire rental unit,Entire home/apt,6,,1 bath,2.0,4.0,"[""Wine glasses"", ""Clothing storage: wardrobe a...",$131.00,1,180,1,3,180,180,1.0,180.0,,t,0,0,5,275,2021-11-08,162,7,2,2014-09-06,2019-11-24,4.84,4.91,4.87,4.9,4.93,4.59,4.73,,f,15,14,1,0,1.85
3,51287,https://www.airbnb.com/rooms/51287,20211107161644,2021-11-08,little studio- next to citycenter- wifi- nice ...,small studio in new renovated old house and ve...,The neighbourhood has a lot of very nice littl...,https://a0.muscache.com/pictures/25163038/1c4e...,166283,https://www.airbnb.com/users/show/166283,Hannes,2010-07-14,"Wien, Wien, Austria",I am open minded and like travelling myself. I...,,,,t,https://a0.muscache.com/im/users/166283/profil...,https://a0.muscache.com/im/users/166283/profil...,Leopoldstadt,1.0,1.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,"Vienna, Austria",Leopoldstadt,,48.21778,16.37847,Entire rental unit,Entire home/apt,3,,1 bath,,2.0,"[""Iron"", ""Hair dryer"", ""Bed linens"", ""Cooking ...",$59.00,3,100,3,3,1125,1125,3.0,1125.0,,t,11,22,52,142,2021-11-08,324,15,5,2012-04-12,2021-11-04,4.64,4.76,4.51,4.92,4.95,4.86,4.56,,t,3,3,0,0,2.78
4,70637,https://www.airbnb.com/rooms/70637,20211107161644,2021-11-08,Flat in the Center with Terrace,<b>The space</b><br />My apartment (including ...,,https://a0.muscache.com/pictures/925691/c8c1bd...,358842,https://www.airbnb.com/users/show/358842,Elxe,2011-01-23,"Vienna, Vienna, Austria","Flat in the Center with TerraceWien, Wien, Öst...",within a few hours,100%,72%,t,https://a0.muscache.com/im/users/358842/profil...,https://a0.muscache.com/im/users/358842/profil...,Leopoldstadt,3.0,3.0,"['email', 'phone', 'facebook', 'reviews', 'off...",t,t,,Leopoldstadt,,48.2176,16.38018,Private room in rental unit,Private room,2,,2 shared baths,1.0,2.0,"[""Bathtub"", ""Extra pillows and blankets"", ""Loc...",$50.00,2,1000,2,2,1000,1000,2.0,1000.0,,t,0,0,27,302,2021-11-08,117,1,0,2011-09-18,2019-12-17,4.77,4.74,4.68,4.8,4.75,4.81,4.71,,f,3,1,2,0,0.95


# Save Images

In [132]:

# descriptive statistics for availability of pictures
n_no_hostpic = sum(listings_meta["host_picture_url"].isnull()) 
n_no_thumbnail = sum(listings_meta["picture_url"].isnull())
n_hosts_no_thumbnail = len(np.unique(listings_meta["host_id"][listings_meta["host_picture_url"].isnull()]))
print(f"{n_no_hostpic} listings have no hostpic. In total, {n_hosts_no_thumbnail} hosts have no hostpic. {n_no_thumbnail} listings have no thumbnail.")

22 listings have no hostpic. 0 listings have no thumbnail. In total, 6 hosts have no thumbnail.


In [None]:
# instantiate list of all ids where url does not work
pic_malfunction = []

# save hostpics that are available
for i in range(1488, 1505):

    # get url
    url = listings_meta.loc[i]["host_picture_url"]

    # check if url is not available
    if pd.isna(url):
        continue

    # scrape url
    response = requests.get(url)

    # check if url does not work
    try:
        img_plot = Image.open(BytesIO(response.content)).resize(IMAGE_SIZE)
    except:
        pic_malfunction.append(listings_meta.loc[i]["id"])
        continue
    
    # save rgb data
    rgb_data = np.array(img_plot)
    save_path = "/content/drive/MyDrive/data/hostpics/hostpic" + str(listings_meta.loc[i]["id"])
    np.save(save_path, rgb_data)

In [260]:
# save ids where host pics are not available (either no url or non-functioning url) in dictionary
nopic = np.unique(listings_meta["id"][listings_meta["host_picture_url"].isnull()])
nopic = list(nopic)
#indices = nopic + pic_malfunction

missing_data = {"hostpic": [int(ind) for ind in indices]}

temp_file = open("/content/drive/MyDrive/data/missing_data.json", "w")
json.dump(missing_data, temp_file)
temp_file.close()

#temp_file = open("missing_data.json", "r")
#output = json.load(temp_file)

In [238]:
# instantiate list of all ids where url does not work
pic_malfunction = []

# save thumbnails that are available
for i in range(len(listings_meta)):

    # get url
    url = listings_meta.loc[i]["picture_url"]

    # check if url is not available
    if pd.isna(url):
        continue

    # scrape url
    response = requests.get(url)

    # check if url does not work
    try:
        img_plot = Image.open(BytesIO(response.content)).resize(IMAGE_SIZE)
    except:
        pic_malfunction.append(listings_meta.loc[i]["id"])
        continue
    
    # save rgb data
    rgb_data = np.array(img_plot)
    save_path = "/content/drive/MyDrive/data/thumbnails/thumbnail" + str(listings_meta.loc[i]["id"])
    np.save(save_path, rgb_data)

  " Skipping tag %s" % (size, len(data), tag)


In [266]:
# save ids where thumbnails are not available in dictionary 
# various reasons, e.g. could not load because of corrupt exif data or image size

temp_file = open("/content/drive/MyDrive/data/missing_data.json", "r")
temp_file_dict = json.load(temp_file)
temp_file_dict["thumbnail"] = pic_malfunction
temp_file.close()

temp_file = open("/content/drive/MyDrive/data/missing_data.json", "w")
json.dump(temp_file_dict, temp_file)
temp_file.close()

#temp_file = open("missing_data.json", "r")
#output = json.load(temp_file)

# 5. Analyze the reviews

In [239]:
reviews_meta.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,15883,29643839,2015-04-10,30537860,Robert,"If you need a clean, comfortable place to stay..."
1,15883,80590019,2016-06-19,37529754,Chuang,It's so nice to be in the house! It's a peace ...
2,15883,89583522,2016-07-29,3147341,Arber,"A beautiful place, uniquely decorated showing ..."
3,15883,93550424,2016-08-13,29518067,Raphaela,Eine sehr schöne Unterkunft in einem privaten ...
4,15883,114990769,2016-11-21,36016357,Chris,It was a very pleasant stay. Excellent locatio...


## Get all reviews for one entry
 The "listing_id" in the reviews file is the "id" column in the listings file

In [None]:
senti = reviews_meta[reviews_meta["listing_id"] == listings_meta.loc[2]["id"]]
senti

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
348,40625,73717,2010-08-04,176849,William,Ingela is a superb host. She personally welco...
349,40625,110809,2010-10-03,222519,Kerri,Ingela was a perfect host! She gave great dire...
350,40625,206046,2011-03-22,273895,Heather,Our stay in Vienna with Ingela could not have ...
351,40625,554329,2011-09-21,254998,Fernando,Our stay in the beautiful city of Vienna was g...
352,40625,584745,2011-10-01,314952,Michael,We really enjoyed our visit and loved the very...
...,...,...,...,...,...,...
505,40625,758726942,2021-05-16,219341323,Jovan,Alles Super gewesen.
506,40625,442480631302769561,2021-09-02,402971261,Richard,"Ingela is an amazing host, very friendly, help..."
507,40625,443274687839876914,2021-09-03,8795608,Anton,We have been traveling across Central Europe f...
508,40625,484538823288562944,2021-10-30,335925400,Tetiana,Many thanks for the generous hospitality! The...


## Perform sentiment analysis for these reviews

Use nlptown/bert-base-multilingual-uncased-sentiment as we have reviews in different languages

Just to give you a short idea of the reviews we will perform a simple sentiment analysis on a small subset

In [None]:
classifier = pipeline(
    "sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment"
)

Downloading:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/638M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/851k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
results = []
for i in range(100):
    temp = classifier(senti["comments"].iloc[i])
    results.append(temp)
results[0:10]

[[{'label': '5 stars', 'score': 0.6166882514953613}],
 [{'label': '5 stars', 'score': 0.9249880313873291}],
 [{'label': '4 stars', 'score': 0.3360788822174072}],
 [{'label': '5 stars', 'score': 0.7516244053840637}],
 [{'label': '5 stars', 'score': 0.5675436854362488}],
 [{'label': '5 stars', 'score': 0.6125885844230652}],
 [{'label': '5 stars', 'score': 0.5022594928741455}],
 [{'label': '5 stars', 'score': 0.8312374949455261}],
 [{'label': '5 stars', 'score': 0.49380046129226685}],
 [{'label': '4 stars', 'score': 0.6440829038619995}]]

In [None]:
# Let#s look at one of these reviews. As most of them have *5* stars I would expect to have a highly positive review
senti["comments"].iloc[2]

"Our stay in Vienna with Ingela could not have been better!!!  She met us at the U station to personally introduce us to the fantastic apartment.  Anything and everything you could possibly need was at your fingertips, two Vienna cell phones, a small portable internet notebook, city and attraction maps-the works.  The apartment was very accessible to city transport and quick after a long day out in the city(Get the Vienna Card!!)  She even arranged our very early morning taxi to assure us the best rate!!  Vienna was amazing, our stay with Ingela just added to our experience-we can't wait to go back...."

# 6. Inspect the calendar file

In [None]:
calendar

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,305788,2021-11-08,f,$20.00,$20.00,1.0,1125.0
1,15883,2021-11-08,f,$120.00,$120.00,1.0,365.0
2,15883,2021-11-09,t,$120.00,$120.00,1.0,365.0
3,15883,2021-11-10,t,$120.00,$120.00,1.0,365.0
4,15883,2021-11-11,t,$120.00,$120.00,1.0,365.0
...,...,...,...,...,...,...,...
4164285,53198439,2022-11-03,f,$28.00,$28.00,1.0,7.0
4164286,53198439,2022-11-04,f,$28.00,$28.00,1.0,7.0
4164287,53198439,2022-11-05,f,$28.00,$28.00,1.0,7.0
4164288,53198439,2022-11-06,f,$28.00,$28.00,1.0,7.0


# 7. Inspect the neighbourhoods.geojson file

In [None]:
neighbours = gpd.read_file("neighbourhoods.geojson")
print(neighbours.head())

  neighbourhood  ...                                           geometry
0  Leopoldstadt  ...  MULTIPOLYGON (((16.38484 48.22616, 16.38495 48...
1    Landstra§e  ...  MULTIPOLYGON (((16.38681 48.21271, 16.38683 48...
2  Innere Stadt  ...  MULTIPOLYGON (((16.36497 48.21590, 16.36498 48...
3   Brigittenau  ...  MULTIPOLYGON (((16.38595 48.24764, 16.38611 48...
4   Floridsdorf  ...  MULTIPOLYGON (((16.37817 48.28858, 16.37819 48...

[5 rows x 3 columns]


In [None]:
feq = listings[listings_meta["accommodates"] == 2]
feq = feq.groupby("neighbourhood")["price"].mean().sort_values(ascending=True)
feq = pd.DataFrame([feq])
feq = feq.transpose()
adam = gpd.read_file("neighbourhoods.geojson")
adam = pd.merge(adam, feq, on="neighbourhood", how="left")
adam.rename(columns={"price": "average_price"}, inplace=True)
adam.average_price = adam.average_price.round(decimals=0)
#adam = adam.dropna()
adam = adam[adam["average_price"] < 400]

map_dict = adam.set_index("neighbourhood")["average_price"].to_dict()
color_scale = LinearColormap(
    ["yellow", "red"], vmin=min(map_dict.values()), vmax=max(map_dict.values())
)


  """Entry point for launching an IPython kernel.


In [None]:
def get_color(feature):
    value = map_dict.get(feature["properties"]["neighbourhood"])
    return color_scale(value)


map3 = folium.Map(location=[48, 16], zoom_start=11)

In [None]:
folium.GeoJson(
    data=adam,
    name="Berlin",
    tooltip=folium.features.GeoJsonTooltip(
        fields=["neighbourhood", "average_price"], labels=True, sticky=False
    ),
    style_function=lambda feature: {
        "fillColor": get_color(feature),
        "color": "black",
        "weight": 1,
        "dashArray": "5, 5",
        "fillOpacity": 0.5,
    },
    highlight_function=lambda feature: {
        "weight": 3,
        "fillColor": get_color(feature),
        "fillOpacity": 0.8,
    },
).add_to(map3)
map3