# Business Project of Alessandro Derchi 
## June 29th 2021 
## Programming with Advanced Computer Languages


Business Problem:

Due to the global pandemic, a Swiss asset management firm wants to use its newly received capital to invest in the real estate market in Bali before tourism starts to pick up again. The goal is to acquire types of properties and renovate them  so that they can be listed on Airbnb at a higher price. Therefore, we need to identify the features that contribute to higher prices in order to maximize the firm’s return on investment. 

Methodology:

In order to assess the current market situation, we will assess the demand and the supply side of listed Airbnb properties for the time July 7 to 10,2021 by using webscrapping. 
Based on the results we choose numeric features that are above the average and we will use qualitative features that come with them. 

We need to interpret the model to be able to design the appropriate acquisition strategy of apartments that can be offered at higher prices. Without knowing which types of properties to invest in, the asset management firm has lower chances to be profitable and might have lower return on its investment.

## 1. Setup 
First, we need to gain the data from the Airbnb website, before analyzing it. Therefore, we need to setup a code to extract information from the web. In order to do that we need the url code with listed properties for the specified location and time. We choose as location Bali and the time is set for July 7 to July 10, 2021 which is a usual prime time for  tourists to travel. 

In [1]:
url = "https://www.airbnb.com/s/Bali--Indonesia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&source=structured_search_input_header&search_type=filter_change&place_id=ChIJoQ8Q6NNB0S0RkOYkS7EPkSQ&flexible_trip_dates%5B%5D=august&flexible_trip_dates%5B%5D=july&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&checkin=2021-07-07&checkout=2021-07-10&adults=2"

With the function get_page it should take the url as input and return its underlying HTML code as a BeautifulSoup object as output. The required libraries (requests) and (bs4) need to be imported in order to run the code. 

In [2]:
import requests
import bs4

def get_page(url):
    response = requests.get(url)
    return bs4.BeautifulSoup(response.text, 'html.parser')

soup = get_page(url)
soup

<!DOCTYPE html>

<html data-is-hyperloop="true"><script>window.sherlock_firstbyte = window.performance && window.performance.timing ? window.performance.timing.responseStart : Number(new Date());</script><script>!function(){"use strict";var n=window;const e="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/",o=new RegExp(`^\\d{10}_[${e}]{16}$`);const t=/(?:^| )bev=(.*?)(?:;|$)/;let c=!1;function i(){if(c||"undefined"==typeof document)return null;c=!0;const n=(document.cookie||"").match(t);if(!n||2!==n.length)return null;const e=decodeURIComponent(n[1]);return function(n){return o.test(n)}(e)?e:null}!function(){try{if(n.bev=n.bev||i(),!n.bev){const o=function(){const n=[];for(let o=0;o<16;o+=1)n.push(e[Math.floor(Math.random()*e.length)]);return`${Math.floor(Date.now()/1e3)}_${n.join("")}`}();!function(n){const{hostname:e}=document.location,o="."+e.slice(e.indexOf("airbnb.")),t=new Date;t.setDate(t.getDate()+730),document.cookie=["bev="+encodeURIComponent(n),"expires="+t.

When trying to extract information from a webpage it is good to see how it is constructed. A brief look at the given webpage shows that the information on the different listings is shown underneath in a list form.
If we open the url code we see for every listing a preview image with some standard information that includes a title, a subtitle, the number of guests allowed, the number of bedrooms and bathrooms, the number of beds, information about certain amenities, the price per night, the total price per stay, the average rating and the number of reviews.

The get_listings function should take a BeautifulSoup object containing the code for a whole webpage as input and return a list of the individual pieces of code for each listing.

In [3]:
listing_class = "_8ssblpx"
listing_tag = "div"

def get_listings(soup):
    return soup.find_all(listing_tag,{"class": listing_class})

get_listings(soup)[0]

<div class="_8ssblpx"><div class="_gig1e7"><div itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem"><meta content="-70%- PROMO Romantic Hideaway 1BR Private Villa Ubud" itemprop="name"/><meta content="1" itemprop="position"/><meta content="www.airbnb.com/rooms/46761225?adults=2&amp;check_in=2021-07-07&amp;check_out=2021-07-10&amp;previous_page_section_name=1000&amp;translate_ugc=false" itemprop="url"/><div><div><div style="margin-top:12px;margin-bottom:24px"><div class="_7qp4lh"></div></div><div aria-labelledby="title_46761225" class="_8s3ctt" role="group"><a aria-labelledby="title_46761225" class="_mm360j" href="/rooms/46761225?adults=2&amp;check_in=2021-07-07&amp;check_out=2021-07-10&amp;previous_page_section_name=1000&amp;translate_ugc=false&amp;federated_search_id=f2867145-e583-4a09-a062-031a9b52e1d0" rel="noopener noreferrer" target="listing_46761225"></a><div class="_1nz9l7j"><div class="_1s4ea4t9"><div class="_1mx6kqf" style="background:#EBEBEB;--dls-ba

## 2. Retrieving the data

Now that the code for all the separate listings is retrieved, we need to retrieve the standard information for each listing.

For each part of information that we can retreive from the preview image on Airbnb, we will use functions for each part of information. 

1. Title

In [4]:
title_class = "_5kaapu"
title_tag = "div"

def get_listing_title(listing):
    try:
        return listing.find(title_tag, {"class": title_class}).text
    except: 
        return False

get_listing_title(get_listings(soup)[0])

'-70%- PROMO Romantic Hideaway 1BR Private Villa Ubud'

2. Type of property

This information can be extracted from the result of the get_listing_subtitle function.

In [5]:
property_class = "_1tanv1h"
property_tag = "div"

def get_listing_property(listing):
    try:
        mystring = listing.find(property_tag, {"class": property_class}).text
        before_keyword, keyword, after_keyword = mystring.partition(" in ")
        return before_keyword
    except: 
        return None
get_listing_property(get_listings(soup)[0])

'Entire villa'

3. Location

This information can be extracted from the result of the get_listing_subtitle function.

In [6]:
location_class = "_1tanv1h"
location_tag = "div"

def get_listing_location(listing):
    try:
        mystring = listing.find(location_tag, {"class": location_class}).text
        before_keyword, keyword, after_keyword = mystring.partition(" in ")
        return after_keyword
    except: 
        return None

get_listing_location(get_listings(soup)[0])

'Kecamatan Ubud'

4. Info

In [7]:
info_class = "_3c0zz1"
info_tag = "div"

def get_listing_info(listing):
    try:
        return listing.find_all(info_tag, {"class": info_class})[0].text
    except: 
        return None

get_listing_info(get_listings(soup)[0])

'3 guests · 1 bedroom · 1 bed · 1 bath'

5. Amenities

In [8]:
ammenities_class = "_3c0zz1"
ammenities_tag = "div"

def get_listing_ammenities(listing):
    try:
        return listing.find_all(ammenities_tag, {"class": ammenities_class})[1].text
    except: 
        return None

get_listing_ammenities(get_listings(soup)[0])

'Pool · Wifi · Air conditioning · Kitchen'

6. Rating 

In [9]:
rating_class = "_10fy1f8"
rating_tag = "span"

def get_listing_rating(listing):
    try:
        return float(listing.find(rating_tag, {"class": rating_class}).text)
    except:
        return None

get_listing_rating(get_listings(soup)[0])

4.88

7. Number of reviews

In [10]:
reviews_class = "_a7a5sx"
reviews_tag = "span"

def get_listing_reviews(listing):
    try:
        return int(listing.find(reviews_tag, {"class": reviews_class}).text[2:-1].strip(" reviews"))
    except:
        return None

get_listing_reviews(get_listings(soup)[0])

16

8. Price per night

In [11]:
price_per_night_class = "_1gi6jw3f"
price_per_night_tag = "div"

def get_listing_price_per_night(listing):
    try:
        return int(listing.find(price_per_night_tag, {"class": price_per_night_class}).text.split("$")[-1].strip("/ night"))
    except: 
        return None

get_listing_price_per_night(get_listings(soup)[0])

84

Next we need a function to retrieve information of the next webpage of the current url. The function find_next_page takes a soup object containing the code for an individual page as input and returns the complete url for the next page. If there are no more pages left, it returns a None in boolean form. We need the base_url to set this up.

In [12]:
base_url = "https://airbnb.com"
next_page_class = "_za9j7e"
next_page_tag = "a"

def find_next_page(page):
    link = soup.find(next_page_tag, {"class": next_page_class})
    try: 
        return base_url + link["href"]
    except:
        return None

find_next_page(soup)

'https://airbnb.com/s/Bali--Indonesia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&source=structured_search_input_header&search_type=filter_change&flexible_trip_dates%5B%5D=august&flexible_trip_dates%5B%5D=july&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&checkin=2021-07-07&checkout=2021-07-10&adults=2&place_id=ChIJoQ8Q6NNB0S0RkOYkS7EPkSQ&federated_search_session_id=43dd37ff-bd10-4c5b-bdb2-b518eb87f5bd&pagination_search=true&items_offset=20&section_offset=3'

Next, we need to retrieve the data above for all listings in all webpages. We use a for loop to retrieve the information and store the information in lists.

In [13]:
all_listings = []
url = "https://www.airbnb.com/s/Bali--Indonesia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&source=structured_search_input_header&search_type=filter_change&place_id=ChIJoQ8Q6NNB0S0RkOYkS7EPkSQ&flexible_trip_dates%5B%5D=august&flexible_trip_dates%5B%5D=july&flexible_trip_lengths%5B%5D=weekend_trip&date_picker_type=calendar&checkin=2021-07-07&checkout=2021-07-10&adults=2"
soup = get_page(url)

while True: 
    try:
        soup = get_page(url)
        for listing in get_listings(soup):
            all_listings.append(listing)
        url = find_next_page(soup)
    except:
        break

In [14]:
title = []
info = []
location = []
type_of_property = []
ammenities = []
rating = []
reviews = []
price_per_night = []

for listing in all_listings:
    title.append(get_listing_title(listing))
    location.append(get_listing_location(listing))
    type_of_property.append(get_listing_property(listing))
    info.append(get_listing_info(listing))
    ammenities.append(get_listing_ammenities(listing))
    rating.append(get_listing_rating(listing))
    reviews.append(get_listing_reviews(listing))
    price_per_night.append(get_listing_price_per_night(listing))

## 3. Saving the data

Next, in order to view all information we retrived, we need to store it in a DataFrame.

We store the data in the DataFrame object and call it airbnb. The names of the different columns are equal to those of the lists we just created: title, location, type_of_property, info, ammenities, rating, reviews and price_per_night. However, for further analysis we do not need the title of the listing as it does not give us added value.

In [15]:
import pandas as pd

data = {'title': title,
        'location': location,
        'type_of_property': type_of_property,
        'ammenitites': ammenities,
        'info': info,
        'rating': rating,
        'reviews': reviews,
        'price_per_night': price_per_night,
        }

airbnb = pd.DataFrame(data = data)
airbnb

Unnamed: 0,title,location,type_of_property,ammenitites,info,rating,reviews,price_per_night
0,❣️Romantic Staycation-PrivateSunset Pool@megan...,Ubud,Entire villa,Pool · Wifi · Air conditioning · Kitchen,2 guests · 1 bedroom · 1 bed · 1 bath,4.94,216.0,49
1,-70%- PROMO Romantic Hideaway 1BR Private Vill...,Kecamatan Ubud,Entire villa,Pool · Wifi · Air conditioning · Kitchen,3 guests · 1 bedroom · 1 bed · 1 bath,4.88,16.0,84
2,Cozy 2BR Villa with Panoramic View of Rice Fields,Kecamatan Ubud,Entire villa,Pool · Wifi · Air conditioning · Kitchen,6 guests · 2 bedrooms · 2 beds · 2 baths,4.92,26.0,203
3,Private Bohemian villa with pool (no cooking ),South Kuta,Entire house,Pool · Wifi · Air conditioning · Kitchen,3 guests · 1 bedroom · 4 beds · 1 bath,4.88,48.0,23
4,♥️Private Pool Villa #sunset & paddy view@Mega...,Ubud,Entire villa,Pool · Wifi · Air conditioning · Kitchen,2 guests · 1 bedroom · 1 bed · 1 bath,4.88,169.0,49
...,...,...,...,...,...,...,...,...
295,A spectacular home away from home,Jimbaran,Cave,Pool · Wifi · Air conditioning · Kitchen,2 guests · 3 bedrooms · 0 beds · 3 baths,4.81,57.0,82
296,-70% Promo- Design 3 BR next to beach Bali,Seminyak,Entire villa,Pool · Wifi · Air conditioning · Kitchen,6 guests · 3 bedrooms · 3 beds · 3 baths,4.75,57.0,92
297,"1 bedroom ubud private villa, river & jungle view",Kecamatan Sukawati,Entire villa,Pool · Wifi · Air conditioning · Kitchen,2 guests · 1 bedroom · 2 beds · 1 bath,5.00,3.0,41
298,"Pondok Kebun - 1 bd Eco Bamboo House, Pool, Ga...",Abiansemal,Entire guesthouse,Pool · Wifi · Air conditioning · Kitchen,2 guests · 1 bedroom · 1 bed · 1 bath,4.73,40.0,136


In [17]:
airbnb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   title             300 non-null    object 
 1   location          300 non-null    object 
 2   type_of_property  300 non-null    object 
 3   ammenitites       300 non-null    object 
 4   info              300 non-null    object 
 5   rating            270 non-null    float64
 6   reviews           270 non-null    float64
 7   price_per_night   300 non-null    int64  
dtypes: float64(2), int64(1), object(5)
memory usage: 18.9+ KB


Sanity check: 

Looking at the info of the dataframe it shows that there is enough information to conduct an analysis. The cases in there were no information avalaible (40 cases) appeared in the rating and reviews columns. We have now the choice what to do with these 30 missing cases. We could either delete them, insert average values or ignore them. We chose to ignore the cases in which information is missing as we need enough data points to conduct an analysis that gives us valuable insights.

In [32]:
airbnb.mean(axis=0)

rating              4.826926
reviews            60.688889
price_per_night    83.773333
dtype: float64

In order to set a benchmark for the asset management firm, we need to look at the average levels of the integer and floating numbers of our dataframe. We can deduct that the overall rating of the given properties is very high as it is close to 5. The average amount of reviews is 62 and the average price per night is 90 EURO for the filters (location, date) we applied.

In [19]:
airbnb.groupby(['type_of_property'])['type_of_property'].count()


type_of_property
Cave                    1
Entire bungalow         6
Entire cabin            5
Entire cottage          1
Entire guest suite      2
Entire guesthouse       3
Entire house            5
Entire loft             1
Entire villa          213
Hotel room              3
Hut                     5
Private room           33
Resort room             1
Room                    5
Shared room             2
Tent                    1
Tiny house              1
Treehouse              12
Name: type_of_property, dtype: int64

Here, we can see that most properties that are offered in Bali are Entire villa and Private rooms. This is important to notice, as real estate investors need to realize that the supply side of the Airbnb market is very skewed on entire villas. By acknowledging that, this information can be used for our advantage. However, the other results that are not entire villas might be taken into consideration as well.

Since we and our investors want to see what brings the highest possible return on investment we hope to receive a high price per night. Therefore, we will check which location brings most money by calculating the average of all integer and floating columns and set a descending order for price_per_night. 

In [20]:
airbnb_groupby = airbnb.groupby(by=["location"]).mean()
airbnb_groupby = airbnb_groupby.sort_values(by=['price_per_night'], ascending = False)
airbnb_groupby.head()

Unnamed: 0_level_0,rating,reviews,price_per_night
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bingin Beach,4.872,71.4,219.4
Perenenan,4.97,39.0,201.0
Kintamani,4.83,6.0,188.0
Kecamatan Baturiti,5.0,8.0,176.0
Selat,4.9325,167.0,168.75


In [23]:
display(airbnb_groupby.loc[(airbnb_groupby['rating']>4.826926) &
                           (airbnb_groupby['price_per_night']>83.773333)])

Unnamed: 0_level_0,rating,reviews,price_per_night
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bingin Beach,4.872,71.4,219.4
Perenenan,4.97,39.0,201.0
Kintamani,4.83,6.0,188.0
Kecamatan Baturiti,5.0,8.0,176.0
Selat,4.9325,167.0,168.75
Tampaksiring,4.9675,47.25,158.5
Kecamatan Sukawati,5.0,4.333333,157.0
boutique hotel in Kecamatan Kuta Utara,5.0,3.0,135.0
Bingin,4.86,37.0,130.0
Balian Beach,4.85,235.0,124.0


Result: Based on this analysis we can see that there are a lot of places in which the average price per night is above the average price per night (79.823333) and a rating above the mean one (4.830388). The locations for the investors to consider are Perenenan, Kecamatan Baturiti, Tampaksiring, Selat, Bingin Beach, Kecamatan Sukawati, Bingin, Kecamatan Tabanan, Nusa Lembongan, Tukadmungga, Balian Beach and Kintamani.

Due to the way how current hosts insert information, we have to disregard boutique hotel in Kecamatan Kuta Utara.

Now it is important to know in what type of property the investors want to invest in. This is done by grouping the type of property and showed in descending order of the column "rating". 

In [22]:
airbnb_groupby2 = airbnb.groupby(by=["type_of_property"]).mean()
airbnb_groupby2 = airbnb_groupby2.sort_values(by=['rating'], ascending = False)
airbnb_groupby2.head()

Unnamed: 0_level_0,rating,reviews,price_per_night
type_of_property,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Entire cottage,5.0,3.0,26.0
Entire guest suite,5.0,8.0,162.0
Tent,5.0,15.0,65.0
Resort room,5.0,3.0,34.0
Entire cabin,4.976,126.2,219.8


Based on the assumption that the investors want to have a successful estate that gets high ratings we will consider only the ones that have a higher average rating review (4.830388). We also want to consider ratings with enough reviews (above 10 reviews) to consider the following types of property.

In [24]:
display(airbnb_groupby2.loc[(airbnb_groupby2['rating']>4.826926) &
                           (airbnb_groupby2['reviews']>10)])

Unnamed: 0_level_0,rating,reviews,price_per_night
type_of_property,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tent,5.0,15.0,65.0
Entire cabin,4.976,126.2,219.8
Hotel room,4.893333,96.333333,48.333333
Tiny house,4.85,60.0,45.0
Entire bungalow,4.838333,40.0,82.833333
Entire villa,4.836075,57.876344,87.779343


Result: The result shows that the most favored types of properties are tent, entire cabin, tiny house, hotel room, entire bungalow and entire villa. However when we checked how many of these properties are in this analysis, it becomes clear that due to only one listing the result does not add value. Therefore, we have the following result: Entire cabin, Hotel Room and Entire Villa.

Next we want to see which features lead to higher ratings in order to satisfy the tourists' expectations for their stay. 

# 4. Features of properties to consider

First, we need to gain more detailed information from get_listing_info and  get_listing_ammenities with the following code. Please note that due to feature selection we will disregard the feature bed as it it already described with the term bedrooms.

In [25]:
guests = []
bedrooms = []
baths = []

def get_listing_info_each(all_listings):
    info_each = []
    info_class = "_3c0zz1"
    info_tag = "div"
    for listing in all_listings:
        try:
            info_each.append(listing.find(info_tag, {"class": info_class}).text.split("·"))
        except:
              info_each.append(False)
    return info_each


for y in get_listing_info_each(all_listings):
    
    #for guests
    number_guest = y[0].split()[0]
    guests.append(int(number_guest))

    #for bedrooms
    number_bedrooms = y[1].split()[0]
    if number_bedrooms.isdigit():
        bedrooms.append(int(number_bedrooms))
    else:
        bedrooms.append(None)

    #for baths
    try: 
        number_baths = y[3].split()[0]
        baths.append(float(number_baths))
    except: 
        baths.append(None)

In [26]:
wifi = []
kitchen = []
air_conditioning = []
pool = []

for x in ammenities:
    if x:
        if "Wifi" in x: 
            wifi.append(1)
        else:
            wifi.append(0)
        if "Kitchen" in x: 
            kitchen.append(1)
        else:
            kitchen.append(0)
        if "Air conditioning" in x: 
            air_conditioning.append(1)
        else:
              air_conditioning.append(0)
        if "Pool" in x: 
            pool.append(1)
        else:
            pool.append(0)  
    else:
        wifi.append(None)
        kitchen.append(None)
        air_conditioning.append(None)
        pool.append(None)

Below you can find a Dataframe with the location and type of property per listing with more detailed information of the ammeninities as well as other information that is important for the asset management company to consider: how many bedrooms, guests and baths.

In [27]:
import pandas as pd

data = {"location": location,
        "type_of_property": type_of_property,
        "rating": rating,
        "reviews": reviews,
        "price_per_night": price_per_night,
        "guests": guests, 
        "bedrooms": bedrooms,
        "baths": baths,
        "wifi": wifi,
        "kitchen": kitchen,
        "air_conditioning": air_conditioning,
        "pool": pool,
        }
airbnb2 = pd.DataFrame(data = data)
airbnb2

Unnamed: 0,location,type_of_property,rating,reviews,price_per_night,guests,bedrooms,baths,wifi,kitchen,air_conditioning,pool
0,Ubud,Entire villa,4.94,216.0,49,2,1.0,1.0,1,1,1,1
1,Kecamatan Ubud,Entire villa,4.88,16.0,84,3,1.0,1.0,1,1,1,1
2,Kecamatan Ubud,Entire villa,4.92,26.0,203,6,2.0,2.0,1,1,1,1
3,South Kuta,Entire house,4.88,48.0,23,3,1.0,1.0,1,1,1,1
4,Ubud,Entire villa,4.88,169.0,49,2,1.0,1.0,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
295,Jimbaran,Cave,4.81,57.0,82,2,3.0,3.0,1,1,1,1
296,Seminyak,Entire villa,4.75,57.0,92,6,3.0,3.0,1,1,1,1
297,Kecamatan Sukawati,Entire villa,5.00,3.0,41,2,1.0,1.0,1,1,1,1
298,Abiansemal,Entire guesthouse,4.73,40.0,136,2,1.0,1.0,1,1,1,1


Next, we will group all entries by rating and sort them in descending order and look at teh first 5 entries that lead to 5 star rating.

In [28]:
airbnb2_groupby = airbnb2.groupby(by=["rating"]).mean()
airbnb2_groupby = airbnb2_groupby.sort_values(by=['rating'], ascending = False)
airbnb2_groupby.head()

Unnamed: 0_level_0,reviews,price_per_night,guests,bedrooms,baths,wifi,kitchen,air_conditioning,pool
rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
5.0,10.5,91.416667,3.483333,1.6,1.644068,0.983333,0.666667,0.966667,0.95
4.99,86.666667,55.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0
4.98,46.0,210.0,4.0,2.0,1.0,1.0,0.0,0.0,0.0
4.97,111.166667,153.5,3.0,1.0,1.166667,1.0,1.0,1.0,1.0
4.95,159.5,143.0,4.5,1.5,2.0,1.0,1.0,0.25,0.25


Result: It can be depicted that the asset management company should focus on properties for 2 to 4 guests with 1 or 2 bedrooms, 1 to 2 bathrooms, with Wifi, Air Conditioning and if possible a pool. 