<h1><center>Project 2: Analyzing Airbnb Data in DMV areas</center></h1>

**<center>DATS 6103-O10<center>**
**<center>Weirui Liu<center>**

Founded in 2008, Airbnb has become a giant in the short-stay homestay industry in just nine years. It provides a platform for hosts to post information about short-term rentals or rooms, allowing travelers to search and book unique properties worldwide based on their needs.

For customers, they want to know whether they are getting a reasonable price. Will they overpay? Whether there is a difference in amenities and prices between the same property type and room structure in the same area. For hosts, which property types and room structure are more prevalent in the area, and which amenities will be offered more competitive advantages? For this, we will use [Airbnb](https://www.airbnb.com/), to help customers, and hosts better understand lodging data in the DMV area.

This will be done in two steps: 

I. Use BeautifulSoup to gather lodging descripitions and data.

II. Analyze and compare the lodging data in different DMV areas.

# I. Web Scraping

In [1]:
# Import packages

from bs4 import BeautifulSoup as soup
from urllib.request import Request, urlopen    # help in opening URLs 
import re                                      # work with Regular Expressions
import pandas as pd

Define a function to scrape information from Airbnb web page with specific places to stay and save each web page information into an individual csv file.

In [2]:
def Airbnb_web_scrape(url, location, number_of_places):
    
    # load a page
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    
    # create a soup variable
    page_soup = soup(webpage, "html.parser")
    
    # show all lodging information on an Airbnb page (which shows 20 results per page)
    page = page_soup.find_all("div", {"class":"_8ssblpx"})
        
    # Create an empty list for each feature that we want to scrape 
    
    Type = []
    Location = []
    Title = []

    # Room structure and the Number of Guests that can be accommodated
    Guest = []
    Bedroom = []
    Studio = []
    Bed = []
    Bath = []

    # Amenities and Facilities Description
    Self_Check_in = []
    Air_Conditioning = []
    Wifi = []
    Washer = []
    Kitchen = []
    Breakfast = []
    Pets_allowed = []
    Free_parking = []
    Elevator = []
    Gym = []
    Indoor_fireplace = []
    Pool = []

    Rate = []
    Reviews = []
    Price = []
    Link = []

    for item in page:
    
        # Extract Property Type of Lodging                                 # Example of the corresponding information in the web page:
        try:                                                               
            TYPE_LOCATION = item.find('div', {"class": "_167qordg"}).text  # "Entire apartment in Washington"
            TYPE_lst = re.findall(r"^(.+?) in ", TYPE_LOCATION)[0]         # ["Entire apartment"] (extract first text string before "in") 
            TYPE = ''.join(TYPE_lst)                                       # "Entire apartment"   (change list to string)
            Type.append(TYPE) 
        except:
            Type.append("NaN")     # if no information is found, “NaN” is displayed

        # Extract Location
        try:
            TYPE_LOCATION = item.find('div', {"class": "_167qordg"}).text  # "Entire apartment in Washington"
            LOCATION_lst = re.findall(r" in (.+?)$", TYPE_LOCATION)[-1]    # ["Washington"] (extract last text string after "in")
            LOCATION = ''.join(LOCATION_lst)                               # "Washington"
            Location.append(LOCATION)
        except:
            Location.append("NaN")

        # Extract Title
        try:
            Title.append(item.find("div", {"class":"_1048zci"}).find("a").get("aria-label"))    # "Lovely apartment close to everything"
        except:
            Title.append("NaN")

        # Extract Number of Guest
        try:
            D1, D2 = item.findAll("div",{"class":"_kqh46o"})[:2]           # “4 guests · 1 bedroom · 1 bed · 1 bath", "Wifi · Kitchen · Air Conditioning"
            Num_of_Guest_lst = re.findall(r"(.+?) guest", D1.text)         # ["4"] (extract text before "guest" or "guests")
            Num_of_Guest = ''.join(Num_of_Guest_lst)                       # "4"
            if Num_of_Guest != '':
                Guest.append(Num_of_Guest)
            else:
                Guest.append('NaN')
            Guest = [int(i) for i in Guest]                                # 4 (change string to interge)
        except:
            Guest.append("NaN")  

        # Extract Number of Bedroom 
        # Extract Studio or not
        try:
            D1, D2 = item.findAll("div",{"class":"_kqh46o"})[:2]             # “4 guests · 1 bedroom · 1 bed · 1 bath", "Wifi · Kitchen · Air Conditioning"
            Num_of_Bedroom_lst = re.findall(r" · (.+?) bedroom", D1.text)    # ["1"] (extract text before "bedroom" or "bedrooms")
            Num_of_Bedroom = ''.join(Num_of_Bedroom_lst)                     # "1"
            if Num_of_Bedroom != '':
                Bedroom.append(Num_of_Bedroom)
                Studio.append("No")                                          # if has number of bedroom is found, "No" is displayed in Studio list
            else:
                Bedroom.append('NaN')
                Studio.append("Yes")                                         # if no number of bedroom is found, “Yes” is displayed in Studio list
        except:
            Bedroom.append("NaN")
            Studio.append("NaN")

        # Extract Number of Bed
        try:
            D1, D2 = item.findAll("div",{"class":"_kqh46o"})[:2]             # “4 guests · 1 bedroom · 1 bed · 1 bath", "Wifi · Kitchen · Air Conditioning"
            Num_of_Bed = re.findall(r"room · (.+?) bed · |room · (.+?) beds · |rooms · (.+?) bed · |rooms · (.+?) beds · |Studio · (.+?) bed · |Studio · (.+?) beds · ", D1.text)   # ["1", "", "", "", "", ""] (extract number about bed)
            if len(Num_of_Bed) == 0:                                         
                Bed.append("NaN")
            else:
                for tupl in Num_of_Bed:
                    for i in tupl:
                        if i != '':
                            Bed.append(i)
        except:
            Bed.append("NaN")

        # Extract Number of Bath
        try:
            D1, D2 = item.findAll("div",{"class":"_kqh46o"})[:2]                                              # “4 guests · 1 bedroom · 1 bed · 1 bath", "Wifi · Kitchen · Air Conditioning"      
            Num_of_Bath_1 = re.findall(r" beds · (\d+\.\d+|\d+)| bed · (\d+\.\d+|\d+)", D1.text)              # ["", "1"] (extract number after "bed" or "beds")
            if len(Num_of_Bath_1) == 0:                                                                       # Some lodging information do not contain number of bed information
                Num_of_Bath_2 = re.findall(r" bedroom · (\d+\.\d+|\d+)| bedrooms · (\d+\.\d+|\d+)", D1.text)  # (extract number after "bedroom" or "bedrooms")
                if len(Num_of_Bath_2) == 0:
                    Bath.append("NaN")     
                else:
                    for tupl_1 in Num_of_Bath_2:
                        for i1 in tupl_1:
                            if i1 != '':
                                Bath.append(i1)
            else:
                for tupl_2 in Num_of_Bath_1:
                    for i2 in tupl_2:
                        if i2 != '':
                            Bath.append(i2)                            
        except:
            Bath.append("NaN")

        # Extract Self Check-in or not
        try:
            D1, D2 = item.findAll("div",{"class":"_kqh46o"})[:2]             # “4 guests · 1 bedroom · 1 bed · 1 bath", "Wifi · Kitchen · Air Conditioning"
            Self_Check_in_lst = re.findall(r"Self check-in", D2.text)        # extract "Self check in" from "Wifi · Kitchen · Air Conditioning"
            if len(Self_Check_in_lst) == 0 :
                Self_Check_in.append("No")
            else:
                Self_Check_in.append("Yes")
        except:
            Self_Check_in.append("NaN")

        # Extract Air conditioning or not
        try:
            D1, D2 = item.findAll("div",{"class":"_kqh46o"})[:2]             # “4 guests · 1 bedroom · 1 bed · 1 bath", "Wifi · Kitchen · Air Conditioning"
            Air_Conditioning_lst = re.findall(r"Air conditioning", D2.text)  # extract "Air conditioning" from "Wifi · Kitchen · Air Conditioning"     
            if len(Air_Conditioning_lst) == 0 :
                Air_Conditioning.append("No")
            else:
                Air_Conditioning.append("Yes")
        except:
            Air_Conditioning.append("NaN")

        # Extract Wifi or not
        try:
            D1, D2 = item.findAll("div",{"class":"_kqh46o"})[:2]             # “4 guests · 1 bedroom · 1 bed · 1 bath", "Wifi · Kitchen · Air Conditioning"
            Wifi_lst = re.findall(r"Wifi", D2.text)                          # extract "Wifi" from "Wifi · Kitchen · Air Conditioning"     
            if len(Wifi_lst) == 0 :
                Wifi.append("No")
            else:
                Wifi.append("Yes")
        except:
            Wifi.append("NaN")

        # Extract Washer or not
        try:
            D1, D2 = item.findAll("div",{"class":"_kqh46o"})[:2]             # “4 guests · 1 bedroom · 1 bed · 1 bath", "Wifi · Kitchen · Air Conditioning"
            Washer_lst = re.findall(r"Washer|Dishwasher", D2.text)           # extract "Washer" or "Diswasher" from "Wifi · Kitchen · Air Conditioning"  
            if len(Washer_lst) == 0 :
                Washer.append("No")
            else:
                Washer.append("Yes")
        except:
            Washer.append("NaN")

        # Extract Kitchen or not
        try:
            D1, D2 = item.findAll("div",{"class":"_kqh46o"})[:2]             # “4 guests · 1 bedroom · 1 bed · 1 bath", "Wifi · Kitchen · Air Conditioning"
            Kitchen_lst = re.findall(r"Kitchen", D2.text)                    # extract "Kitchen" from "Wifi · Kitchen · Air Conditioning" 
            if len(Kitchen_lst) == 0 :
                Kitchen.append("No")
            else:
                Kitchen.append("Yes")
        except:
            Kitchen.append("NaN")

        # Extract Breakfast or not
        try:
            D1, D2 = item.findAll("div",{"class":"_kqh46o"})[:2]            # “4 guests · 1 bedroom · 1 bed · 1 bath", "Wifi · Kitchen · Air Conditioning"
            Breakfast_lst = re.findall(r"Breakfast", D2.text)               # extract "Breakfast" from "Wifi · Kitchen · Air Conditioning" 
            if len(Breakfast_lst) == 0 :
                Breakfast.append("No")
            else:
                Breakfast.append("Yes")
        except:
            Breakfast.append("NaN")

        # Extract Pets allowed or not
        try: 
            D1, D2 = item.findAll("div",{"class":"_kqh46o"})[:2]            # “4 guests · 1 bedroom · 1 bed · 1 bath", "Wifi · Kitchen · Air Conditioning"
            Pets_allowed_lst = re.findall(r"Pets allowed", D2.text)         # extract "Pets allowed" from "Wifi · Kitchen · Air Conditioning" 
            if len(Pets_allowed_lst) == 0 :
                Pets_allowed.append("No")
            else:
                Pets_allowed.append("Yes")
        except:
            Pets_allowed.append("NaN")

        # Extract Free parking or not
        try:
            D1, D2 = item.findAll("div",{"class":"_kqh46o"})[:2]            # “4 guests · 1 bedroom · 1 bed · 1 bath", "Wifi · Kitchen · Air Conditioning"
            Free_parking_lst = re.findall(r"Free parking", D2.text)         # extract "Free parking" from "Wifi · Kitchen · Air Conditioning" 
            if len(Free_parking_lst) == 0 :
                Free_parking.append("No")
            else:
                Free_parking.append("Yes")
        except:
            Free_parking.append("NaN")

        # Extract Elevator or not
        try:
            D1, D2 = item.findAll("div",{"class":"_kqh46o"})[:2]           # “4 guests · 1 bedroom · 1 bed · 1 bath", "Wifi · Kitchen · Air Conditioning"
            Elevator_lst = re.findall(r"Elevator", D2.text)                # extract "Elevator" from "Wifi · Kitchen · Air Conditioning"
            if len(Elevator_lst) == 0 :
                Elevator.append("No")
            else:
                Elevator.append("Yes")
        except:
            Elevator.append("NaN")

        # Extract Gym or not
        try:
            D1, D2 = item.findAll("div",{"class":"_kqh46o"})[:2]          # “4 guests · 1 bedroom · 1 bed · 1 bath", "Wifi · Kitchen · Air Conditioning"
            Gym_lst = re.findall(r"Gym", D2.text)                         # extract "Gym" from "Wifi · Kitchen · Air Conditioning"
            if len(Gym_lst) == 0 :
                Gym.append("No")
            else:
                Gym.append("Yes")
        except:
            Gym.append("NaN")

        # Extract Indoor fireplace or not
        try:
            D1, D2 = item.findAll("div",{"class":"_kqh46o"})[:2]               # “4 guests · 1 bedroom · 1 bed · 1 bath", "Wifi · Kitchen · Air Conditioning"
            Indoor_fireplace_lst = re.findall(r"Indoor fireplace", D2.text)    # extract "Indoor fireplace" from "Wifi · Kitchen · Air Conditioning"    
            if len(Indoor_fireplace_lst) == 0 :
                Indoor_fireplace.append("No")
            else:
                Indoor_fireplace.append("Yes")
        except:
            Indoor_fireplace.append("NaN")

        # Extract Pool or not
        try:
            D1, D2 = item.findAll("div",{"class":"_kqh46o"})[:2]               # “4 guests · 1 bedroom · 1 bed · 1 bath", "Wifi · Kitchen · Air Conditioning"
            Pool_lst = re.findall(r"Pool", D2.text)                            # extract "Pool" from "Wifi · Kitchen · Air Conditioning"    
            if len(Pool_lst) == 0 :
                Pool.append("No")
            else:
                Pool.append("Yes")
        except:
            Pool.append("NaN")

        # Extract Rate (out of 5)
        try:
            Rate.append(item.find("span", {"class":"_10fy1f8"}).text)          # "4.96"
        except:
            Rate.append("NaN")

        # Extract Number of People Review
        try:
            Reviews.append(item.find("span", {"class":"_a7a5sx"}).text[2:-1])  # "166"
        except:
            Reviews.append("NaN")

        # Extract Price (per night)
        try:
            PRICE = item.find("div", {"class":"_1fwiw8gv"}).text               # "Price: $68 / night"
            PRICE_int_lst = re.findall(r"\d+\.\d+|\d+", PRICE)                 # ["68"] (extract number form "Price: $68 / night" )
            PRICE_int = ''.join(PRICE_int_lst)                                 # "68"
            Price.append(PRICE_int)
        except:
            Price.append("NaN")

        # Extract Website Link
        try:
            Link.append("https://www.airbnb.com" + item.find("div", {"class":"_1048zci"}).find("a").get("href"))
        except:
            Link.append("NaN")

    # Save every list into csv file
    data = {}
    data = {
        'Type' : Type,
        'Location' : Location,
        'Title' : Title,
        'Number of Guest' : Guest,
        'Number of Bedroom' : Bedroom,
        'Studio' : Studio,
        'Number of Bed' : Bed,
        'Number of Bath' : Bath,
        'Self Check-in' : Self_Check_in,
        'Air Conditioning' : Air_Conditioning,
        'Wifi' : Wifi,
        'Washer' :Washer,
        'Kitchen' : Kitchen,
        'Breakfast' : Breakfast,
        'Pets allowed' : Pets_allowed,
        'Free parking' : Free_parking,
        'Elevator' : Elevator,
        'Gym' : Gym,
        'Indoor fireplace' : Indoor_fireplace,
        'Pool' : Pool,
        'Rate (out of 5)' : Rate,
        'Reviews' : Reviews,
        'Price (per Night)' : Price,
        'Link' : Link
            }
    Page = pd.DataFrame(data)
    header = [
        'Type',
        'Location',
        'Title', 
        'Number of Guest', 
        'Number of Bedroom', 
        'Studio', 
        'Number of Bed', 
        'Number of Bath',
        'Self Check-in',
        'Air Conditioning',
        'Wifi',
        'Washer',
        'Kitchen',
        'Breakfast',
        'Pets allowed',
        'Free parking',
        'Elevator',
        'Gym',
        'Indoor fireplace',
        'Pool',
        'Rate (out of 5)', 
        'Reviews', 
        'Price (per Night)', 
        'Link'
            ]
    Page.to_csv("Airbnb {} {} places.csv".format(location, number_of_places), columns = header)

We will scrape lodgings' information on the DMV area (District of Columbia, Maryland, Virginia). Airbnb will show 300 lodgings' information (15 pages) for each area we retrieve. 

Since each area's first web page is not the same as the URL of the remaining 14 web pages, we cannot run the loop on the first page of each area.

In [3]:
# Grab Washington DC Page 1 source code

url_DC_1 = 'https://www.airbnb.com/s/Washington--DC--United-States/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&source=structured_search_input_header&search_type=autocomplete_click&query=Washington%2C%20DC%2C%20United%20States&place_id=ChIJW-T2Wt7Gt4kRKl2I1CJFUsI'
Airbnb_web_scrape(url_DC_1, 'Washington DC', '0-20')

# we will get 1 csv file with 20 lodgings' information in DC area

In [4]:
# Grab Washington DC Page 2 - Page 15 source code

for i in range (20, 300, 20):
    url_DC_2_15 = 'https://www.airbnb.com/s/Washington--D.C.--United-States/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&source=structured_search_input_header&search_type=pagination&place_id=ChIJW-T2Wt7Gt4kRmKFUAsCO4tY&federated_search_session_id=7895b54c-fbfb-4118-91d1-f2b812b65cde&items_offset=' + str(i) + '&section_offset=3'
    Airbnb_web_scrape(url_DC_2_15, 'Washington DC', '{}-{}'.format(i, i+20))
    
# we will get 14 csv file and each with 20 lodging's information in DC area

In [5]:
# Grab Virginia Page 1 source code

url_VA_1 = 'https://www.airbnb.com/s/Virginia--United-States/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&source=structured_search_input_header&search_type=autocomplete_click&query=Virginia%2C%20United%20States&place_id=ChIJzbK8vXDWTIgRlaZGt0lBTsA'
Airbnb_web_scrape(url_VA_1, 'Virginia', '0-20')

# we will get 1 csv file with 20 lodgings' information in VA area

In [6]:
# Grab Virginia Page 2 - Page 15 source code

for i in range (20, 300, 20):
    url_VA_2_15 = 'https://www.airbnb.com/s/Virginia--United-States/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&source=structured_search_input_header&search_type=pagination&place_id=ChIJzbK8vXDWTIgRlaZGt0lBTsA&federated_search_session_id=3b8aef88-12cd-4c0a-b8c1-8edaab09eccc&items_offset=' + str(i) + '&section_offset=6'
    Airbnb_web_scrape(url_VA_2_15, 'Virginia', '{}-{}'.format(i, i+20))
    
# we will get 14 csv file and each with 20 lodging's information in VA area

In [7]:
# Grab Maryland Page1 source code

url_MA_1 = 'https://www.airbnb.com/s/Maryland--United-States/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&source=structured_search_input_header&search_type=autocomplete_click&query=Maryland%2C%20United%20States&place_id=ChIJ35Dx6etNtokRsfZVdmU3r_I'
Airbnb_web_scrape(url_MA_1, 'Maryland', '0-20')

# we will get 1 csv file with 20 lodgings' information in MD area

In [8]:
# Grab Maryland Page2 - Page15 source code

for i in range (20, 300, 20):
    url_MA_2_15 = 'https://www.airbnb.com/s/Maryland--United-States/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&source=structured_search_input_header&search_type=pagination&place_id=ChIJ35Dx6etNtokRsfZVdmU3r_I&federated_search_session_id=e13563ca-870d-4723-add6-9cd0a6d2336d&items_offset=' + str(i) + '&section_offset=6'
    Airbnb_web_scrape(url_MA_2_15, 'Maryland', '{}-{}'.format(i, i+20))
    
# we will get 14 csv file and each with 20 lodging's information in MD area

It will take about 2 minutes to complete.

**Now we have our DMV area data (45 csv files), let's head over to <u>Project 2 Cleaning and Analyzing.ipynb</u> to analyze our results**
