![bse_logo_textminingcourse](https://bse.eu/sites/default/files/bse_logo_small.png)

#  Introduction to Text Mining and Natural Language Processing
## Homework 1: Research Project about hotel prices on Booking.com

GROUP 11: Luis Francisco Alvarez Poli, Vanessa Kromm, Clarice Mottet

<h4> Motivation:


<h5> 

1) In this project, we will conduct an analysis of accommodation prices in Barcelona during the annual Primavera Sound Festival, taking place from May 30th to June 1st. This three-day event features a lineup of renowned international artists, including 'Vampire Weekend', 'Rels B', 'The National', 'Lana del Rey', and more. Notably, the festival stands as a unique attraction in Europe, as it is the only place where it occurs outside of Latin America. This naturally draws a substantial influx of both national and international tourists. Our focus will be on examining the dynamics of accommodation pricing during this festival period, shedding light on trends in Barcelona prices, influential factors, and potential implications for visitors.

2) For this, we will use Booking as our primary data source. Our approach involves scraping the website to extract valuable information from various accommodation offers. Key features for our analysis will include prices, ratings, distance from the city center, and the textual descriptions associated with each listing. We plan to implement a Natural Language Processing (NLP) method. This approach will enable us to extract relevant information from the textual data in the descriptions, allowing us to integrate more refined details into our model. The scraping section of the project involves extracting information for accommodations in two consecutive weeks (27/05/2024-02/06/2024) and (03/06/2024 - 09/06/2024), where the first one is when we expect an increase in prices due to more people coming in, and the latter one to use as a control. The underlying assumption is that tourism will have a demand-side effect on accomodations in the first week and the second week considered the situation in Barcelona will normalize. We will also obtain data from another city, specifically Rome, for those same two weeks and construct a counterfactual trend based on it. The decision to choose Rome as the control city was made based on several similarities between the cities in terms of total population, culinary offerings, cultural diversity, and proximity to each other. We believe that Rome can serve as a helpful benchmark for comparing Barcelona when analyzing the effects of the festival on hotel rates.
Since the objective is to analyze the effect of the festival on accommodation prices, we will construct a control to act as a counterfactual in a differences-in-differences regression. Considering two relevant treatment variables, time and city. The time variable will have a value of one for observations in the week of the event (27/05-02/06) and zero otherwise. On the other hand, the city variable will be worth one for all accommodations in Barcelona and zero otherwise. Finally, by multiplying both variables, we can identify the lodgings from the festival's week located in Barcelona; we aim to account for any significant difference from the rest.







In [1]:
import json
import numpy as np
import pandas as pd
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException, StaleElementReferenceException
from selenium import webdriver
import os
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
import requests


In [2]:
def ffx_preferences(dfolder, download=False):
    '''
    Sets the preferences of the firefox browser: download path.
    '''
    profile = webdriver.FirefoxProfile()
    # set download folder:
    profile.set_preference("browser.download.dir", dfolder)
    profile.set_preference("browser.download.folderList", 2)
    profile.set_preference("browser.download.manager.showWhenStarting", False)
    profile.set_preference("browser.helperApps.neverAsk.saveToDisk",
                           "application/msword,application/rtf, application/csv,text/csv,image/png ,image/jpeg, application/pdf, text/html,text/plain,application/octet-stream")


    # this allows to download pdfs automatically
    if download:
        profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf,application/x-pdf")
        profile.set_preference("pdfjs.disabled", True)

    options = Options()
    options.profile = profile
    return options


def start_up(link, dfolder, geko_path, firefox_binary_path, download=True):
    os.makedirs(dfolder, exist_ok=True)

    options = ffx_preferences(dfolder, download)
    options.binary_location = firefox_binary_path  # Set Firefox binary location
    service = Service(geko_path)
    browser = webdriver.Firefox(service=service, options=options)
    
    # Enter the website address here
    browser.get(link)
    
    time.sleep(5)  
    return browser


def check_and_click(browser, xpath, type):
    '''
    Function that checks whether the object is clickable and, if so, clicks on
    it. If not, waits one second and tries again.
    '''
    ck = False
    ss = 0
    while ck == False:
        ck = check_obscures(browser, xpath, type)
        time.sleep(1)
        ss += 1
        if ss == 15:
            # warn_sound()
            # return NoSuchElementException
            ck = True
            # browser.quit()

def check_obscures(browser, xpath, type):
    '''
    Function that checks whether the object is being "obscured" by any element so
    that it is not clickable. Important: if True, the object is going to be clicked!
    '''
    try:
        if type == "xpath":
            browser.find_element('xpath',xpath).click()
        elif type == "id":
            browser.find_element('id',xpath).click()
        elif type == "css":
            browser.find_element('css selector',xpath).click()
        elif type == "class":
            browser.find_element('class name',xpath).click()
        elif type == "link":
            browser.find_element('link text',xpath).click()
    except (ElementClickInterceptedException, NoSuchElementException, StaleElementReferenceException) as e:
        print(e)
        return False
    return True

In [4]:
# lets open booking:

dfolder='C:/Users/vanes/Downloads'
geko_path='C:/Users/vanes/Desktop/BSE/Term 2/Introduction to Text Mining and NLP/TA-Sessions/geckodriver.exe'
link='https://www.booking.com/index.html?lang=en'
firefox_binary_path = r'C:\Program Files\Mozilla Firefox\firefox.exe'  


browser=start_up(dfolder=dfolder,link=link,geko_path=geko_path, firefox_binary_path=firefox_binary_path)

### close cookies and google pop-up

In [5]:
browser.find_element(by='xpath',value='//*[@id="onetrust-accept-btn-handler"]').click()

In [5]:
####### works only somethimes, don't know why##################
browser.find_element(By.XPATH, '//*[@id="close"]').click()


NoSuchElementException: Message: Unable to locate element: //*[@id="close"]; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:189:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:507:5
dom.find/</<@chrome://remote/content/shared/DOM.sys.mjs:132:16


## Input the place and dates

### First we will look at Barcelona for the time of the event

In [7]:
# input the place
def input_place(place):
    browser.find_element(by='xpath',value='//*[@id=":re:"]').click()
    search1 = browser.find_element(by='xpath',value='//*[@id=":re:"]')
    search1.send_keys(place)


In [8]:
# input the dates
def input_dates(from_day, to_day, change_month = 0):
    css='button.ebbedaf8ac:nth-child(2) > span:nth-child(1)'

    browser.find_element('css selector',css).click()

    # click to change the month to May/June
    if change_month == 1:
        browser.find_element(By.XPATH, '/html/body/div[3]/div[2]/div/form/div[1]/div[2]/div/div[2]/div/nav/div[2]/div/div[1]/button/span/span').click()
        browser.find_element(By.XPATH, '/html/body/div[3]/div[2]/div/form/div[1]/div[2]/div/div[2]/div/nav/div[2]/div/div[1]/button[2]').click()
        browser.find_element(By.XPATH, '/html/body/div[3]/div[2]/div/form/div[1]/div[2]/div/div[2]/div/nav/div[2]/div/div[1]/button[2]').click()
        browser.find_element(By.XPATH, '/html/body/div[3]/div[2]/div/form/div[1]/div[2]/div/div[2]/div/nav/div[2]/div/div[1]/button[2]').click()

    # week from Monday to Sunday covering the whole length of Primavera Sound
    path = '//div[@id="calendar-searchboxdatepicker"]//table[@class="eb03f3f27f"]//tbody//td[@class="b80d5adb18"]//span[@class="cf06f772fa"]'

    dates = browser.find_elements('xpath', path)

    for date in dates:
        date_value = date.get_attribute("data-date")
        
        if date_value == from_day:
            date.click()
        elif date_value == to_day:
            date.click()
            break

## Iterate through all pages and hotels to extract the information

In [9]:
# find total number of pages
def get_number_pages(browser):
    '''
    Get the number of pages. 
    '''
    a = browser.find_elements('xpath',
        '/html/body/div[4]/div/div[2]/div/div[2]/div[3]/div[2]/div[2]/div[4]/div[2]/nav/nav/div/div[2]/ol/li[7]/button')
    return(int(a[-1].text))


In [10]:
css_pages = 'div.b16a89683f:nth-child(3) > button:nth-child(1) > span:nth-child(1) > span:nth-child(1)'
def get_information(pages):
    # Get the original window handle
    original_window_handle = browser.current_window_handle
    hotel_names = []
    ratings = []
    room_descriptions = []
    prices = []
    location_descriptions = []
    long_descriptions = []

    # beach_distances = []
    
    sections = browser.find_elements('xpath', '//div[@class="c066246e13"]')
    for hotel in sections:
        hotel_name = hotel.find_element('xpath', './/div[@class="f6431b446c a15b38c233"]').text

    for page in range(int(pages)+1):    
    # for page in range(int(10)+1):      
        #Print page that it is in 
        print(f'Page: {page + 1}')
        sections = browser.find_elements('xpath', '//div[@class="c066246e13"]')
        for hotel in sections:
            hotel_name = hotel.find_element('xpath', './/div[@class="f6431b446c a15b38c233"]').text
            hotel_names.append(hotel_name)
            # extract ratings
            try:
                rating = hotel.find_element('xpath', './/div[@class="a3b8729ab1 d86cee9b25"]').text
            except:
                rating = np.nan
            ratings.append(rating)    
            # extract prices
            try:
                price = hotel.find_element('xpath','.//span[@class="f6431b446c fbfd7c1165 e84eb96b1f"]').text
            except:
                price = np.nan
            prices.append(price) 
            # extract room_descriptions
            try:
                room_description = hotel.find_element('xpath','.//div[@class="c19beea015"]').text
            except:
                room_description = np.nan
            room_descriptions.append(room_description)
            # extract location_descriptsion
            try:
                location_description = hotel.find_element('xpath', './/div[@class="abf093bdfe ecc6a9ed89"]').text
            except:
                location_description = np.nan  
            location_descriptions.append(location_description)            
        
            hotel_link = hotel.find_element('xpath', './/div[@class="f6431b446c a15b38c233"]')
            hotel_link.click()

            # Switch to the new window
            new_window_handle = None
            for handle in browser.window_handles:
                if handle != original_window_handle:
                    new_window_handle = handle
                    break

            if new_window_handle:
                browser.switch_to.window(new_window_handle)
                
                # Wait for the hotel details page to load (you might need to adjust the sleep time)
                time.sleep(5)

                # Extract the description from the hotel details page
                try:
                    long_description = browser.find_element(by='xpath', value='//*[@id="summary"]').text
                except:
                    long_description = np.nan
                long_descriptions.append(long_description)    

                # Close the new window
                browser.close()

                # Switch back to the original window
                browser.switch_to.window(original_window_handle)

                # Wait for the overview page to load (you might need to adjust the sleep time)
                time.sleep(5)

        next_page = browser.find_element('css selector', css_pages).click()
        time.sleep(2)        

    df = pd.DataFrame({'hotel_name': hotel_names, 'rating': ratings, 'room_description': room_descriptions, 'price': prices, 'location_descirption': location_descriptions, 'long_description': long_descriptions})
    return df

In [11]:
input_place('Barcelona')
input_dates('2024-05-27', '2024-06-02', 1)
# click on "search"
my_xpath='/html/body/div[3]/div[2]/div/form/div[1]/div[4]/button/span'
check_and_click(browser,my_xpath , type='xpath')    

In [12]:
# close "Genius Pop up"
browser.find_element(By.XPATH, '/html/body/div[49]/div/div/div/div[1]/div[1]/div/button').click()
# browser.find_element(By.XPATH, '/html/body/div[48]/div/div/div/div[1]/div[1]/div/button/span/span').click()

NoSuchElementException: Message: Unable to locate element: /html/body/div[49]/div/div/div/div[1]/div[1]/div/button; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:189:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:507:5
dom.find/</<@chrome://remote/content/shared/DOM.sys.mjs:132:16


In [19]:
pages = get_number_pages(browser)
barcelona_treatment = get_information(pages)

Page: 1
Page: 2
Page: 3
Page: 4
Page: 5
Page: 6
Page: 7
Page: 8
Page: 9
Page: 10
Page: 11
Page: 12
Page: 13
Page: 14
Page: 15
Page: 16
Page: 17
Page: 18
Page: 19
Page: 20
Page: 21
Page: 22
Page: 23
Page: 24
Page: 25
Page: 26
Page: 27
Page: 28
Page: 29
Page: 30
Page: 31
Page: 32
Page: 33
Page: 34
Page: 35
Page: 36
Page: 37
Page: 38
Page: 39
Page: 40


In [20]:
barcelona_treatment

Unnamed: 0,hotel_name,rating,room_description,price,location_descirption,long_description
0,Hotel Turin Barcelona,8.2,Double Room\n1 double bed\nFree cancellation,"€ 1,323","Ciutat Vella, BarcelonaShow on map450 m from c...",You're eligible for a Genius discount at Hotel...
1,Sonder Los Arcos,8.4,King Room\n1 extra-large double bed\nOnly 1 ro...,"€ 1,530","Ciutat Vella, BarcelonaShow on map1 km from ce...",Sonder Los Arcos features accommodation with f...
2,Hotel Sansi Barcelona,8.0,Basic Double\n1 double bed,"€ 1,300","Eixample, BarcelonaShow on map450 m from centr...",You're eligible for a Genius discount at Hotel...
3,ibis Styles Barcelona City Bogatell,8.5,Standard Double Room\n1 large double bed\nBrea...,"€ 1,403","Sant Martí, BarcelonaShow on map2.1 km from ce...","Located in Barcelona, 900 metres from Port Oly..."
4,SM Hotel Sant Antoni,8.6,Double or Twin Room\nBeds: 1 double or 2 singl...,"€ 1,197","Eixample, BarcelonaShow on map1.5 km from cent...",Situated 5 minutes' walk from the Sagrada Fami...
...,...,...,...,...,...,...
997,Catalonia Castellnou,8.3,Double or Twin Room\nBeds: 1 double or 2 singl...,€ 970,"Sarrià-St. Gervasi, BarcelonaShow on map3.6 km...","Catalonia Castellnou offers free Wi-Fi, a 24-h..."
998,Tendency Apartments - Sagrada Familia,7.8,Standard Apartment\nEntire apartment • 1 bedro...,"€ 3,320","Eixample, BarcelonaShow on map2.3 km from cent...",You're eligible for a Genius discount at Tende...
999,numa I Lustre Apartments,8.3,One-Bedroom Apartment\nEntire apartment • 1 be...,"€ 2,910","Ciutat Vella, BarcelonaShow on map0.9 km from ...",You're eligible for a Genius discount at numa ...
1000,Enjoy Apartments Sagrada Familia V,7.3,Three-Bedroom Apartment\nEntire apartment • 1 ...,"€ 4,004","Eixample, BarcelonaShow on map1.5 km from cent...",Set 2.7 km from Nova Icaria Beach and 2.8 km f...


In [21]:
barcelona_treatment.to_csv('C:/Users/vanes/Desktop/BSE/Term 2/Introduction to Text Mining and NLP/HW1/barcelona_treatment.csv')

In [2]:
barcelona_treatment = pd.read_csv('C:/Users/vanes/Desktop/BSE/Term 2/Introduction to Text Mining and NLP/HW1/barcelona_treatment.csv')

In [3]:
barcelona_treatment

Unnamed: 0.1,Unnamed: 0,hotel_name,rating,room_description,price,location_descirption,long_description
0,0,Hotel Turin Barcelona,8.2,Double Room\n1 double bed\nFree cancellation,"€ 1,323","Ciutat Vella, BarcelonaShow on map450 m from c...",You're eligible for a Genius discount at Hotel...
1,1,Sonder Los Arcos,8.4,King Room\n1 extra-large double bed\nOnly 1 ro...,"€ 1,530","Ciutat Vella, BarcelonaShow on map1 km from ce...",Sonder Los Arcos features accommodation with f...
2,2,Hotel Sansi Barcelona,8.0,Basic Double\n1 double bed,"€ 1,300","Eixample, BarcelonaShow on map450 m from centr...",You're eligible for a Genius discount at Hotel...
3,3,ibis Styles Barcelona City Bogatell,8.5,Standard Double Room\n1 large double bed\nBrea...,"€ 1,403","Sant Martí, BarcelonaShow on map2.1 km from ce...","Located in Barcelona, 900 metres from Port Oly..."
4,4,SM Hotel Sant Antoni,8.6,Double or Twin Room\nBeds: 1 double or 2 singl...,"€ 1,197","Eixample, BarcelonaShow on map1.5 km from cent...",Situated 5 minutes' walk from the Sagrada Fami...
...,...,...,...,...,...,...,...
997,997,Catalonia Castellnou,8.3,Double or Twin Room\nBeds: 1 double or 2 singl...,€ 970,"Sarrià-St. Gervasi, BarcelonaShow on map3.6 km...","Catalonia Castellnou offers free Wi-Fi, a 24-h..."
998,998,Tendency Apartments - Sagrada Familia,7.8,Standard Apartment\nEntire apartment • 1 bedro...,"€ 3,320","Eixample, BarcelonaShow on map2.3 km from cent...",You're eligible for a Genius discount at Tende...
999,999,numa I Lustre Apartments,8.3,One-Bedroom Apartment\nEntire apartment • 1 be...,"€ 2,910","Ciutat Vella, BarcelonaShow on map0.9 km from ...",You're eligible for a Genius discount at numa ...
1000,1000,Enjoy Apartments Sagrada Familia V,7.3,Three-Bedroom Apartment\nEntire apartment • 1 ...,"€ 4,004","Eixample, BarcelonaShow on map1.5 km from cent...",Set 2.7 km from Nova Icaria Beach and 2.8 km f...


In [7]:
barcelona_treatment.describe()

Unnamed: 0.1,Unnamed: 0,rating
count,1002.0,974.0
mean,500.5,8.132957
std,289.396786,0.893503
min,0.0,1.0
25%,250.25,7.8
50%,500.5,8.3
75%,750.75,8.675
max,1001.0,10.0


### now we have to change the date to extract the hotel prices for the week before the event

We only scrape the name and the price here because rating and description will not change.

In [10]:
css_pages = 'div.b16a89683f:nth-child(3) > button:nth-child(1) > span:nth-child(1) > span:nth-child(1)'
def get_only_prices(pages):
    # Get the original window handle
    hotel_names = []
    prices_week_before = []
    
    sections = browser.find_elements('xpath', '//div[@class="c066246e13"]')
    for hotel in sections:
        hotel_name = hotel.find_element('xpath', './/div[@class="f6431b446c a15b38c233"]').text

    for page in range(int(pages)+1):    
    #for page in range(int(1)+1):      
        #Print page that it is in 
        print(f'Page: {page + 1}')
        sections = browser.find_elements('xpath', '//div[@class="c066246e13"]')
        for hotel in sections:
            hotel_name = hotel.find_element('xpath', './/div[@class="f6431b446c a15b38c233"]').text
            hotel_names.append(hotel_name)  
            # extract prices
            try:
                price_week_before = hotel.find_element('xpath','.//span[@class="f6431b446c fbfd7c1165 e84eb96b1f"]').text
            except:
                price_week_before = np.nan
            prices_week_before.append(price_week_before)

        next_page = browser.find_element('css selector', css_pages).click()
        time.sleep(2)        

    df = pd.DataFrame({'hotel_name': hotel_names, 'price_week_before': prices_week_before})
    return df

In [13]:
input_dates('2024-05-20', '2024-05-26')
# click on "search"
my_xpath='/html/body/div[4]/div/div[2]/div/div[1]/div/form/div[1]/div[4]/button/span'
check_and_click(browser,my_xpath , type='xpath') 
pages = get_number_pages(browser)
barcelona_control = get_only_prices(pages)

In [35]:
barcelona_control

Unnamed: 0,hotel_name,price_week_before
0,Mayerling Bisbe Urquinaona,"€ 1,250"
1,Sonder Paseo de Gracia,"€ 1,928"
2,HCC Regente,"€ 1,049"
3,Hostal Orleans,€ 661
4,ibis Styles Barcelona City Bogatell,€ 974
...,...,...
997,Sants Estacio - Modern and comfy 4BD for 8 guests,"€ 2,690"
998,Murillo 18,"€ 1,796"
999,42enf575 - Authentic &Centric Barcelonian 2BR ...,"€ 1,238"
1000,By Cathedral Rooms,"€ 1,004"


In [42]:
barcelona_control['hotel_name'].nunique()

951

In [36]:
barcelona_control.to_csv('C:/Users/vanes/Desktop/BSE/Term 2/Introduction to Text Mining and NLP/HW1/barcelona_control.csv')

In [22]:
barcelona_control = pd.read_csv('C:/Users/vanes/Desktop/BSE/Term 2/Introduction to Text Mining and NLP/HW1/barcelona_control.csv')

Creating a dataframe that contains prices for both weeks

In [26]:
barcelona_merged = pd.merge(barcelona_treatment, barcelona_control, on='hotel_name', how='left')

barcelona_merged = barcelona_merged.dropna(subset=['price_week_before'])

barcelona_merged = barcelona_merged.drop('Unnamed: 0_x', axis=1)

In [27]:
barcelona_merged

Unnamed: 0,hotel_name,rating,room_description,price,location_descirption,long_description,Unnamed: 0_y,price_week_before
0,Hotel Turin Barcelona,8.2,Double Room\n1 double bed\nFree cancellation,"€ 1,323","Ciutat Vella, BarcelonaShow on map450 m from c...",You're eligible for a Genius discount at Hotel...,333.0,"€ 1,201"
1,Sonder Los Arcos,8.4,King Room\n1 extra-large double bed\nOnly 1 ro...,"€ 1,530","Ciutat Vella, BarcelonaShow on map1 km from ce...",Sonder Los Arcos features accommodation with f...,902.0,"€ 1,704"
3,ibis Styles Barcelona City Bogatell,8.5,Standard Double Room\n1 large double bed\nBrea...,"€ 1,403","Sant Martí, BarcelonaShow on map2.1 km from ce...","Located in Barcelona, 900 metres from Port Oly...",4.0,€ 974
5,Four Points by Sheraton Barcelona Diagonal,8.8,Classic King Room\n1 extra-large double bed\nF...,"€ 1,724","Sant Martí, BarcelonaShow on map3 km from cent...",Four Points by Sheraton Barcelona Diagonal is ...,16.0,"€ 1,335"
6,Hotel del Mar,8.4,Room Assigned on Arrival\n1 single bed,"€ 1,205","Ciutat Vella, BarcelonaShow on map1.3 km from ...",You're eligible for a Genius discount at Hotel...,11.0,"€ 1,232"
...,...,...,...,...,...,...,...,...
1017,Room con baño privado sagrada familia,6.8,Suite\nPrivate suite\n1 double bed\nFree cance...,"€ 1,238","Eixample, BarcelonaShow on map2 km from centre...",Room con baño privado sagrada familia is set i...,889.0,"€ 1,238"
1020,Antiga Casa Buenavista,9.4,Double Room Raval\n1 large double bed\nBreakfa...,"€ 2,953","Ciutat Vella, BarcelonaShow on map0.6 km from ...","Conveniently set in the centre of Barcelona, A...",871.0,"€ 2,476"
1028,Apartment Barcelona Rentals - Sarria Apartment...,5.5,Apartment with Balcony\nEntire apartment • 2 b...,"€ 3,096","Sarrià-St. Gervasi, BarcelonaShow on map3.1 km...",You're eligible for a Genius discount at Apart...,1001.0,"€ 2,556"
1030,Hostal Dragonflybcn,8.1,Twin Room with Shared Toilet\n2 single beds\nF...,€ 975,"Ciutat Vella, BarcelonaShow on map0.7 km from ...",Hostal Dragonflybcn is a guest house located i...,424.0,€ 700


In [28]:
barcelona_merged.to_csv('C:/Users/vanes/Desktop/BSE/Term 2/Introduction to Text Mining and NLP/HW1/barcelona_merged.csv')

### now we have to change the city to extract the hotel prices for the week before the event for Rome

In [14]:
# clear the field first
browser.find_element(by='xpath',value='//*[@id=":re:"]').clear()
input_place('Rome')
time.sleep(3)
browser.find_element(by='xpath',value='//*[@id="autocomplete-result-1"]').click()
# add sleep time, because otherwise it will still show results for Barcelona
time.sleep(3)
# click on "search"
my_xpath='/html/body/div[4]/div/div[2]/div/div[1]/div/form/div[1]/div[4]/button/span'
check_and_click(browser,my_xpath , type='xpath') 
pages = get_number_pages(browser)
rome_control = get_only_prices(pages)

In [15]:
rome_control.to_csv('C:/Users/vanes/Desktop/BSE/Term 2/Introduction to Text Mining and NLP/HW1/rome_control.csv')

In [16]:
rome_control

Unnamed: 0,hotel_name,price_week_before
0,HOTEL DELLE CIVETTE,"€ 1,139"
1,Sonder Piazza San Pietro,"€ 1,693"
2,"[Centocelle, Metro C] Luminoso quadrilocale",€ 951
3,Pamphili212,€ 711
4,Rhome Apartments,€ 882
...,...,...
1021,ILAMI House,"€ 1,476"
1022,Apartment Bruno's,"€ 1,178"
1023,Mamomi HOUSE,"€ 1,243"
1024,Pantheon Rome Relais,"€ 1,584"


### extract prices and all the information for Rome during the week of event

In [15]:
input_dates('2024-05-27', '2024-06-02')
my_xpath='/html/body/div[4]/div/div[2]/div/div[1]/div/form/div[1]/div[4]/button/span'
check_and_click(browser,my_xpath , type='xpath') 
pages = get_number_pages(browser)
rome_treatment = get_information(pages)

Page: 1
Page: 2
Page: 3
Page: 4
Page: 5
Page: 6
Page: 7
Page: 8
Page: 9
Page: 10
Page: 11
Page: 12
Page: 13
Page: 14
Page: 15
Page: 16
Page: 17
Page: 18
Page: 19
Page: 20
Page: 21
Page: 22
Page: 23
Page: 24
Page: 25
Page: 26
Page: 27
Page: 28
Page: 29
Page: 30
Page: 31
Page: 32
Page: 33
Page: 34
Page: 35
Page: 36
Page: 37
Page: 38
Page: 39
Page: 40
Page: 41


In [16]:
rome_treatment

Unnamed: 0,hotel_name,rating,room_description,price,location_descirption,long_description
0,ROMUhouse economy apartment METRO B,8.9,One-Bedroom Apartment\nEntire apartment • 1 be...,€ 525,"Tiburtino, RomeShow on map8.1 km from centreMe...",You're eligible for a Genius discount at ROMUh...
1,Sonder Piazza San Pietro,8.2,Superior Apartment\nEntire apartment • 1 bedro...,"€ 1,777","Vaticano Prati, RomeShow on map2.2 km from centre",You're eligible for a Genius discount at Sonde...
2,The Royals,8.8,Double or Twin Room with Balcony\n3 single bed...,€ 579,"Gianicolense, RomeShow on map4.3 km from centre",Located within 2.1 km of Roma Trastevere Train...
3,Monti Guest House - Affittacamere,7.7,Economy Double Room\n1 double bed\nFree cancel...,€ 654,"Rione Monti, RomeShow on map0.9 km from centre...",You're eligible for a Genius discount at Monti...
4,"[Centocelle, Metro C] Luminoso quadrilocale",,"Two-Bedroom Apartment\n3 beds (2 singles, 1 do...",€ 951,RomeShow on map7.1 km from centreMetro access,"Located 5.9 km from Porta Maggiore, 5.9 km fro..."
...,...,...,...,...,...,...
1021,Campo de' fiori cute apartment,8.1,One-Bedroom Apartment\nEntire apartment • 1 be...,"€ 1,030","Navona, RomeShow on map0.9 km from centre",You're eligible for a Genius discount at Campo...
1022,Appartamento Vacanze Romane,9.5,Two-Bedroom Apartment\nEntire apartment • 2 be...,€ 935,"San Giovanni, RomeShow on map3.1 km from centre",You're eligible for a Genius discount at Appar...
1023,Massi Vatican House,9.0,One-Bedroom Apartment\nEntire apartment • 1 be...,"€ 1,123","Vaticano Prati, RomeShow on map2.8 km from centre",You're eligible for a Genius discount at Massi...
1024,Large appartament near Vaticano,7.9,Deluxe Apartment\nEntire apartment • 2 bedroom...,"€ 1,482","Vaticano Prati, RomeShow on map2.1 km from centre","In the centre of Rome, located within a short ..."


In [17]:
rome_treatment.to_csv('C:/Users/vanes/Desktop/BSE/Term 2/Introduction to Text Mining and NLP/HW1/rome_treatment.csv')

In [18]:
rome_control = pd.read_csv('C:/Users/vanes/Desktop/BSE/Term 2/Introduction to Text Mining and NLP/HW1/rome_control.csv')

In [19]:
rome_merged = pd.merge(rome_treatment, rome_control, on='hotel_name', how='left')

rome_merged = rome_merged.dropna(subset=['price_week_before'])

rome_merged = rome_merged.drop('Unnamed: 0', axis=1)

In [20]:
rome_merged

Unnamed: 0,hotel_name,rating,room_description,price,location_descirption,long_description,price_week_before
1,Sonder Piazza San Pietro,8.2,Superior Apartment\nEntire apartment • 1 bedro...,"€ 1,777","Vaticano Prati, RomeShow on map2.2 km from centre",You're eligible for a Genius discount at Sonde...,"€ 1,693"
4,"[Centocelle, Metro C] Luminoso quadrilocale",,"Two-Bedroom Apartment\n3 beds (2 singles, 1 do...",€ 951,RomeShow on map7.1 km from centreMetro access,"Located 5.9 km from Porta Maggiore, 5.9 km fro...",€ 951
6,Pamphili212,,Two-Bedroom Apartment\nEntire apartment • 2 be...,€ 808,"Gianicolense, RomeShow on map2.9 km from centre","Situated in Rome, 1.4 km from Roma Trastevere ...",€ 711
8,4rooms In Rome,9.4,Standard Double Room\n1 double bed\nFree cance...,€ 718,"Central Station, RomeShow on map2.2 km from ce...","Set in the centre of Rome, less than 1 km from...",€ 732
9,Juna's guest house,8.5,Double Room with Private Bathroom\n1 double be...,€ 546,"Tiburtino, RomeShow on map4.3 km from centre",You're eligible for a Genius discount at Juna'...,€ 546
...,...,...,...,...,...,...,...
1028,My Vatican Home in Rome,,"Two-Bedroom Apartment\n3 beds (2 singles, 1 do...",€ 832,"Aurelio, RomeShow on map3.6 km from centreMetr...",My Vatican Home in Rome is set in the Aurelio ...,€ 832
1034,Flaminio Lovely House,9.1,One-Bedroom Apartment\nEntire apartment • 1 be...,"€ 1,648","Villa Borghese Parioli, RomeShow on map4.1 km ...",You're eligible for a Genius discount at Flami...,"€ 1,158"
1035,iFlat Sunny and Colorful Esquilino Apartment,8.5,Two-Bedroom Apartment\nEntire apartment • 2 be...,"€ 1,854","Central Station, RomeShow on map2.2 km from ce...","In the Central Station district of Rome, close...","€ 1,851"
1041,Large appartament near Vaticano,7.9,Deluxe Apartment\nEntire apartment • 2 bedroom...,"€ 1,482","Vaticano Prati, RomeShow on map2.1 km from centre","In the centre of Rome, located within a short ...","€ 1,482"


In [21]:
rome_merged.to_csv('C:/Users/vanes/Desktop/BSE/Term 2/Introduction to Text Mining and NLP/HW1/rome_merged.csv')

## Regressions

In [22]:
barcelona_merged=pd.read_csv('C:/Users/vanes/Desktop/BSE/Term 2/Introduction to Text Mining and NLP/HW1/barcelona_merged.csv')
rome_merged=pd.read_csv('C:/Users/vanes/Desktop/BSE/Term 2/Introduction to Text Mining and NLP/HW1/rome_merged.csv')

To run the regression, the columns including the price have to be integers

In [23]:
barcelona_merged['price'] = barcelona_merged['price'].str.replace('€', '').str.replace(',', '').astype(int)
rome_merged['price'] = rome_merged['price'].str.replace('€', '').str.replace(',', '').astype(int)
barcelona_merged['price_week_before'] = barcelona_merged['price_week_before'].str.replace('€', '').str.replace(',', '').astype(int)
rome_merged['price_week_before'] = rome_merged['price_week_before'].str.replace('€', '').str.replace(',', '').astype(int)

In [36]:
barcelona_merged.dtypes

Unnamed: 0                int64
hotel_name               object
rating                  float64
room_description         object
price                     int32
location_descirption     object
long_description         object
Unnamed: 0_y            float64
price_week_before         int32
dtype: object

### Possible features based on the description

From the location description it is possible to extract the distance to the city centre

In [24]:
def extract_distance_city_centre(df):    
    pattern = r'(\d?\.?\d,?\d*?\s\w+)'

    # Apply the regex pattern to the 'location' column and create a new 'distance' column
    df['distance_city_centre'] = df['location_descirption'].str.extract(pattern)

    # regex patterns to capture kilometers and meters
    pattern_km = r'(\d?\.?\d,?\d*?)\s?km'
    pattern_m = r'(\d+)\s?m'

    # Apply the regex patterns to the 'distance' column and create new 'distance_meter' and 'distance_meter_only' columns
    df['distance_city_centre_meter'] = df['distance_city_centre'].str.extract(pattern_km)
    df['distance_city_centre_meter_only'] = df['distance_city_centre'].str.extract(pattern_m)

    # Convert the 'distance_city_centre_meter' column to meters (assuming 1 km = 1000 meters)
    df['distance_city_centre_meter'] = df['distance_city_centre_meter'].astype(float) * 1000

    # Replace NaN values in 'distance_city_centre_meter' with values from 'distance_city_centre_meter_only'
    df['distance_city_centre_meter'].fillna(df['distance_city_centre_meter_only'], inplace=True)

    # Drop the 'distance_city_centre_meter_only' column
    df.drop(columns=['distance_city_centre_meter_only'], inplace=True)

    # Convert the 'distance_city_centre_meter' column to integers
    df['distance_city_centre_meter'] = df['distance_city_centre_meter'].astype(int)
    df = df.drop('distance_city_centre', axis=1)
    return df


In [25]:
barcelona_merged = extract_distance_city_centre(barcelona_merged)

In [26]:
barcelona_merged

Unnamed: 0.1,Unnamed: 0,hotel_name,rating,room_description,price,location_descirption,long_description,Unnamed: 0_y,price_week_before,distance_city_centre_meter
0,0,Hotel Turin Barcelona,8.2,Double Room\n1 double bed\nFree cancellation,1323,"Ciutat Vella, BarcelonaShow on map450 m from c...",You're eligible for a Genius discount at Hotel...,333.0,1201,450
1,1,Sonder Los Arcos,8.4,King Room\n1 extra-large double bed\nOnly 1 ro...,1530,"Ciutat Vella, BarcelonaShow on map1 km from ce...",Sonder Los Arcos features accommodation with f...,902.0,1704,1000
2,3,ibis Styles Barcelona City Bogatell,8.5,Standard Double Room\n1 large double bed\nBrea...,1403,"Sant Martí, BarcelonaShow on map2.1 km from ce...","Located in Barcelona, 900 metres from Port Oly...",4.0,974,2100
3,5,Four Points by Sheraton Barcelona Diagonal,8.8,Classic King Room\n1 extra-large double bed\nF...,1724,"Sant Martí, BarcelonaShow on map3 km from cent...",Four Points by Sheraton Barcelona Diagonal is ...,16.0,1335,3000
4,6,Hotel del Mar,8.4,Room Assigned on Arrival\n1 single bed,1205,"Ciutat Vella, BarcelonaShow on map1.3 km from ...",You're eligible for a Genius discount at Hotel...,11.0,1232,1300
...,...,...,...,...,...,...,...,...,...,...
739,1017,Room con baño privado sagrada familia,6.8,Suite\nPrivate suite\n1 double bed\nFree cance...,1238,"Eixample, BarcelonaShow on map2 km from centre...",Room con baño privado sagrada familia is set i...,889.0,1238,2000
740,1020,Antiga Casa Buenavista,9.4,Double Room Raval\n1 large double bed\nBreakfa...,2953,"Ciutat Vella, BarcelonaShow on map0.6 km from ...","Conveniently set in the centre of Barcelona, A...",871.0,2476,600
741,1028,Apartment Barcelona Rentals - Sarria Apartment...,5.5,Apartment with Balcony\nEntire apartment • 2 b...,3096,"Sarrià-St. Gervasi, BarcelonaShow on map3.1 km...",You're eligible for a Genius discount at Apart...,1001.0,2556,3100
742,1030,Hostal Dragonflybcn,8.1,Twin Room with Shared Toilet\n2 single beds\nF...,975,"Ciutat Vella, BarcelonaShow on map0.7 km from ...",Hostal Dragonflybcn is a guest house located i...,424.0,700,700


Since the distance sometimes is measured in km and sometimes in m, we will convert all the km to m and make it integers

In [27]:
rome_merged = extract_distance_city_centre(rome_merged)

In [28]:
rome_merged

Unnamed: 0.1,Unnamed: 0,hotel_name,rating,room_description,price,location_descirption,long_description,price_week_before,distance_city_centre_meter
0,1,Sonder Piazza San Pietro,8.2,Superior Apartment\nEntire apartment • 1 bedro...,1777,"Vaticano Prati, RomeShow on map2.2 km from centre",You're eligible for a Genius discount at Sonde...,1693,2200
1,4,"[Centocelle, Metro C] Luminoso quadrilocale",,"Two-Bedroom Apartment\n3 beds (2 singles, 1 do...",951,RomeShow on map7.1 km from centreMetro access,"Located 5.9 km from Porta Maggiore, 5.9 km fro...",951,7100
2,6,Pamphili212,,Two-Bedroom Apartment\nEntire apartment • 2 be...,808,"Gianicolense, RomeShow on map2.9 km from centre","Situated in Rome, 1.4 km from Roma Trastevere ...",711,2900
3,8,4rooms In Rome,9.4,Standard Double Room\n1 double bed\nFree cance...,718,"Central Station, RomeShow on map2.2 km from ce...","Set in the centre of Rome, less than 1 km from...",732,2200
4,9,Juna's guest house,8.5,Double Room with Private Bathroom\n1 double be...,546,"Tiburtino, RomeShow on map4.3 km from centre",You're eligible for a Genius discount at Juna'...,546,4300
...,...,...,...,...,...,...,...,...,...
448,1028,My Vatican Home in Rome,,"Two-Bedroom Apartment\n3 beds (2 singles, 1 do...",832,"Aurelio, RomeShow on map3.6 km from centreMetr...",My Vatican Home in Rome is set in the Aurelio ...,832,3600
449,1034,Flaminio Lovely House,9.1,One-Bedroom Apartment\nEntire apartment • 1 be...,1648,"Villa Borghese Parioli, RomeShow on map4.1 km ...",You're eligible for a Genius discount at Flami...,1158,4100
450,1035,iFlat Sunny and Colorful Esquilino Apartment,8.5,Two-Bedroom Apartment\nEntire apartment • 2 be...,1854,"Central Station, RomeShow on map2.2 km from ce...","In the Central Station district of Rome, close...",1851,2200
451,1041,Large appartament near Vaticano,7.9,Deluxe Apartment\nEntire apartment • 2 bedroom...,1482,"Vaticano Prati, RomeShow on map2.1 km from centre","In the centre of Rome, located within a short ...",1482,2100


In [29]:
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

def abbr_or_lower(word):
    if re.match('([A-Z]+[a-z]*){2,}', word):
        return word
    else:
        return word.lower()

def strip(word):
    mod_string = re.sub(r'\W+', '', word)
    return mod_string

def pipeline_DTM(filepath, column_name):
    # Load csv file
    df = pd.read_csv(filepath)

    tokenized_texts = []
    # Iterate through the 'text' column and save each sentence as a string
    for index, row in df.iterrows():
        tokens = nltk.word_tokenize(str(row[column_name]))
        tokenized_texts.append(tokens)

    # lowercasing, stopword removal, and stemming
    corpus_stop = []
    corpus_stem = []
    for words in tokenized_texts:
        lowered_removed_stopwords = [abbr_or_lower(strip(w)) for w in words if
                                     abbr_or_lower(strip(w)) not in stopwords.words('english')]
        #print(lowered_removed_stopwords)
        corpus_stop.append(lowered_removed_stopwords)
        stemmed = [SnowballStemmer("english").stem(w) for w in lowered_removed_stopwords]
        # print(stemmed)
        corpus_stem.append(" ".join(stemmed))

    # TF-IDF Vectorization
    tfidf = TfidfVectorizer(ngram_range=(1, 2), norm=None, min_df=0.05, max_df=0.8)
    tfidf_vectorized_text = tfidf.fit_transform(corpus_stem)
    tfidf_vectorized_text = tfidf_vectorized_text.todense()

    # Creating a DataFrame from the TF-IDF vectorized data
    df_tfidf_vectorized = pd.DataFrame(tfidf_vectorized_text, columns=tfidf.get_feature_names_out())

    return df_tfidf_vectorized


In [31]:
df_tfidf_barcelona = pipeline_DTM('C:/Users/vanes/Desktop/BSE/Term 2/Introduction to Text Mining and NLP/HW1/barcelona_merged.csv','long_description')

In [32]:
# Sum the TF-IDF scores for each term across all documents
term_scores = df_tfidf_barcelona.sum(axis=0)

# Convert the result to a DataFrame for better visualization
df_term_scores = pd.DataFrame({'Term': term_scores.index, 'TF-IDF Score': term_scores.values})

# Sort the DataFrame by TF-IDF score in descending order
df_term_scores = df_term_scores.sort_values(by='TF-IDF Score', ascending=False)

# Display the top N terms
top_terms = 30  # Set the number of top terms you want to display
print(f"Top {top_terms} Terms:")
print(df_term_scores.head(top_terms))

Top 30 Terms:
           Term  TF-IDF Score
53        apart   2392.482131
218       hotel   1917.729335
232          km   1840.066413
356        room   1502.052565
443        walk   1306.909416
137          de   1212.064499
275       minut   1147.169977
335    properti   1105.033135
294       offer   1083.719552
268        metr   1052.143558
45      airport   1009.634432
65         away   1000.456467
409     station    907.704223
204      gracia    901.986313
178      featur    883.366006
276  minut walk    863.438119
346      rambla    854.109360
271       metro    847.177884
255       locat    829.343916
312     passeig    827.088091
313  passeig de    821.101100
49         also    807.894581
223      includ    802.770514
139   de gracia    798.483348
332      privat    784.637174
207       guest    784.132388
62        avail    758.880555
436          tv    748.923674
112   catalunya    748.894302
241          la    735.533069


possible features: control whether hotel or apartment, distance to airport

In [33]:
df_tfidf_rome = pipeline_DTM('C:/Users/vanes/Desktop/BSE/Term 2/Introduction to Text Mining and NLP/HW1/rome_merged.csv','long_description')

In [34]:
# Sum the TF-IDF scores for each term across all documents
term_scores = df_tfidf_rome.sum(axis=0)

# Convert the result to a DataFrame for better visualization
df_term_scores = pd.DataFrame({'Term': term_scores.index, 'TF-IDF Score': term_scores.values})

# Sort the DataFrame by TF-IDF score in descending order
df_term_scores = df_term_scores.sort_values(by='TF-IDF Score', ascending=False)

# Display the top N terms
top_terms = 30  # Set the number of top terms you want to display
print(f"Top {top_terms} Terms:")
print(df_term_scores.head(top_terms))

Top 30 Terms:
              Term  TF-IDF Score
82           apart   1461.057981
455        station   1215.483977
313          metro   1138.358513
314  metro station   1120.837404
230          guest    902.429786
242           hous    793.341121
232     guest hous    702.826189
405           room    643.037562
504        vatican    638.900716
498           unit    614.554566
56        accommod    613.812590
206         featur    579.312155
308           metr    576.873020
373         privat    511.857635
350          peter    495.312414
115            bed    487.447249
353         piazza    485.288810
197          equip    476.463154
241           home    475.625199
453             st    475.099907
454       st peter    472.683906
92            area    472.163969
384         provid    465.863637
248         includ    463.589094
101           away    456.644231
215     flatscreen    439.265467
216  flatscreen tv    438.995257
397           roma    438.269370
98           avail    437.006