# Hotel Search Data

Company XYZ is an Online Travel Agent site, such as Expedia, Booking.com, etc.

They haven't invested in data science yet and all the data they have about user searches are simply stored in the URLs generated when users search for a hotel. If you are not familiar with URLs, you can run a search on any OTA site and see how all search parameters are present in the URL.

In [1]:
import pandas as pd

## Question 1

Create a clean data set where each column is a ﬁeld in the URL, each row is a given search and the cells are the corresponding URL values.

In [2]:
# constant definition
Site = 'http://www.mysearchforhotels.com/shop/hotelsearch?'
LenSite = len(Site)

ParamPrefix = 'hotel.'
LenParaPrefix = len(ParamPrefix)

Separator = ', '

In [4]:
def parse_url(url):
    """
    input: a url string
    output: a dictionary which contains parameter name and its value
    """
    # remove common prefix
    assert url[LenSite-1] == '?'
    segments = url[LenSite:].split('&')

    params = {}
    for segment in segments:
        kvpairs = segment.split('=')
        assert len(kvpairs) == 2

        k = kvpairs[0]
        # remove common prefix
        assert k[LenParaPrefix-1] == '.'
        k = k[LenParaPrefix:]

        if k in params:
            print ("'{}' has already existed in search".format(k))
            params[k] = params[k] + Separator +kvpairs[1]
        else:
            params[k] = kvpairs[1]

    return params

In [7]:
def load_parse():
    succ_urls = []
    fail_urls = []
    with open("url_list.txt",'rt') as inf:
        for index,line in enumerate(inf):
            try:
                url = parse_url(line.strip())
                succ_urls.append(url)
            except:
                fail_urls.append(line)
                print ("failed to parse: {}".format(line))

            # if index%1000 ==0: print '{} lines parsed'.format(index)

    print ("************ ALL DONE ************")
    return succ_urls,fail_urls

In [8]:
succ_urls,fail_urls = load_parse()
assert len(fail_urls) == 0

'amenities' has already existed in search
'amenities' has already existed in search
'amenities' has already existed in search
'amenities' has already existed in search
'amenities' has already existed in search
************ ALL DONE ************


In [9]:
# convert into DataFrame
urls = pd.DataFrame(succ_urls)

# clean
urls['checkin'] = pd.to_datetime(urls.checkin)
urls['checkout'] = pd.to_datetime(urls.checkout)
urls["children"].fillna(0,inplace=True)
urls['city'] = urls.city.str.replace('+',' ')
urls['search_page'] = urls.search_page.astype(int)

In [10]:
urls.head()

Unnamed: 0,checkin,stars_4,min_score,adults,city,checkout,search_page,stars_3,customMaximumPriceFilter,stars_5,freeCancellation,stars_2,children,max_score,couponCode,stars_1,customMinimumPriceFilter,amenities
0,2015-09-19,yes,4,3,"New York, NY, United States",2015-09-20,1,,,,,,0,,,,,
1,2015-09-14,,4,3,"London, United Kingdom",2015-09-15,1,yes,,,,,0,,,,,
2,2015-09-26,yes,5,2,"New York, NY, United States",2015-09-27,1,,175.0,,,,0,,,,,
3,2015-09-02,yes,4,1,"Hong Kong, Hong Kong",2015-09-03,1,,,yes,,,0,,,,,
4,2015-09-20,,5,3,"London, United Kingdom",2015-09-29,1,,275.0,,,,0,,,,,


## Question 2

For each search query, how many amenities were selected?

In [11]:
pd.notnull(urls.amenities).value_counts()

False    76973
True       704
Name: amenities, dtype: int64

most of the search doesn't specify 'amenities'.

In [12]:
urls.amenities.value_counts()

internet                272
yes_smoking             170
shuttle                 111
yes_pet                  85
breakfast                39
lounge                   22
yes_smoking, yes_pet      4
breakfast, yes_pet        1
Name: amenities, dtype: int64

so each search only contains one amenities, it seems the website doesn't allow include multiple amenities in the search.

In [13]:
amenities_cnts = urls.amenities.map(lambda s: 0 if pd.isnull(s) else len(s.split(Separator)))

amenities_cnts.value_counts()

0    76973
1      699
2        5
Name: amenities, dtype: int64

## Question 3

Often, to measure the quality of a search algorithm, data scientists use some metric based on how often users click on the second page, third page, and so on. The idea here is that a great search algorithm should return all interesting results on the ﬁrst page and never force users to visit the other pages (how often do you click on the second page results when you search on Google? Almost never, right?).

Create a metric based on the above idea and ﬁnd the city with the worst search algorithm.

In [14]:
urls.search_page.value_counts()

1     50000
2     11637
3      5864
4      3635
5      2422
6      1636
7      1114
8       740
9       436
10      193
Name: search_page, dtype: int64

In [15]:
urls.groupby('city')['search_page'].apply(lambda s: s.value_counts(normalize=True)).unstack().sort_values(1)

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
"London, United Kingdom",0.526588,0.187398,0.102502,0.065329,0.044372,0.030152,0.020315,0.012973,0.007199,0.003172
"New York, NY, United States",0.557616,0.181357,0.094575,0.058808,0.038899,0.026443,0.018173,0.01266,0.007929,0.003539
"Hong Kong, Hong Kong",0.910826,0.064992,0.014254,0.00526,0.002291,0.001103,0.000848,0.000339,8.5e-05,
"San Francisco, California, United States",0.959285,0.033613,0.004853,0.00142,0.000829,,,,,


In [16]:
def firstpage_ratio(s):
    total = s.shape[0]
    n_firstpage = (s == 1).sum()
    return float(n_firstpage)/total

city_page1_ratio = urls.groupby('city')['search_page'].agg(firstpage_ratio).sort_values()

In [17]:
city_page1_ratio

city
London, United Kingdom                      0.526588
New York, NY, United States                 0.557616
Hong Kong, Hong Kong                        0.910826
San Francisco, California, United States    0.959285
Name: search_page, dtype: float64