# Hostel World
- Caitlin Mowdy
- DSI-SF-2

In [18]:
import numpy as np
import scipy 
import seaborn as sns
import pandas as pd
import sys
from IPython.display import display

In [19]:
hostel_dist = pd.read_csv('/Users/caitlinmowdy/Desktop/DSI-SF-2-caitlinmowdy/capstone-hostelworld/clean-data/hostel_dist_oct10.csv')
user_dist = pd.read_csv('/Users/caitlinmowdy/Desktop/DSI-SF-2-caitlinmowdy/capstone-hostelworld/clean-data/user_dist_oct10.csv')

## Problem Statement
- Hostelworld is a website that connects independent, budget, and youthful travelers to hostels. For the past 10 years Hostelworld has lead the market in online reservations for this demographic. The website lets users search for hostels given a location, group size, and dates. It also lets the user filter for prefered features and prices. 
- The goal of this project is to make hostel recommendation for a users based on other users that are similar to them, and hostels that are similar to those they have reviewed highly.
- However most hostels have very similar features, and given that hostels are budget accommodation they usually fall in the same price range. For this reason the text of hostel descriptions and reviews needs to be taken into account when making hostel recommendations.

## Collecting Data
### User IDs
- To view anonymously all the reviews made by a single user the following Xpath can be used with a unique ID inserted inplace of 'USER-ID 
    - http://www.hostelworld.com/profile/USER-ID/reviews
- I randomly generated codes and checked that they belonged to users’ profiles
- to view code
    - https://github.com/caitlinmowdy/DSI-SF-2-caitlinmowdy/blob/master/capstone-hostelworld/code/Get%20User%20IDs%201.ipynb

### User and Review Details
- Used the list of IDs that belong to reviewers
- For users: 
    - collected age, travel group type, age group, and number of reviews 
- For reviews:
    - collected reiew text, date, score, hostel, link to hostel, and location
    
### Hostel Details
- Used links collected in reviews
- tested that links belong to hostels that were still on hostel world
- scraped for location, rating, review features, descriptions, amenities, policies, and awards

- to view code
    - https://github.com/caitlinmowdy/DSI-SF-2-caitlinmowdy/blob/master/capstone-hostelworld/code/Collect%20Raw%20Data.ipynb

## Cleaning Data 
- First step in cleaning my data was deleting some of my users and reviews
    - Some of the reviews belonged to hostels that No longer exist
    - Also the reviews that weren't in english had to be deleted
    - Deleting those reveiws left some userse who no longer had reviews
- I removed odd spacing and html tags
- seperated about_user, date, and location columns
- Created hostel and country IDs
- to view code
    - https://github.com/caitlinmowdy/DSI-SF-2-caitlinmowdy/blob/master/capstone-hostelworld/code/Clean%20Data.ipynb

## EDA
### Reviews
- What countries are recieving the most reviews 
    - Italy, Spain, England
- mean scores of reviews by country
    - Tunisia, Antigua, San Marino
- number of reviews by Month
    - Jan, Dec, and July are the Months with the highest reviews
- distribution of review scores
    - Found that most reviews are positive

### Users 
- What countries are most users from
    - USA, England, Australia
- In countries do users leave the most reviews
    - Looked at mean(num_revs) grouped by country
    - Bouvet Island, Poland, and Pakistan
- What are the most poplular travel group types
    - male, female, couple
- What are the most common age groups
    - not specified was most common by far, second was 25-30, then 18-24 was third
- histogram of number of reviews by user
    - most users leave 1 or 2 reviews
    
### Hostels
- histogram of hostel scores
    - most hostels have a score between 8 and 9.5
- What countries have the most hostels
    - Italy, Spain, Australia
    
- to view code
    - https://github.com/caitlinmowdy/DSI-SF-2-caitlinmowdy/blob/master/capstone-hostelworld/code/eda.ipynb

## Topic Modeling
- I decided to use topic modeling on both my reviews and hostel descriptions. For both of them I created 10 topics and found the topic probabilities for every review and hostel description.
- To view code 

- Once I had the topic probabilities I edited some of my hostel features to fit a regression for hostel scores. I eventually gave up on fitting a regression model for hostel ratings and review scores. The correlation matrix in my code help explain why my regression scores were so low. 

## Hostel and User Distance
- I used jaccard distance for finding the distances between hostels and users. 
 - len( intersection of setA and setB) divided by len( union of setA and setA)
- To use jaccard distance I first had to change my user and hostel information into sets.

### User Distance 
- I created a distance function for users that would take a given user and find the distance between that user and every other user. It returns the users closest to the given users and lists of hostels those users have rated highly.

In [21]:
def j_user_dist(user):
    distance = []
    users1 = []
    hostels = []
    user_a = set(user_dist ['user_stuff'][user_dist.user_id == user].values[0].replace('[','').replace(']','').split(', '))
    
    for i,id in enumerate(user_dist.user_id):
        user_b = set(user_dist.user_stuff[i].replace('[','').replace(']','').split(', '))
        numerator = len(user_a.intersection(user_b)) * 1.0
        denominator = len(user_a.union(user_b)) * 1.0
        distance.append(numerator / denominator)
        users1.append(id)
        hostels.append(user_dist['hsts_liked'][i])
    distances = pd.DataFrame()
    distances['users'] = users1
    distances['distances']=distance
    distances['hostels'] = hostels
    return distances.sort_values('distances', ascending = False)[1:4]

### Hostel Distance
- The function for finding hostel distance is very similar to the function for user distance. It takes in a given hostel and a list of hostels, and finds the distance between the first hostel and every hostel in that list. It returns the hotels closest to the first hostel and links to them.

In [22]:
def j_hostel_dist(hst,hostel_list):
    distance = []
    hostels1 = []
    links = []
    distances = pd.DataFrame()
    
    hostel_a = set(hostel_dist['hostel_info'][hostel_dist.hostel == hst].values[0].split(', '))
    
    for h in hostel_list:
        
        hostel_b = set(hostel_dist['hostel_info'][hostel_dist['hostel']==h].values[0].split(', '))
        
        numerator = len(hostel_a.intersection(hostel_b)) * 1.0 
        denominator = len(hostel_a.union(hostel_b)) * 1.0
        distance.append(numerator / denominator)
        
        hostels1.append(h)
        links.append(hostel_dist['link'][hostel_dist['hostel']==h].values[0])
        
    distances['hostel'] = hostels1
    distances['distances']=distance
    distances['link'] = links
    
    return distances.sort_values('distances', ascending = False)[:1]

## Hostel Recomendations
- Using the hostel distance and user distance functions I made a function that would give hostel recommendations for a given user. The function first uses the user distance function to find the users closest to the given users. It collects the hostels the  given user has rated highly, and the hostels the closest users have rated. The function then takes every hostel the given user has rated highly and compares it to the list of hostels the best matched users have rated highly using the hostel distance function. It then returns a list of recommended hotels and links to those hostels

In [23]:
def hostel_rec(user):
    rec = pd.DataFrame()
    
    h_list = []
    for h in list(j_user_dist(user)['hostels'].values):
        for H in h.replace('[','').replace(']','').split(', '):
            if H != '':
                h_list.append(H)
            
    users_hsts = [h.replace('[','').replace(']','')
                 for h in user_dist['hsts_liked'][user_dist['user_id']==user].values[0].split(', ')]
    
    for hst in users_hsts:
        rec = rec.append(j_hostel_dist(hst,h_list),ignore_index=True)
        
    rec['user_hostel'] = users_hsts
    return rec[['user_hostel','hostel','distances','link']]

In [24]:
hostel_rec(3298399)

Unnamed: 0,user_hostel,hostel,distances,link
0,Bed & Bike Barcelona,Way Hostel,0.530612,http://www.hostelworld.com/hosteldetails.php/W...
1,B&B Giovy,Hostal Andalucia,0.375,http://www.hostelworld.com/hosteldetails.php/H...
2,CroParadise Green Hostel,Way Hostel,0.418182,http://www.hostelworld.com/hosteldetails.php/W...
3,B&B Giovy,Hostal Andalucia,0.375,http://www.hostelworld.com/hosteldetails.php/H...
