# Scraping the SEAsia TripAdvisor Dataset

This is the first part in a three part series.

1. Scraping
2. <a href="http://nbviewer.ipython.org/github/arhee/tripadvisor/blob/master/tripadvisor.ipynb">Analysis</a>
3. <a href="http://nbviewer.ipython.org/github/arhee/tripadvisor/blob/master/model.ipynb">Rating Prediction</a>

### Introduction

After graduating, I took some time off to travel the world.  One area that I particularly enjoyed visiting was Southeast Asia.  So for a side project, I decided to scrape the TripAdvisor website for all the attractions reviews in (Cambodia, Laos, Vietnam), mine the data to for <a href="http://nbviewer.ipython.org/github/arhee/tripadvisor/blob/master/tripadvisor.ipynb">interesting insights</a>, and build a <a href="http://nbviewer.ipython.org/github/arhee/tripadvisor/blob/master/model.ipynb">prediction algorithm.</a>

While the TripAdvisor website hosts reviews for a number of different categories - restaurants, hotels, attractions, etc., this project will focus on attraction reviews.

### The Structure of TripAdvisor.com

The overall structure of the website is a heavily nested format where the item of interest is a review page buried underneath many pages as shown below:

<ul><li>Country 1
    <ul><li>City 1
        <ul><li>Attraction 1</li>
                <ul><li> Review Page 1</li>
                    <li> Review Page 2</li></ul>
            <li>Grouped Attraction 1</li>
                <ul><li>Attraction 2</li>
                        <ul><li> Review Page 1</li>
                            <li> Review Page 2</li></ul></ul>
                <ul><li>Attraction 3</li>
                        <ul><li> Review Page 1</li>
                            <li> Review Page 2</li></ul>
                </ul>

A couple of potential difficulties are immediately noticeable: the multi-layered nesting, although predictable, necessites a flexible scraper.  For example, in the figure below, we can see that the 'Old Quarter' is an Attraction object, whereas 'Sighseeing Tours' is a Grouped Attraction object.  The scraper needs to differentiate between the two objects and act appropriately.

<img src="figs/attractions.png" style="max-height: 300; max-width: 400px;">

In addition, the site employs javascript popups that prevented access to information.  Therefore, due to these constraints, I wrote my own <a href="https://github.com/arhee/tripadvisor_scraper">scraper</a> with the help of the Selenium package, a set of tools for web browser automation.


### Scraping Data

Each review page holds up to 10 reviews which is what we are interested in.  An example of a review is shown below:

<img src="figs/scraped_review.png" style="max-height: 300; max-width: 400px;">

We can easily extract the HTML with Selenium and parse it using the BeautifulSoup library for python.  Then from the data, we can obtain the following relevant pieces of information:
<ul>
<li>Review Text/Title</li>
<li>Rating</li>
<li>Review Date</li>
<li>Language</li>
<li>User Properties</li>
<li>Item Rated</li>
</ul>

In addition, from the review page itself, we have the following pieces of information:

<ul>
<li>Country</li>
<li>City</li>
<li>Attraction Name</li>
<li>TripAdvisor Designated Categories</li>
</ul>

These data can then be entered as a single review entry in a SQLite database.  

In all, it took about a week or so of scraping to amass the final database:
<ul>
<li>430k reviews, 200k users, 4.5k reviewed attractions at 600MB.</li></ul>

### Lessons learned
#### Bookmarks to pick up where last left off
The crawler was mostly automatic. However, crashes due to a faulty internet connection or upset OS were inevitable.  A bookmark marking the precise page saved a lot of time.  This was more difficult than anticipated since the scraping order is not explicitly given.

#### Redundancy checking
The quantity of redundant reviews on the website was not anticipated.  After implementing a feature to check the database for redundant attractions, the scraping speed sped up considerably.

On the to the <a href="http://nbviewer.ipython.org/github/arhee/tripadvisor/blob/master/tripadvisor.ipynb">Analysis</a>