# Analyzing the SEAsia TripAdvisor Dataset

### Summary

<ul>
<li>A handful of cities receive a large share of reviews</li>
<li>Reviews can be mined to find non-obvious characteristics of tourist attractions.</li>
<li>Tourism trends can possibly be extracted from reviews</li>
</ul>

### Introduction

After graduating, I took some time off to travel the world.  One area that I particularly enjoyed visiting was Southeast Asia.  So for a side project, I decided to scrape the TripAdvisor website for all the attractions reviews in (Cambodia, Laos, Vietnam), mine the data to for insights, and build a <a href="http://nbviewer.ipython.org/github/arhee/tripadvisor/blob/master/model.ipynb">recommendation algorithm.</a>

<b>Data</b>
<ul>
<li>430k reviews, 200k users, 5k reviewed attractions at 800MB.</li>
</ul>

<b>Processing</b>
<ul>
<li>Scraped, cleaned, and processed with Python</li>
<li>Stored in SQLite</li>
<li>Visualizations done in R</li>
</ul>

### A few cities take the lion's share of reviews

<img src="figs/city_popularity.png" style="max-height: 400; max-width: 600px;">

We can see that the majority of reviews go to a small number of cities.  In particular, the highest number of reviews belong to the town of Siem Reap, home to the world famous Angkor Wat. There is also a high number of reviews for the most populated urban areas (Ho Chi Minh City, Hanoi) and a popular tourist town (Hoi An) with a subsequent steep drop-off and long tail of many smaller cities. This aligns well with <a href=""> observed power laws </a> in user participation and attraction popularity.

### The exponential rise of TripAdvisor and the periodicity of tourism

<img src="figs/country_time.png" style="max-height: 400; max-width: 600px;">

By plotting the number of reviews for each country we notice that there is an exponential increase in the number of reviews over time, likely due to the increased adoption of TripAdvisor as a popular review platform.  We also observe that the number of reviews for Vietnam increased faster than either the rate of Cambodia and Laos, possibly indicative of increased tourism to Vietnam.

There is also prominant periodicity in Cambodian reviews peaking in Jan-Feb which is likely due to an increased influx of visitors to Angkor Wat in the <a href="http://www.traveldudes.org/travel-tips/climate-and-best-time-visit-angkor-wat-cambodia/2177">Dec-Feb timeframe</a>.  On the other hand, for those interested in avoiding the tourist crowds, Laos remains a sleepy and unreviewed country.

### English as lingua franca (on TripAdvisor)

<img src="figs/location_lang_histo.png" style="max-height: 400; max-width: 600px;">

Next, I was wondered how well my observations while traveling were reflected in the data.

The first general observation was that English was the de facto tourist lingua franca everywhere.  The tourist infrastructure (signs, menus, hotels, etc.) in all the countries I visited were in English. As a result, conversations in broken English between two non-native speakers was a common occurence.  With the large caveat that TripAdvisor may be more popular in English speaking countries, the data reflects the dominance of English. 

The cities of Nha Trang and Phan Thiet (Mui Ne) cities stand out in the proportion of Russian reviews as a consequence of being well-known as tropical get-aways for Russian tourists.

There are a fair bit of reviews in French as well, this is unsurprising as large parts of Southeast Asia were formerly known as French Indochina.

### Where should I be careful?

Unfortunately, theft is not an uncommon part of the tourist experience in SEAsia.  In Vietnam, thieves would typically either slash a woman's purse in the marketplace or do a drive-by motorcycle snatching.  The other favored approach was to wait until tourists were in the ocean and then steal their valuables from the beach or the hotel.

Most tourist websites may mention theft, but it is rarely quantified.  However, with the TripAdvisor dataset, it may be possible to estimate the comparative frequency of theft.

Reviews were filtered for stop words and stemmed.  Reviews with instances of words that had a strong relation to theft (pickpocket, stole) were marked as a theft-related review and counted.

For a benchmark, the American city of Baltimore was included.

<img src="figs/pickpocket_cities_histo.png" style="max-height: 400; max-width: 600px;">


<b>Lock up your valuables while at the beach and in busy cities!</b>

The top three cities with reviews mentioning theft are popular touristy beach towns while the next four cities are the major cities of the region.  Notably, the mention of theft in all these locations are much greater when compared to Baltimore.  This aligns well with my personal experiences.  At one point while in Ho Chi Minh City, roughly 20% of my hostel had been victims of some sort of theft!

Are there specific places to be wary of?

<img src="figs/pickpocket_locs_histo.png" style="max-height: 400; max-width: 600px;">

<b>Thieves like busy markets!</b>

With the notable exception of the Why Not Bar, theft is most often mentioned in the reviews of markets in busy cities.  Notably, the locations listed for Ho Chi Minh City are the major backpacker areas.  All three areas are within ~500m of each other and favorites of thieves on motorcycles and pickpockets. The inclusion of the Angkor National Museum is likely due to users commenting on cultural artefacts that were stolen.

How about scams? The bane of any tourist's existence.

<img src="figs/scam_locs_histo.png" style="max-height: 400; max-width: 600px;">

<b>Angkor What?</b>

The top 5 out of 6 are reviews of the Tonle Sap lake nearby Angkor Wat.  It's interesting that it is so localized.

### Conclusions

The TripAdvisor dataset is rich in text that can be mined for local characteristic information that may have been buried underneath thousands of reviews.

In addition, we observe that the TripAdvisor data could potentially reflect real demographic trends in tourism.  This could possibly be of use for operators in the tourism industry as well as tourists themselves.