# TripAdvisor Rating Prediction

This is the last part in a three part series.

1. Scraping
2. <a href="http://nbviewer.ipython.org/github/arhee/tripadvisor/blob/master/tripadvisor.ipynb">Analysis</a>
3. <a href="http://nbviewer.ipython.org/github/arhee/tripadvisor/blob/master/model.ipynb">Rating Prediction</a>

### Introduction

With the TripAdvisor dataset, I wanted to see if I could build a model to predict a user's rating of an attraction within the computational limits of my laptop.

<b>Tools</b>
<ul>
<li>Modeling done with Python.  Code available <a href="https://github.com/arhee/tripadvisor/tree/master/models">here</a></li>
<li>Visualizations done in R</li>
</ul>

Analysis on the dataset available <a href="http://nbviewer.ipython.org/github/arhee/tripadvisor/blob/master/tripadvisor.ipynb"> here</a>. 

### Framework

To begin, let us define the problem as predicting a user's rating of an attraction.  We then define the objective function as minimizing the RMSE:

$$\frac{1}{n}\sum\limits_{}^n  (r_{u,i} - \hat{r}_{u,i})^2,$$

where ${r}_{u,i}$ is the rating given by user $u$ for item $i$, and $\hat{r}_{u,i}$ is the model's estimate.

We will use a linear model as:
<ul>
<li> Linear models have low computational overhead </li>
<li> Optimization is guaranteed to find the global minimum </li>
<li> Regression models are likely to outperform classification models in RMSE scores </li>
</ul>

Therefore, the model is in the following format:

$$\hat{r}_{u,i} = \mu + \sum_{n\in\omega} b_{n},$$

where $\mu$ is the global average rating, and $\sum_{n\in\omega} b_{n}$ is the sum of all the biases $b$ in the set $\omega$.

The corresponding cost function solved via stochastic gradient descent is:

$$\min_{b_{*}}\sum_{(u,i)\in\kappa}({r}_{u,i} - (\mu + \Sigma_{n\in\omega} b_{n}))^2 + \lambda\Sigma_{n\in\omega} b_{n}^2),$$

where $\kappa$ is the set of all user reviews, and $\lambda$ is the regularization parameter.  

We will evaluate the performance of the model using k-folds cross-validation with 5 folds to prevent over-fitting and a constant seed value to enable comparison across models.

### Biases

The general strategy is to search for and evaluate biases independently to see if they improve upon the global average $\mu$ or a comparative baseline.  By evaluating biases independently, we can save upon computational time and complexity.

However, this approach may find false positives as when two biases that are not truly independent report RMSE improvements.  To address this issue, after collecting all the biases, we will conduct a greedy search to find the best combination of biases that improve the RMSE score in combination.

#### Global Average
To begin, we evaluate the global average.  A histogram of the TripAdvisor ratings show that it is heavily skewed towards the upper end.  Tourists are very satisfied.  The average rating is a high 4.4/5 stars.

<img src="figs/ratings_histo.png" style="max-height: 400; max-width: 600px;">

By predicting every rating with the global average with following model:

$$\hat{r}_{u,i} = \mu$$

we obtain a RMSE of 0.903.  We set this as the baseline to improve upon.

#### Attraction Bias
One obvious possible bias is the attraction itself.  People are likely to rate an attraction close to the attraction average.  With the inclusion of an attraction bias $b_{a}$ in the following model:

$$\hat{r}_{u,i} = \mu + b_{a}$$

we find a considerable RMSE improvement of 0.123 over the global average

#### User Bias
Another obvious bias is the user. We all have that one grouchy friend who rate items fairly negatively.  With the inclusion of a user bias $b_{u}$, we find a modest RMSE improvement of 0.030 over the global average.

<img src="figs/reviews_user_histo.png" style="max-height: 400; max-width: 600px;">

The likely cause of a small improvement is that in this dataset, users overwhelmingly only rate a single attraction.  There is noticeable a lack of stickiness; users don't keep rating attractions, it seems to be a one-off endeavour.  Thus, over 97% of users rate less than 5 items.  Without a rich user history, predicting a user's future ratings is difficult.

#### Attraction Group Bias

It is plausible that individuals have preferences for categories of attractions.  Someone might be positively biased for attractions that are historical landmarks rather than ecotours. We can term this bias as $b_{u,g}$ for user $u$ and the attraction group $g$.

Initially, I investigated a model using the pre-determined groups given on TripAdvisor: Eco Tours, Nature & Wildlife Areas, etc.  Unfortunately, there was no improvement in model prediction.

Instead, we can construct our own groups.  By aggregating all the reviews  for each attraction, we construct an attraction "document".  These "documents" are then cleaned of punctuation, and word stemmed into tokens that are then counted.  This gives a term frequency-inverse document frequency (tf-idf) object which can then be turned into a set of document-specific features with non-negative matrix factorization.  

With a set of features describing each document, we can easily cluster the attractions into categories with k-means clustering.  We find that this method of categorizing attractions with the user-group bias $\hat{r}_{u,i} = \mu + b_{u,g}$ improves the RMSE by 0.05 over the user bias model: $\hat{r}_{u,i} = \mu + b_{u}$.

#### Language Bias

This bias was surprising.  It turns out that reviews written in Japanese were more critical of attractions than any other language.

<img src="figs/lang_rating.png" style="max-height: 400; max-width: 600px;">

It may be possible that reviews written in Japanese only visited bad attractions.  To determine the improvement we compared $\hat{r}_{u,i} = \mu + b_{a}$, the performance of the attraction model, with the following:

$$\hat{r}_{u,i} = \mu + b_{a} + b_{l},$$

where $b_{l}$ is the language correction term.  We do find a small RMSE improvement of 0.02 over the attraction model.  The tiny improvement can be attributed to the fact that reviews written in Japanese are only 3% of all reviews.

#### Temporal Biases

Given that SEAsia has both a rainy and dry season, one would suspect some sort of seasonality for ratings; high ratings during the dry season and poorer ratings during the rainy weather.  However, we did not find any such pattern at the country or city levels.  However, there were some temporal patterns at the attraction level that were consistent throughout 7 years.  With this knowledge, we construct the following model:

$$\hat{r}_{u,i} = \mu + b_{a,m},$$

where $b_{a,m}$ is the bias for attraction $a$ in month $m$ and $a \in \tau$ where $\tau$ is the set of all attractions that show monthly periodicity.  We find a small 0.02 improvement in RMSE over the attraction model: $\hat{r}_{u,i} = \mu + b_{a}$.

#### Final Model

To determine the final model, we institute a greedy search.  By incrementally adding the best performing biases until RMSE improvement ceases, we obtain the following model:

$$\hat{r}_{u,i} = \mu + b_{a} +  b_{a,m} + b_{u}$$

with a total RMSE of 0.762 which is a considerable 16% improvement over the global average of 0.903.  Notably, the small gains from the language bias and the group bias were eliminated leaving behind a relatively simple model.

### Conclusion

The biggest gains from biases that were rich.  Absent a rich user history, it is hard to predict - the cold start problem.