# APMTH 207: Advanced Scientific Computing: 
## Stochastic Methods for Data Analysis, Inference and Optimization
## Long Homework #3
**Harvard University**<br>
**Spring 2017**<br>
**Instructors: Rahul Dave**<br>
**Due Date: ** Friday, April 14th, 2017 at 11:59pm

**Instructions:**

- Upload your final answers as well as your iPython notebook containing all work to Canvas.

- Structure your notebook and your work to maximize readability.

In this course, we've spent a lot of time learning algorithms for performing inference on complex models and we've spent time using these models to make decisions regarding our data. But in nearly every assignment, the model for the data is specified in the problem statement. In real life, the creative and, arguably, much more difficult task is to start with a broadly defined goal and then to customize or create a model which will meet this goal in some way. In this long homework, we will lead you through the process of model building in simulated real-life conditions. 

In the dataset called "sample_reviews", you'll find a fairly representative selection of Yelp reviews for a (now closed) sushi restaurant called Ino's Sushi in San Francisco. The goal in this assignment is to build a model to help a machine classify any given restaurant (or qualities of a restaurant) as "good" or "bad" given Yelp reviews. 

Problem #1 is atypical as it does not involve any programming or (necessarily) difficult mathematics/statistics, however, answering these questions *seriously* will give you a idea of how one might create or select a model for a particular application and your answers will help you with formalizing the model in Problem #2, which is much more technically involved.


## Problem #1: Understanding Yelp Review Data As a Human

***Grading:*** *We want you to make a genuine effort to mold an ambiguous and broad real-life question into a concrete data science or machine learning problem without the pressure of getting the "right answer". As such, we will grade your answer of Problem #1 on a pass/fail basis. Any reasonable answer that demonstrates actual effort will be given a full grade.*

Read the reviews and form an opinion regarding the various qualities of Ino's Sushi. Answer the following:

- If the task is to summarize the quality of a restaurant in a simple and intuitive way, what might be problemmatic with simply classifying this restaurant as simply "good" or "bad"? Justify your answers with specific examples from the dataset.


- For Ino's Sushi, categorize the food and the service, separately, as "good" or "bad" based on all the reviews in the dataset. Be as systematic as you can when you do this.

  (**Hint:** Begin by summarizing each review. For each review, summarize the reviewer's opinion on two aspects of the restaurant: food and service. That is, generate a classification ("good" or "bad") for each aspect based on what the reviewer writes.) 
  
  
- Identify statistical weaknesses in breaking each review down into an opinion on the food and an opinion on the service. That is, identify types of reviews that make your method of summarizing the reviewer's optinion on the quality of food and service problemmatic, if not impossible. Use examples from your dataset to support your argument. 


- Identify all the ways in which the task in bullet #2 might be difficult for a machine to accomplish. That is, break down the classification task into simple self-contained subtasks and identify how each subtask can be accomplished by a machine (which area of machine learning, e.g. topic modeling, sentiment analysis etc, addressess this type of task).


- Describe a complete pipeline for processing and transforming the data to obtain a classification for both food and service for each review. (You are welcome to use our schema in Problem #2 to help you do this).

## Problem #2: Modeling Your Understanding

In the dataset "reviews_processed.csv", you'll find a database of Yelp reviews for a number of restaurants. These reviews have already been processed and transformed by someone who has completed the (pre) modeling process described in Problem #1. That is, imagine the dataset in "reviews_processed.csv" is the result of feeding the raw Yelp reviews through the pipeline someone build for Problem #1.

The following is a full list of columns in the dataset and their meanings:

I. Relevant to Part A and B:

  1. "review_id" - the unique identifier for each Yelp review
  2. "topic" - the subject addressed by the review (0 stands for food and 1 stands for service)
  3. "rid" - the unique identifier for each restaurant
  4. "count" - the number of sentences in a particular review on a particular topic
  5. "mean" - the probability of a sentence in a particular review on a particular topic being positive, averaged over total number of sentences in the review related to that topic.
  6. "var" - the variance of the probability of a sentence in a particular review on a particular topic being positive, taken over all sentences in the review related to that topic.
  7. (only relevant

II. Relevant (possibly) to Extra Credit:

  1. "uavg" - the average star rating given by a particular reviewer (taken across all their reviews)
  2. "stars" - the number of stars given in a particular review
  3. "max" - the max probability of a sentence in a particular review on a particular topic being positive
  4. "min" - the min probability of a sentence in a particular review on a particular topic being positive

The following schema illustrates the model of the raw data that is used to generate "reviews_processed.csv":
<img src="restuarant_model.pdf">

***Warning:*** *this is a "real" data science problem in the sense that the dataset in "reviews_processed.csv" is large. We understand that a number of you have limited computing resources, so you are encouraged but not required to use the entire dataset. If you wish you may use 10 rows from the dataset, as long as your choice of 10 contains a couple of restaurants with a large number of reviews and a couple with a small number of reviews.*

### Part A: Modeling

When the value in "count" is low, the "mean" value can be very skewed (refer to your answers for Problem #1 to see why this is a problem if we are interested in summarizing the reviewer's opinion on each aspect of a restaurant).

Following the SAT prep school example discussed in lab (and using your answers for Problem #1), set up a Bayesian model for a reviewer $i$'s opinion of restaurant $j$'s food and service, separately. That is, you will have a model for each restaurant and each aspect (food and serivce). For restaurant $j$, you will have a model for $\{\theta_{ij}^{\text{food}}\}$ and one for $\{\theta_{ij}^{\text{service}}\}$, where $\theta_{ij}$ is the positivity of the opinion of the $i$-th reviewer regarding the $j$-th restaurant. 

**Hint:** what quantity in our data naturally corresponds to $\bar{y}$'s in the prep school example? How would you calculate the parameter $\sigma^2$ in the distribution of $\bar{y}$ (note that $\sigma^2$ is not provided explictly in the data?)

### Part B: Analysis for Each restaurant

Use your model to produce estimates for $\theta_{ij}$'s. Pick a few restaurants, for each aspect ("food" and "service") of each restaurant, plot your estimates for $\theta$'s against the values in the "mean" column (corresponding to this restaurant. 

For the same restaurants, for each aspect, generate shrinkage plots as follows:

<img src="shrinkage.png">

The $x$-axis is the posterior means, the $y$-axis is classification probability (1-cdf) or fraction of predictive samples. The colored lines are error bars. (The code to generate this plot is included in this notebook.)

Use these plots to discuss the statistical benefits of modeling each reviewer's opinion as you did in Part A, rather than approximating the reviewer opinion with the value in "mean".

### Part C: Analysis Across Restaurants

Aggregate, in a simple but reasonable way, the reviewer's opinions to given a pair of overall scores for each restaurant, one for food and one for service. Rank the restaurants by food score and then by service score. Discuss the statistical weakness of ranking by these scores.

(**Hint:** what is problemmatic about the way you aggregated the reviews of each restaurant to produce an overall food or service score? You've see this question addressed a number of times in previous Homeworks.)

### Extra Credit:

Propose a model, that addresses the weakness of your approach in Part C, for the overall quality of food and service for each restaurant given the $\theta$'s. Combine your model for the overall quality with your model for the $\theta$'s. Use this combined model to estimate the overall quality of food and service for each restaurant.

(**Hint:** Homework #7 might be a good reference for building your model for overall quality of food and service)


In [None]:
import itertools

# fix a restaurant and an aspect (food or service)
# "means" is the array of values in the "mean" column for the restaurant and the aspect 
#         in the dataset
# "thetas" is the array of values representing your estimate of the opinions of reviewers 
#          regarding this aspect of this particular restaurant
# "theta_vars" is the array of values of the varaiances of the thetas
# "counts" is the array of values in the "count" column for the restaurant and the aspect 
#.         in the dataset

def prob_shrinkage_plot(means, thetas, theta_vars, counts):
    data = zip(means, thetas, theta_vars / counts, theta_vars, counts)
    palette = itertools.cycle(sns.color_palette())
    with sns.axes_style('white'):
        for m,t, me2, te2, c in data:
            color = next(palette)
            noise = 0.001 * np.random.randn()
            noise2 = 0.001 * np.random.randn()
            if me2 == 0:
                me2 = 4
            p = prob(m, me2, 1.)
            peb = prob(t, te2, 1.)
            plt.plot([m, t],[p, peb],'o-', color=color, lw=1)
            plt.errorbar([m, t],[p + noise, peb + noise2], xerr=[np.sqrt(me2), np.sqrt(te2)], color=color, lw=1)
        ax = plt.gca()
        plt.xlim([0, 1])
        plt.ylim([0, 1.05])
    return ax