# Homework 8

## Rui Fang $\quad$ Collaborator: Yudi (Judy) Wang

**Harvard University**<br>
**Spring 2018**<br>
**Instructors: Rahul Dave**<br>
**Due Date: ** Friday, March 30th, 2018 at 11:00am

**Instructions:**

- Upload your iPython notebook containing all work to Canvas.

- Structure your notebook and your work to maximize readability.




## Problem 1: Understanding Yelp Review Data As a Human
In this course, we've spent a lot of time learning algorithms for performing inference on complex models and we've spent time using these models to make decisions regarding our data. But in nearly every assignment, the model for the data is specified in the problem statement. In real life, the creative and, arguably, much more difficult task is to start with a broadly defined goal and then to customize or create a model which will meet this goal in some way. 



Problem #1 is atypical in that it does not involve any programming or (necessarily) difficult mathematics/statistics. The process of answering these questions *seriously* will however give you an idea of how one might create or select a model for a particular application and your answers will help you with formalizing the model if and when you're called upon to do so.

***Grading:*** *We want you to make a genuine effort to mold an ambiguous and broad real-life question into a concrete data science or machine learning problem without the pressure of getting the "right answer". As such, we will grade your answer of Problem #1 on a pass/fail basis. Any reasonable answer that demonstrates actual effort will be given a full grade.*

We've compiled for you a fairly representative selection of [Yelp reviews](./yelp_reviews.zip) for a (now closed) sushi restaurant called Ino's Sushi in San Francisco. Read the reviews and form an opinion regarding the various qualities of Ino's Sushi. Answer the following:

1. If the task is to summarize the quality of a restaurant in a simple and intuitive way, what might be problematic with simply classifying this restaurant as simply "good" or "bad"? Justify your answers with specific examples from the dataset.

2. For Ino's Sushi, categorize the food and the service, separately, as "good" or "bad" based on all the reviews in the dataset. Be as systematic as you can when you do this.

  (**Hint:** Begin by summarizing each review. For each review, summarize the reviewer's opinion on two aspects of the restaurant: food and service. That is, generate a classification ("good" or "bad") for each aspect based on what the reviewer writes.) 
  
3. Identify statistical weaknesses in breaking each review down into an opinion on the food and an opinion on the service. That is, identify types of reviews that make your method of summarizing the reviewer's optinion on the quality of food and service problemmatic, if not impossible. Use examples from your dataset to support your argument. 

4. Identify all the ways in which the task in #2 might be difficult for a machine to accomplish. That is, break down the classification task into simple self-contained subtasks and identify how each subtask can be accomplished by a machine (i.e. which area of machine learning, e.g. topic modeling, sentiment analysis etc, addressess this type of task).

5. Describe a complete pipeline for processing and transforming the data to obtain a classification for both food and service for each review.

***

### *Solution*

1.Classifying this restaurant as simply "good" or "bad" is problematic because reviewers have different needs and preferences, and may give extremely different reviews dependending on how they value different aspects of this restaurant. For example, the reviews from the dataset show an unusual review distribution with all of the reviews either 1-star or 5-star (as noted by review05). Many 1-star reviews criticize harshly the service of the restaurant while acknowledging the quality of the food, and many 5-star reviews are given because of the high quality of the food and low expectation on the restaurant's service. Therefore, a simple "good" or "bad" classification doesn't cover such information, and it is also not very useful for making future predictions. 

2.Categorize the food and the service as "good" or "bad" based on all the reviews in the dataset: 

| Review    | food  | service | 
|-----------|-------|---------|
| 1         | good  |  bad  |
| 2         | good  |  bad  |
| 3         | N/A   |  bad  |
| 4         | good  |  bad  |
| 5         | good  |  good |
| 6         | good  |  N/A  |
| 7         | good  |  N/A  |
| 8         | good  |  N/A  |
| 9         | N/A   |  bad  |
| 10        | good  | good  |

3.Some reviews don't have both opinions on the food and the service. This is reflected in the table above as missing data. 

4.The classification task can be broken down into subtasks including topic modeling (discover abstract topics that occur in the reviews) and sentiment analysis (determine whether the expressed opinion is positive, negative or neutral). 

5.For each review, we extract text features from the review (e.g. bag or words) for both food and service, then we can use models (e.g. Naive Bayes, Maximum entropy, Support vector machine) to classify each aspect to "good" or "bad" categories. 

***

## Problem 2: My Sister-In-Law's Baby Cousin Tracy ...


Wikipedia describes the National Annenberg Election Survey as follows -- "National Annenberg Election Survey (NAES) is the largest academic public opinion survey conducted during the American presidential elections. It is conducted by the Annenberg Public Policy Center at the University of Pennsylvania."  In the file [survey.csv](./survey.csv) we provide the following data from the 2004 National Annenberg Election Survey:  `age` -- the age of the respondents, `numr` -- the number of responses, and `knowlgbtq` -- the number of people at the given age who have at least one LGBTQ acquaintance.  We want you to model how age influences likelihood of interaction with members of the LGBTQ community in three ways. 


1. Using pymc3, create a bayesian regression model (either construct the model directly or use the glm module) with the same feature and dependent variable. Plot the mean predictions for ages 0-100, with a 2-sigma envelope.

2. Using pymc3, create a 1-D Gaussian Process regression model with the same feature and dependent variables.  Use a squared exponential covariance function. Plot the mean predictions for ages 0-100, with a 2-sigma envelope.

3. How do the models compare? Does age influence likelihood of acquaintance with someone LGBTQ? For Bayesian Linear Regression and GP Regression, how does age affect the variance of the estimates?

For GP Regression, we can model the likelihood of knowing someone LGBTQ as a product of binomials -- one binomial distribution per age group. 

$$p(y_a | \theta_a, n_a) = Binom( y_a, n_a, \theta_a)$$

where $y_a$ (i.e. `knowlgbtq`) is the observed number of respondents who know someone lgbtq  at age $a$, $n_a$ (i.e. `numr`) is the number of trials and $\theta_a$ is the rate parameter for having an lgbtq acquaintance at age $a$.

Using the Gaussian approximation  (http://en.wikipedia.org/wiki/Binomial_distribution#Normal_approximation) to approximate the Binomial since `numr` is large, you can simply use a GP posterior with the error for each measurement to be given using this approximation. 

***

### *Solution*

In [1]:
%matplotlib notebook

import numpy as np
import matplotlib.pyplot as plt
import pymc3 as pm
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set()

In [2]:
# Load in data 
survey_df = pd.read_csv('survey.csv')
survey_df.head()

Unnamed: 0,age,numr,knowlgbtq
0,18,310,158
1,19,221,118
2,20,217,120
3,21,255,131
4,22,301,168


In [26]:
print(survey_df.shape)

age = survey_df['age'].values
numr = survey_df['numr'].values
knowlgbtq = survey_df['knowlgbtq'].values

know_ratio = knowlgbtq/numr

# Visualize data 
plt.figure()
plt.plot(age, know_ratio, 'k.', label='data')
plt.xlabel('age')
plt.ylabel('know ratio = knowlgbtq/numr')
plt.legend(frameon=True)
plt.show()

(78, 3)


<IPython.core.display.Javascript object>

In [4]:
from theano import shared

age_range = shared(age)

# Construct Beyesian linear regression model 
with pm.Model() as blr_model:
    
    # Define priors
    sigma = pm.HalfCauchy('sigma', beta=10, testval=1.)
    intercept = pm.Normal('Intercept', 0, sd=20)
    x_coeff = pm.Normal('x_coeff', 0, sd=20)

    # Define likelihood
    likelihood = pm.Normal('y', mu=intercept + x_coeff * age_range,
                        sd=sigma, observed=know_ratio)
    
    # Draw 3000 posterior samples using NUTS sampling
    blr_trace = pm.sample(3000, cores=2) 

# Define new age range [0, 100] for sampling posterior predictive 
new_age_range = np.arange(0, 101)
age_range.set_value(new_age_range)

blr_samples = pm.sample_ppc(blr_trace, model=blr_model, samples=200)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [x_coeff, Intercept, sigma_log__]
100%|██████████| 3500/3500 [00:14<00:00, 239.83it/s]
The acceptance probability does not match the target. It is 0.896763410083, but should be close to 0.8. Try to increase the number of tuning steps.
The acceptance probability does not match the target. It is 0.909672216946, but should be close to 0.8. Try to increase the number of tuning steps.
100%|██████████| 200/200 [00:00<00:00, 960.31it/s]


In [27]:
# Plot posterior predictive results for Bayesian linear regression model 
blr_samples_mean = blr_samples['y'].mean(axis=0)
blr_samples_sigma = blr_samples['y'].std(axis=0)

lower_bound = blr_samples_mean - 2*blr_samples_sigma
upper_bound = blr_samples_mean + 2*blr_samples_sigma

plt.figure()
plt.plot(age, know_ratio, 'k.', label='data')
plt.plot(new_age_range, blr_samples_mean, '-', label='mean predictions', c='b')
plt.fill_between(new_age_range, lower_bound, upper_bound, facecolor='g', alpha=0.2, label = '2 sigma envelope')
plt.title('Bayesian linear regression')
plt.legend(loc='best', frameon=True)
plt.xlabel('age')
plt.ylabel('know ratio')
plt.show()

<IPython.core.display.Javascript object>

In [9]:
# Construct Gaussian process regression model 
with pm.Model() as gpr_model:

    length = pm.HalfCauchy('length', 1)
    sigma = pm.HalfCauchy('sigma', 10)
    
    M = pm.gp.mean.Linear(coeffs=know_ratio.mean())
    K = (sigma**2) * pm.gp.cov.ExpQuad(1, length) 
    
    noise_term = pm.HalfCauchy('noise_term', 5)
    
    gpr = pm.gp.Marginal(mean_func=M, cov_func=K)
    yobs = gpr.marginal_likelihood('yobs', X=age.reshape(-1,1), y=know_ratio, noise=noise_term)

    gpr_trace = pm.sample(10000)

X_pred = new_age_range.reshape(-1, 1)
with gpr_model:
    ypred = gpr.conditional("ypred", X_pred)
    gpr_samples = pm.sample_ppc(gpr_trace, vars=[ypred], samples=40)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [noise_term_log__, sigma_log__, length_log__]
100%|██████████| 10500/10500 [03:18<00:00, 52.86it/s]
The acceptance probability does not match the target. It is 0.88196848158, but should be close to 0.8. Try to increase the number of tuning steps.
100%|██████████| 40/40 [00:00<00:00, 57.86it/s]


In [28]:
# Plot posterior predictive results for Gaussian process regression model 
gpr_samples_mean = gpr_samples['ypred'].mean(axis=0)
gpr_samples_sigma = gpr_samples['ypred'].std(axis=0)

lower_bound = gpr_samples_mean - 2*gpr_samples_sigma
upper_bound = gpr_samples_mean + 2*gpr_samples_sigma

plt.figure()
plt.plot(age, know_ratio, 'k.', label='data')
plt.plot(X_pred, gpr_samples_mean, 'b', label='mean predictions')
plt.fill_between(new_age_range, lower_bound, upper_bound, facecolor='green', alpha=0.2, label='2 sigma envelope')
plt.xlabel("age")
plt.ylabel("know ratio")
plt.title("Gaussian process regression")
plt.xlim(0, 100)
plt.legend(frameon=True)
plt.show()

<IPython.core.display.Javascript object>

Based on the results of the two models, age does influence the likelihood of acquaintance with someone LGBTQ. The Bayesian Linear Regression model has big bias and variance, thus indicating an underfit of data. The GP Regression model has much smaller bias and variance and captures the general trend well, thus providing a better fit of data than the linear model. 

For the Bayesian Linear Regression model, the variance is pretty constant across age. For the GP Regression model, age has a great influence on the variance of the estimates: variance is small for age between 20 and 95, and increases for age below 20 and above 95. This is expected since we don't have data on these ranges. 

***

## Problem 3:  Like a Punch to the Kidneys 

In this problem we will work with the US Kidney Cancer Dataset (by county), a dataset of kidney cancer frequencies across the US over 5 years on a per county basis. 

The kidney cancer data can be found [here](./kcancer.csv).

A casual inspection of the data might suggest that we independently model cancer rates for each of the provided counties.  Our experience in past homeworks/labs/lectures (in particular when we delved into the Rat Tumors problem) suggests potential drawbacks of conclusions based on raw cancer rates.  Addressing these drawbacks, let's look use a Bayesian model for our analysis of the data. In particular you will implement an Empircal Bayes model to examine the adjusted cancer rates per county.

Let $N$ be the number of counties; let $y_j$ the number of kidney cancer case for the $j$-th county, $n_j$ the population of the $j$-th county and $\theta_j$ the underlying kidney cancer rate for that county. We can construct a Bayesian model for our data as follows:
\begin{aligned}
y_j &\sim Poisson(5 \cdot n_j \cdot \theta_j), \quad j = 1, \ldots, N\\
\theta_j &\sim Gamma(\alpha, \beta), \quad j = 1, \ldots, N
\end{aligned}
where $\alpha, \beta$ are hyper-parameters of the model.

- (#1) Implement Empirical Bayes via moment matching as described as follows. Consider the **prior-predictive** distribution (also called the evidence i.e. the denominator normalization in bayes theorem) of the model: $p(y) = \int p(y \vert \theta) p(\theta) d \theta$. Why the prior-predictive? Because technically we "haven't seen" individual county data yet.  For this model, the prior-predictive is a negative binomial. Matching the mean and the variance of the negative binomial to that from the data, you can find appropriate expressions for $\alpha$ and $\beta$. (Hint: You need to be careful with the $5n_j$ multiplier.) 

- (#2) Produce a scatter plot of the raw cancer rates (pct mortality) vs the county population size. Highlight the top 300 raw cancer rates in red. Highlight the bottom 300 raw cancer rates in blue. Finally, on the same plot add a scatter plot visualization of the posterior mean cancer rate estimates (pct mortality) vs the county population size, highlight these in green.

- (#3) Using the above scatter plot, explain why using the posterior means from our model to estimate cancer rates is preferable to studying the raw rates themselves.

(**Hint:** You might also find it helpful to follow the Rat Tumor example.)

(**Note:** Up until now we've had primarily thought about the posterior predictive: $\int p( y \vert \theta) p(\theta \vert D) d\theta$.  The posterior predictive and the prior predictive can be somewhat connected. In conjugate models such as ours, the two distributions have the same form.) 

***

### *Solution*

In [29]:
kcancer_df = pd.read_csv('./kcancer.csv')
kcancer_df.head()

Unnamed: 0,state,fips,county,countyfips,dc,pop,pct_mortality
0,ALABAMA,1,AUTAUGA,1001,1.0,64915.0,1.5e-05
1,ALABAMA,1,BALDWIN,1003,15.0,195253.0,7.7e-05
2,ALABAMA,1,BARBOUR,1005,1.0,33987.0,2.9e-05
3,ALABAMA,1,BIBB,1007,1.0,31175.0,3.2e-05
4,ALABAMA,1,BLOUNT,1009,5.0,91547.0,5.5e-05


Following the hint posted on Piazza, the prior-predictive for this model is a negative binomial $y_j \sim Neg Binom \left(\alpha, \frac{5n_j}{5n_j + \beta}\right)$, which we know the moments analytically: 
\begin{align}
    \mathbb{E}[y_j] &= \frac{5n_j \alpha}{\beta} \\
    Var[y_j] &= \frac{(5n_j)^2 \alpha}{\beta ^2} + \frac{5n_j \alpha}{\beta} = \mathbb{E}[y_j] \frac{5n_j + \beta}{\beta}
\end{align}

However, we don't have multiple instances of these counts per county in the data. To reduce the $n_j$ dependence, we alternatively consider the per county death rate: 
\begin{align}
    \mathbb{E}\left[\frac{y_j}{n_j}\right] &= \frac{1}{N}\sum_{j=1}^N\frac{y_j}{n_j} = \frac{5\alpha}{\beta} \\
    Var\left[\frac{y_j}{n_j}\right] &= \frac{25 \alpha}{\beta^2} + \frac{5\alpha}{\beta n_j} = \mathbb{E}\left[\frac{y_j}{n_j}\right]\left(\frac{5}{\beta} + \frac{1}{n_j}\right)
\end{align}

We can match the mean and variance by the sample mean and variance from data, and can replace $n_j$ that shows up in the variance by the mean county population. Then we get expressions for $\alpha$ and $\beta$:
\begin{align}
    \alpha &= \frac{sample\_mean \cdot \beta}{5} = \frac{sample\_mean}{\frac{sample\_variance}{sample\_mean}-\frac{1}{\bar{n}}}\\
    \beta &= \frac{5}{\frac{sample\_variance}{sample\_mean}-\frac{1}{\bar{n}}}
\end{align}

In [47]:
# Calculate alpha and beta
pct_mean = kcancer_df['pct_mortality'].mean()
pct_var = kcancer_df['pct_mortality'].var()
n_j_mean = kcancer_df['pop'].mean()
beta = 5 / (pct_var/pct_mean - 1/n_j_mean)
alpha = pct_mean * beta / 5
print("alpha = ", alpha)
print("beta = ", beta)

alpha =  1.5445945866
beta =  133464.150333


The posterior of a Gamma conjugate prior model can be determined analytically: 
\begin{align}
    \theta_j | y_j \sim Gamma\left(\alpha+y_j, \beta+5n_j\right)
\end{align}

In [48]:
# Sample thetas
N = kcancer_df.shape[0]
thetas = np.zeros(N)

for i in range(N):
    y_j = kcancer_df['dc'].iloc[i]
    n_j = kcancer_df['pop'].iloc[i]
    thetas[i] = np.random.gamma(alpha+y_j, 1/(beta+5*n_j), size=5000).mean()

In [50]:
# Make scatter plot
ax = kcancer_df.plot(kind='scatter', 
                     x="pop", 
                     y="pct_mortality", 
                     s=5, alpha=0.3, color="grey", logx=True, figsize=(8,8))
bot_kcancer_counties = kcancer_df.sort_values(by='pct_mortality', ascending=True)[:300]
top_kcancer_counties = kcancer_df.sort_values(by='pct_mortality', ascending=False)[:300]
top_kcancer_counties.plot(kind='scatter', 
                          x="pop", 
                          y="pct_mortality", 
                          s=5, alpha=0.3, color="red", 
                          label="top 300 raw cancer rates", 
                          ax=ax, logx=True)
bot_kcancer_counties.plot(kind='scatter', 
                          x="pop", 
                          y="pct_mortality", 
                          s=5, alpha=0.3, color="blue", 
                          label="bottom 300 raw cancer rates", 
                          ax=ax, logx=True)
ax.plot(kcancer_df['pop'], 5*thetas, '.', alpha=0.1, color="green", label="posterior mean cancer rates")
ax.set_ylim([-0.00005, 0.0008])
plt.legend(frameon=True)
plt.show()

<IPython.core.display.Javascript object>

According to the above scatter plot, we can conclude that using the posterior means from our model to estimate cancer rates is preferable to studying the raw rates, because the raw cancer rates are very sensitive to population, and using the posterior mean estimates can reduce the effect of population on the cancer rates. 

***

## Problem 4: In the Blink of a Bayesian Iris

We've done classification before, but the goal of this problem is to introduce you to the idea of classification using Bayesian inference. 

Consider the famous *Fisher flower Iris data set* a  multivariate data set introduced by Sir Ronald Fisher (1936) as an example of discriminant analysis. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, you will build a model to predict the species. 

For this problem only consider two classes: **virginica** and **not-virginica**. 

The iris data can be obtained [here](./iris.csv).

Let $(X, Y )$ be our dataset, where $X=\{\vec{x}_1, \ldots \vec{x}_n\}$ and $\vec{x}_i$ is the standard feature vector corresponding to an offset 1 and the four components explained above. $Y \in \{0,1\}$ are the scalar labels of a class. In other words the species labels are your $Y$ data (virginica = 0 and virginica=1), and the four features -- petal length, petal width, sepal length and sepal width -- along with the offset make up your $X$ data. 

The goal is to train a classifier, that will predict an unknown class label $\hat{y}$ from a new data point $x$. 

Consider the following glm (logistic model) for the probability of a class:

$$ p(y) = \frac{1}{1+e^{-x^T \beta}} $$

(or $logit(p) = x^T \beta$ in more traditional glm form)

where $\beta$ is a 5D parameter to learn. 

Then given $p$ at a particular data point $x$, we can use a bernoulli likelihood to get 1's and 0's. This should be enough for you to set up your model in pymc3. (Other Hints: also use theano.tensor.exp when you define the inverse logit to go from $\beta$ to $p$, and you might want to set up $p$ as a deterministic explicitly so that pymc3 does the work of giving you the trace).

Use a 60-40 stratified (preserving class membership) split of the dataset into a training set and a test set. (Feel free to take advantage of scikit-learn's `train_test_split`).

1. Choose a prior for $\beta \sim N(0, \sigma^2 I) $ and write down the formula for the normalized posterior $p(\beta| Y,X)$. Since we dont care about regularization here, just use the mostly uninformative value $\sigma = 10$.
2. Find the MAP and mean estimate for the posterior on the training set.
3. Implement a  sampler to sample from this posterior of $\beta$.   Generate samples of $\beta$ and plot the sequence of $\beta$'s  and histograms for each $\beta$ component.



***

### *Solution*

The formula for the normalized posterior $p(\beta|Y,X)$ is given by 
\begin{align}
    p(\beta|Y,X) &= const. \times p(\beta)p(Y|X,\beta) \\
                 &= const. \times \exp\left(-\frac{\beta^T \beta}{2\sigma^2}\right)\prod_{i=1}^N \left[\left( \frac{1}{1+\exp\left(-x_i^T\beta\right)} \right)^{y_i} \left( \frac{\exp\left(-x_i^T\beta\right)}{1+\exp\left(-x_i^T\beta\right)} \right)^{1-y_i}\right].
\end{align}

In [74]:
iris_df = pd.read_csv('iris.csv')
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [76]:
from sklearn.model_selection import train_test_split

N = iris_df.shape[0]

X = iris_df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].values
X = np.concatenate([np.ones((N, 1)), X], axis=1)

y = np.zeros(N)
y[iris_df['class'].values == ' Iris-virginica'] = 1

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0, stratify=y)

In [77]:
import theano.tensor as T

# Construct the logistic model 
with pm.Model() as model:

    beta = pm.Normal('beta', mu=0, sd=10, shape=5) 
    p = pm.Deterministic('p', 1/(1 + T.exp(-T.dot(X_train, beta))))
    yhat = pm.Bernoulli('yhat', p, observed=y_train)
    
    trace = pm.sample(10000, pm.NUTS())

Multiprocess sampling (2 chains in 2 jobs)
NUTS: [beta]
100%|██████████| 10500/10500 [05:58<00:00, 29.25it/s]
There were 890 divergences after tuning. Increase `target_accept` or reparameterize.
There were 532 divergences after tuning. Increase `target_accept` or reparameterize.
The estimated number of effective samples is smaller than 200 for some parameters.


In [78]:
# MAP estimate
map_beta = pm.find_MAP(model=model)['beta']
print("MAP estimate: ", map_beta)
    
# Mean posterior estimate 
mean_beta = np.mean(trace['beta'], axis=0)
print("Mean posterior estimate: ", mean_beta)

INFO (theano.gof.compilelock): Refreshing lock /Users/rfang/.theano/compiledir_Darwin-16.7.0-x86_64-i386-64bit-i386-3.6.2-64/lock_dir/lock
logp = -18.514, ||grad|| = 0.00042632: 100%|██████████| 44/44 [00:00<00:00, 528.73it/s]  


MAP estimate:  [ -6.38925752  -6.82498054  -2.76193212   8.05899035  10.87505387]
Mean posterior estimate:  [ -9.1244099   -9.46412212  -3.66904669  11.36731087  14.47537243]


In [97]:
# Create the traceplot of beta
pm.traceplot(trace, varnames=['beta'])
plt.show()

<IPython.core.display.Javascript object>

In [98]:
# Create histograms of beta 
fig, axes = plt.subplots(1, 5, figsize=(20, 4))

for i in range(5):
    sns.distplot(trace['beta'][:, i], ax=axes[i])
    axes[i].set_ylim(0, 0.12)
    axes[i].set_title(r'$\beta_%s$' % (i+1))

fig.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

***