#  Classification via Probabilistic Modeling

In this lab, we will practice working with Bayesian probabilistic modeling.

You will implement a probabilistic model of your own design from Homework 3, and you will also use the generalization of Naive Bayes Classification for classification on the breast cancer dataset from our logistic regression demo.

## Warm Up: Implementing a Probabilisitc Model 

Implement the probabilistic model you came up with for Problem 1, part (a) on written homework 3. Limit your model to only include data for any **three zipcodes** of your choice in New York City.

You will probably want to use one or more functions in the `numpy.random` package.

Feel free to simply select reasonable parameters for your model using your own intuition, but **bonus points will be given if you use outside data to inform the model.** Please include a link to any outside sources and explain how the data was used. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
def generate_rent_data(n):
    """
    Randomly generate a synthetic dataset of apartment rental data with n examples.
    Return: X, a n x 2 numpy array, where the first column contains a zip code and
    the second contains a square footage number.
    y, an n x 1 numpy array containing a monthly rental price
    
    """
    
    return X,y

* Generate 1000 data examples using the probabilistic model implemented in `generate_rent_data(n)`. 
* Plot the data on a scatter plot, with the x-axis being apartment size and the y-axis being rent. Color each data points to indicate which of the three zip codes it is from. 

Confirm that the data looks reasonable for the zip codes you selected! I have provided an example result below:

<img src="sample_output.png" width="400">

Your data will look different depending on how you designed your probabilistic model.


In [None]:
# TODO
X,y = generate_rent_data(1000)
# plt.plot(...)

##  Breast Cancer Diagnosis via Gaussian Naive Bayes

In this portion of the lab, we will revisit the Breast Cancer Diagnosis problem from `demo_breast_cancer.ipynb` and try to classifiy examples using the Gaussian Naive Bayes method developed in Problem 2 of Homework 3.

As a reminder, the data set is described here:
https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original) 

More information on the problem can be found in the demo.

### Loading and Visualizing the Data

We first load the packages and data as in the demo.

In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets, linear_model, preprocessing
%matplotlib inline
names = ['id','thick','size_unif','shape_unif','marg','cell_size','bare',
         'chrom','normal','mit','class']
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/' +
                 'breast-cancer-wisconsin/breast-cancer-wisconsin.data',
                names=names,na_values='?',header=None)
df = df.dropna()

# Get the response.  Convert to a zero-one indicator 
yraw = np.array(df['class'])
BEN_VAL = 2   # value in the 'class' label for benign samples
MAL_VAL = 4   # value in the 'class' label for malignant samples
y = (yraw == MAL_VAL).astype(int) # now y has values of 0,1 
Iben = (y==0)
Imal = (y==1)

For this lab we are going to use all predictors we have at our disposal, as was done at the end of the demo. The code below places all of the predictor data in a matrix `Xfull` of dimension (n x d). 

In [None]:
xnames = names[:-1]
Xfull = np.array(df[xnames])
n = Xfull.shape[0]
d = Xfull.shape[1]

### Naive Bayes Classification

The first step in using the Gaussian Naive Bayes method is to estimate all parameters of the probabilistic model. These include:
* `p` -- the probability that a data example is malignant (label 1)
* `mub` = `[mub[0], \ldots, mub[d-1]]` -- the expected value of each predictor variable for benign examples.
* `sigb` = `[sigb[0], \ldots, sigb[d-1]]` -- the variance value of each predictor variable for benign examples.
* `mum` = `[mum[0], \ldots, mum[d-1]]` -- the expected value of each predictor variable for malignant examples.
* `sigm` = `[sigm[0], \ldots, sigm[d-1]]` -- the variance value of each predictor variable for maglignant examples.

Compute estimates for these values below using teh data in `Xfull`. 

**Hint**: For a compact approach, you might want to use the boolean arrays `Imal` and `Iben` created above for "mask indexing" (see [these docs](https://docs.scipy.org/doc/numpy/user/basics.indexing.html)).

In [None]:
# TODO
# p = ...
# mub = ..
# sigb = ...
# mum = ..
# sigm = ...

Next, for every row $\vec{x}$ in `Xfull`, use the model parameters determined above to compute the maximum a posterior (MAP) estimate for the class label $y$. You should use the equations derived in your solution to Problem 3 (b) on the homework. 

You may need to be thoughtful in how you do this computation to avoid numerical underflow or overflow in your computations: keep in mind that you just need to determine which of $p(y=0 | \vec{x})$ or $p(y=1 | \vec{x})$ is **larger** -- you don't necessarily need to compute both values explicitly. 

Store your MAP estimates for the data examples in an integer vector `yhat` (with 0 indicating benign, 1 indicating malignant).

In [None]:
# TODO
# ...
# yhat = 

Compute the accuracy of your estimates using the code below. You should see a result which is very competitive with logistic regression, which is pretty cool given how simple this algorithm is!

**Note**: we would ideally want to do a proper train/test split to evaluate the performance of both logistic regression and Naive Bayes classification. We're just looking at training set loss to keep things simple, and because the number of features is relatively small, so we're not super worried about overfitting.

In [None]:
acc = np.mean(yhat == y)
print("Accuracy on training data = %f" % acc)
recall = np.sum((yhat == 1)*(y == 1))/np.sum(y == 1)
precision = np.sum((yhat == 1)*(y == 1))/np.sum(yhat == 1)
print("Recall: " + str(recall))
print("Precision: " + str(precision))