In [None]:
# initializing otter-grader
import otter
grader = otter.Notebook()

# Lab 7: Fitting Models to Data


In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

## Objectives for Lab 7:

Models and fitting models to data is a common task in data science. In this lab, you will practice fitting models to data. The models you will fit are:

* Linear fit
* Normal distribution

## Boston Housing Dataset

In [2]:
from sklearn.datasets import load_boston
boston_dataset = load_boston()

print(boston_dataset['DESCR'])

In [3]:
housing = pd.DataFrame(boston_dataset['data'], columns=boston_dataset['feature_names'])
housing['MEDV'] = boston_dataset['target']
housing.head()

In [4]:
fig, ax = plt.subplots(figsize=(10, 7))
sns.scatterplot(x='LSTAT', y='MEDV', data=housing)
plt.show()

The model for the relationship between the response variable MEDV ($y$) and predictor variables LSTAT ($u$) and RM ($v$) is that
$$ y_i = \beta_0 + \beta_1 u_i + \epsilon_i, $$
where $\epsilon_i$ is random noise.

In order to fit the linear model to data, we minimize the sum of squared errors of all observations, $i=1,2,\dots,n$. 
$$\begin{aligned}
&\min_{\beta} \sum_{i=1}^n (y_i - \beta_0 + \beta_1 u_i )^2 = \min_{\beta} \sum_{i=1}^n (y_i - x_i^T \beta)^2 = \min_{\beta} \|y - X \beta\|_2^2
\end{aligned}$$
where $\beta = (\beta_0,\beta_1)^T$, and $x_i^T = (1, u_i)$. Therefore, $y = (y_1, y_2, \dots, y_n)^T$ and $i$-th row of $X$ is $x_i^T$. 

## Question 1: Constructing Data Variables

Define $y$ and $X$ from `housing` data.

<!--
BEGIN QUESTION
name: q1
manual: false
points: 4
-->

In [5]:
y = ...
X = ...
...
# X.insert(..., 'intercept', ...)

## Installing CVXPY

First, install `cvxpy` package by running the following bash command:

In [11]:
!pip install cvxpy

## Question 2: Fitting Linear Model to Data

Read this example of how cvxpy problem is setup and solved: https://www.cvxpy.org/examples/basic/least_squares.html

The usage of cvxpy parallels our conceptual understanding of components in an optimization problem:
* `beta` are the variables $\beta$
* `loss` is sum of squared errors
* `prob` minimizes the loss by choosing $\beta$

Make sure to extract the data array of data frames (or series) by using `values`: e.g., `X.values` 

<!--
BEGIN QUESTION
name: q2
manual: false
points: 4
-->

In [12]:
import cvxpy as cp

beta = ...
loss = ...
prob = ...

prob.solve()

yhat = ...

## Question 3: Visualizing resulting Linear Fit

Visualize fitted model by plotting `LSTAT` by `MEDV`.

<!--
BEGIN QUESTION
name: q3
manual: true
points: 4
-->
<!-- EXPORT TO PDF -->

In [14]:
fig, ax = plt.subplots(figsize=(10, 7))

...

plt.show()

## Question 4: Fitting Quadratic Model to Data

Add a column of squared `LSTAT` values to `X`. The new model is,

Then, fit a quadratic model to data.

<!--
BEGIN QUESTION
name: q4a
manual: false
points: 4
-->

In [15]:
X.insert(2, 'LSTAT^2', X['LSTAT']**2)

beta = ...
loss = ...
prob = ...

prob.solve()

yhat = ...

Visualize quadratic fit:

<!--
BEGIN QUESTION
name: q4b
manual: true
points: 4
-->
<!-- EXPORT TO PDF -->

In [17]:
fig, ax = plt.subplots(figsize=(10, 7))

...

plt.show()

# Running Built-in Tests
1. All tests are in `tests` directory
1. Each python file in `tests` is a test
1. `grader.check('testname')` runs test `'testname'`, e.g. `'q1'`
1. `grader.check_all()` runs all visible tests

In [None]:
# Run built-in checks
grader.check_all()

In [None]:
# Uncomment to generate pdf in classic notebook (does not work in JupyterLab):
# import nb2pdf
# nb2pdf.convert('lab07.ipynb')

# Uncomment to generate pdf using command-line tool:
# ! nb2pdf lab07.ipynb

# Submission Checklist
1. Check filename is `lab07.ipynb`
1. Save file to confirm all changes are on disk
1. Run *Kernel > Restart & Run All* to execute all code from top to bottom
1. Check `grader.check_all()` output
1. Save file again to write any new output to disk
1. Check generated pdf that all responses are displayed correctly
1. Submit `lab07.ipynb` and `lab07.pdf` to Gradescope