**1**. (25 points)

- Download the data from the URL given into a `pandas` DataFrame (5 points)
```
https://gist.github.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv
```
- Your objective is to predict the `mpg` of a car from `hp`, `wt` and `am`
    - Use the last 10 rows as test data and the rest as training data (5 points)
    - Train a multiple linear regression model on the training data (10 points)
    - Evaluate the mean squared error on the test data (5 points)
    
You may **only** use the following class from `sklearn`  (default parameters are fine)

- `sklearn.linear_model.LinearRegression`

In particular, splitting into test and train data and calculaiton of mean squared error should not use `sklearn`

In [None]:
url = 'https://gist.github.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv'
df = pd.read_csv(url)

In [None]:
df.shape

In [None]:
df.head()

In [None]:
X = df[['hp', 'wt', 'am']]
y = df['mpg']

In [None]:
X_train, X_test = X[:-10], X[-10:]
y_train, y_test = y[:-10], y[-10:]

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(X_train, y_train)

In [None]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [17]:

In [None]:
y_pred = lr.predict(X_test)

In [None]:
np.mean((y_test - y_pred)**2)

**2**. (25 points)

- Using the `requests` package, download berries 1, 2, and 3 from `https://pokeapi.co/api/v2/berry` in JSON format and 
convert to a `pandas` DataFrame (5 points)
- Create a new DataFrame that only retains only the `name` column and numeric columns. You should find the appropriate columns, not hard code their locations. (10 points)
- Show only rows where the name begins with the letter `c` (5 points)
- Convert to a `numpy` array (excluding `name`) and standardize so each **row** has mean 0 and standard deviation 1 (5 points)

In [None]:
import requests
import numpy as np
import pandas as pd
from pandas.api.types import is_numeric_dtype

In [None]:
berries = []
for i in range(1, 4):
    berry = requests.get('https://pokeapi.co/api/v2/berry/{}/'.format(i)).json()
    berries.append(berry)

In [None]:
berries_df = pd.DataFrame(berries)

In [None]:
idx = np.nonzero([is_numeric_dtype(x) for x in berries_df.dtypes])

In [None]:
idx = np.r_[idx[0], [berries_df.columns.tolist().index('name')]]

In [None]:
df = berries_df.iloc[:, idx]

In [None]:
df[df.name.str.startswith('c')]

In [None]:
xs = df.iloc[:, :-1].values

In [None]:
(xs - xs.mean(axis=1)[:, None])/xs.std(axis=1)[:, None]

**3**. (25 points)

We have provided an SQLite3 database in `data/pets.db` with 3 tables `dog`, `treat` and `dog_treat`. The `dog_treat` table is a linker table showing which dog ate which treat. 

- Show a table of ALL dogs and the treats with calories that they ate with column names `dog`, `treat`, `calorie`. A dog that did not eat any treats should still be present in the table (15 points)
- Using a common table expression, show a table with two columns `dog` and `total_calories` where only dogs that have eaten more than 500 calories are displayed (5 points)

In [None]:
%load_ext sql

In [None]:
%sql sqlite:///pets.db

In [None]:
%sql SELECT * FROM sqlite_master

In [None]:
%%sql

SELECT dog.name as dog, treat.name as treat, treat.calories
FROM dog
LEFT JOIN dog_treat
ON dog.dog_id = dog_treat.dog_id
LEFT JOIN treat
ON dog_treat.treat_id = treat.treat_id

In [None]:
%%sql

with t AS
(SELECT dog.name as dog, treat.name as treat, treat.calories
FROM dog
LEFT JOIN dog_treat
ON dog.dog_id = dog_treat.dog_id
LEFT JOIN treat
ON dog_treat.treat_id = treat.treat_id)
SELECT dog, SUM(calories) as total_calories
FROM t
GROUP BY dog
HAVING total_calories > 500

**4**. (40 points)

You want to evaluate whether a liner, quadratic or cubic polynomial is the best model for a set of data using leave-one-out cross-validation (LOOCV) and the mean squared error as evaluation metric. 

- Write a function named `loocv` that takes the predictor variable `x`, the outcome variable `y`, a list of of degrees of polynomial models to be evaluated, and an evaluation function and returns the best model found by LOOCV. For example, you would call the function like this `loocv(x, y, [1,2,3], mse)` where `mse` is of course a function that returns the mean squared error. (30 points)
- Write the `mse` function to provide to the LOOCV routine (5 points)
- Use the `llocv` function to find the best polynomial model for the data provided (5 points)

Notes

- Use the `x` and `y` variables provided
- Do not use any packages except for the standard library and `numpy`
- Code snippets for fitting and evaluation of polynomials is provided

In [None]:
import numpy as np

In [None]:
x = np.load('data/x.npy')
y = np.load('data/y.npy')

In [None]:
coeffs = np.polyfit(x, y, 2)
ypred = np.polyval(coeffs, x)

In [None]:
x

In [None]:
x[np.ones(10).astype('bool')]

In [None]:
def loocv(x, y, degrees, metric):
    """Performs LOOCV to find best polynomial modle."""
    
    n = len(x)
    losses = []
    for d in degrees:
        loss = 0
        for i in range(n):
            idx = np.ones(n).astype('bool')
            idx[i] = False
            xx = x[idx]
            yy = y[idx]
            coeffs = np.polyfit(xx, yy, d)    
            ypred = np.polyval(coeffs, x[i])
            loss += metric(y[i], ypred)
        losses.append(loss)
    k = np.argmin(losses)
    return degrees[k]

In [None]:
def mse(y, ypred):
    """Returns MSE between y and ypred."""
    
    return np.mean((y - ypred)**2)

In [None]:
loocv(x, y, [1,2,3, 4], mse)