In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab4.ipynb")

---

<h1><center>SDSE Lab 4 <br><br> Linear regression and Feature selection </center></h1>

---

In this lab we will use linear regression to predict cancer mortality rates based on data obtained from the American Community Survey of the [U.S. Census Bureau](https://www.census.gov/). The lab has five parts. In part 1 you will load the data and do basic manipulations using [pandas](https://pandas.pydata.org/docs/index.html). Pandas is a Python package that specializes in tabular data. It is widely used in data science and machine learning since the data in these fields are usually structured as a table. Pandas is a very powerful library that is well worth investing some time in. [Here](https://pandas.pydata.org/docs/getting_started/index.html#getting-started) and [here](https://pandas.pydata.org/docs/user_guide/index.html) are resources to learn more.

In part 2 you will perform linear regression on the full feature set. In part 3 you will assess the performance of the linear regression model using the coefficient of determination. Then, in part 4 you will compute confidence intervals for the slope parameters of the linear regression model. Finally, in part 5 you will run the forward and backward stepwise feature selection algorithms and estimate the performance of the resulting model using a test dataset.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as stats
from resources.hashutils import *

# Part 1:  Load and clean the data

## 1.1 Load the data into a pandas DataFrame

Use [pd.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) to load the data from `cancerdata.csv`.

You can obtain information about the data using these DataFrame methods and attributes:
+ [`data0.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html): displays the first 5 rows of the DataFrame.
+ [`data0.tail()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html): displays the last 5 rows of the DataFrame.
+ [`data0.shape`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html): a tuple with the number of rows and of columns in the DataFrame.
+ [`data0.columns`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html): the column headers.
+ [`data0.index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.index.html): a unique identifier for each row.

In [None]:
data0 = ...  # TODO

## 1.2 Inspect the columns

Run `data0.info()` and note:
 a) which inputs are non-numerical (Dtype=object), and
 b) which inputs have null entries (Non-Null Count<3047).

Store the names ('Column' entry) of the non-numerical inputs in a [set](https://www.w3schools.com/python/python_sets.asp) called `non_numerical_inputs`. Store the names of inputs with null entries in a set called `null_entry_inputs`.

**Note**: If not all of the rows of `data0.info()` are displayed, you'll probably have this message at the bottom:

*``Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...''*

In [None]:
data0.info()

In [None]:
non_numerical_inputs = {...,...}      # TODO
null_entry_inputs = {...,...,...}     # TODO

In [None]:
grader.check("q1p2")

## 1.3 Discard non-numerical columns

Remove the two columns with non-numeric data.

Hints:
+ `data0.dtypes` lists the data types for each column.
+ You can construct a boolean mask for non-numeric columns with `data0.dtypes=='object'`.
+ Use that mask to index `data0.columns`
+ Use [`data0.drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) to remove the selected columns.
+ Save the result as `data1`

In [None]:
ind = ...               # TODO
drop_cols = ...         # TODO
data1 = data0.drop(...)       # TODO

In [None]:
grader.check("q1p3")

## 1.4 Discard columns where more than 10% of values are nan

Hints:
+ [`.dropna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)
+ Use `axis=1` to drop columns (as opposed to `axis=0` for rows.
+ Use the `thresh` argument. The condition for dropping a column is that it has less than `round(0.9*data1.shape[0])` non-nans.
+ Save the result as `data2`

In [None]:
thresh = ...                # TODO
data2 = data1.dropna(...)     # TODO

In [None]:
grader.check("q1p4")

## 1.5 Drop all rows that contain one or more nans.

Use the [`dropna`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method of the `data2` DataFrame to remove rows with one or more nans.

Save the result as `data3`.

In [None]:
data3 = ...      # TODO

In [None]:
grader.check("q1p5")

## 1.6 Inspect correlations

Next we'll look at the sample correlation coefficients between each of the inputs and the output (a.k.a. target variable) `target_deathrate`. This is a quick way to check which of the inputs may be most useful to include in a model. Correlations only provide an initial guess, however. Remember that the correlation coefficient only measures the linear relationship between variables. That's perfect when the model is linear (as in this lab activity), but less useful for nonlinear models.

1) Use `data3.corr()` to build the correlations matrix.
2) Inspect the column (or row) corresponding to `target_deathrate`.
3) Rank (i.e. sort) the inputs from most to least correlated with the output. This ranking is in terms of the absolute value of the sample correlation coefficient.
4) Save the top 5 correlated inputs to `top_5_sort`. `top_5_sort` should be a numpy array with shape `(5,)`.

Hints:
+ [`abs`](https://pandas.pydata.org/docs/reference/api/pandas.Series.abs.html)
+ [`sort_values`](https://pandas.pydata.org/docs/reference/api/pandas.Series.sort_values.html)
+ [`to_numpy`](https://pandas.pydata.org/docs/reference/api/pandas.Series.to_numpy.html)

In [None]:
# correlation matrix
C = ...         # TODO

# vector correlations between the inputs and target_deathrate
corr_target = ...         # TODO

# sorted corr_target_sort
corr_target_sort = ...         # TODO

# top 5 correlations with target_deathrate
top_5_sort = ...         # TODO

In [None]:
grader.check("q1p6")

## 1.7 Make a scatter plot

Make a scatter plot of the data with the most correlated input along the x axis, and the target along the y axis.

Hint: You can use the plotting function attached to the DataFrame: [data3.plot(kind='scatter',x=..., y=...)](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html)

In [None]:
data3.plot(kind='scatter',x=...,y=...)   # TODO

---

# Part 2: Solve linear regression

## 2.1 Extract `X` and `Y` from `data3`

The next cell extracts the `X` and `Y` matrices from `data3` and constructs a list of `inputs`. Define the number of samples `N` and the total number of inputs `D`.

In [None]:
X = data3.drop(columns='target_deathrate').values
all_inputs = data3.columns.values
all_inputs = all_inputs[all_inputs!='target_deathrate']
Y = data3['target_deathrate'].values

N = ...
D = ...

In [None]:
grader.check("q2p1")

## 2.2 Construct $\mathbb{X}$ defined in Eq. (6.112) of the reader

Implement the formula and store the result in the variable name `Xe`.

**Hint**: 
+ [np.hstack](https://numpy.org/doc/2.1/reference/generated/numpy.hstack.html)
+ [np.ones](https://numpy.org/doc/2.3/reference/generated/numpy.ones.html)

In [None]:
Xe = ...

In [None]:
grader.check("q2p2")

## 2.3 Find the solution $\underline{\hat\theta}$ of linear regression using Eq. (6.121) of the reader.

Implement the formula and store the result in the variable name `thetahat`.

Hint: 
+ [np.linalg.inv](https://numpy.org/doc/2.3/reference/generated/numpy.linalg.inv.html)
+ Matrix multiplication in NumPy can be achieved with the `@` operator or with [np.matmul](https://numpy.org/doc/2.3/reference/generated/numpy.matmul.html)

In [None]:
thetahat = ...

In [None]:
grader.check("q2p3")

## 2.4 Extract $\hat\theta_0$ and $\underline{\hat\theta}_1$ from $\underline{\hat\theta}$ as per Eq. (6.113) of the reader.

Eq. (6.113) shows the arrangement of candidate parameters $\theta_0$ and $\underline{\theta}_1$ into a candidate parameter vector $\underline{\theta}$. The same arrangement applies to the parameter estimates $\hat\theta_0$ and $\underline{\hat\theta}_1$. Unpack $\hat\theta$ into $\hat\theta_0$ and $\underline{\hat\theta}_1$, and save these respectively to variables `theta0hat` and `theta1hat`.


In [None]:
theta0hat = ...
theta1hat = ...

In [None]:
grader.check("q2p4")

---

# Part 3: Evaluate model performance

## 3.1 Compute predictions for each of the training samples using Eq. (6.111)

Implement the formula and store the result in the variable name `Yhat`.

In [None]:
Yhat = ... 

In [None]:
grader.check("q3p1")

## 3.2 Compute the coefficient of determination using Eq. (5.21). 

Implement the formula and store the result in the variable name `R2`.

In [None]:
R2 = ...   

In [None]:
grader.check("q3p2")

---

# Part 4: Quantify parameter uncertainty

## 4.1 Compute the average input value using Eq. (6.117) of the reader

Implement the formula and store the result in the variable name `muhatX`.

Hint: You can use the `axis` argument of [np.mean](https://numpy.org/doc/2.2/reference/generated/numpy.mean.html) to take an average across all samples. 

In [None]:
muhatX = ...

In [None]:
grader.check("q4p1")

## 4.2 Compute the centered inputs using Eq. (6.129) 


Hint: 
+ The formula for `Xc` has $\mathbf{1}_N\hat\mu_X$. The broadcasting rules of numpy make multiplying $\hat\mu_X$ by $\mathbf{1}_N$ unnecessary.
+ Check that the column-wise means of `Xc` equal zero (to machine precision). 

In [None]:
Xc = ...

In [None]:
grader.check("q4p2")

## 4.3 Compute $\hat\sigma^2$, an unbiased estimate $\sigma^2$

Equation (6.105) of the reader gives the formula for estimating the variance of the measurement noise $\mathcal{E}$ when there is only one input ($D=1$). 

$$\hat\sigma^2 = \frac{1}{N-2} \sum_{i=1}^{N}(y_i-\hat{y}_i)^2$$

The generalization of this formula to the case of $D$ inputs is,

$$\hat\sigma^2 = \frac{1}{N-D-1} \sum_{i=1}^{N}(y_i-\hat{y}_i)^2$$

Implement this formula and assign the result to the variable `sigmahat2`.

In [None]:
sigmahat2 = ...

In [None]:
grader.check("q4p3")

## 4.4 Compute the variances of the slope parameters

The slope parameters are contained in the array $\underline{\hat\theta}_1$. Compute their covariance matrix using Eq. (6.165) of the reader. In this equation you should replace the true noise level $\sigma^2$ with the estimate $\hat\sigma^2$ calculated in part 4.3. Save this matrix to the variable `CovThetaHat1`. Then extract the diagonal entries of this matrix, which correspond to the variances of each of the slope parameters. Save this to a 1D array called `varThetaHat1` 

**Hint**: [`np.diag`](https://numpy.org/doc/stable/reference/generated/numpy.diag.html)

In [None]:
CovThetaHat1 = ...
varThetaHat1 = ...

In [None]:
grader.check("q4p4")

## 4.5 Compute the radiuses of confidence intervals on the slope parameters

Having the variances of each of the slope parameters in `varThetaHat1`, and assuming that the output measurement noise is Gaussian, we can apply our generic formula from lecture for the radius of a confidence interval. This gives the following for the radius of a confidence interval for $\theta_d$:
$$\rho_d =\sqrt{v_d} \left| F^{-1}_{\mathcal{N}}\left( \frac{1-\gamma}{2}\right)\right| $$
Here $v_d$ is the variance of the $d$'th slope parameter, i.e. `varThetaHat1[d]`.

Compute the radiuses of 95\% confidence intervals for each of the slope parameters. Store these in the array `rho`.

In [None]:
gamma = 0.95

rho = ... 

In [None]:
grader.check("q4p5")

## 4.6 Tag as "significant" those parameters whose confidence interval does not include zero.

Create a 1D NumPy boolean array of the same size as `theta1hat` called `significant`. The $d$'th entry of `significant` should be `True` if the 95\% confidence interval for the corresponding slope parameter **does not** include 0, and `False` otherwise. In other words, an input is considered significant if its slope parameter is non-zero with 95\% confidence. 

In [None]:
significant = ...

In [None]:
grader.check("q4p6")

## 4.7 Parameters table (done already)
Make a DataFrame with one row for each input. The index of the table should be the input names. The columns should be:
+ `slope`: the estimates of the slope parameter associated with the input. 
+ `slope stddev`: the standard deviation of the slope parameter.
+ `significant`: whether the input is significant according to part 3.4.

In [None]:
params_table = pd.DataFrame(index=all_inputs,
             data={'slope':theta1hat,
                   'slope stddev':np.sqrt(varThetaHat1),
                   'significant':significant})

params_table

## 4.8 Build an array of significant inputs

Extract the names of the significant inputs from `params_table` using the `significant` array from part 4.7.

Store these significant input names as `significant_inputs`. 

`significant_inputs` should be a NumPy array with shape `(15,)`.

Here's one way you can do this that doesn't require a "for" loop:
1. Use the `significant` column to select the rows of the table corresponding to significant inputs. 
2. Use `.index` to obtain the names of the inputs for those rows. 
3. Use `.to_numpy()` to convert the result to a NumPy array.

In [None]:
...
significant_inputs = ...

In [None]:
grader.check("q4p8")

## 4.9 Create a new table with significant inputs only (done already)

This table is called `data` and the target variable is now called `Y`.

In [None]:
data = data3[significant_inputs].copy()
data['Y'] = data3['target_deathrate']
data

---

# Part 5: Feature selection 

## 5.1 Split `data` into training, validation, and testing datasets (done already)

We will use 70% of the data for training, 15% for validation, and 15% for testing.

1. Define `Dtrain` as the first `Ntrain` rows of `data`.
2. Define `Dvalidate` as the next `Nvalidate` rows of `data`.
3. Define `Dtest` as the last `Ntest` rows of `data`.

Here we use pandas' [iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) method for selecting the three datasets.

In [None]:
Ntrain = round(0.7*N)
Nvalidate = round(0.15*N)
Ntest = N - Ntrain - Nvalidate
Ntrain, Nvalidate, Ntest

Dtrain = data.iloc[:Ntrain,:]
Dvalidate = data.iloc[Ntrain:Ntrain+Nvalidate,:]
Dtest = data.iloc[Ntrain+Nvalidate:,:]

## 5.2 Linear regression training function

Create a function called `train` that receives a list of features `S` and a dataset `Dtrain` and does the following:
1. Selects the features `S` from `Dtrain` and stores them in `X`. (done already)
2. Selects the target values from `Dtrain` and stores them in `Y`. (done already)
3. Performs the linear regression calculations from parts 2.2 and 2.3 (copy your code from those parts into the `train` method)
4. Returns the estimated parameter vector `thetahat` containing $\hat\theta_0$ and $\underline{\hat\theta}_1$

In [None]:
def train(S, Dtrain):

    X = Dtrain[list(S)].values
    Y = Dtrain['Y'].values
    N = X.shape[0]

    # 2.2 Construct $\mathbb{X}$
    Xe = ...

    # 2.3 Compute thetahat
    thetahat = ...

    return thetahat

In [None]:
# Use this cell to test your code

thetahat = train(['incidencerate','birthrate'], Dtrain)

In [None]:
grader.check("q5p2")

## 5.3 Model evaluation function

Create a function called `eval` that receives
+ `S`... a set of input feature names

 the linear regression parameters `thetahat`, the corresponding feature names `S`, and a dataset `D`, which may be the validation, the training, or the testing dataset.

The function should evaluate the mean squared error (MSE) of the model using this data.

The steps are:
1. Select the features `S` from `D` and stores them in `X`. (done already)
2. Select the target values from `D` and stores them in `Y`. (done already)
3. Compute `Yhat`, as in part 3.1.
4. Evaluate $R^2$.

In [None]:
def train_eval(S, Dtrain, Deval):

    # fix the order of the inputs S by casting it to a list
    S = list(S)

    # train linear regression (call the training function you already made)
    thetahat = ...

    # unpack the linear regression coefficients
    theta0hat = ...
    theta1hat = ...

    # unpack the evaluation data
    X = Deval[S].values
    Y = Deval['Y'].values

    # compute predictions for each of the samples (same as part 3.1)
    Yhat = ...

    # 2.7 Evaluate performance with R2 (same as part 3.2)
    R2 = ...
    
    return R2

In [None]:
# Use this cell to test your code
r2 = train_eval(['incidencerate','birthrate'], Dtrain, Dvalidate)

In [None]:
grader.check("q5p3")

## 5.4 Forward feature selection (done already)

Below is a method that implements the forward feature selection algorithm that was described by your GSI in lab. 

This part has no deliverables.

In [None]:
def forward_selection(all_inputs):

    # allocate stuff
    P = len(all_inputs)                     # ... number of inputs
    setF = set(all_inputs)                  # ... fixed set of all inputs
    setFk = [set() for i in range(P+1)]     # ... array of best input set at each stage
    ellk = np.empty(P+1)             # ... array of best performance at each stage

    # initialize
    setFk[0] = set()
    ellk[0] = train_eval(set(),Dtrain,Dvalidate)

    # loop through stages
    for k in range(1,P+1):
        
        # allocate for inner loop 
        setA = [set() for i in range(P-k+1)]  # ... array of input sets to evaluate
        pkappa = np.full(P-k+1,np.inf)        # ... array of performance values

        # inner loop: through all remaining inputs to add
        for kappa, phip in enumerate(setF-setFk[k-1]):

            # add the p-th input (number kappa among remaining inputs)
            setA[kappa] = setFk[k-1].union({phip})

            # train with the training data, evaluate with validation data
            pkappa[kappa] = train_eval(setA[kappa],Dtrain,Dvalidate)

        # keep the best set and its performance.
        kappastar = pkappa.argmax()
        setFk[k] = setA[kappastar]
        ellk[k] = pkappa[kappastar]

    # keep the best over all stages
    kstar = ellk.argmax()
    Fstar = setFk[kstar]

    # save the best parameters and their performance
    ellstar = train_eval(Fstar, Dtrain, Dtest)

    return ellk, ellstar, kstar

In [None]:
# You can run forward selection on the collection of significant inputs

f_ellk, f_ellstar, f_kstar = forward_selection(significant_inputs)

## 5.5 Backward feature removal

Implement backward feature removal.

In [None]:
def backward_removal(all_inputs):

    ...

    return ellk, ellstar, kstar

In [None]:
# Use this cell to test your code
b_ellk, b_ellstar, b_kstar = backward_removal(significant_inputs)

b_ellk, b_ellstar, b_kstar

In [None]:
grader.check("q5p5")

## 5.6 Run forward and backward selection (done already)

In [None]:
f_ellk, f_ellstar, f_kstar = forward_selection(significant_inputs)
b_ellk, b_ellstar, b_kstar = backward_removal(significant_inputs)
P = len(significant_inputs)

## 5.7 Plot (done already)

In [None]:
plt.figure(figsize=(10,5))

# Plot forward seleection in blue
color = 'blue'
plt.plot(range(P+1),f_ellk,'o-',color=color,linewidth=3,label='forward (validation)')
plt.plot([f_kstar,f_kstar],[f_ellk[f_kstar],f_ellstar],color=color,linestyle='--',linewidth=2)
plt.plot(f_kstar,f_ellstar,'*',color=color,markersize=22,label='forward (test)')

# Plot backward removal in orange
color = 'darkorange'
plt.plot(range(P+1),b_ellk,'o-',color=color,linewidth=2,label='backward (validation)')
plt.plot([b_kstar,b_kstar],[b_ellk[b_kstar],b_ellstar],color=color,linestyle=':',linewidth=2)
plt.plot(b_kstar,b_ellstar,'*',color=color,markersize=16,label='backward (test)')

plt.legend(fontsize=14)
plt.grid(linestyle=':')
plt.xticks(range(16),fontsize=16)
plt.xlabel('k (number of inputs included)',fontsize=16)
plt.ylabel(r'$R^2$',fontsize=16,rotation=0,labelpad=20)

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Make sure you submit the .zip file to Gradescope.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)