In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab05.ipynb")

<a id='verytop'></a>

# Lab 5 v2 (updated 11/7/25 at 10:35am) : Linear Regression and Least Squares

## Due Date: Thurs, Nov 13th 11:59 PM on Gradescope


### Detailed Submission Instructions Are Provided at the end of this Notebook


## Collaboration Policy

A key step in learning and retention is **creating solutions on your own.**   Below are examples of acceptable vs unacceptable use of resources and collaboration when doing lab assignments in CSCI 2820.


The following would be some **examples of cheating** when working on HW assignments in CSCI 2820.  Any of these constitute a **violation of the course's collaboration policy and will result in an F in the course and a trip to the honor council**.   


 - Consulting web pages that may have a solution to a given lab problem or one similar is cheating.  However, consulting notes from the class videos, and web pages that explain the material taught in class but do NOT show a solution to the lab problem in question are permissible to view.  Clearly, there's a fuzzy line here between a valid use of resources and cheating. To avoid this line, one should merely consult the course videos, the course textbooks, and references that contain syntax and/or formulas.
 - Copying a segment of code or math solution of three lines or more from another student from a printout, handwritten copy, or by looking at their computer screen 
 - Allowing another student to copy a segment of your code or math solution of three lines or more
 - Taking a copy of another student's work (or a solution found online) and then editing that copy
 - Reading someone else’s solution to a problem on the lab before writing your own.
 - Asking someone to write all or part of a program or solution for you.
 - Asking someone else for the code necessary to fix the error for you, other than for simple syntactical errors
 


On the other hand, the following are some **examples of things which would NOT usually be
considered to be cheating**:
 - Working on a lab problem on your own first and then discussing with a classmate a particular part in the problem solution where you are stuck.  After clarifying any questions you should then continue to write your solution independently.
 - Asking someone (or searching online) how a particular construct in the language works.
 - Asking someone (or searching online) how to formulate a particular construct in the language.
 - Asking someone for help in finding an error in your program.  
 - Asking someone why a particular construct does not work as you expected in a given program.
   

To test whether you are truly doing your own work and retaining what you've learned you should be able to easily reproduce from scratch and explain a lab solution that was your own when asked in office hours by a TA/Instructor or on a quiz/exam.   


If you have difficulty in formulating the general solution to a problem on your own, or
you have difficulty in translating that general solution into a program, it is advisable to see
your instructor or teaching assistant rather than another student as this situation can easily
lead to a, possibly inadvertent, cheating situation.

We are here to help!  Visit office Hours and/or post questions on Piazza!



## Grading
Grading is broken down into autograded answers and manually graded answers. 

For autograded answers, the results of your code are compared to provided and/or hidden tests.

For manually graded answers you must show and explain all steps.  Graders will evaluate how well you answered the question and/or fulfilled the requirements of the question.


<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


In [None]:
# import useful libraries
import numpy as np
import matplotlib.pyplot as plt
import sympy as sp
import pandas as pd

sp.init_printing(use_unicode=True)


# Function needed to run in-notebook tests
import hashlib

def get_hash(num):
    """Helper function for assessing correctness"""
    return hashlib.md5(str(num).encode()).hexdigest()


def get_array_hash_normalized(arr):
    """
    Hash numpy array ignoring dtype differences (values only),
    treating -0.0 as 0.0 and canonicalizing NaNs.
    """
    arr = np.ascontiguousarray(arr, dtype=np.float64)

    # View as unsigned integers so we can manipulate bits
    view = arr.view(np.uint64)

    # Mask for the exponent+mantissa (everything except the sign bit)
    mant_exp_mask = (1 << 63) - 1

    # For entries that are exactly zero (pos or neg), clear sign bit
    zero_mask = (view & mant_exp_mask) == 0
    view[zero_mask] = 0

    # Canonicalize NaNs: all NaNs get the same payload
    nan_mask = np.isnan(arr)
    if nan_mask.any():
        arr[nan_mask] = np.float64(np.nan)

    return hashlib.md5(arr.tobytes() + str(arr.shape).encode()).hexdigest()

# QR Factorization 



Given a matrix $A$ with linearly independent columns, the QR factorization of $A$ is a pair of matrices $Q$ and $R$ such that $Q$ is orthogonal, $R$ is upper triangular, and $QR=A$.

### Structure of orthogonalization 

If $A$ is an $m\times n$ matrix with linearly independent columns, it must be that $m \ge n$.  The matrix $Q$ then will be $m\times n$ with orthonormal columns, and $R$ will be $n\times n$ and upper triangular.  For example, if $A$ is a $6\times 4$ matrix, the matrices have the following structures, with the $A_i$ and $U_i$ being vectors in $\mathbb{R}^6$.

$$
\begin{equation}
A =  \left[ \begin{array}{c|c|c|c} & & & \\
A_1 & A_2 & A_3 & A_4 \\ & & & \end{array} \right] \hspace{2cm}
Q =  \left[ \begin{array}{c|c|c|c} & & & \\
U_1 & U_2 & U_3 & U_4 \\ & & & \end{array} \right] \hspace{2cm}
R = \left[ \begin{array}{cccc} * & * & * & * \\ 0 & * & * & * \\ 0 & 0 & * & * \\ 0 & 0 & 0 & *  \end{array}\right]
\end{equation}
$$

The columns of $Q$ are the result of applying the orthogonalization process to the columns of $A$.  If we suppose that this is the case, let's explain why $R$ must be triangular by looking at the product $QR$ one column at a time.  For the first column we have the following vector equation which specifies the linear combination of the $U$ vectors that form $A_1$.

$$
\begin{equation}
\left[ \begin{array}{c|c|c|c} & & & \\
U_1 & U_2 & U_3 & U_4 \\ & & & \end{array} \right]
\left[ \begin{array}{c} r_{11} \\ r_{21} \\ r_{31} \\ r_{41} \end{array} \right]
= r_{11}U_1 + r_{21}U_2 + r_{31}U_3 + r_{41}U_4 = A_1
\end{equation}
$$

We know however that $U_1$ is the unit vector in the direction of $A_1$.  This means that $r_{21}=r_{31}=r_{41}=0$ and 
$r_{11} = ||A_1||$.  Let's also note that  $||A_1|| = U_1\cdot A_1$.

For the second column we have a similar equation.


$$
\begin{equation}
\left[ \begin{array}{c|c|c|c} & & & \\
U_1 & U_2 & U_3 & U_4 \\ & & & \end{array} \right]
\left[ \begin{array}{c} r_{12} \\ r_{22} \\ r_{32} \\ r_{42} \end{array} \right]
= r_{12}U_1 + r_{22}U_2 + r_{32}U_3 + r_{42}U_4 = A_2
\end{equation}
$$

We know from the orthogonalization process that $U_2$ is built by subtracting from $A_2$ the component that is in the $U_1$ direction.  Thus, $A_2$ is a linear combination of $U_1$ and $U_2$.  This means that $r_{32}=r_{42}=0$ and $r_{12}$ and $r_{22}$ are the coordinates of $A_2$ with respect to $U_1$ and $U_2$, which we can compute as $r_{12} = U_1\cdot A_2$ and 
$r_{22} = U_2\cdot A_2$.

Carrying out the same reasoning for the last two columns, we find that in general $r_{ij} = U_i\cdot A_j$ and that $r_{ij} = 0$ for $i>j$ because the span of $\{U_1, U_2, ..., U_i\}$ is equal to the span of $\{A_1, A_2, ..., A_i\}$.



### QR Factorization in SymPy

The `.QRdecomposition()` method in SymPy calculates the QR decomposition for a matrix:
                                                                        
                                                             
                                                                        

In [None]:
A = sp.Matrix([[2, 3, 0, -1],[-1, 0, 2, 0],[-1, -1, 4, 2],[0, 3, -3, 2]])

B = sp.Matrix([[0],[1],[2],[5]])


A.QRdecomposition()



Notice this method also provides a factorization given a matrix with *linearly dependent* columns. 

$$
\begin{equation}
B = \left[ \begin{array}{rrr} 
1 & 3 & -1  \\ 
0 & -1 & 1  \\ 
2 & 2 & 2  \\
1 & 1 & 1  \\
1 & 0 & 2  \\
\end{array}\right]
\end{equation}
$$

In this case, the 2nd matrix is no longer upper triangular and is not invertible.  However, this factorization can still be used to help us in approximating solutions to linear systems, as we will see below. 

In [None]:
A = sp.Matrix([[1, 3, -1],[0, -1, 1],[2, 2, 2],[1, 1, 1], [1, 0, 2]])

A.QRdecomposition()


## Solving a **consistent** linear system with QR Factorization:

Up until now we've been using either inverses or row-reduction to solve consistent linear systems of equations.

The orthogonalization behind the $QR$ factorization provides us another way to solve a linear system $A\mathbf{x}=\mathbf{b}$.  


### Case 1:  $A\mathbf{x}=\mathbf{b}$ is **consistent**. 

Then we can rewrite this system by substituting $A = QR$:  

$A\mathbf{x} = \mathbf{b}$

$\implies QR\mathbf{x} = \mathbf{b}$

$\implies Q^TQR\mathbf{x} = Q^T\mathbf{b}$

$\implies R\mathbf{x} = Q^T\mathbf{b}$

This is an **upper triangular system** that can be solved easily by back substitution.

SymPy has a built-in function `upper_triangular_solve` to solve an upper-triangular system using back-substitution.

For example, to solve

$$
\begin{equation}
\left[ \begin{array}{rrrr} 
2 & 3 & 5 \\ 
0 & 1 & 2 \\ 
0 & 0 & 4  \\
\end{array}\right]X = 
\left[ \begin{array}{r} 2 \\ 3 \\ 8 \end{array} \right]
\end{equation}
$$

We can use the following code:

In [None]:

T = sp.Matrix([[2,3,5],[0,1,2],[0,0,4]])

b = sp.Matrix([[2],[3],[8]])

T.upper_triangular_solve(b)



<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


## <span style='color:Red'>   Question 1  ###


Let's try it out on a $4\times 4$ system.  
$$
\begin{equation}
\left[ \begin{array}{rrrr} 
2 & 3 & 0 & -1 \\ 
-1 & 0 & 2 & 0 \\ 
-1 & -1 & 4 & 2 \\
0 & 3 & -3 & 2 \\
\end{array}\right]X = 
\left[ \begin{array}{r} 0 \\ 1 \\ 2 \\ 5 \end{array} \right]
\end{equation}
$$

In [None]:
A = sp.Matrix([[2, 3, 0, -1],[-1, 0, 2, 0],[-1, -1, 4, 2],[0, 3, -3, 2]])

b = sp.Matrix([[0],[1],[2],[5]])

display(A)
display(b)

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


## <span style='color:Red'>   Question 1a  ###

i).  Multiple choice:   This system of equations is:

      A). Consistent 
      B). Inconsistent


ii).  Multiple choice:   The columns of A are 

      C). Linearly independent
      D). Linearly dependent


(Assign your answers in the cell below as either the string "A" or "B", "C" or "D", capitalized)


Tip: The hstack method joins a SymPy matrix and vector to create an augmented matrix
`sp.Matrix.hstack(A, b)`

In [None]:

answer_i = ...

answer_ii = ...


In [None]:
grader.check("q1a")

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


## <span style='color:Red'>   Question 1b  ###


Use SymPy to find the $QR$ factorization of $A$
	     


In [None]:

Q, R = ...

display(Q)

display(R)


In [None]:
grader.check("q1b")

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


## <span style='color:Red'>   Question 1c  ###


Use the $QR$ factorization of $A$ and SymPy's `upper_triangular_solve` function to solve $A\mathbf{x}=\mathbf{b}$. 
	     


In [None]:

...

x = ...

# Check your answer by making sure Ax = b

display(x)

A@x==b


In [None]:
grader.check("q1c")

In some situations, we might find that we are solving several systems such as 

$AX=B_1$, $AX=B_2$, $AX=B_3$, ..., 

that involve the same matrix but different right hand sides.  In these situations it is useful to solve the systems with a factorization such as $QR$ because the factorization does not need to be recomputed for each system.

### Case 2:  $A\mathbf{x}=\mathbf{b}$ is **inconsistent**.  

In this section we address the problem of inconsistent systems, and the common resolution known as the least squares solution.

In the case that $AX=B$ is inconsistent, there is no vector $X$ such that the two vectors $AX$ and $B$ are the same.  A natural idea then is to choose a vector $X$ such that $AX$ and $B$ are as close as possible.  


Recall that if the system $AX=B$ is inconsistent, the vector $B$ is not in $\mathcal{C}(A)$, the column space of $A$.  **The error vector $E=AX-B$ has minimum magnitude exactly when it is orthogonal to $\mathcal{C}(A)$**.  

Thus we want to solve the linear system  $AX=B$ by projecting $B$ onto $\mathcal{C}(A)$

### Solving Least Squares

When solving least squares problems, we seek the vector \(x\) that best fits \(Ax \approx b\).  
A common approach is to use the *normal equations*:


$A^T A x = A^T b$

but forming $A^T A$ can greatly magnify rounding errors in floating-point arithmetic.





###  QR factorization to solve least squares

Instead, we can use the **QR factorization**:

Let $A = QR$

where $Q$ has orthonormal columns and $R$ is upper triangular.  


The projection of $b$ onto the column space of $A$ is:

$$
\hat{b} = QQ^T b.
$$

The least–squares solution satisfies:

$$
Ax = \hat{b}.
$$

We can substitute $A = QR$:

$$
QRx = QQ^T b.
$$

Because $Q^T Q = I$, projecting $b$ onto the column space of $A$ does not amplify errors:

$$
Q^TQRx = Q^TQQ^T b,
$$

$$
Rx = Q^Tb.
$$

This is an upper triangular system, so we can then solve for $x$ using back substitution.

This avoids forming $A^T A$ entirely and reduces the problem to solving a stable triangular system.  
Thus, the $QR$ factorization provides a more **numerically stable** method for least squares, especially when $A$ is close to rank-deficient or ill-conditioned.






<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


## <span style='color:Red'>   Question 2  ###


Although the method of least squares can be applied to any inconsistent system, it is usually associated with systems that have more equations than unknowns.  These systems are called overdetermined, and here is one such example.

$$
\begin{eqnarray*}
2x_1 + x_2 & = & 0 \\
2x_1 - x_2 & = & 2 \\
3x_1 + 2x_2 & = & 1 \\
5x_1 + 2x_2 & = & -2
\end{eqnarray*}
$$



## <span style='color:Red'>   Question 2a  ###


Use matrix notation to represent this system above.  That is, rewrite it as $A_2\mathbf{x} = \mathbf{b_2}$ (we're using 2's here to distinguish this from the A and b in our previous example)



i).  Multiple choice:   This system of equations is:

      A). Consistent 
      B). Inconsistent


ii).  Multiple choice:   The columns of $A_2$ are 

      C). Linearly independent
      D). Linearly dependent


(Assign your answers in the cell below as either the string "A" or "B", "C" or "D", capitalized)




In [None]:

# Define A2 to be a SymPy matrix, and b2 to be a sympy 4x1 matrix

A2 = ...

b2 = ...

...

ans_2ai = ...

ans_2aii = ...



In [None]:
grader.check("q2")

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


## <span style='color:Red'>   Question 2b  ###


Use QR factorization and SymPy's upper_triangular_solve function to find the least squares solution $\mathbf{\hat{x_2}}$ to the inconsistent system $A_2\mathbf{x}=\mathbf{b_2}$. 





In [None]:

Q2, R2 = ...

...

x2_hat = ...

display(x2_hat)



In [None]:
grader.check("q2b")

## <span style='color:Red'>   Question 3a  ###

Consider the following system:


$$
\begin{equation}
\left[ \begin{array}{rrrr} 
1 & 1 & 0 & 0 \\ 
1 & 1 & 0 & 0 \\ 
1 & 0 & 1 & 0 \\
1 & 0 & 1 & 0 \\
1 & 0 & 0 & 1 \\
1 & 0 & 0 & 1 \\
\end{array}\right]X = 
\left[ \begin{array}{r} -3 \\ -1 \\ 0 \\ 2 \\ 5 \\ 1 \end{array} \right]
\end{equation}
$$


Use matrix notation to represent this system above.  That is, rewrite it as $A_3\mathbf{x} = \mathbf{b_3}$ 

i).  Multiple choice:   This system of equations is:

      A). Consistent 
      B). Inconsistent


ii).  Multiple choice:   The columns of $A_3$ are 

      C). Linearly independent
      D). Linearly dependent


(Assign your answers in the cell below as either the string "A" or "B", "C" or "D", capitalized)




In [None]:

# Define A3 to be a SymPy matrix, and b3 to be a sympy 6x1 matrix

A3 = ...

b3 = ...

...

ans_3ai = ...

ans_3aii = ...



In [None]:
grader.check("q3a")

<!-- BEGIN QUESTION -->

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


## <span style='color:Red'>   Question 3b  ###

i).  What happens when you try  to use QR factorization and SymPy's `upper_triangular_solve` function to find the least squares solution to $A_3\mathbf{x} = \mathbf{b_3}$?    Why does this occur?  Write your answer below in the cell in  Markdown


ii). For this system of equations, there is a least squares solution, but it is not unique.   Find the set of **all** least-squares solutions to $A_3\mathbf{x} = \mathbf{b_3}$.  Write your answer in parametric vector form (hint  - you will have to do the back-substitution step by hand).    Write your answer using LaTeX in the Markdown Cell below.

**Type your answer to Question 3b(i) here**:  

**Type your answer to Question 3b(ii) here**:

In [None]:

Q3, R3 = ...

display(Q3)

display(R3)

...




In [None]:
grader.check("q3b")

<!-- END QUESTION -->

## Application:  Finding the best line to fit a set of data

In the problems below we will be using the library Pandas to load in sets of data.  

## Pandas module:
***
**Pandas** is an open source $\color{red}{\text{data analysis module}}$ in Python used for storing, cleaning, wrangling, and analyzing data.   (Fun fact: It was named as a shortcut for the term "$\textbf{pan}$el  $\textbf{da}$ta", a common term for multidimensional data sets encountered in statistics and econometrics.)





First, let's import the Pandas module.  It's custom in data science to import Pandas with the alias $\texttt{pd}$.  We can then access any function in the Pandas libraries by prepending function names by $\texttt{pd.}$  

In [None]:
import pandas as pd

### $\color{red}{\textbf{Pandas}}$ Data Structures




Pandas has three types of data structures: 
- **Series**: A one dimensional array with labeled indices (can be mixed data types). 
-  **DataFrame**: 2D tabular data structure with both row and column labels.  $\color{red}{\text{Rows}}$ have a specific index to access them, which can be $\color{red}{\text{any name or value}}$. The $\color{blue}{\text{columns}}$ are just $\color{blue}{\text{Pandas Series}}$. The Pandas DataFrame data structure can be seen as a spreadsheet, but it is much more flexible. 
-  **Index**:  A sequence of row/column labels


<img src="img/pandas.png" width="600" height="600">

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


## <span style='color:Red'>   Question 4  ###

The rate at which crickets chirp is related to the temperature.  If you evaluate the next cell, you will see a small dataset that tells us the rate at which a cricket chirps, in chirps per minute, and the temperature, in degrees Fahrenheit.

Run the cell below to load in the data. The data is represented in what's known as a Pandas dataframe.

In [None]:
df = pd.DataFrame({'Chirp rate': [14.7, 20.0, 18.4, 17.1, 15.5],
                   'Temperature': [69.7, 88.6, 84.3, 80.6, 75.2]})

df


You can select a column from a dataframe and convert it as a SymPy matrix as follows:


In [None]:
sp.Matrix(df['Chirp rate'])

You can select multiple columns from a dataframe and save it as a SymPy matrix as follows:

In [None]:
cols = ["Chirp rate", "Temperature"]

sp.Matrix(df[cols].values.tolist())



 `onesvec(n,1)` creates an n-dimensional vector whose entries are all one.

In [None]:
from sympy.matrices import ones

vector_ones = ones(5,1)

vector_ones

You can create take this vector of ones a join it to your original matrix as follows:

In [None]:


sp.Matrix.hstack(vector_ones, data)

Here's a scatter plot of the data.
We'll use a package called seaborn which quickly creates scatter plots from dataframes

In [None]:
import seaborn as sns
sns.scatterplot(df, x = "Chirp rate", y = "Temperature");

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


## <span style='color:Red'>   Question 4a  ###


These data points look like they roughly lie on a line so let's try to model them with
$$
\beta_0 + \beta_1 C = T
$$
where $C$ is the chirp rate and $T$ is the temperature.  The 5 points in our dataset gives us 5 equations for the parameters $\beta_0$ and $\beta_1$ that we can express in the form $A_4{\mathbf x} = {\mathbf b_4}$.  Construct the matrix $A_4$ and the vector ${\mathbf b_4}$ below. The matrix $A_4$ is called the "design" matrix.

**i).  Use SymPy and Pandas commands as shown above to create $A_4$ and ${\mathbf b_4}$ - don't manually enter the values by hand, as you will soon have much bigger datasets to handle.**

ii).  Use SymPy built in functions to find the $QR$ factorization of $A_4$. 

iii). Then use your QR factorization and SymPy's `solve_upper_triangular` function  to find the least squares approximate solution $\widehat{\mathbf x}$.


In [None]:



b_4 = ...

...

A_4 = ...

display(b_4)

display(A_4)



Q4, R4 = ...

...

x4_hat = ...

display(x4_hat)



In [None]:
grader.check("q4a")

Run the next cell to plot your model $\widehat{\mathbf x}$ and the data.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot the original dataset
sns.scatterplot(df, x = "Chirp rate", y = "Temperature")

#To plot the least squares regression line:

# Create a vector of x values starting at the smallest input value and ending at the largest
xs = np.linspace(num=100, start=np.min(df["Chirp rate"]), stop=np.max(df["Chirp rate"]))

# Calculate the predicted output for each x input
yhats = [x4_hat[0] + x4_hat[1] * x for x in xs]

# Plot the line
plt.plot(xs, yhats, color='red', lw=4);
    


<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


## <span style='color:Red'>   Question 4b  ###

If the chirp rate is 18 chirps per minute, what is does this linear model predict for the temperature? (You should be able to calculate this using a dot product of 2 vectors). 



In [None]:


temp = ...


temp

In [None]:
grader.check("q4b")

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


## <span style='color:Red'>   Question 5a  ###


## Baseball salaries

Here's a data set describing the performance and salaries (in thousands) of 263 baseball players during the 1986 season.  



In [None]:
df_5 = pd.read_csv('https://raw.githubusercontent.com/davidaustinm/MTH205-W20/master/data/hitters.csv')



df_5



We would like to predict a player's salary based on only a **subset** of these features. 

Set up a linear system to find the unknown parameters:
$$
\beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \beta_5x_5 = y
$$
where the $x_i$ are **hits, homeruns, runs, rbis, and putouts**, and $y$ is the salary.


In other words, we'd like to solve the linear system $A_5 \mathbf{x} = \mathbf b_5$  using least squares.  

i).  Use SymPy and Pandas commands to create $A_5$ and ${\mathbf b_5}$ - don't manually enter the values by hand.  

ii).  Use SymPy built in functions to find the $QR$ factorization of $A_5$. 

iii). Then use your QR factorization and SymPy's `solve_upper_triangular` function  to find the least squares approximate solution $\widehat{\mathbf x}$.


In [None]:



b_5 = ...

...

A_5 = ...




Q5, R5 = ...

...

x5_hat = ...

display(x5_hat)



In [None]:
grader.check("q5")

<!-- BEGIN QUESTION -->

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


## <span style='color:Red'>   Question 5b  ###

According to the model, if a player hits more home runs, what happens to his salary?


_Type your answer here, replacing this text._

<!-- END QUESTION -->

It turns out this isn't a great model.  We have just pulled out some of the features from the dataset and ignored many of the others.  There are statistical techniques for choosing the best features, an aspect of data science called *feature selection*.  While the features we used here may seem like the best, it turns out that some of the other features are better and there are many tools we can use to compare how "good" our model is.   There's a real art to doing this well. Take **CSCI 3022 (Intro to Data Science)** to learn more!!

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />


### Submission Instructions

Before proceeding any further, **save this notebook.**

Then run the cell below to double check that you don't have any spaces between dollar signs and text when writing LaTeX:

In [None]:
# Run this cell before you run the 'grader.export()' cell below.  
# It will search for LaTeX errors that will cause the LaTeX compiler to fail.  

import simple_latex_checker as slc

nb = slc.Nb_checker()
nb.run_check("lab05.ipynb")

After running the `grader.export()` cell provided below, **2 files will be created**: a zip file and pdf file.  You can download them using the links provided below OR by finding them in the same folder where this juptyer notebook resides in your JuptyerHub.

To receive credit on this assignment, **you must submit BOTH of these files
to their respective Gradescope portals:** 


* **Lab 5 Autograded**: Submit the zip file that is output by the `grader.export()` cell below to the Lab 5 Autograded assignment in Gradescope.

* **Lab 5 Manually Graded**: Submit your lab05.PDF to the Lab 5 Manually Graded assignment in Gradescope.  **It is your responsibility to fully review your PDF file before submitting and make sure that all your lines of code are visible and any LaTeX has correctly compiled and is fully viewable.**  **YOU MUST SELECT THE PAGES CORRESPONDING TO EACH QUESTION WHEN YOU UPLOAD TO GRADESCOPE.** If not, you will lose points.    

[TROUBLESHOOTING TIPS](https://docs.google.com/document/d/1ndr3Wj1PSF5qzlLMaBJznwh6QGeEXjd5TAJ6nf9EJvo/edit?usp=sharing)  If you are having any issues compiling your assignment, please read through these troubleshooting tips first, then post any questions on Piazza.  

**You are responsible for ensuring your submission follows our requirements. We will not be granting regrade requests nor extensions to submissions that don't follow instructions.** If you encounter any difficulties with submission, please don't hesitate to reach out to staff prior to the deadline.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

AFTER running the cell below, click on <a href='lab05.pdf' download>this link to download the PDF </a> to upload to Gradescope.  There will be a separate link that appears after running the cell below with a link to download the zip file to upload to Gradescope.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)