# Regression and Model Selection
By: Adrian Garcia <br>
UCSC: AM-170B

## Regression
1. Implement a linear regression algorithm using your implementation of gradient descent for the following objective/loss functions:
<br>
    a. Mean absolute error $\ell_{1}:E_{1}(f) = \sum_{k = 1}^{n}\lvert f(x_{k}) - y_{k}\lvert$,
<br>
    b. Least squared error $\ell_{2}:E_{2}(f) = \left(\sum_{k = 1}^{n}\lvert f(x_{k}) - y_{k}\lvert\right)^{\frac{1}{2}}$,
<br>
where $f(x) = mx - b$.

### Solution Road Map
1. Create **two** linear regression functions that:
<br>
(a) Uses the specified loss function for gradient descent.
<br>
(b) Takes in inputs: ($x$, $y$, $m_0$, $b_0$, $\delta$, epochs).
<br>
(b) Produces outputs: ($m$, $b$, error).
2. Load in data into our main program, isolate desired data, and **split data into training/testing sets**.
3. Train algorithm(s) with specified arguements and plot desired results.
<br>
<font color='red'>NOTE</font>: epochs -> # of iterations training data is passed through an algorithm.

### Solution Pseudo Code
```python
def algorithm(x, y, m, b, L, epochs = 1000):
    # Initialize error array
    for i in range(epochs):
        # Define model: f(x) = mx - b
        # Calculate the partial derivative of the loss function w.r.t m
        # Calculate the partial derivative of the loss function w.r.t b
        m = m - L * dldm # Update m
        b = b - L * dldb # Update b
        err[i] = sum(abs(y_pred - y)) or sum(abs(y_pred - y))**(1/2) # Calculate error
    return m, b, err # Return calculated coefficients and error (array)
```

### Caveats
1. Too small of a learning rate -> takes too long to reach a minimum.
2. Too large of a learning rate -> drastic updates/divergent behavior
3. Gradient descent does **not** guarentee that the minimum you are reaching is the **global** minimum.

2. Test your algorithm on the dataset almost_linear.csv (fit one set of data either TV vs sales or Radio vs sales) using at most three different learning rates. Plot $\ell_{1}$ and $\ell_{2}$ errors as a function of number of iterations of your algorithm. Plot learning rate as a function of the number of iterations in your algorithm. State values for $m$, $b$, and total error for each objective/loss function and learning rate combination.

### Optimal learning rate
For most applications, there exists an optimal learning rate for gradient descent at each iteration step; however, only a subset of those applications allow for a computationally inexpensive search for that rate. So, assuming we could (and want to):

Let $f(x)$ be the function gradient descent is being applied to.
<br>
At each iteration:
$$
x_{k + 1}(\delta) = x_{k} - \delta\nabla f(x_{k})
$$
In using this expression, we have:
$$
F(\delta) = f(x_{k + 1}(\delta))
$$
Thus, the optimal learning rate can be found by setting $\frac{dF}{d\delta} = 0$ and solving for $\delta$. Hence,
$$
\frac{dF}{d\delta} = -\nabla f(x_{k + 1})\nabla f(x_{k}) = 0 \implies \delta = \dots
$$

### Solution Pseudo Code
```python
def algorithm(x, y, m, b, L, epochs = 1000, opt_L = False):
    if (opt_L == True):
        # Initialize error and learning rate array
        for i in range(epochs):
            # Find the optimal learning rate
            # Repeat previously illustrated code
    else:
        # Previously illustrated code
    return m, b, err, L # Return ... and learning rate (array or float)
```

3. Compare your results with a standard linear regression solver like sklearn.linear_model.LinearRegression. 

## Model Selection
1. Explore the data (https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html). Plot the distribution of median house value. Plot median house value vs total bedrooms. Plot the correlation matrix. Discuss plot results.

### Solution Pseudo Code
```python
# Load in data
data = pd.read_csv('cal_housing.data', header = None)
# Set column names
data.columns = [...]
sns.histplot() # Plot data 1
sns.scatterplot() # Plot data 2
plt.yscale('log')
plt.xscale('log')
corr_matrix = data.corr() # Find correlation matrix
sns.heatmap() # Plot data 3
```

2. Explore three different linear regression models: these can include least squares, lasso, ridge, elastic net, or any other linear (or nonlinear) regression model. Perform a grid search on the regularization parameters, and use five fold cross validation to find parameter values that minimize the error on your test data for each of the three models.

### k-Fold Cross Validation
**Problem**: How can we compare different models, or the same model with different parameter(s) ;), in a fair and robust way?
<br>
<br>
**Idea**: Partition a dataset into a training set and a test set $k$ different ways and average the "scores" of each of the models (or parameters) and compare.

### Grid Search
**Problem**: What is the best parameter, or parameters, for a specific model?
<br>
<br>
**Idea**: Build a model with **every** possible parameter value (defined by the user), and compare which parameter value performed the best.

### Solution Pseudo Code
```python
# Import packages
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
# Define the parameter we wish to optimize
parameters = {'alpha':np.linspace(start, stop, number)}
Lasso = linear_model.Lasso() # Define the model we wish to test
# Define a grid search that will use 5-fold cross validation
Lasso_reg = GridSearchCV(Lasso, parameters, scoring = 'neg_mean_squared_error', cv = 5)
Lasso_reg.fit(X,Y) # Fit the grid search
print(Lasso_reg.best_estimator_) # Print best alpha parameter
```

2. (cont.) For each model, plot a) regularization parameter value vs. error on test data and error on training data b) regularization parameter value vs predictor coefficient values.

### Solution Pseudo Code
```python
from sklearn.metrics import mean_squared_error
alphas = np.linspace(start, stop, number)
mse = []
coef = []
for a in alphas:
    # Create linear regression object
    # Train the model using the training set
    # Make prediction using the training (or testing) set
    mean = mean_squared_error(Y_test, Y_pred) # Calculate the error
    mse.append(mean) # add error to list
    coef.append(model.coef_) # add predictor coefficent to list
```

### Solution Pseudo Code (a)
```python
plt.figure()
plt.plot(alphas, mse, label = 'Model')
# Plot config
plt.xscale('log')
plt.yscale('log')
plt.legend()
plt.show()
```

### Solution Pseudo Code (b)
```python
plt.figure()
plt.plot(alphas, coef, label = 'Model')
# Plot config
plt.legend()
plt.show()
```

3. Plot test errors for all three models for the optimal parameters chosen in question 2. Choose the model that best fits the data. Discuss the results.

### Solution Pseudo Code
```python
plt.bar([Models], [Errors for the optimal paramters])
plt.show()
```