# Test 01 (Midterm Exam): Scikit-Learn, Classification, Regression

Please answer the following questions using Python and/or written responses in the notebook
cells below.  This tests covers the materials we have looked at so far in this class, including
Python and the scientific python stack, supervised learning for classification and regression
problems, using the scikit-learn ML library framework, and linear and logistic regression.
You should fill out the cells to answer the questions, and submit your working notebook
to the correct submission folder in MyLeoOnline/D2L before the deadline for this test.
Please make sure that your notebook runs all cells cleanly from top to bottom when a run
all is performed, as you have been doing for your assignments.  Expected output is given in
many places for this notebook, that depends on a random number seed being set and the cells executed
in order.  So you need to ensure you are exeucuting cells in order in order to get the correct results.
Please ensure you use markdown cells for any written responses you are asked to submit for this test.
Please use standard Python PEP coding style standards for your work, and provide Python doc comments
for any functions you create in this notebook.

As a reminder, all work on tests and assignments are required to be the sole product of the individual submitting the work.  You may not work in a group or look at other current or past student work or solutions while working on your test.  Copied work may receive a 0 grade for this midterm exam and may be subject to disciplinary actions.

**Due: Friday 10/21/2022**

Please add your name and the last 5 digits of your CWID here for my reference and in case a notebook
gets accidently misplaced or copied while grading.

Name: Jane Student

CWID-5: (last 5 digits of cwid)

In the following cells, we import some common libraries, including a few from the scikit-learn
framework, and set some plotting and visualization defaults.  However, you may need to add in additional
imports to your notebook for your work on the questions for this test.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# By convention, we often just import the specific classes/functions
# from scikit-learn we will need to train a model and perform prediction.
# Here we include all of the classes and functions you should need for this
# assignment from the sklearn library, but there could be other methods you might
# want to try or would be useful to the way you approach the problem, so feel free
# to import others you might need or want to try

from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_blobs
from sklearn.metrics import confusion_matrix
from sklearn.datasets import make_blobs
import statsmodels.api as sm


In [2]:
# notebook wide settings to make plots more readable and visually better to understand
np.set_printoptions(suppress=True)
plt.style.use('seaborn-darkgrid')
plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)
plt.rc('figure', titlesize=24)
plt.rc('legend', fontsize=14)
plt.rcParams['figure.figsize'] = (12.0, 8.0) # default figure size if not specified in plot


## Part 1: Numpy and Generation of Polynomial Function Data
---------------------------

To start this test, we are first going to create a function that can
randomly generate a set of data of a single parameter x, but where the
function or label of the parameter x is some nonlinear polynomial combination
of x.  The function will also add in noise to the randomly generated data,
to make it more difficult to find and fit a model to the data.



### NumPy Practice

But first, perform the following tasks.  You will use the following to
create the function that generates a random polynomial dataset.

In the next cell, set the `NumPy` random seed to 42, so that all of
your following work generating random numbers will get the expected result.

In [3]:
# set NumPy random seed to 42 here.

Generate a NumPy array of shape (5,) of randomly
generated numbers in the range from [-1, 1].  Use the
`uniform()` function of the `NumPy` random library to do this.  Call the
array `x`.  If you set your seed and generate it correctly, you should
get the following results:

```python
> print(x)
[-0.25091976  0.90142861  0.46398788  0.19731697 -0.68796272]

> print(x.shape)
(5,)
```

In [4]:
# m is the number of samples, use m = 5 sample size for the cells in this
# question section
m = 5

# generate the array of m = 5 random numbers here, all values in the range [-1, 1]

# uncomment these, make sure you get the shape and values expected
#print(x)
#print(x.shape)

Reshape the vector called `x` so that it is a column matrix, e.g.
a matrix with 5 rows and 1 column, so a shape of (5, 1).  Make sure
to reassign reshaped result back into x to use in next few cells.

```python
> print(x)
[[-0.25091976]
 [ 0.90142861]
 [ 0.46398788]
 [ 0.19731697]
 [-0.68796272]]

> print(x.shape)
(5, 1)
```

In [5]:
# reshape x into a column matrix of the required shape.  Make sure to
# reassign reshaped array back into x so that variable x has the desired shape

# uncomment these and make sure you get the exact results shown
#print(x)
#print(x.shape)

Create an array called `theta` of shape `(3,)` (a vector) with
the following values.  Crete a variable named degree which should
be determined by querying the shape of `theta` and subtracting 1 from
its shape.

```python
> print(theta)
[ 3 -4  5]

> print(theta.shape)
(3,)

> print(degree)
2
```

In [6]:
# Create the array called theta with the values and shape indicated

# uncomment these and make sure you get the exact results shown
#print(theta)
#print(theta.shape)

# create a variable named degree.  Make sure that you query theta shape and subtract 1
# to determine the degree, which should be 2 here since theta is shaped (3,)

# uncomment this and make sure you get the expected result
#print(degree)

Using SciKit `PolynomialFeatures` class, create a new array `X` of features from the original `x`, but generating the `degree=2` polynomial features.
Make sure you include the bias term.  The resulting `X` should look like
this:

```python
> print(X)
[[ 1.         -0.25091976  0.06296073]
 [ 1.          0.90142861  0.81257354]
 [ 1.          0.46398788  0.21528476]
 [ 1.          0.19731697  0.03893399]
 [ 1.         -0.68796272  0.4732927 ]]

> print(X.shape)
(5, 3)
```

In [7]:
# create a polynomial features instance of the indicated degree

# fit/transform the single feature x column matrix into a degree 2 set of
# polynomial features

# uncomment these and make sure you get exactly the same values and shape
# expected here for the X polynomial features
#print(X)
#print(X.shape)

Perform a matrix multiplication to multiply the polynomial features `X`
times the `theta` parameters (**Hint**: NumPy `dot()` function, or overloaded
operator that performs matrix multiplication in Python 3 NumPy).
The result will be the `y` value of the
quadratic function $y = 3 - 4x + 5x^2$ for the 5 sampled `x` values you generated.
If you perform the matrix multiplication correctly,  you should get an array named `y`
with the following values:

```python
> print(y)
[4.31848268 3.45715327 2.22047225 2.40540206 8.11831439]

> print(y.shape)
(5,)
```

In [8]:
# perform matrix multiplication of X polynomial features and theta parameters

# uncomment these and make sure you get exactly the same resulting y
# labels for this quadratic function
#print(y)
#print(y.shape)

Add some gaussian (normally distributed) noise to the y labels.  Use
a mean (`mu` $\mu$) of 0.0 and a standard deviation (`sigma` $\sigma$) of 0.1 for the
generated noise.  Use the `NumPy` random library function named
`normal()` to generate this noise with the indicated mean and
standard deviation, and of the correct shape `m = 5` so `(5,)` to add
the noise to `y`.  The result should be saved back to the variable `y`.

If you generate this noise as asked for, you should now have these exact
values in your `y` variable.

```python
> print(y)
[4.34638681 3.5582048  2.16238443 2.35288508 8.06117637]

> print(y.shape)
(5,)
```

In [9]:
# add in normally distributed noise to the y labels with a mean of mu=0.0
# and a standard deviation of sigma=0.1
mu = 0.0
sigma = 0.1

# uncomment these lines and make sure you are getting the expected noisy y labels now
#print(y)
#print(y.shape)

### Make Polynomial Dataset Generator Function

At this point you have all of the pieces you need to create a function
to generate a random polynomial dataset of any degree.  The
function signature is given to you here in the next cell.  Use the work
you did in the previous cells to generate a set of `x` features randomly
on the interval [-1, 1], determine the degree of the polynomial from
the `theta` array, generate the polynomial features, generate the `y`
labels and then add some gaussian noise to the `y` labels.  This function
returns the resulting randomly generated `x` and `y` arrays.

The cell after the function implementation calls your function to generate
a random dataset.  It expects that `m = 5` and the `theta` array were defined above.
It also expects that you ask for a data set with random noise where `mu = 0.0` and 
`sigma = 0.1`.  Invoke your function and generate random features `x` and
the corresponding noisy labels `y`. If you have set your seed as asked for and run
these cells sequentially, you should get the following result in `x` and `y`
after running your function:

```python
> print(x)
[[ 0.04951286]
 [-0.13610996]
 [-0.41754172]
 [ 0.22370579]
 [-0.72101228]]
> print(x.shape)
(5, 1)

> print(y)
[2.72179788 3.54626705 5.40064195 2.50196312 8.46076501]
> print(y.shape)
(5,)
```

In [10]:
# implement the required function here
def make_polynomial_dataset(m, theta, mu=0.0, sigma=1.0):
    """Make a randomly generated artificial regression dataset based on a polynomial function.  The degree of the
    polynomial is determined by the number of parameters given in the theta array.  For example a quadratic
    (squared) function of the form 3 - 4x + 5x^2 would be generated by passing in theta = [3, -4, 5].
    The m parameter controls the number of random samples of x to be generated in that interval. 
    x values are sampled randomly using a uniform distribution in the interval from [-1.0, 1.0]. 
    We generate y target labels according to the indicated polynomial function, and add in gaussian noise
    using a mean of mu and a standard deviation of sigma, indicated by additional default parameters
    to this function.
    
    Parameters
    ----------
    m - Number of random samples to artificially generate (an integer value >= 1)
    theta - terms of the polynomial to generate.  theta.size-1 indicates the degree of the polynomial
       to generate, and the parameters in theta are of the form [x_0, x_1, x_2, ..., x_n].  So a 
       theta parameter of [3, -4, 5] indicates a degree 2 polynomial with parameters y = 3 - 4x + 5x^2
    mu, sigma - terms controlling the amount of random noise added to each artifically generated
       y target label.  mu is the mean of the gaussian noise to generate and sigma is the standard
       deviation of the noise to be generated and added to the sample labels.
       
    Returns
    -------
    x,y - Returns a tuple of the randomly generated x features, and the noisy y regression labels.  Both
      will be NumPy arrays with m elements in them.  x is a column matrix of shape (m,1) while y is a vector
      with shape (5,)
    """
    # generate x samples uniformly over range -1 to 1
    
    # reshape x as a column array to perform matrix multiplication directly
    
    # determine polynomial degree
    
    # create polynomial features from original x samples

    # use theta parameters to generate y labels from sampled polynomial features
    
    # add noise to the regression labels with mean mu and standard deviation sigma
    
    # return the random features and their regression targets

In [11]:
# invoke your function here and check that the resulting x and y match the expected results

# Uncomment the following and check your generated x and y values are what are expected at this point
#print(x)
#print(x.shape)
#print(y)
#print(y.shape)

### Generate Regression Data for Part 2 and Visualize It

Next we will use the function you just created to generate a random dataset governed by the
following polynomial:

$$
y = 3 + 5x + 1x^2 - 3x^3 + 2x^4 -3x^5
$$

Generate a sample size of `m = 1000` data points this time using the `theta` parameters indicated in this cells equation for the
5th degree polynomial.  Use a mean of 0 but a standard deviation of 1.0 this time for the noise added to these labels.

If you have implemented your function and generated the asked for polynomial, you should get the following values in
the first 5 features and labels of this noisy dataset:

```python
> print(x[:5])
[[-0.60065244]
 [ 0.02846888]
 [ 0.18482914]
 [-0.90709917]
 [ 0.2150897 ]]

> print(x.shape)
(1000, 1)

> print(y[:5])
[1.53083608 3.17284304 4.87933597 4.2069929  4.19087997]

> print(y.shape)
(1000,)
```

In [12]:
# declare the correct value for m, theta, mu and sigma here

# invoke your function to generate the degree 5 dataset

# uncomment the following and make sure you seem to have gotten the expected results so far, your values
# should exactly match these if you have set the random seed as asked for and have run the cells sequentially
# to this point
#print(x[:5])
#print(x.shape)
#print(y[:5])
#print(y.shape)

In the next cell you are to visualize this generated dataset that you will be using
in Part 2 of this test to perform some linear regressions.

- Plot the raw data as scatter plot points.
- Plot the true function as a black line.  Make sure that you use
  a different linearly spaced grid of x points when plotting the true
  function as discussed in class (e.g. you should not be using the `x` variable with the randomly generated features here.
- Label your figure axis, though this data is only the x feature and the y label (it is made up data).
  But also label your figure elements, e.g. use a legend to indentify the noisy data points and the
  true polynomial function line.

Your figure should look similar to the following if your data has been generated correctly and you add the asked for
elements to the plot.

![Degree 5 random polynomial dataset](test-01-question-01-dataset.png)

In [13]:
# plot randomly generated dataset here first as scatter plot

# create a different set of x grid values, linearly spaced from -1 to 1

# create the true function y values, probably need to use PolynomialFeatures
# again here

# plot the true function as a solid black line on your figure

# add axis labels and legend here



Finally, while we are at it, generate a second set of data named `x_test` and `y_test` this time, with `m = 1000` samples again, using
the same polynomial function and the same `mu` and `sigma` values to add noise to the label.  We will use this data to test how well
fitted models perform on data they have not been fitted with. Create your `x_test` and `y_test` in the next cell for later use.  You will get
the following results for this test data:

```python
> print(x_test[:5])
[[ 0.01482388]
 [ 0.74844505]
 [-0.01290684]
 [ 0.40451753]
 [ 0.98563368]]

> print(x_test.shape)
(1000, 1)

> print(y_test[:5])
[3.97139565 7.24251709 3.58716089 3.87101553 4.92251277]

> print(y_test.shape)
(1000,)
```

In [14]:
# create a set of test data of the same size as our training data

# uncomment these and make sure that you get the expected values for this data as well
#print(x_test[:5])
#print(x_test.shape)
#print(y_test[:5])
#print(y_test.shape)

## Part 2: Linear Regression on Polynomial Function
------------------------

For the next part of this exam you will be performing a sequence of linear regressions on the artificial data set you just generated. 
You will be using your `x` and `y` NumPy arrays to perform all of the training of your regression models in the following.  Hold off using
the `x_test` and `y_text` arrays until asked to do so.

### Fit a Linear Regression

In the next cell(s) find a best fit linear regression (line) to your artificial dataset, the `x` and `y`, fitting a line to all of the
data you have.  Please use Scikit-Learn Library objects to fit your linear model.  Report all of the following for the fitted model on
all of the data.

1. Make a plot of your noisy data with your fitted model line drawn with the data.
2. Report the intercept and slope coefficients of your fitted line.
3. Find and report the $R^2$ measure, which is the goodness of the linear regression fit to the given data.
4. Calculate and report the MSE and RMSE cost of your fitted model on all of the data.

You should get the following values for the parameters, $R^2$, MSE and RMSE if you have fitted the model
correctly and your random seed was set correctly at this point:

```
Intercept:  3.855189716230805
Slope    :  [1.91250359]
R^2      :  0.33425755500437426
MSE      :  2.504874453540251
RMSE     :  1.582679517002811
```

The fitted linear model should match the following figure when you generate it.

![Linear Model Fit to Noisy 5th Degree Dataset](test-01-question-01-linearfit.png)

In [15]:
# Create the linear regression model and fit the data using Scikit-Learn

# 1. Make a scatter plot with the fitted regression line model
# and plot the linear regression model on the data

# 2. report the intercept and slope

# 3. Report the R^2 fit determined by scikit-learn LinearRegression instance

# 4. Report the RMSE cost of the final fitted model


This model is of course not a really good fit as the true function is not a linear function that is
generating this data.  You should make a note of the $R^2$ score and the RMSE cost of this linear
model's fit to this data and compare it to later better models.

### Fit a 5th Order Regression

As you know this artificial data set is actually generated from a 5th degree polynomial with some random noise in
the data mapping x to the dependent variable y.  If we knew or suspected this was a degree 5 polynomial
function, we could fit a polynomial of that degree to the data and see how good our fit is.

For example, here I will just show you the result of using `np.polyfit()` to fit a 5th order
polynomial to our noisy data.

```python
# fit a degree 5 polynomial to the data
x = x.reshape((m,))
theta = np.polyfit(x, y, 5) 

# report the fitted parameters, polyfit returns parameters from highest to lowest,
# so flip them to have in order we have been using
theta = np.flip(theta)
print(theta)
>>> [ 3.0735744   4.92857937  1.22891594 -2.74466058  1.71305691 -3.07157513]
```

In [16]:
# if your x and y were generated correctly, you should be able to uncomment these and get the same linear
# fit from polyfit here
# fit a degree 5 polynomial to the data
#x = x.reshape((m,))
#theta = np.polyfit(x, y, 5) 

# report the fitted parameters, polyfit returns parameters from highest to lowest,
# so flip them to have in order we have been using
#theta = np.flip(theta)
#print(theta)

Recall once again the true parameters you should be using that generated your random nonlinear dataset:

$$
y = 3 + 5x + 1x^2 - 3x^3 + 2x^4 -3x^5
$$

You should find that, though of course the fit is not exact because of the added noise, the
fitted parameters match pretty well with the true function parameters used to generate the data, despite
a relatively large amount of noise added to the output labels.

Usually, however, you don't know the shape or order of the underlying function that controls your data.
But let's fit a 5th degree polynomial using Scikit-Learn `LinearRegression` and `PolynomialFeatures`.  You
should get the same fit as we just obtained with the NumPy `polyfit()` function (e.g. slope and intercept parameters should match).

In the next cell use the `PolynomialFeatures` function from `scikit-learn` to generate
all combinations of features for a `degree=5` polynomial features.  You do not need the bias term
when generating polynomial features to be used by a `scikit-learn` transformer.  Since
you generated the data using a known seed, your resulting set of features should have the following
as its first 5 rows of sample values:

```python
> print(X.shape)
(1000, 5)

> print(X[:5,:])
[[-0.60065244  0.36078335 -0.2167054   0.13016462 -0.0781837 ]
 [ 0.02846888  0.00081048  0.00002307  0.00000066  0.00000002]
 [ 0.18482914  0.03416181  0.0063141   0.00116703  0.0002157 ]
 [-0.90709917  0.82282891 -0.74638743  0.67704742 -0.61414916]
 [ 0.2150897   0.04626358  0.00995082  0.00214032  0.00046036]]
```

In [17]:
# Create a polynomial features of degree 5 (include the bias term) and fit/transform it to the raw
# x features of your dataset

# Uncomment these and check that the shape should be (1000,5) after transforming to add the
# higher order polynomial features,  and the first 5 samples should look like the following
#print(X.shape)
#print(X[:5,:])

Given this set of features, use a `scikit-learn` `LinearRegression` transformer to fit a regressor to this expanded set of
features, and report the coefficients you end up finding for your fit.  As before
report your intercept and coefficients, the R^2 fit score and the MSE and RMSE of the
cost function when fit to all of the data using this regression.

In [18]:
# fit a new linear regression to the expanded feature matrix

# report the intercept and slope, R^2 score and MSE/RMSE here


Here you should find that intercept corresponds to the bias or intercept term.  The coefficients are arranged
in the reverse order, from the lowest to the highest, so the coefficient at index 0 is the $x^1$ term, and the
last coefficient returned from the model is the $x^5$ term. You should also find that this fit is exactly the same
as the one from `polyfit()` above if you performed it correctly.

Also take a moment to examine the $R^2$ score and RMSE fit cost. You should find that $R^2$ is much higher now,
double what we got for the linear fit.  Also the RMSE cost should have improved quite a bit (it should be around 
1.0 for this model). Keep these values in mind as we fit the next models.

Plot this fitted model in the next cell.  Show the raw noisy data as a scatter plot, and plot both the
true function (as a red dashed line) and your `scikit-learn` fitted 5th degree linear regression (as a black solid line).
Your plot should look similar the following figure if you have fit your data correctly and all of your data matches the
expected results so far.

![Fit of 5th order polynomial to cubic data](test-01-question-02-5thorderfit.png)


In [19]:
# create scatter plot here of the noisy dataset

# plot the scikit-learn fit of your model using a black line

# plot the true function somehow on the graph as a red dashed line here

# make sure you label your axis and create a legend showing your plot elements here


Also check the performance of this model on the `x_test` and `y_test` held back test sets.  Report the MSE / RMSE on the test data
for this fitted model here.

In [20]:
# determine MSE / RMSE here on test data for the 5th degree fitted model


### Overfit Noisy Data using 50th Degree Polynomial

The function you used in assignment 03 to plot learning curves of a model has been copied once again for you in the next cell.  You will use it
in the following work to fit and regularize a regression model to this data.

In [21]:
# you will use this function once again to visualize the fit performance of models to your noisy data in the next few tasks
def plot_learning_curves(model, X, y):
    """Plot learning curves obtained with training the given scikit-learn model
    with progressively larger amounts of the training data X.
    
    Nothing is returned explicitly from this function, but a plot will be created
    and the resulting learning curves displayed on the plot.
    
    Parameters
    ----------
    model - A scikit-learn estimator model to be trained and evaluated.
    X - The input training data
    y - The target labels for training
    """
    # we actually split out 20% of the data solely for validation, we train on the other 80%
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
    
    # keep track of history of the training and validation cost / error function
    train_errors, val_errors = [], []
    
    # train on 1 to m of the data, up to all of the data in the split off training set
    for m in range (5, len(X_train)):
        # fit/train model on the first m samples of the data
        model.fit(X_train[:m], y_train[:m])
        
        # get model predictions
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val)
        
        # determine RMSE errors and save history for plotting
        train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
        val_errors.append(mean_squared_error(y_val, y_val_predict))
        
    # plot the resulting learning curve
    plt.plot(np.sqrt(train_errors), 'r-', linewidth=2, label='train')
    plt.plot(np.sqrt(val_errors), 'b-', linewidth=2, label='val')
    plt.xlabel('Training set size')
    plt.ylabel('RMSE')
    plt.legend(fontsize=18)

We first want to demonstrate an obviously overfitted model for this data.
Fit a 50th degree polynomial to the noisy data using `scikit-learn` and `PolynomialFeatures`.
You may want to start using `scikit-learn` pipelines here.  

Visualize the fit performance by plotting the learning curves for your overfit 50th degree model.

Report the intercept, coefficients, R^2 score and MSE/RMSE of the final fitted model.  Also visualize the fitted model you obtained on this data.  You should
again plot the raw data, the fitted model and the true function.  You may want to abstract this into a function, as you are doing the same thing here you
did before to create a plot, and will need to perform the same plot a few more times after this as well.

In [22]:
# create a pipeline here if needed for a degree 50 set of PolynomialFeatures that is then
# trained with a standard LinearRegression



In [23]:
# plot the learning curves.  You may need to change the limits of your plot, because if the data is overfitting
# the performance on the validation data may be very bad compared to on the training data.


In [24]:
# display intercept, coefficients and R^2 fit score here

# determine and display MSE/RMSE fitted cost here


In [25]:
# visualize your overfitted model again.  Show the noisy data, the fitted 50th order regression and the
# true function as before


Make a note of the RMSE you achieved for this model on all of the data it was trained with.  Now use the `x_test` and `y_test` data
to make new predictions and calculate MSE and RMSE on the test data.  Observe the RMSE you get on the test data and compare it with what you saw for the
training data.  You will write down your observations at the end of this part of the test, after fitting a model with some regularization.

In [26]:
# determine MSE/RMSE on the test data here for your overfit 50th degree model


### Regularization of Polynomial Regression

As we did in assignments for this class and in our lecture notebooks, demonstrate
fitting a model using regularization this time.  Continue using a degree 50 model, but use regularization to fight overfitting.
Use Ridge ($\ell_2$), Lasso ($\ell_1$) or a combination of both with an Elastic Net.  Your goal is
to demonstrate a model that gets about the same $R^2$ and RMSE score on all of the data as the overfit and 5th degree best fit models,
while reducing (regularizing) the fitted parameters. You should be able to obtain a model where the theta parameters are much lower,
and the plot of the fitted model will look smoother and closer to the true function.

As with the previous step, show the learning curve of the model you select to at least demonstrate it doesn't appear to be overfitting as much.

And also as with the previous steps, report the Intercept, Coefficients, $R^2$ score and MSE/RMSE of the model you demonstrate regularization on.

Finally give a plot again of the noisy data, the fitted model and the true function on a single figure so you can compare with the previous 
when you discuss it below.

In [27]:
# create a pipeline here if needed for a degree 50 set of PolynomialFeatures that is then
# trained with a regression that uses regularization to avoid/reduce overfitting

# plot the learning curves.  You may need to change the limits of your plot, because if the data is overfitting
# the performance on the validation data may be very bad compared to on the training data.


In [28]:
# display intercept, coefficients and R^2 fit score here

# determine MSE/RMSE cost of the fit on the training data


In [29]:
# visualize your fitted model with regularization again here.  Show the raw noisy data, the fitted model
# and the true function on the plot.


Again for your model with regularization, run predictions on your `x_test` dataset and calculate the MSE and RMSE.

In [30]:
# determine MSE/RMSE fitted cost here on your test data for your regularized model.


### Part 2 Discussion

In the following markdown cell discuss the following:

- Compare the $R^2$ score and RMSE reached on the training data for the 5th order fitted model, the 50th order overfit model and the model you used regularization on.
- Also look at the RMSE you get on the held back test data for each of these.  What conclusions can you make from the change of RMSE from the training to the test
  data here?


Your written answer for Part 2 should go in this markdown cell.  Use complete sentences and discuss your observations
of the trained models on this dataset.

## Part 3: Logistic Classification

In this section you will generate another artificial data set that contains 5 discrete categories, thus we
are going to perform a classification instead of a regression in this part of the assignment.

This time, however, we will use one of the `scikit-learn` dataset generator methods to generate the data set for us.  You will
then fit a logistic regression classifier to the data.

### Artifical Multiclass Dataset

There are many functions in the `sklearn.datasets` that can be used to 
[generate artificial datasets](https://scikit-learn.org/stable/datasets/sample_generators.html)
in order to test out various machine learning methods.  The simplest one for generating
labeled datasets suitable for classification tasks is the
[`make_blobs` dataset generator](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html).

In the next cell, use `make_blobs` (it was already imported above) and generate an artificial dataset with 5 categories or
labels.  You will need to read the documentation for the function to determine the parameters
you need.  Make a dataset with 1000 samples and 2 features.  You will need to specify 5 as the
number of centers of the generated blobs, in order to generate data with 5 categories.  Use
a cluster standard deviation of 1.0.  Reset the random seed first to 42 in the first cell before where
you call `make_blobs`, so that you will get the expected results when you generate you multiclass
dataset.

In [31]:
# make sure we reset the random seed to 42 so that you get the expected results when generating your dataset


In [32]:
# generate your multiclass dataset with 5 classes.  There should be 1000 samples in the dataset with 2
# input features.  Use a standard deviation of 1.0 for the cluster centers


If you correctly create your dataset, then you should have an `X` feature matrix of size (1000,2) and a
`y` label vector with (1000,) labels.  The labels will be the values [0 1 2 3 4] for this artificial
multiclass dataset.  The first few items of the `X` inputs and the `y` labels should be as shown:

```python
> print(X.shape)
(1000, 2)

> print(y.shape)
(1000,)

> print(np.unique(y))
[0 1 2 3 4]

> print(X[:5])
[[ 5.02007669  2.58375543]
 [ 3.23236714  1.195353  ]
 [-6.10792848 -9.72865221]
 [ 5.19966928  3.05395041]
 [ 1.38081864  4.5933741 ]]

> print(y[:5])
[1 1 2 1 4]
```

In [33]:
# uncomment these and run to verify that you have correctly generated your multiclass classification dataset
#print(X.shape)
#print(y.shape)
#print(np.unique(y))

#print(X[:5])
#print(y[:5])

In the next cell create a scatter plot of the artifical data set and use markers or colors to indicate
the 5 classes.  You can use similar methods as we have shown previously in class lecture
notebooks, but you need to extend the concept to display the 5 categories of this dataset.

Your figure details can differ a bit, but it should look something like the following figure.  Here you are using color and/or shape to indicate each of the
5 categories of the data.  The 2 features of the data should use the x and y axis respectively in your figure.

You should find, using this seed and if you have your settings correct for `make_blobs()`, that most of the classes
are pretty well separated (this is controlled by the cluster standard deviation).
With the exception of classes 1 and 4, whose centers ended up having quite a bit of overlap,
and thus these classes will be the most difficult to predict and separate for this data.
You should, for example, expect that if you generate the confusion matrices of a fitted 
classification, these classes may have the most errors with one another.

![Scatter Plot of Multi-Class dataset](test-01-question-03-blobs.png)


In [34]:
# create a scatter plot of the artificial multiclass dataset and use color/shape to visualize
# the categories of the data


For this classification you are going to evaluate the goodness of the model fit by doing a train/test
split and evaluating the classification performance on the test data.  In the next cell, create `X_train`, `X_test`,
`y_train` and `y_test` arrays from your artificial multiclass dataset, using a 80%/20% train/test split.  You should
use the `scikit-learn` methods for splitting the data here that we have shown examples of and used in previous
assignments and lectures.  Use a random_state of 42 to make sure that you split the data in the same way
every time.

If you use a 80/20 split, you should end up with 800 samples in the training data, and 200 in the testing data:

```python
> print(X_train.shape)
(800, 2)

> print(X_test.shape)
(200, 2)

> print(y_train.shape)
(800,)

> print(y_test.shape)
(200,)
```

In [35]:
# perform a 80/20 percent train/test split of the artificial multiclass data here


### Fit a Multiclass Logistic Classifier

In this section you should fit/train a multiclass logistic regression instance using a `scikit-learn`
`LogisticRegression` instance on your training data.  I will not tell you the exact parameters to use.
Try and see in the next part if you can tweak the parameters to get good performance on the
test data.  Try comparing using multi_class='multinomial' vs. multi_class='ovr' (one vs. rest).

In the next cell, fit your model to the training data only.

In [36]:
# Create a LinearRegression instance and 
# fit a Linear Regression classification to the training data 


Using the model you just fit, show the predictation accuracy on the training data, and then on the
held back test data, for your fitted model. (**Hint**: recall the `score()` method for
`scikit-learn` model instances, which in the case of classification models returns the accuracy of the model).

In [37]:
# show the prediction accuracy on the training data


In [38]:
# now show the prediction accuracy for your model on the held back test data


You should be able to get an accuracy of about 0.97 or better on the training data, and not much worse,
if at all, on the test data most of the time if your regression is fitting correctly.

Now in the next cell, display the confusion matrices on the trained data and on the test data for
your logistic classifier.  As a hint, the `confusion_matrix()` method from `scikit-learn` can do this for you
for both train and test data, if you have the labels and the predictions for each of these.

You might want to confirm that class 1 and 4 are having the most confussion using your classifier.

In [39]:
# display training data confusion matrix here


In [40]:
# display test data confusion matrix here


### Visualize Decision Boundaries

Using examples from class and your fitted logistic regression, visualize
the decision boundaries that were learned by your multi-class logistic
regression on the training data.  You should use countour maps for this visualization.
Your figure should look similar (though it doesn't have to exactly
reproduce) to the following example of the learned decision boundaires:

![Multi-class Dataset Decision Boundaries](test-01-question-03-decisionboundaries.png)


So try and visualize the decision boundary
of your fitted logistic classifier here.  Since this is a multiclass classifier, it is somewhat difficult to find the
decision boundary lines for each of the individual classifiers being used.  So the easiest approach is to use
the `predict()` method form a mesh/grid of prediction values covering the 2 features, and make a contour plot
of the resulting areas. I have given examples of doing this before, but again you would need to expand this
for the multiclass case of 4 classes here.  You should see that, since this is a basic logistic regression, 
linear decision boundaries are being fitted by the model as best they can to make the classification decisions
for the fitted model.

Put your visualization in the next cell of the fitted decision boundaries of your classifier.

In [41]:
# display resulting decision boundaries of the fitted classification model using contour plot function here