# Unit 6: Regression and Classification
---------------------------------------

One way to categorize machine learning problems is to split them into *regression* and *classification*. Most engineers are familiar with basic regression analysis, where we seek to predict a continuous value by defining a function in terms of one or more input variables. An example of regression analysis, which we will look at below, is finding the best fit line for a series of data points. In a classification problem, the output is typically a discrete value. Classification algorithms are wide-ranging and be used for yes/no predictions or to select the most representative class out of a set of options. For example, a classification algorithm may be *trained* on many medical images that are labeled by experts as "cancer" or "not cancer". The features of these images that lead to a good prediction will be learned through the training period and can be used to identify cancer in future patients. This is an example of a single-class problem, where the predicted result is that the new sample belongs, or does not belong, to the class. In contrast, handwriting recognition is an example of a multi-class problem; each digit may belong to any of the alphanumeric characters or punctuation marks. 

In this unit, we will just scratch the surface on regression and classification techniques using the *scikit-learn* Python package. This package makes it easy to get started, but the effective application of these (and more sophisticated) machine learning algorithms does require some theoretical understanding of how they work. Often, data scientists will experiment with multiple algorithms (and, using different parameter settings) to identify the best options for a given problem. They will also test the error for the algorithm using test data that is purposely excluded from the training set.

**After completing this unit, you should be able to:**

- Understand the theory that underpins least-squares regression
- Compute the best fit polynomial for a dataset, using linear algebra
- Perform a linear regression analysis using the *scikit-learn* package
- Understand the purpose of a machine learning classification problem
- Perform a basic two-class classification using the *scikit-learn* package
- Understand and execute a *k-fold* cross-validation to test a classification algorithm

## 6.1. Regression

The basic concept of regression analysis should already be familiar to engineers, and you have probably fit a linear function to a dataset in a spreadsheet. Observe the data shown below. We may want to find the function $y=f(x)$ that best fits this dataset, so that we can use that function to predict other values. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('ggplot')

# number of values to generate
n = 30

# x values
x = np.arange(0, n, 1)

# y=2x+5 + random noise
y = 2*x + 5 + 10*(np.random.random(n)-0.5)

# plot x, y data
fig, ax = plt.subplots()
ax.scatter(x, y)

### 6.1.2. Linear least-squares regression

A simple and common approach to regression analysis is to find a linear function that approximates the data based on the available input variables. The process for fitting this function can be derived from the fundamentals of linear algebra, as we will show below. I think it can be helpful to understand *why* this process works, but following the math is not actually necessary to use the Python machine learning libraries that we'll cover later on.

We will use $\hat{y}$ to represent our predicted value of $y$, so $\hat{y}=f(x)$. For *least-squares* regression, we define the *cost function* in terms of the squared Euclidian norm (two-norm) of the error between each corresponding actual and predicted value in the vectors $y$ and $\hat{y}$. The *cost function* will be the number that we're trying to minimize to improve the fit of the data.

$$\Vert y-\hat{y} \Vert_2^2$$

The Euclidian norm of a vector $v$ can also be written as the vector product $\left(v^Tv\right)^{0.5}$. If you recall from linear algebra, this means that we add up each element in the array, multiplied by itself. When we square the Euclidian norm, we cancel out the square root.

$$\Vert v \Vert_2^2=\sum v_i^2$$

Now, the equation of a line is $\hat{y}=mx+b$. We can find the values of $m$ and $b$ that minimizes the cost function. For consistency with some literature, we'll call the coefficients the *weights* and represent these as a vector, $w$. Using this notation, the equation of the line becomes:

$$\hat{y}=w_0x^0+w_1x^1$$

$w_0$ is the y-intercept (multiplied by $x^0$, or 1) and $w_1$ is the slope. For our regression, we have many experimental observations, which make up the rows in the matrix, $X$, below:

$$\begin{bmatrix}
| \\
\hat{y} \\
|
\end{bmatrix} = 
\begin{bmatrix}
| & | \\
x^0 & x^1 \\
| & | 
\end{bmatrix}
\begin{bmatrix}
w_0 \\
w_1
\end{bmatrix}$$

When we multiply each row of $X$ by the weight vector, $w$, we get an estimate of $\hat{y}$. This can be substituted into the cost function. 

$$\Vert y-\hat{y} \Vert_2^2= \Vert y-Xw \Vert_2^2$$

We want to find the optimum values for $w$ which minimize this error function. 

$$\min_w \Vert y-Xw \Vert_2^2$$

From calculus we know that, the global minimum of a convex function will occur where the derivative is equal to 0. Because this is a vector equation, we use the gradient operator.

$$\min_{w} \Vert y-Xw \Vert_2^2 = 0$$
$$\nabla_w \Vert y-Xw \Vert_2^2 = 0$$

$$\nabla_w \left[\left( y-Xw \right)^T \left(y-Xw \right) \right] = 0$$
$$\nabla_w \left[y^Ty - y^TXw - w^TX^Ty + w^TX^TXw \right] = 0$$
$$-X^Ty -X^Ty + 2X^TXw = 0$$
$$X^TXw = X^Ty$$
$$w=(X^TX)^{-1}X^Ty$$

After a bit of math, we find that the best weights (coefficients) for a least-squares regression are simply the result of this matrix equation! Let's apply this formula to our data to calculate the values for $w$ that minimize the squared error.

In [None]:
# create a matrix with ones in the first column, and our x values in the second column
# if you're not confident in what values are in X, add a cell and print out the contents
X = np.vstack((np.ones(n), x)).T

# reshape the y values into a column vector
y_col = y.reshape(-1, 1)

# compute the weights using the equation derived above
# the np.linalg.inv function computes the matrix inverse
w = np.linalg.inv(X.T@X)@X.T@y_col
y_hat = X@w

# plot the actual data and regression line
fig, ax = plt.subplots()
ax.scatter(x, y, label='Actual Data')
ax.plot(x, y_hat, label='$\hat{y}$')

# add the equation and legend
ax.text(0, 45, f'$\hat{{y}}$={w[1, 0]:0.2f}x+{w[0, 0]:0.2f}')
ax.legend()


Notice that the resulting values for the slope and intercept are close to what we used to generate the data. To visualize what we have done, let's plot out the error as a function of $w$. We see in the following plot that the error forms a parabolic bowl. The best fit line occurs when we select the slope and intercept that correspond to the bottom of the bowl, where error is minimized.

In [None]:
# number of values that we will use to map out the slope, intercept
n_weightvals = 500

# array of possible slope values
w1_arr = np.linspace(0, 4, n_weightvals)

# array of possible y-intercept values
w0_arr = np.linspace(-25, 35, n_weightvals)

# meshgrid allows for 3d plotting
# this forms two 2d matrices
w1_mat, w0_mat = np.meshgrid(w1_arr, w0_arr)
error = np.zeros(w1_mat.shape)

for i in range(n_weightvals):
    for j in range(n_weightvals):
        e = (y - (w1_mat[i, j]*x + w0_mat[i, j])).reshape((n, 1))
        error[i, j] = e.T@e

fig, ax = plt.subplots()

# plot contour lines for the error function, clipped to a max of 5000
ax.contour(w1_mat, w0_mat, np.clip(error, 0, 5000))

ax.set_xlabel('Slope ($w_1$)')
ax.set_ylabel('Intercept ($w_0$)')

# plot the point calculated by our regression formula
ax.scatter(w[1], w[0], c='black')
ax.annotate(r'Best fit at $\nabla_w \Vert y-\hat{y} \Vert_2^2=0$', xy=w[::-1], xytext=[2.5, 20],  
            arrowprops={'facecolor': 'black', 'width': 2})

ax.text(0.1, -19, 'Contours show increasing error in $\hat{y}$')

### 6.1.3. Leveraging the [`scikit-learn`](https://scikit-learn.org/stable/index.html) package

As we have seen in previous units, there are often Python packages available to make our life easier. The `scikit-learn` package includes many useful models for regression and classification. For a simple linear model like this, the time savings is minor. But, as you get to more complicated models, it is much easier to use the pre-built models versus building your own.

We will import individual modules from the `sklearn` library. For this example, we'll use the `LinearRegression` module. We first need to create a `LinearRegression` object. By default, the model will try to calculate an intercept term. However, we've already included that term in our $X$ matrix, so we set `fit_intercept=False`.

There is a function of the regression model called `LinearRegression.fit()` that will calculate the weight vector, just as we have previously. Once the model has been fitted, we can use the `LinearRegression.predict()` function to apply the weights and generate a vector of $\hat{y}$ values.

If you want to print out, or make any additional calculations using the weight vector, this is available as the property `LinearRegression.coef_`.

In [None]:
from sklearn.linear_model import LinearRegression

# create the LinearRegression object from scikit-learn
# we set fit_intercept=False because we already have an intercept term in the X matrix
reg = LinearRegression(fit_intercept=False)

# execute the regression operation
reg.fit(X, y)
y_hat = reg.predict(X)

# plot the actual data and regression line
fig, ax = plt.subplots()
ax.scatter(x, y, label='Actual Data')
ax.plot(x, y_hat, label='$\hat{y}$')

# add the equation and legend
# these are stored in an array labeled .coef_ in the regression object
ax.text(0.5, 45, f'$\hat{{y}}$={reg.coef_[1]:0.2f}x+{reg.coef_[0]:0.2f}')
ax.text(0.5, 41, f'$R^2$={reg.score(X, y):0.2f}')
ax.legend()

### 6.1.4. Multi-variate linear regression

When we derived the formula for linear regression, we imagined a simple equation of a line $y=mx+b$. However, nothing in our derivation limited us to a single independent variable. A generic linear function with $n$ input variables can be written as:

$$\hat{y_i}=w_0x_{i0}+w_1x_{i1}+...+w_nx_{in}$$

When we stack each observed datapoint, $i$, we get the matrix form:

$$\begin{bmatrix}
| \\
\hat{y} \\
|
\end{bmatrix} = 
\begin{bmatrix}
| & | & | & | \\
x_0 & x_1 & ... & x_n \\
| & | & | & |
\end{bmatrix}
\begin{bmatrix}
| \\
w \\
|
\end{bmatrix}$$

There is no difference between this and our original derivation, so our result, $w=(X^TX)^{-1}X^Ty$, will work for a *feature* matrix $X$ with any number of columns. To further expand on this, we can include polynomial terms as columns in the $X$ matrix. These values: $x^2$, $x^3$, ... would be calculated in advance and stacked into the matrix $X$. Again, the same basic regression formula applies.

$$\begin{bmatrix}
| \\
\hat{y} \\
|
\end{bmatrix} = 
\begin{bmatrix}
| & | & | & | \\
x^0 & x^1 & x^2 & x^3\\
| & | & | & |
\end{bmatrix}
\begin{bmatrix}
| \\
w \\
|
\end{bmatrix}$$

We could generate this polynomial matrix directly using `numpy`, but `scikit-learn` has a module available to generate polynomial expansions of a feature. In the example below, we have data on the tensile modulus of various polyethylene grades as a function of the polymer density. It appears to follow a parabolic shape, so we might try adding a squared term to the regression. Create a `PolynomialFeatures` object. The parameter that you pass should be the maximum order of the polynomial (in this case, 2). Then, run the `PolynomialFeatures.fit_transform(x)` function to generate a matrix that has (in this case) $\begin{bmatrix} x^0 &  x^1 & x^2 \end{bmatrix}$ as the columns.

The first 10 rows are printed out so that you can see the matrix that was generated.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# load the array of polyethylene modulus (col 1) versus density (col 0)
pe_data = np.loadtxt('../../data/density_modulus.csv', delimiter=',')

# sort the data in increasing x order, so that it plots correctly
# returns the row index values, in the order that will sort the matrix
sort_order = pe_data[:, 0].argsort()

# selects the rows in the sort order that we computed above
pe_data_sorted = pe_data[sort_order, :]

pe_x = pe_data_sorted[:, 0]
pe_y = pe_data_sorted[:, 1]

# scatter plot the data
fig, ax = plt.subplots()
ax.scatter(pe_x, pe_y)

ax.set_xlabel('Density (g/cm$^3$)')
ax.set_ylabel('Tensile Modulus (MPa)')

# create an object to transform the x as a 2nd order polynomial (squared)
poly = PolynomialFeatures(2)

# create the polynomial terms from our x array
# x.reshape(-1, 1) creates a column vector from the 1d array
pe_X = poly.fit_transform(pe_x.reshape(-1, 1))

# print the first 10 rows. columns are: x**0, x**1, x**2
pe_X[:10, :]

One we have the feature matrix, $X$, the linear regression is identical to what was done previously.

In [None]:
# sklearn LinearRegression object
reg = LinearRegression(fit_intercept=False)

# train the regression model, based on the supplied data
reg.fit(pe_X, pe_y)

# compute predicted values, based on the original X matrix
pe_y_hat = reg.predict(pe_X)

# plot the raw data and regression fit
fig, ax = plt.subplots()
ax.scatter(pe_x, pe_y)
ax.plot(pe_x, pe_y_hat)

ax.set_xlabel('Density (g/cm$^3$)')
ax.set_ylabel('Tensile Modulus (MPa)')

# display the equation on the axis
equation_text = ' '.join([f'{a:+0.0f}$x^{{{i}}}$' for i, a in enumerate(reg.coef_)])
ax.text(0.88, 820, f'y={equation_text}')
ax.text(0.88, 720, f'$R^2$={reg.score(pe_X, pe_y):0.3f}')

Regression analysis works best when the data is evenly distributed. In the example above, we see that there is a cluster of data in 0.91-0.93 density range. This causes the regression process to weight this region more heavily than the ends, which are more sparsely populated. One method to improve the performance could be to run a stratified sampling, to select the same number of points from different ranges in the data.

Additionally, we might think to improve the fit by adding additional polynomial terms. This is easily done by changing the order in the `PolynomialFeatures()` object. 

In [None]:
# create polynomial features up to a power of 6
poly = PolynomialFeatures(6)
pe_X = poly.fit_transform(pe_x.reshape(-1, 1))

# create a separate x range for the predictions
predict_x = np.linspace(0.88, 0.96, 100)
predict_X = poly.fit_transform(predict_x.reshape(-1, 1))

# train the regression model, based on the supplied data
reg.fit(pe_X, pe_y)

# compute predicted values, based on the original X matrix
pe_y_hat = reg.predict(predict_X)

# plot the raw data and regression fit
fig, ax = plt.subplots()
ax.scatter(pe_x, pe_y)
ax.plot(predict_x, pe_y_hat)

ax.set_xlabel('Density (g/cm$^3$)')
ax.set_ylabel('Tensile Modulus (MPa)')

# display the equation on the axis
equation_text = ' '.join([f'{a:+0.1e}$x^{{{i}}}$' for i, a in enumerate(reg.coef_)])
ax.text(0.88, 820, f'y={equation_text}', fontsize='x-small')
ax.text(0.88, 720, f'$R^2$={reg.score(pe_X, pe_y):0.3f}')

As the number of terms increases, the $R^2$ will continue to improve. However, we risk *overfitting* the data. This means that we may be fitting the noise in the data, which will increase error when we compare to data that is not part of the training set. High-order polynomials will also have unusual behavior at the edges of the data set.

### 6.1.5. Fitting exponential trends

Another common technique for non-linear data is to look at the data on a log or semilog scale. When we view the same data on a semilog scale, we observe that the trend looks pretty linear.

In [None]:
# plot the raw data and regression fit
fig, ax = plt.subplots()
ax.scatter(pe_x, pe_y)

# put the y axis on a log scale
ax.semilogy()

ax.set_xlabel('Density (g/cm$^3$)')
ax.set_ylabel('Tensile Modulus (MPa)')

Knowing this, we can take set up the regression to predict $\ln y$. Then, we can transform the data back by taking $e^{\ln \hat{y}}$. This leads to a smooth curve that appears to fit the values well. We would need more data to test the quality of fit at the higher end of the density spectrum.

In [None]:
# sklearn LinearRegression object, inccluding the intercept
reg = LinearRegression(fit_intercept=True)

# train the regression model, based on the supplied data
pe_y_log = np.log(pe_y)
reg.fit(pe_x.reshape(-1, 1), pe_y_log)

# compute predicted values, based on the original X matrix
y_log_hat = reg.predict(predict_x.reshape(-1, 1))
y_hat = np.exp(y_log_hat)

# plot the raw data and regression fit
fig, ax = plt.subplots()

ax.scatter(pe_x, pe_y)
ax.plot(predict_x, y_hat)

ax.set_xlabel('Density (g/cm$^3$)')
ax.set_ylabel('Tensile Modulus (MPa)')

## 6.2. Classification

The types of techniques we use for regression analysis can be applied to the problem of classification. Consider the example below, where we have points representing polymer films that are either BOPET (gray) or BOPP (green). Numerically, these are represented by a $1$ if the film belongs to the BOPET class or $-1$ if it belongs to the BOPP class. A machine learning classification teaches the computer to distiguish between these categories, much like the regression analysis that we performed previously.

In the example below, we plot out the Tensile Modulus (MPa) and Dart Impact Energy (J) for each film grade. The color represents the film type. We could train a machine learning model to classify a new film as either BOPP or BOPET based on its measured properties. The methods that we'll show will allow for even more features (columns in the $X$ matrix) to be used, but it gets difficult to visualize higher dimensionality in a simple plot.

In [None]:
# load the raw data into an ndarray, the first row is headings, so we skip it
raw_data = np.loadtxt('../../data/film_classification.csv', delimiter=',', skiprows=1)

# the first two columns contain the dart impact and modulus data that we want
film_X = raw_data[:, :2]

# the last column contains the 1=BOPET, -1=BOPP class indicator
film_class = raw_data[:, 4]

# plot the data
fig, ax = plt.subplots()

# scatter plot, using color to indicate class (BOPET/BOPP)
ax.scatter(film_X[:, 0], film_X[:, 1], c=film_class, cmap='Dark2')

ax.set_xlabel(r'$x_0$: Dart Impact Energy (J)')
ax.set_ylabel(r'$x_1$: Tensile Modulus (MPa)')

### 6.2.1. Linear classification

The linear regression formula that we derived previously can be applied to classification as well. For classification, our $y$ vector contains a $1$ if the film belongs to the BOPET class and $-1$ if it belongs to the BOPP class. Using our regression formula, we derive the decision boundary, which is where 

$$X^Tw=\hat{y}=0$$ 

For the $X$ matrix that we have defined for this problem, this is equivalent to the equation $w_0x_0+w_1x_1+w_2=0$, so we can rearrange the equation to plot the decision boundary in terms of $\left(x_0,x_1\right)$. 

$$x_1=-\frac{w_0}{w_1}x_0-\frac{w_2}{w_1}$$

To determine if a new film will be classified as BOPET or BOPP, we compute $\hat{y}$ using the weights that we have determined in the training step. If $\hat{y}>0$, we classify the new point as "1" or BOPET. If $\hat{y}<0$, the point is classified as "-1" or BOPP. These values correspond to the new point being on one side or the other of the decision boundary.

In [None]:
# add a column of ones to allow for a non-zero y-intercept
film_X_lm = np.hstack((film_X, np.ones(len(film_X)).reshape(-1, 1)))

# execute the training
w = np.linalg.inv(film_X_lm.T@film_X_lm)@film_X_lm.T@film_class

m = -w[0]/w[1]
b = -w[2]/w[1]

boundary_x0 = np.linspace(0.2, 1.4, 10)
boundary_x1 = m*boundary_x0 + b

# plot the data
fig, ax = plt.subplots()

# scatter plot, using color to indicate class (BOPET/BOPP)
ax.scatter(film_X[:, 0], film_X[:, 1], c=film_class, cmap='Dark2')
ax.plot(boundary_x0, boundary_x1, ls='--')

ax.set_xlabel(r'$x_0$: Dart Impact Energy (J)')
ax.set_ylabel(r'$x_1$: Tensile Modulus (MPa)')


In the linear classifier above, we see that there are some points that are misclassified. Some error percentage is to be expected with a machine learning classifier. One common issue with a linear least-squares classifier is that they can be very sensitive to outliers -- even outliers that are properly classified. This is because the algorithm minimizes squared error, so far-off points have a significant effect.

### 6.2.2. Support vector machines

The *support vector machine* (SVM) classifier method minimizes the effect of outliers. This is sometimes referred to as a maximum-margin classifier, because the algorithm maximizes the distance between the decision boundary and the points classified on either side. Only misclassified points are penalized in the cost function, so correctly-classified extreme values have no impact on the boundary. The *support vectors* in the name refer to the data points that effectively set the decision boundary. We won't go into the math behind this algorithm (or the neural network, which follows) here. These would be covered in a university machine learning course such, as ECE 532 at UW-Madison.

Many of the more sophisticated machine learning algorithms are most effective with some preprocessing of the data to put the different features (columns of $X$) on the same scale. One simple approach would be min-max scaling, to reshape each column into the range [0, 1]. The *scikit-learn* package contains several options. In the SVM example below, we apply the [`sklearn.preprocessing.StandardScaler`](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing), which shifts each feature such that $\bar{x}_i=0$ and $s_{x_i}=1$. 

In [None]:
from sklearn import svm
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.preprocessing import StandardScaler

# scale the features using sklearn
x_scaler = StandardScaler().fit(film_X)
film_X_scaled = x_scaler.transform(film_X)

# create the classifier object
clf_svm = svm.SVC()

# train the classifier, using the scaled X matrix
clf_svm.fit(film_X_scaled, film_class)

# scatter plot, using color to indicate class (BOPET/BOPP)
# note that we are plotting x0, x1 in the scaled coordinates
fig, ax = plt.subplots()

ax.scatter(film_X_scaled[:, 0], film_X_scaled[:, 1], c=film_class, cmap='Dark2')

ax.set_xlabel(r'$x_0$: Scaled Dart Impact Energy')
ax.set_ylabel(r'$x_1$: Scaled Tensile Modulus')

# use the sklearn function to display the decision boundary on the axis
DecisionBoundaryDisplay.from_estimator(
    clf_svm, film_X_scaled, response_method='predict',
    alpha=0.2, ax=ax
)

The trained classifier can be used to classify a new film, based on its measured properties. After our classifier is trained, we test a new film and measure a dart impact energy of 0.8 J and a modulus of 3500 MPa. So, is it BOPET or BOPET? We create a row vector with our data, and apply the preprocessing scaler that we created previously. Then, we call the `predict()` function for our trained classifier to predict BOPET (1) or BOPP (-1). In this example we predict and plot a single value, but this code could be modified to predict multiple values at the same time.

In [None]:
# new film that we would like to classify
new_x = np.array([0.8, 3500]).reshape(1, -1)

# scale the new point
new_x_scaled = x_scaler.transform(new_x)

# predict the value, using our SVM classifier
new_x_prediction = clf_svm.predict(new_x_scaled)

# scatter plot, using color to indicate class (BOPET/BOPP)
# note that we are plotting x0, x1 in the scaled coordinates
fig, ax = plt.subplots()

ax.scatter(film_X_scaled[:, 0], film_X_scaled[:, 1], c=film_class, cmap='Dark2')

ax.set_xlabel(r'$x_0$: Scaled Dart Impact Energy')
ax.set_ylabel(r'$x_1$: Scaled Tensile Modulus')

# use the sklearn function to display the decision boundary on the axis
DecisionBoundaryDisplay.from_estimator(
    clf_svm, film_X_scaled, response_method='predict',
    alpha=0.2, ax=ax
)

# plot the new point as an x
ax.scatter(new_x_scaled[:, 0], new_x_scaled[:, 1], c=new_x_prediction, cmap='Dark2', marker='x')

ax.annotate(r'New Point: Classified as BOPP', xy=new_x_scaled[0, :], xytext=[1, 1.5],  
            arrowprops={'facecolor': 'black', 'width': 2})

### 6.2.3. Deep learning (neural networks)

Another popular category of classifiers are the *neural networks*. This is really a familly of classifiers that share a similar architecture. These are highly flexible algorithms, but do require a) large training data sets to avoid fitting noise, and b) experimentation to determine the most effective network architecture. 

The architecture that we will test here is the *multilayer perceptron* network, referred to as the [`MLPClassifier`](https://scikit-learn.org/stable/modules/neural_networks_supervised.html) in *scikit-learn*. In this architecture, we must define the number of *neurons* in each layer of the network. The greater the number of layers, and number of neurons, the more flexible the network will be. However, this can also lead to overfitting, especially for small datasets. Here, the network is shown with 3 hidden layers, each with 5 neurons. Notice the non-linearity in the decision boundary.

In [None]:
from sklearn.neural_network import MLPClassifier

# set up the MLPClassifier object
clf_nn = MLPClassifier(solver='lbfgs', alpha=1e-5,
                       hidden_layer_sizes=(5, 5, 5), 
                       random_state=1)

# fit the scaled data
clf_nn.fit(film_X_scaled, film_class)

# scatter plot, using color to indicate class (BOPET/BOPP)
fig, ax = plt.subplots()

ax.scatter(film_X_scaled[:, 0], film_X_scaled[:, 1], c=film_class, cmap='Dark2')

ax.set_xlabel(r'$x_0$: Scaled Dart Impact Energy')
ax.set_ylabel(r'$x_1$: Scaled Tensile Modulus')

# use the sklearn function to display the decision boundary on the axis
DecisionBoundaryDisplay.from_estimator(
    clf_nn, film_X_scaled, response_method='predict',
    alpha=0.2, ax=ax
)

The trained neural network can be used to predict new values, as we did with the previous SVM classifier. Try to modify the code in the previous cell, using the `predict()` function, to classify a new point.

## 6.3. Testing the performance of machine learning algorithms

A sophisticated classifier, like a neural network, can be designed to fit a training data set extremely well. But, testing an algorithm with the same data that was used to train it does not provide a good measure of its accuracy. As with regression, overfitting noise in the data is a risk as the algorithm gets to be more complex. This can improve accuracy with the training data, while reducing accuracy with new, untrained data.

One common method of evaluating a classifier is known as *k-fold cross-validation*. The available data is partitioned into $k$ samples. Then, the algorithm is trained and tested $k$ times. For each test, $k-1$ of the samples are used to train the classifier and the remaining $1$ sample is held back as the test set. Once the algorithm is trained, the accuracy is tested with the hold-out data. This process repeats with each set used once for testing. The error rate for each test is averaged to provide a final score.

You could code this yourself, but there the [`sklearn.model_selection.cross_validate`](https://scikit-learn.org/stable/modules/cross_validation.html) function can take care of this work for you. This function takes the untrained classifier, the $X$ matrix and the class/$y$ vector. By default, this function performs a 5-fold validation. In the example below, we see the accuracy for the SVM classifer is about 98%, compared to 97% for the neural network. It is possible that we could improve the design of these classifiers and/or add new features to further improve their accuracy.

In [None]:
from sklearn.model_selection import cross_validate
    
# test the SVM classifier built previously
acc_svm = cross_validate(clf_svm, film_X_scaled, film_class)['test_score'].mean()

# test the neural network classifier build previously
acc_nn = cross_validate(clf_nn, film_X_scaled, film_class)['test_score'].mean()

acc_svm, acc_nn


--------------
## Next Steps:

1. Complete the [Unit 6 Problems](./unit06-solutions.ipynb) to test your understanding
2. Advance to [Unit 7](../07-advanced-plotting/unit07-lesson.ipynb) when you're ready for the next step