<img src="images/kiksmeisedwengougent.png" alt="Banner" width="1100">

<div>
    <font color=#690027 markdown="1">   
<h1>ALGORITHM FOR REGRESSION WITH DATA ON THE IRIS VIRGINICA</h1>    </font>
</div>

<div class="alert alert-box alert-success">
In the previous notebook, you determined the regression line with the built-in functions of the SciPy module. In this notebook, you will see how a <em>machine learning</em> system manages to find a <b>best fitting line</b> for a given collection of points. <br>The algorithm starts with a randomly chosen straight line. The algorithm adjusts the coefficients in the equation of this straight line, based on the given data, until finally the <b>regression line</b> is obtained.<br>In this notebook, the algorithm is explained for those who want to know more.</div>

The Iris dataset was published in 1936 by the Brit Ronald Fischer in 'The Use of Multiple Measurements in Taxonomic Problems' [1][2].<br>The dataset pertains to three types of irises (*Iris setosa*, *Iris virginica*, and *Iris versicolor*), 50 samples of each type.Fischer could distinguish the species from each other based on four characteristics: the length and width of the sepals and petals.

<img src="images/kelkbladkroonblad.jpg" alt="Drawing" width="400"/> <br>
<center>Figure 1: Sepals and petals.</center>

In this notebook, you only use the data on the length of the sepals and petals of the *Iris virginica*.

### Importing the necessary modules

In [None]:
import numpy as npimport matplotlib.pyplot as pltimport pandas as pd
from sklearn.metrics import r2_scorefrom sklearn.metrics import mean_squared_error
from matplotlib import animationfrom IPython.display import HTML

<div>
    <font color=#690027 markdown="1">   
<h2>1. The data of the <em>Iris virginica</em></h2>    </font>
</div>

<center><img src="images/irisvirginica.jpg" alt="Drawing" width="203"/></center><br>
<center>Figure 2: <em>Iris virginica</em> [3].</center>

Read in the dataset with the `pandas` module.

In [None]:
# read in datasetvirginica = pd.read_csv("data/virginica.csv")

Look at the data. This can be done very simply by entering the name of the table. The length of some sepals and some petals are displayed. The number of samples is easy to read.

In [None]:
# display dataset in tablevirginica

The relationship between the length of the calyx and the length of the petal is being studied. <br> For this, the length of the petal is plotted as a function of the length of the calyx. So the length of the petal goes on the y-axis and the length of the calyx on the x-axis.

<div class="alert alert-box alert-info">
For the machine learning system, the <em>length of the sepals</em> will serve as <b>input</b> and the <em>length of the petals</em> as <b>output</b>.</div>

In [None]:
x = virginica["length of sepal"]       # use column name as indexy = virginica["petal length"]

We convert the data into NumPy arrays.

In [None]:
x = np.array(x)y = np.array(y)

<div>
    <font color=#690027 markdown="1">   
<h2>2. Formatting the data correctly</h2>    </font>
</div>

You standardize the data and display it in a scatter plot.

<div>
    <font color=#690027 markdown="1">   
<h3>2.1 Standardizing the data</h3>    </font>
</div>

To standardize, we switch to the Z-scores of the characteristics.

<div class="alert alert-block alert-warning">
More explanation about the importance of standardization can be found in the notebook 'Standardize'.</div>

In [None]:
x = (x - np.mean(x)) / np.std(x)y = (y - np.mean(y)) / np.std(y)

<div>
    <font color=#690027 markdown="1">   
<h3>2.2 Displaying the standardized data in a scatter plot</h3>    </font>
</div>

In [None]:
# Petal length vs. Sepal length# length of calyx leaf comes on x-axis, length of petal comes on y-axis
plt.figure(figsize=(10,8))    # to get a larger graph, so that points are more spread out# select range so that it is suitable for larger and smaller leavesplt.xlim(x.min()-2, x.max()+3)plt.ylim(y.min()-2, y.max()+3)
plt.scatter(x, y, color="blue", marker="o")              # scatter plot
plt.title("Iris virginica standardized")plt.xlabel("length of sepal")                             # xlabel gives description on x-axisplt.ylabel("petal length")                           # ylabel provides description on y-axis
plt.show()

<div class="alert alert-block alert-warning">
In the previous notebook, you determined a straight line as a regression line based on linear regression, and you showed this straight line together with the scatter plot that visualizes the data.</div>

<div>
    <font color=#690027 markdown="1">   
<h2>3. The algorithm behind the regression line</h2>    </font>
</div>

<div>
    <font color=#690027 markdown="1">   
<h3>3.1 Structure of the algorithm</h3>    </font>
</div>

Such a regression line is sought with an algorithm. In the previous notebook, you used the algorithm provided by the function `curve_fit()` of the SciPy module. Here you see how such an algorithm is structured.

<div class="alert alert-box alert-info">
To find a line that fits well with the given data, the ML system starts from a randomly chosen line. This is done by randomly choosing the slope and the y-intercept of this line.<br>The system is <em>trained</em> with the training set (the inputs and the corresponding outputs): for each point of the training set, it is examined how much the corresponding y-value deviates from the given y-value on the provisional straight line. The coefficients in the equation of the straight line are adjusted, so that the average deviation for the entire dataset is minimal. <br>The entire training set is run through a number of times. Such a time is called an <em>epoch</em>. The system <em>learns</em> during these <em>attempts ('epochs')</em>.</div>

The same standardized data x and y are still being used.

In [None]:
# training set with input x and output yprint(x, y)

The system must be able to calculate the **mean square deviation** of the data points from the determined straight line.<br>To this end, the **residue** $y-\hat{y}$ is calculated for each point. Here $y$ is the given y-value and $\hat{y}$ is the **predicted value**, i.e., the value you obtain by substituting the given x-value into the equation of the straight line.<br> The squares of the residues are summed. This total divided by the number of data points is the **error sought**.

In [None]:
Without any text or comments present in the Dutch input provided, there is nothing that needs to be translated. Thus, the output remains the same:

def gka(b, a, x, y):"""Calculate average square deviation of points to straight line."""    
total_dev = 0n = len(x)            # number of pointsy_right = a * x + b  # y-values for certain straight line    
    # sum of squared deviations at all points    for i in range(n):total_dev += (y[i] - y_straight[i])**2    
return total_deviation/50

As an example, you can have the average squared deviation calculated with respect to the x-axis:

In [None]:
# average squared deviation of the training data compared to the line y = 0error = gka(0, 0, x, y)print(error)

<div class="alert alert-box alert-info">
The ML system starts from a random straight line with equation <em>y = mx + q</em>. At the start of the training, <em>m</em> and <em>q</em> are chosen randomly. The number of <em>epochs</em> and the <em>learning rate</em> $\eta$ are determined.</div>

The algorithm will determine the coefficients of the straight line in such a way that the error is minimized. It does this using the method ***gradient descent***.<br>After each *epoch*, the coefficients are adjusted, depending on the values of the partial derivatives and the *learning rate*.

In [None]:
def gradient_descent(q, m, x, y, eta):"""Adjustment of parameters q and m after completed epoch with learning rate eta."""    
n = len(x)y_current = m * x + q      # found straight at a certain point in the process    derivative_m = 0           # declare and initialize partial derivative with respect to m    derivative_q = 0           # declare and initialize partial derivative with respect to q    
# calculation of partial derivatives    for i in range(n):        derivative_m += - (2/n) * x[i] * (y[i] - y_current[i])        derivative_q += - (2/n) * (y[i] - y_current[i])    
# adjust values of m and q    m = m - eta * derivative_m    q = q - eta * derivative_q     
# return modified values of m and qreturn q, m

<div class="alert alert-block alert-warning">
Also see the notebook 'Gradient descent' from the learning path 'Advanced Deep Learning'.</div>

<div>
    <font color=#690027 markdown="1">   
<h3>3.2 Testing the gradient descent algorithm for multiple epochs</h3>    </font>
</div>

Take 0 as the initial value for m and for q. Perform gradient descent for 3000 epochs with a learning rate of 0.01 and where the adjustments of $m$ and $q$ and the error after each *epoch* are shown.

In [None]:
# testing algorithmq = 0m = 0eta = 0.01
for j in range(3000):    fout = gka(q, m, x, y)                     # calculate average square deviation after each epoch    print(q, m, fout)                          # display values of q, m and fout after each epoch    q, m = gradient_descent(q, m, x, y, eta)   # adjust values of q and m after each epoch    
print("The line intersects the y-axis at: %.3f" % q)print("The judge has as rico: %.3f" % m)print("Average squared deviation for the line with respect to the data: %.2f"% error)

In the example, you can see that the number of epochs will help determine the accuracy of the regression line. The line that was found after, for example, 50 epochs is still very far from the intended regression line. Also observe the progression of the error - as long as this continues to decrease it has not yet been minimized - the system *underfits* then.<br>In the example, you also see that there are too many epochs, at a certain point the error no longer decreases and the values are no longer adjusted. That means that the minimum has been reached.<br>

### Assignment 3.2.1Adjust the *learning rate*, the initial values of *m* and *q*, or the number of *epochs* and see what effect this has.

<div>
    <font color=#690027 markdown="1">   
<h3>3.3 How does the error and the position of the line change during the process?</h3>    </font>
</div>

The process to determine the regression line given data (x,y) depends on the starting values of m and q, the *learning rate* (eta), and the number of times the data are run through (*epochs*). <br>To view the evolution of the position of the line and the size of the error, the values of m, q and the error must be stored after each epoch.<br>Three lists are created for this purpose, which are supplemented after each epoch.

In [None]:
def gradient_descent_process(x, y, q, m, eta, epochs):"""Go through the process and gradually make lists of q, m, and error."""list_error = [gka(q, m, x, y)]      # declare and initialize error listlist_q = [q]                       # declare and initialize list of q'slist_m = [m]                       # declare and initialize list of rico's
# Fill lists for each epochfor i in range(epochs):        q, m = gradient_descent(q, m, x, y, eta)    # adjusted parameters after epochfout = gka(q, m, x, y)                      # cost after epochlijst_q.append(q)                           # add adjusted qlijst_m.append(m)                           # add adjusted mlijst_fout.append(fout)                     # add this cost
return [list_q, list_m, list_error]

Run this algorithm for chosen *m*, *q*, *epochs* and *learning rate*:

In [None]:
# initialization of m and qm = 0q = 0
# recording of number of epochs and learning rate ètaeta = 0.01epochs = 500
# walkthrough linear regression algorithm for choosing m, q, èta and epochslist_q, list_m, list_error = gradient_descent_process(x, y, q, m, eta, epochs)
# regression lineprint ("Y-axis passage: %.3f" % lijst_q[-1])print ("Rico: %.3f" % list_m[-1])
# average square deviation regression lineprint ("Minimized error: %.2f" %  list_error[-1])

Create an animation for this process:

In [None]:
# all rightsxcoord =  np.linspace(-4, 4, 67)
ycoord = []for j in range(epochs):    y_r = lijst_m[j] * xcoord + lijst_q[j]         # Calculate y-value of all x's from xcoord for the respective line    ycoord.append(y_r)ycoord = np.array(ycoord)    # type casting
# initialize plot windowfig, ax = plt.subplots()line, = ax.plot(xcoord, ycoord[0], color="green")   # plot the line
plt.scatter(x, y, color="blue", marker="o")         # scatter plotax.axis([x.min()-2, x.max()+3, y.min()-2, y.max()+3])  # range of axes
plt.title("Iris virginica standardized")plt.xlabel("length sepal")          # xlabel provides description on x-axisplt.ylabel("petal length")         # ylabel provides description on y-axis
def animate(i):    line.set_ydata(ycoord[i])    # update equation of straight linereturn line,
plt.close()  # to close the temporary plot-window, only need animation screen
anim = animation.FuncAnimation(    fig, animate, repeat=False, frames=len(ycoord))    
HTML(anim.to_jshtml())

In [None]:
# evolution error graphplt.figure(figsize=(10,8))
plt.plot(list_error)
plt.xlabel("epoch")plt.ylabel("average square deviation")plt.title("Evolution of the error")
plt.show()

### Assignment 3.3.1Now experiment yourself. Adjust the algorithm with self-chosen *m*, *q*, *epochs* and *learning rate*.

<div class="alert alert-block alert-warning">
Linear regression is also covered in the notebook 'Sea Level' and the notebook 'Tree height and stomata dimensions in the Amazon rainforest'.</div>

<div>
<h2>Reference List</h2></div>

[1] Dua, D., & Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. <br> &nbsp; &nbsp; &nbsp; &nbsp; Irvine, CA: University of California, School of Information and Computer Sciences.<br>[2] Fisher, R. A. (1936). The Use of Multiple Measurements in Taxonomic Problems. *Annals of Eugenics*. 7(2), 179–188. <br> &nbsp; &nbsp; &nbsp; &nbsp; https://doi.org/10.1111/j.1469-1809.1936.tb02137.x<br>[3] No machine-readable author provided. Dlanglois assumed (based on copyright claims). <br> &nbsp; &nbsp; &nbsp; &nbsp;[CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0/)], via Wikimedia Commons.

<div>
<h2>With support from</h2></div>

<img src="images/kikssteun.png" alt="Banner" width="1100"/>

<img src="images/cclic.png" alt="Banner" align="left" width="100"/><br><br>
Notebook KIKS, see <a href="http://www.aiopschool.be">AI At School</a>, by F. wyffels & N. Gesquière, is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.