<img src="images/kiksmeisedwengougent.png" alt="Banner" width="1100">

<div>
    <font color=#690027 markdown="1">   
<h1>REGRESSION WITH DATA ON THE IRIS VIRGINICA</h1>    </font>
</div>

<div class="alert alert-box alert-success">
In this notebook, you will see how a <em>machine learning</em> system manages to find a <b>best fitting line</b> for a given collection of points. The algorithm starts with a randomly chosen line. The algorithm adjusts the coefficients in the equation of this line, based on the given data, until eventually the <b>regression line</b> is obtained.<br>First you determine the regression line with the built-in functions of the scikit-learn module. Afterwards, the algorithm is explained in case you want to know more.</div>

The Iris dataset was published in 1936 by the Brit Ronald Fischer in 'The use of multiple measurements in taxonomic problems' [1][2].<br>The dataset concerns three types of irises (*Iris setosa*, *Iris virginica* and *Iris versicolor*), 50 samples of each type.Fischer could distinguish the species from each other based on four characteristics: the length and width of the sepals and petals.

<img src="images/kelkbladkroonblad.jpg" alt="Drawing" width="400"/> <br>
<center>Figure 1: Calyx and Corolla.</center>

In this notebook, you only use the data on the length of the sepals and petals of the *Iris virginica*.

### Importing the necessary modules

In [None]:
import numpy as npimport matplotlib.pyplot as pltimport pandas as pd
from sklearn.linear_model import LinearRegressionfrom sklearn.metrics import r2_scorefrom sklearn.metrics import mean_squared_error
from matplotlib import animationfrom IPython.display import HTML

<div style='color: #690027;' markdown="1">
<h2>1. The data of the <em>Iris virginica</em></h2></div>

<center><img src="images/irisvirginica.jpg" alt="Drawing" width="203"/></center><br>
<center>Figure 2: <em>Iris virginica</em> [3]</center>

Read the dataset using the `pandas` module.

In [None]:
# read datasetvirginica = pd.read_csv("data/virginica.csv")

Check the data. This can be done very simply by entering the name of the table. The length of some sepals and some petals is displayed. The number of samples is easy to read.

In [None]:
# display dataset in tablevirginica

The relationship between the length of the calyx and the length of the petal is studied. <br> For this, the length of the petal is plotted as a function of the length of the calyx. So the length of the petal comes on the y-axis and the length of the calyx on the x-axis.

<div class="alert alert-box alert-info">
For the machine learning system, the <em>length of the sepal</em> will serve as <b>input</b> and the <em>length of the petal</em> as <b>output</b>.</div>

In [None]:
x = virginica["length sepal"]       # use column name as indexy = virginica["petal length"]

We convert the data into NumPy arrays.

In [None]:
x = np.array(x)y = np.array(y)

<div style='color: #690027;' markdown="1">
<h2>2. Visualizing the connection between both characteristics through a regression line</h2></div>

We standardize the data and display it in a scatter plot. We calculate the correlation coefficient to see how strong the association between the two features is.<br>The regression line is then sought and drawn.<br>This regression line predicts the length of a petal for a known length of a sepals.

<div style='color: #690027;' markdown="1">
<h3>2.1 Standardize the data</h3></div>

To standardize, we move on to the Z-scores of the features.

<div class="alert alert-block alert-warning">
More explanation about the importance of standardizing can be found in the notebook 'Standardizing'.</div>

In [None]:
x = (x - np.mean(x)) / np.std(x)y = (y - np.mean(y)) / np.std(y)

<div style='color: #690027;' markdown="1">
    <h3>2.2 Display the standardized data in a scatter plot</h3></div>

In [None]:
# petal length vs. sepal length# sepal length comes on x-axis, petal length comes on y-axisplt.scatter(x, y, color="blue", marker="o")  # scatter plot
plt.title("Iris virginica standardized")plt.xlabel("length sepal")          # xlabel provides description on x-axisplt.ylabel("petal length")         # ylabel gives description on y-axis
plt.show()

In [None]:
plt.figure(figsize=(10,8))    # to get a larger graph, so that points are more spread out# choose range so that suitable for larger and smaller leavesplt.xlim(x.min()-2, x.max()+3)plt.ylim(y.min()-2, y.max()+3)plt.scatter(x, y, color="blue", marker="o")
plt.title("Iris virginica standardized")plt.xlabel("length sepal")plt.ylabel("petal length")
plt.show()

<div style='color: #690027;' markdown="1">
<h3>2.3 Correlation between x and y?</h3></div>


In [None]:
# to what extent is there a correlation between the x and y coordinates of these points?# determine correlation coefficient (lies between -1 and 1, the closer to 0, the poorer the coherence)r = np.corrcoef(x, y)[0,1]print("R = ", r)

Very good coherence!

<div style='color: #690027;' markdown="1">
<h3>2.4 Regression Line</h3></div>

Determine the regression line using built-in functions from the scikit-learn module, a Python module with *machine learning* algorithms. <br> In order to use such an algorithm, the *data must be presented in the desired format*. A 1D array suffices for the y-values, but the 1D array must be converted to a 2D array for the x-values.

In [None]:
# linear regressionX = x[:, np.newaxis]          # provide data in desired format to ML systemrechte = LinearRegression()   # rechte is determined using linear regressionrechte.fit(X, y)              # this line should fit the data (X, y)

Calculating R² and the mean square deviation:

In [None]:
# important numbersprint("R² for the line in relation to the data: %.3f" % r2_score(y, line.predict(X)))print("Average squared deviation for the straight line with respect to the data: %.2f"% mean_squared_error(y, rechte.predict(X)))

Show graph of scatter plot and regression line:

In [None]:
# graph of scatter plot and regression lineplt.figure(figsize=(10,8))
plt.xlim(x.min()-2, x.max()+3)plt.ylim(y.min()-2, y.max()+3)plt.title("Iris virginica standardized")plt.xlabel("length of sepal")          # xlabel provides description on x-axisplt.ylabel("petal length")         # ylabel gives description on y-axis
plt.scatter(x, y, color="blue", marker="o")     # scatter plotplt.plot(x, rechte.predict(X), color='green')   # found regression line; substitute x-values into its equation
plt.show()

From the model, you can directly determine the slope of the regression line and where it intersects the y-axis.

In [None]:
# calculate slope and y-axis intersection
# provide data in desired formatx_O = np.array([0])X_O = x_O[:, np.newaxis]x_1 = np.array([1])X_1 = x_1[:, np.newaxis]
y_O_predict = rechte.predict(X_O)y_1_predict = rechte.predict(X_1)
print("The regression line intersects the y-axis at: %.3f" % y_O_predict)print("The regression line has a slope of: %.3f" % (y_1_predict - y_O_predict))

<div style='color: #690027;' markdown="1">
<h3>2.5 Making predictions with the model</h3></div>

You can use the model to make predictions with new data: e.g. predict the length of the petal if you know the length of a sepal.

In [None]:
# predict petal length with known sepal lengthx_known = np.array([3])               # sepal with standardized length equal to 3X_known = x_known[:, np.newaxis]     # providing data in desired formaty_predict = rechte.predict(X_gekend)   # determine petal length with model
# graphplt.figure(figsize=(10,8))
x_new = np.linspace(-4, 4, 67)      # draw longer straight lineX_new = x_new[:, np.newaxis]       # desired format
plt.xlim(x.min()-2, x.max()+3)plt.ylim(y.min()-2, y.max()+3)plt.title("Iris virginica standardized")plt.xlabel("length of sepal")plt.ylabel("petal length")
plt.scatter(x, y, color="blue", marker="o")     # scatter plotplt.plot(x, rechte.predict(X), color='green')   # found regression lineplt.plot(x_new, straight.predict(X_new), color='yellow')   # extended found regression lineplt.plot(x_known[0], y_predict[0], color="black", marker="o")  # predicted point
plt.show()
print("With a sepals with standardized length " + str(x_gekend[0]) +" is the standardized length of the petal approximately " + str(y_predict[0]) + ".")

### Assignment 2.5.1Try doing the same with a different size for the calyx.

<div style='color: #690027;' markdown="1">
<h2>3. The algorithm behind the regression line</h2></div>

<div style='color: #690027;' markdown="1">
    <h3>3.1 Structure of the algorithm</h3></div>

Such a regression line is sought using an algorithm. Here you can see how such an algorithm is constructed.

Work is still being done with the same standardized data x and y.

<div class="alert alert-box alert-info">
To find a line that fits the given data well, the ML system starts from a randomly chosen line. This is done by randomly choosing the slope and the y-intercept of this line.<br>The system is <em>trained</em> with the training set (the inputs and the corresponding outputs): For each point of the training set, it is checked how much the corresponding y-value deviates from the given y-value on the provisional straight line. The coefficients in the equation of the straight line are adjusted so that the average deviation for the entire dataset is minimal. <br>The entire training set is run through several times. Such a time is called an <em>epoch</em>. The system <em>learns</em> during these <em>attempts ('epochs')</em>.</div>

In [None]:
# training set with input x and output yprint(x, y)

The system should be able to calculate the average quadratic deviation of the data points from the determined straight line.<br>To do this, the residual $y-\hat{y}$ is calculated for each point. Here, $y$ is the given y-value and $\hat{y}$ is the predicted value, i.e. the value obtained by substituting the given x-value into the equation of the straight line.<br> The squares of the residuals are added together. This sum divided by the number of data points is the desired error.

In [None]:
def gka(b, a, x, y):"""Calculate average squared deviation of points from straight line."""    
total_dev = 0n = len(x)            # number of pointsy_rechte = a * x + b  # y-values for certain straight line    
    # sum of quadratic deviations at all points    for i in range(n):total_dev += (y[i] - y_line[i])**2      
return total_deviation/50

As an example, you can have the average quadratic deviation calculated with respect to the x-axis:

In [None]:
# average square deviation of the training data compared to the line y = 0error = gka(0, 0, x, y)print(error)

<div class="alert alert-box alert-info">
The ML system starts with a random straight line with equation <em>y = mx + q</em>. At the start of the training, <em>m</em> and <em>q</em> are randomly chosen. The number of <em>epochs</em> and the <em>learning rate</em> $\eta$ are determined.</div>

The algorithm will determine the coefficients of the line in such a way that the error is minimized. It does this using the **gradient descent** method.<br>After each *epoch*, the coefficients are adjusted, depending on the values of the partial derivatives and the *learning rate*.

In [None]:
def gradient_descent(q, m, x, y, eta):"""Adjustment of parameters q and m after completed epoch with learning rate eta."""    
n = len(x)y_current = m * x + q      # found straight line at a certain point in process    derivative_m = 0           # declare and initialize partial derivative with respect to mafgeleide_q = 0           # declare and initialize partial derivative with respect to q    
# calculation of the partial derivatives    for i in range(n):        derivative_m += - (2/n) * x[i] * (y[i] - y_current[i])derivative_q += - (2/n) * (y[i] - y_current[i])    
    # adjust values of m and qm = m - eta * derivative_m    q = q - eta * derivative_q     
# return modified values of m and qreturn q, m

<div style='color: #690027;' markdown="1">
<h3>3.2 Testing the gradient descent algorithm for multiple epochs</h3></div>

Take 0 as the initial value for m and for q. Perform gradient descent for 3000 epochs with a learning rate of 0.01 and show the adjustments of $m$ and $q$ and the error after each *epoch*.

In [None]:
# testing algorithmq=0m=0eta = 0.01
for j in range(3000):fout = gka(q, m, x, y)                     # calculate average square deviation after each epochprint(q, m, fout)                          # display values q, m and fout after each epoch    q, m = gradient_descent(q, m, x, y, eta)   # adjust values of q and m after each epoch    
print("The line intersects the y-axis at: %.3f" % q)print("The judge has as rico: %.3f" % m)print("Average squared deviation for the line with respect to the data: %.2f"% error)

In the example, you can see that the number of epochs will help determine how accurately the regression line is defined. The line that has been found after, for example, 50 epochs is still very far from the intended regression line. Also note how the error progresses, as long as it continues to fall, it has not yet been minimized, the system *underfits* then.<br>In the example, you also see that there are too many epochs, at a certain point the error no longer decreases and the values are no longer adjusted. This means that the minimum has been reached.<br>

### Assignment 3.2.1You can also adjust the *learning rate* or the initial values of m and q and see what effect this has.

<div style='color: #690027;' markdown="1">
<h3>3.3 How does the error and the position of the line change during the process?</h3></div>

The process of determining the regression line given data (x,y) depends on the initial values of m and q, the *learning rate* (eta) and the number of times the data is run through (epochs). <br>To observe the evolution of the position of the straight line and the size of the error, the values of m, q and the error must be stored after each epoch.<br>For this purpose, three lists are created that are supplemented after each epoch.

In [None]:
def gradient_descent_process(x, y, q, m, eta, epochs):"""Go through the process and gradually make lists of q, m and error."""list_error = [gka(q, m, x, y)]      # declare and initialize error listlist_q = [q]                       # declare and initialize list of q'slist_m = [m]                       # declare and initialize list of slopes
    # Fill in lists for each epoch    for i in range(epochs):        q, m = gradient_descent(q, m, x, y, eta)    # adjusted parameters after epochfout = gka(q, m, x, y)                      # cost after epochlijst_q.append(q)                           # add modified q        lijst_m.append(m)                           # add modified m        lijst_fout.append(fout)                     # add this cost
    return [list_q, list_m, list_error]

Run this algorithm for chosen *m*, *q*, *epochs* and *learning rate*:

In [None]:
# initialization of m and qm = 0q = 0
# recording the number of epochs and learning rate ètaeta = 0.01epochs = 500
# run linear regression algorithm for choice of m, q, èta and epochslist_q, list_m, list_error = gradient_descent_process(x, y, q, m, eta, epochs)
# regression lineprint ("Passage y-axis: %.3f" % list_q[-1])print ("Rico: %.3f" % list_m[-1])
# average squared deviation regression lineprint ("Minimized error: %.2f" % list_error[-1])

Create an animation for this process:

In [None]:
# all rightsxcoord =  np.linspace(-4, 4, 67)
ycoord = []for j in range(epochs):    y_r = lijst_m[j] * xcoord + lijst_q[j]         # Calculate y-value of all x's from xcoord for the respective lineycoord.append(y_r)ycoord = np.array(ycoord)    # type casting
# initialize plot-windowfig, ax = plt.subplots()line, = ax.plot(xcoord, ycoord[0], color="green")   # plot straight line
plt.scatter(x, y, color="blue", marker="o")         # scatter plotax.axis([x.min()-2, x.max()+3, y.min()-2, y.max()+3])  # range axes
plt.title("Iris virginica standardized")plt.xlabel("length sepal")          # xlabel provides description on x-axisplt.ylabel("petal length")         # ylabel gives description on y-axis
def animate(i):line.set_ydata(ycoord[i])    # update equation of the line    return line,
plt.close()  # to temporarily close plot window, only need animation screen
anim = animation.FuncAnimation(fig, animate, repeat=False, frames=len(ycoord))    
HTML(anim.to_jshtml())

In [None]:
# graph evolution errorplt.figure(figsize=(10,8))
plt.plot(list_error)
plt.xlabel('epoch')plt.ylabel('average squared deviation')plt.title('Evolution of the error')
plt.show()

### Assignment 3.3.1Experiment yourself now. Adjust the algorithm with self-selected *m*, *q*, *epochs* and *learning rate*.

<div class="alert alert-block alert-warning">
Linear regression is also covered in the notebook 'Sea level' and the notebook 'Tree height and stomata dimensions in the Amazon rainforest'.</div>

<div>
<h2>Reference List</h2></div>

[1] Dua, D., & Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. <br> &nbsp; &nbsp; &nbsp; &nbsp; Irvine, CA: University of California, School of Information and Computer Science.<br>[2] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. *Annals of Eugenics*. 7(2), 179–188. <br> &nbsp; &nbsp; &nbsp; &nbsp; https://doi.org/10.1111/j.1469-1809.1936.tb02137.x<br>[3] No machine-readable author provided. Dlanglois assumed (based on copyright claims). <br> &nbsp; &nbsp; &nbsp; &nbsp;[CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0/)], via Wikimedia Commons.

<div>
<h2>With support from</h2></div>

<img src="images/kikssteun.png" alt="Banner" width="1100"/>

<img src="images/cclic.png" alt="Banner" align="left" width="100"/><br><br>
Notebook KIKS, see <a href="http://www.aiopschool.be">AI At School</a>, by F. Wyffels & N. Gesquière is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.