<img src="images/kiksmeisedwengougent.png" alt="Banner" width="1100">

<div>
    <font color=#690027 markdown="1">   
<h1>REGRESSION WITH DATA ON THE IRIS VIRGINICA</h1>    </font>
</div>

<div class="alert alert-box alert-success">
In this notebook, you determine for a given collection of points a <b>best fitting straight line</b>, the so-called <b>regression line</b>. The given collection is a commonly used dataset containing data from three types of irises.<br>To determine the regression line, you use a function of the SciPy module; you determine the strength of the correlation with a function of the scikit-learn module.<br>Based on this regression line, you actually have a <b>AI system</b> capable of predicting the length of the petal from a given length of a sepal for a particular type of iris.</div>

<div class="alert alert-block alert-warning">
The basic knowledge for this can be found in the learning path 'Linear regression'.Because a non-standardized dataset can sometimes cause problems (as you could see in the notebook on regression at sea level), here you will work with the <b>standardized dataset</b>.</div>

<div class="alert alert-block alert-warning">
Linear regression is also addressed in the 'Sea Level' notebook and the 'Height trees and dimensions of stomata in the Amazon rainforest' notebook.</div>

<div class="alert alert-block alert-warning">
More explanation about the importance of standardization can be found in the notebook 'Standardize'.</div>

The Iris dataset was published in 1936 by the Brit Ronald Fischer in 'The Use of Multiple Measurements in Taxonomic Problems' [1][2].<br>The dataset concerns three types of irises (*Iris setosa*, *Iris virginica*, and *Iris versicolor*), 50 samples of each type.Fischer could distinguish the species from each other based on four characteristics: the length and width of the sepals and the petals.

<img src="images/kelkbladkroonblad.jpg" alt="Drawing" width="400"/> <br>
<center>Figure 1: Sepals and petals.</center>

In this notebook, you only use the data about the length of the sepals and petals of the *Iris virginica*.

### AssignmentInvestigate the correlation between the length of the sepals and the petals of the *Iris virginica*. Determine a regression line and use it to make predictions regarding new data.

### Importing the necessary modules

In [None]:
import numpy as npimport matplotlib.pyplot as pltimport pandas as pd
from scipy.optimize import curve_fit    # for regressionfrom sklearn.metrics import r2_score    # correlation coefficientfrom sklearn.metrics import mean_squared_error  # method of least squares

<div>
    <font color=#690027 markdown="1">   
<h2>1. The data of the <em>Iris virginica</em></h2>    </font>
</div>

<center><img src="images/irisvirginica.jpg" alt="Drawing" width="203"/></center><br>
<center>Figure 2: <em>Iris virginica</em> [3].</center>

Read the dataset using the `pandas` module.

In [None]:
# read in datasetvirginica = pd.read_csv("data/virginica.csv")

Look into the data. This can be done very simply by entering the name of the table. The length of some sepals and some petals are displayed. The number of samples is easy to read.

In [None]:
# display dataset in tablevirginica

You will now study the relationship between the length of the calyx and the length of the petal. <br> For this, you plot the length of the petal as a function of the length of the calyx. So the length of the petal goes on the y-axis and the length of the calyx on the x-axis.

In [None]:
x = virginica["petal length"]       # use column name as indexy = virginica["petal length"]

You convert the data into NumPy arrays.

In [None]:
x = np.array(x)y = np.array(y)

<div>
    <font color=#690027 markdown="1">   
<h2>2. The correlation between both characteristics</h2>    </font>
</div>

<div>
    <font color=#690027 markdown="1">   
<h3>2.1 Displaying the data in a scatter plot</h3>    </font>
</div>

In [None]:
# petal length vs. sepal length# length of sepal comes on x-axis, length of petal comes on y-axisplt.scatter(x, y, color="blue", marker="o")  # scatter plot
plt.title("Iris virginica")plt.xlabel("length of calyx leaf")      # xlabel gives description on x-axisplt.ylabel("petal length")         # ylabel provides description on y-axis
plt.show()

What kind of regression is meaningful here?

Answer:

<div>
    <font color=#690027 markdown="1">   
<h3>2.2 Linear correlation between x and y?</h3>    </font>
</div>

You calculate the correlation coefficient to see how strong the linear relationship between the two features is.<br>

In [None]:
# to what extent is there a correlation between the x and y coordinates of these points?# determine correlation coefficient (lies between -1 and 1, the closer to 0, the weaker the linear relationship)r = np.corrcoef(x, y)[0,1]print("R = ", r)

What do you conclude about the coherence?

Answer:

In what follows:- You standardize the data, and also display it in a scatter plot.- You adjust the range of the graph so that made predictions can be displayed on it.- Then the regression line is sought and drawn in.- This regression line predicts the length of a petal for a known length of a sepal.

<div>
    <font color=#690027 markdown="1">   
<h2>3. Standardize Data</h2>    </font>
</div>

<div>
    <font color=#690027 markdown="1">   
<h3>3.1 Standardizing the data</h3>    </font>
</div>

To standardize, we will shift to the Z-scores of the features.

In [None]:
x = (x - np.mean(x)) / np.std(x)y = (y - np.mean(y)) / np.std(y)

<div>
    <font color=#690027 markdown="1">   
<h3>3.2 Adjust graph range to predictions</h3>    </font>
</div>

In [None]:
plt.figure(figsize=(10,8))    # to get a larger graph, so points are more spread out# select range so that suitable for larger and smaller leavesplt.xlim(x.min()-2, x.max()+3)       # x-axis range is automatically adjusted to smallest and largest x-valueplt.ylim(y.min()-2, y.max()+3)       # y-axis range is automatically adjusted to smallest and largest y-valueplt.scatter(x, y, color="blue", marker="o")
plt.title("Iris virginica standardized")plt.xlabel("length sepal")plt.ylabel("petal length")
plt.show()

Pay attention to the range of the x-axis and the y-axis.

<div>
    <font color=#690027 markdown="1">   
<h2>4. Regression Line</h2>    </font>
</div>

Determine the regression line as you learned in the learning path 'Linear Regression'.

In [None]:
# regression line is straight
# input how the equation of the line is constructeddef straight_line(x, a, b):"""Prescription (inclined) straight line with variable x and coefficients a and b."""return a * x + b
# search for the line that best fits certain data, show comparison and return coefficientsdef linreg(x, y):"""Right best fitting with data x and y."""popt, pcov = curve_fit(rechte, x, y)            # curve_fit() looks in def rechte() what the function prescription looks like# curve_fit() returns two things, referred to as popt and pcov# only first needed, popt, which gives a and b of the sought straight linea, b = popt                                     # coefficientsprint("y = ", a, "x +", b)                      # display regression line equation    return a, b                                     # returns coefficients from regression line equation

In [None]:
# coefficients regression line at given pointsa, b = linreg(x, y)print(a, b)

From this you directly read the slope of the regression line, i.e. the value of a, and where it intersects the y-axis, i.e. the value of b.

You can draw this line with Python by connecting a number of points on this line with each other.

In [None]:
y_regressionline = straight_line(x, a, b)# y_regressionline refers to list y-values of points located on regression line# to calculate the y-values, we start from known x-values# those x-values are inserted into expression a x + bprint(y_regression_line)

Show scatter plot and regression line graph:

In [None]:
# graph of scatter plot and regression lineplt.figure(figsize=(10,8))
plt.xlim(x.min()-2, x.max()+3)plt.ylim(y.min()-2, y.max()+3)plt.title("Iris virginica standardized")plt.xlabel("length of sepal")          # xlabel provides description on x-axisplt.ylabel("petal length")         # ylabel provides description on y-axis
plt.scatter(x, y, color="blue", marker="o")     # scatter plotplt.plot(x, y_regressionline, color='green')   # found regression line; input x-values in its equation
plt.show()

Calculating R² and the mean squared deviation:

In [None]:
# important numbersprint("R² for the regression line w.r.t. the data: %.3f" % r2_score(y, y_regression_line))print("Average squared deviation for the regression line with respect to the data: %.2f"% mean_squared_error(y, y_regressionline))

Compare this $R^{2}$ with the correlation coefficient.

Answer:

<div class="alert alert-box alert-info">
Based on this regression line, you actually have an AI system capable of predicting the length of the petal from a given sepal length.<br> For this system, the <em>length of the sepal</em> will serve as <b>input</b> and the <em>length of the petal</em> as <b>output</b>. The AI system was developed based on the above dataset.</div>

<div>
    <font color=#690027 markdown="1">   
<h2>5. Making predictions with the AI system</h2>    </font>
</div>

To emphasize that the regression line is used to make predictions, you draw it longer on the graph.<br>You predict the standardized petal length for a calyx leaf of standardized length 3.

In [None]:
# predict petal length with known sepal lengthx_known = 3                          # sepal with standardized length equal to 3y_predicted = straight(x_known, a, b)  # determine standardized petal length based on model
# chartplt.figure(figsize=(10,8))
x_new = np.linspace(-4, 4, 67)      # draw longer straight liney_predicted_new = straight_line(x_new, a, b)
plt.xlim(x.min()-2, x.max()+3)plt.ylim(y.min()-2, y.max()+3)plt.title("Iris virginica standardized")plt.xlabel("length sepal")plt.ylabel("petal length")
plt.scatter(x, y, color="blue", marker="o")     # scatter plotplt.plot(x, y_regressionline, color="green")     # found regression lineplt.plot(x_nieuw, y_voorspeld_nieuw, color="yellow")   # extended found regression lineplt.plot(x_gekend, y_voorspeld, color="black", marker="o")  # predicted point
plt.show()
print("In a calyx with standardized length " + str(x_known) +" is the standardized length of the petal approximately " + str(y_voorspeld) + ".")

### Assignment 5.1Try the same with a different size for the calyx.

<div class="alert alert-block alert-info">
The equation of a straight line that best fits a dataset can also be found using mathematical formulas. For this, you use the method of least squares.</div>

<div class="alert alert-box alert-info">
To find a straight line that fits the given data well, an ML system can start from a randomly chosen straight line. This is done by randomly choosing the slope and the intercept with the y-axis of this line.<br>The system is <em>trained</em> with the training set (the inputs and the corresponding outputs): for each point of the training set, it is checked how much the corresponding y-value deviates from the given y-value on the provisional straight line. The coefficients in the equation of the straight line are adjusted so that the average squared deviation for the entire dataset is minimal. <br> You can study this ML technique in the <em>next notebook</em>.<div>

<div>
<h2>Reference List</h2></div>

[1] Dua, D., & Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. <br> &nbsp; &nbsp; &nbsp; &nbsp; Irvine, CA: University of California, School of Information and Computer Sciences.<br>[2] Fisher, R. A. (1936). The Use of Multiple Measurements in Taxonomic Problems. *Annals of Eugenics*. 7(2), 179–188. <br> &nbsp; &nbsp; &nbsp; &nbsp; https://doi.org/10.1111/j.1469-1809.1936.tb02137.x<br>[3] No machine-readable author provided. Dlanglois assumed (based on copyright claims). <br> &nbsp; &nbsp; &nbsp; &nbsp;[CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0/)], via Wikimedia Commons.

<div>
<h2>With support from</h2></div>

<img src="images/kikssteun.png" alt="Banner" width="1100"/>

<img src="images/cclic.png" alt="Banner" align="left" width="100"/><br><br>
The KIKS notebook, see <a href="http://www.aiopschool.be">AI At School</a>, by F. wyffels & N. Gesquière, is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/"> Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License </a>.