<img src="images/logodwengo.png" alt="Banner" width="150"/>

<div>
    <font color=#690027 markdown="1">
        <h1>STANDARDIZATION</h1>
    </font>
</div>

<div class="alert alert-box alert-success">
In this notebook, you will learn to display bivariate data in a scatter plot. You will learn that you cannot always estimate the degree of correlation by sight and why standardizing the data is important in this regard. 
</div>

For this notebook, we were inspired by the teaching material 'Statistics for Secondary Education' which can be found on the website https://www.uhasselt.be/lesmateriaal-statistiek of Hasselt University [1][2]. The examples from [2] were used in this notebook with permission from Professor Callaert.

### Importing the necessary modules

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In what follows, you will compare three scatter plots and estimate which data show the best linear correlation.<br><br>
Lotte, Robbe, and Kato each receive a number of flower petals and must measure their length and width in millimeters. <br>
They list their measurement results in tables1, table2, and table3 respectively.<br>
They then represent the bivariate data with a scatter plot.

Estimating the strength of the linear correlation directly from a figure is not so easy.

<div>
    <font color=#690027 markdown="1">
        <h2>1. Reading the data</h2> 
    </font>
</div>

Read the datasets with the pandas module.

In [None]:
table1 = pd.read_csv("data/datastandaard1.dat", header=None)  # table to be read has no header
table2 = pd.read_csv("data/datastandaard2.dat", header=None)  # table to be read has no header
table3 = pd.read_csv("data/datastandaard3.dat", header=None)  # table to be read has no header

<div>
    <font color=#690027 markdown="1">
        <h2>2. Displaying the read data</h2> 
    </font>
</div>

### Task 2.1
Examine the data from the three tables.

### Task 2.2
Provide the correct instruction to obtain information about the second table. <br>
- What are the names of the columns?
- How many petals has Robbe measured?

Answer:

<div>
    <font color=#690027 markdown="1">
        <h2>3. Displaying the data in a scatter plot</h2> 
    </font>
</div>

Because the tables have no header, you cannot use a header as a key. Instead, you can use the column number.

In [None]:
# table1
x1 = table1[0]
y1 = table1[1]
x1 = np.array(x1)
y1 = np.array(y1)

plt.figure()

plt.title("Table1 Lotte")
plt.xlabel("width petal (mm)")
plt.ylabel("length petal (mm)")
plt.xlim(0, 45)
plt.ylim(0, 45)

plt.scatter(x1, y1, color="red", marker="o")

plt.show()

In [None]:
# table2
x2 = table2[0]
y2 = table2[1]
x2 = np.array(x2)
y2 = np.array(y2)

plt.figure() 

plt.title("Table2 Robbe")
plt.xlabel("width petal (mm)")
plt.ylabel("length petal (mm)")
plt.xlim(0, 45)
plt.ylim(0, 45)

plt.scatter(x2, y2, color="yellow", marker=">")  

plt.show()

In [None]:
# table3
x3 = table3[0]
y3 = table3[1]
x3 = np.array(x3)
y3 = np.array(y3)

plt.figure()


plt.title("Table 3 Kato")
plt.xlabel("petal width (mm)")
plt.ylabel("petal length (mm)")
plt.xlim(0, 45)
plt.ylim(0, 45)

plt.scatter(x3, y3, color="lightblue", marker="<")  

plt.show()

Which table has the best linear correlation?

At first glance, in table .......

<div>
    <font color=#690027 markdown="1">
        <h2>4. Correlation</h2> 
    </font>
</div>

<div class="alert alert-box alert-info">
It is not straightforward to estimate the linear correlation between data just by looking at the scatter plot. A better measure is the correlation coefficient. 
</div>

To what extent there is a linear relationship between the x- and y-coordinates of the given points can be examined using the *correlation coefficient R*.

<div class="alert alert-box alert-info">
The correlation coefficient, a real number R, always lies between -1 and 1. The closer R is to 0, the worse the linear relationship. <br>
A positive R indicates a positive linear relationship, while a negative R indicates a negative linear relationship. 
</div>

Compare the correlation coefficients of the three tables. First, complete the instructions in the code cell.

In [None]:
# correlation coefficient of table 1
print("Correlation coefficient R1 of the first table =", np.corrcoef(x1, y1)[0,1])

In [None]:
# correlation coefficient of table 2 and table 3
print("Correlation coefficient R2 of the second ...
print("...

What do you conclude about the linear correlation for the three tables?

Answer:

<div>
    <font color=#690027 markdown="1">
        <h2>5. Standardization</h2> 
    </font>
</div>

The data can only be compared if they are standardized. To do this, you calculate the *Z-score* of the data: you subtract the mean and then divide by the standard deviation.

In [None]:
x1 = (x1-np.mean(x1))/np.std(x1)
y1 = (y1-np.mean(y1))/np.std(y1)
x2 = (x2-np.mean(x2))/np.std(x2)
y2 = (y2-np.mean(y2))/np.std(y2)
x3 = (x3-np.mean(x3))/np.std(x3)
y3 = (y3-np.mean(y3))/np.std(y3)

You can now plot the scatter plots that correspond to these adjusted data.

In [None]:
# standardized scatter plot

plt.figure()

plt.scatter(x1, y1, color="red", marker="o")      
plt.scatter(x2, y2, color="yellow", marker=">")
plt.scatter(x3, y3, color="lightblue", marker="<")

plt.title("Standardized")
plt.xlabel("petal width (mm)")
plt.ylabel("petal length (mm)")
plt.xlim(-2, 2)
plt.ylim(-2, 2)

plt.show()

The linear correlation is almost the same for all three tables. After standardization, the points almost coincide.<br>

More explanation on these concepts can be found in the course 'Correlation: exploratory methods: workbook for the student'. [1]

<div class="alert alert-box alert-info">
Comparing the correlation of multiple datasets is best done after standardizing the data. Standardizing has no effect on the value of the correlation coefficient.<br> The correlation between <b>bivariate data</b> can be visually estimated by looking at the corresponding <b>scatter plot</b>. However, the shape of this plot is only reliable if the data are standardized. <br>
<b>Standardizing</b> variables means rescaling the variables in such a way that variables of, for example, different magnitudes or in different units can be compared or related.
</div>

<div class="alert alert-box alert-info">
After standardization, most values lie between 0 and 1, which is beneficial for computer calculations.<br>
Calculating with very large numbers quickly leads to even larger numbers and numerical instability, which is an additional reason why the data are standardized. <br> Also, some Machine Learning algorithms are only usable if the data are standardized, because these algorithms are set up that way.
</div>

<div>
    <h2>References</h2> 
</div>

[1] Callaert, H., Bekaert, H., Goethals, C., Provoost, L., & Vancaudenberg, M. (2012). Correlation: exploratory methods: workbook for the student.<br> &nbsp; &nbsp; &nbsp; &nbsp; *Statistics for secondary education.* University of Hasselt. Accessed on April 15, 2019, via <br> &nbsp; &nbsp; &nbsp; &nbsp; https://www.uhasselt.be/documents/uhasselt@school/lesmateriaal/statistiek/Lesmateriaal/LEERLING%20Correlatie_02.pdf.<br>
[2] Callaert, H.,& Bogaerts, S. (2004). Statistical Intelligence: discovering the relationship: exploration of bivariate data. University of Hasselt. <br> &nbsp; &nbsp; &nbsp; &nbsp; Accessed on April 15, 2019, via https://docplayer.nl/32671814-Statistische-intelligentie.html.<br>

<img src="images/cclic.png" alt="Banner" align="left" width="100"/><br><br>
Notebook Python and mathematics, see <a href="http://www.aiopschool.be">AI At School</a>, by F. Wyffels & N. Gesquière is licensed under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.