<img width="100%" src="callystoBannerTop.jpg"/>

# For Teachers
**Delete this portion before handing over to students:**<br>
**if you want to change / personalize one of the practice questions and need to change the answer, uncomment the print statement in the check_answer function in any of the "# ANSWER HANDLING." cells, type your desired answer into the text field, and copy the outputted hash into the 'answer' variable.**

# Overview
In this notebook, we will introduce the concept standard deviation and how to calculate variance. We are expecting that the students understand square roots and mean calculation. Note that this module is an optional extension and is not necessary to meet any curriculum outcomes.

 

# Introduction
The average female 3 year old is 94cm tall. If the day care has 30 three-year-old girls, would they all be the same height? No, because heights vary from the average. But by how much? If heights are so varied, how can we use this average to determine whether a child is healthy? Is the relationship between height and age really that distinct? We can ask so many questions about data like this. And our first step to gaining answers is learning about standard deviation. 

# Table of Contents
I. <a href = "#sd"> What is Standard Deviation </a> <br>
II. <a href = "#var"> Variance </a> <br>
III. <a href ="#exercise"> Exercise </a> <br>
IV. <a href = "#samples"> Standard Deviation of Samples </a> <br>
V. <a href = "#summary"> Summary </a> <br>
VI. <a href = "#concl"> Conclusion </a> <br>

<h1><a name="sd" style="text-decoration: none; color: black;">What is Standard Deviation?</a></h1>

The standard deviation of a graphed relationship is key to understanding the significance of the relationship and determining the strength of the relationship. Standard deviation measures the spread of each data point from the mean value.

Don't worry too much about the code below, if you can pick out what certain lines do that's great, but if not, just run it and observe the graph below. Notice how the data doesn't form a perfect line, yet the relationship between age and height is clearly defined? This is because the values are spread out over a specific standard deviation around the mean line. 

In [12]:
# Imports, borrowing code from a 'library' that is used to generate the graph.
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import numpy as np
from numpy import random as rand
import math
init_notebook_mode(connected=True)

# Initializes data sets.
ages = []
heights = []

# Generates data set.
for i in range(30):
    ages.append(rand.uniform(3.0, 3.9))
    heights.append(rand.uniform(94-rand.random_sample(), 94+rand.random_sample()))

# Sorts data set from least to greatest.
ages.sort()
heights.sort()

# Generates graph.
trace = go.Scatter(
    x = ages,
    y = heights,
    mode = 'markers', 
    name = "Age vs. Height"
)

data = [trace]
py.iplot(data, filename='scatter-mode')

<h1><a name="var" style="text-decoration: none; color: black;">Variance</a></h1>

The formula of standard deviation can be defined as the square root of the variance. The variance can in turn be defined as the average of a set of values’ squared distances from their mean.

<h1><a name="exercise" style="text-decoration: none; color: black;">Exercise</a></h1>

What is the variance of the data set below? Reorder the code cells using the up and down arrows to the left of the Run button. Leave the START and END cells in their position. The latter will output the final value, which you can check later. 

In [4]:
#START
values = [1, 2, 3, 4, 5] # Your dataset.
print(values)

[1, 2, 3, 4, 5]


In [5]:
value = sum(values)/len(values) #divide the sum of the list of values by the length of the list
print(value)

3.0


In [6]:
for i in range(len(values)): #for every value in the list
  value = i**2 #square the value
  values.append(value) #add it to the list
del values[0:5] #remove the old list values
print(values)

[0, 1, 4, 9, 16]


In [7]:
value = sum(values)/len(values) #divide the sum of the list of values by the length of the list
print(value)

6.0


In [8]:
for i in range(len(values)): #for every value in the list
  value = i - value #subtract the value by the average
  values.append(value) #add it to the list
del values[0:5] #remove the old list values
print(values)

[-6.0, 7.0, -5.0, 8.0, -4.0]


In [24]:
#END (final value)
print(value)

-4.0


## Check your answer by running the cells below and inputting the final value to the textbox:
Again, you don't have to do anything in the code cells, just run them :)

In [16]:
# ANSWER HANDLING. Uncomment the 'print("This is the hash: ", temp)' line by deleting the # before it.

# Imports.
from IPython.display import display 
from ipywidgets import widgets
import hashlib

# Check the answer given by the student.
def check_answer(x):
    temp = None
    try:
        temp = hashlib.md5(str.encode(str(text.value))).hexdigest()
        #print("This is the hash: ", temp)
    except:
        print("Not a number.")
    if(temp == answer):
        print("Correct!")
    else:
        print("Incorrect. Try again!")

In [17]:
# Encrypted answer.
answer = '9f41f9f1c434718ae6e50ffba61152d0'

# Create answer box.
text = widgets.Text()
display(text)
text.on_submit(check_answer)

Text(value='')

You might be wondering why we square. Here’s a simplistic explanation: First, it’s important to understand that a negative standard deviation - distance from the predicted value - does not make sense. Second, in order to relate variance to standard deviation, we need to perform operations that can be undone - when you square a value, you can then square root it. If we took the absolute value, then we couldn’t reverse it in a way that accurately depicts the correlation between data. 

<h1><a name="samples" style="text-decoration: none; color: black;">Standard Deviation of Samples</a></h1>

We’ve just calculated the standard deviation of a population - a small data set composed of ten values. A sample is only a section of the total population, so if we were to calculate the standard deviation the same way, our values would be skewed. To compensate for this, instead of dividing by the total number of values when calculating the average of the squared differences, we want to divide by the total number of values minus 1. 

<h1><a name="summary" style="text-decoration: none; color: black;">Summary</a></h1>

1. Calculate the mean (average value) of the data points by adding up the data points and dividing by the total number of data points <br>

2. Calculate the deviation by subtracting the mean value calculated in the first step from each data point <br> _(Fun fact: if the data point is above the mean value, its deviation would be positive if the data point is below the mean value, its deviation would be negative)_ <br>

3. Square each deviation calculated in the second step to make each value positive  <br>

4. Add up the square deviations calculated in the third step <br>

5. Divide the sum calculated in the fourth step by the total number of data points (or one less than the number of the data points for samples) <br>

6. Take the square root of the value resulting from step 5 <br>

<h1><a name="concl" style="text-decoration: none; color: black;">Conclusion</a></h1>

In this section, we introduced standard deviation for populations and samples, and how variance can be used to calculate both. 

<img width="100%" src="callystoBannerBottom.jpg"/>