---
title: Math Primer 1 + Intro to NumPy
duration: "1:5"
creator:
    name: Kiefer Katovich & David Yerrington
---

<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 15px;">
# Math Primer 1 + Intro to NumPy
Week 1 | Lesson 3.3

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Understand the measures of Central Tendency (mean, median, and mode)
- Understand how mean, median and mode are affected by skewness in data
- Understand measures of variability (variance and standard deviation)

### STUDENT PRE-WORK
*Before this lesson, you should already be able to:*
- This should've been completed as pre-work before starting the course, but if you haven't
didn't watch it, please watch Lesson 3: Estimation
[Intro to Stats](https://www.udacity.com/course/intro-to-statistics--st101)

### LESSON GUIDE
| TIMING  | TYPE  | TOPIC  |
|:-:|---|---|
| 5 min  | [Introduction](#introduction)   |  Descriptive Statistics |
| 20 min  | [Demo / Guided Practice](#demo)  | Mean, Median, and Mode  |
| 20 min  | [Demo / Guided Practice](#demo)  | Skewness  |
| 20 min  | [Demo / Guided Practice](#demo)  | Range, Variance and Standard Deviation  |
| 20 min  | [Independent Practice](#ind-practice)  |   |
| 5 min  | [Conclusion](#conclusion)  |   |

---

Note:  We use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in a study. Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what's going on in our data.  Classic **descriptive statistics** include **mean, min, max, standard deviation, median, skew, kurtosis**. Inferential statistics are a function of the sample data that assists you to draw an inference regarding an hypothesis about a population parameter.

<a name="Mean, median, and mode"></a>
## Guided Practice: Mean, median, and mode (20 mins)

## Mean

_The mean is the sum of the numbers divided by the length of the list._

**Check:** Find the **mean** of this list using python:


In [1]:
import pandas as pd

samples = [1,2,3,4,5]
float(sum(samples))/len(samples)

# df = pd.DataFrame(samples)
# df.describe()(0)['mean']

3.0

---

## Median

We will be reviewing and coding the following cases for calculating the median of a distribution of numbers.

_For **odd-length** lists: the median is the middle number of the ordered list._

_For **even-length** lists: the median is the average of the two middle numbers of the ordered list._


### Calculate the Median

Here are two samples we will work with.  One set has an odd number of samples, while the other is even.  From  here, we will work with these basic objects.

In [2]:
n_odd = [1,5,9,2,8,3,10,15,7]
n_even = [8,2,3,1,0,-1,-5,20]

### Median  STEP 1:  Order the numbers
First step we must sort our list objects in **ascending** order, making sure to reassign / update our list objects with the new **ascending** order.

In [3]:
# How do we order the numbers?  Google: "python sort a list"

sorted_n_odd = sorted(n_odd)
sorted_n_even = sorted(n_even)
print sorted_n_odd
print sorted_n_even

[1, 2, 3, 5, 7, 8, 9, 10, 15]
[-5, -1, 0, 1, 2, 3, 8, 20]


### Median STEP 2a: Find the middle (odd numbers)

First, we will attempt to find the middle number in a given list sequence for our object `n_odd`.  

To find the **median** for an odd sequence of numbers, we will need to perform the following operations:

1. **Divide the length of our list in half.** <br>_Set to a variable._  (expected: 4.5)
1. **Subtract .5 from the divided length.**<br>_Subtract .5 from the variable._
1. **Select the offset of the `n_odd` list based on the halfed length - .5.** <br>_Use the variable to select the index._

In [4]:
# Try it out.  Find the median 

half_length = len(sorted_n_odd) / 2
sorted_n_odd[half_length]

7

### Median STEP 2b: Find the middle (even numbers)
    
The process for finding the median of a sequence of even numbers is little more straight forward.

1. Divide the length of our list in half.<br><i>Set a variable called **sample_length** to store this.</i>
1. Add the **sample_length** - 1 (Nth) item of the even list (`n_even`) with the **sample_length** item of the even list, divided by 2.<br><i>Remember, use the value of the offset of `n_even` according to (**sample_length** - 1) + (**sample_length**) to get the values.</i>

In [5]:
# Seek the median using Python!  You can do it!

sample_length = len(sorted_n_even) / 2
(sorted_n_even[sample_length - 1] + sorted_n_even[sample_length]) / 2.0

1.5

---

## Mode

**The mode is the most frequently occurring number.**

Finding the mode is not as trivial as the mean or median, so here it is calculated using scipy.stats.mode().  With a partner, take some time to review the docs on scipy.stats.mode(), and implement the solution to find the mode of the assigned variable `sample` below.  Google is a good place to start with this activity.

_Notice the output of the initial state of the code cell below.  This is the output of the correct solution, however it will disappear after you run the cell for the first time._

Note: doing this without scipy.stats.mode() is a challenge problem in the independent practice section.

In [6]:
from scipy import stats

sample = [0,1,1,2,2,2,2,3,3,4,4,4,5]

stats.mode(sample)

ModeResult(mode=array([2]), count=array([4]))

> Additional information here:
> [Descriptive Stats:  University of South Alabama](http://www.southalabama.edu/coe/bset/johnson/lectures/lec15.htm)


### Once Again, with Scipy

Thankfully, we have a library that is good, fast, and cheap, that can help us do the heavy lifting.  In most cases, it's best to use the conviences of Scipy and Numpy since they are well tested, high quality, and performant.

With Scipy and Numpy, find the mean, median, and mode of the distribution `sample below`.  You will need to properly import these modules and reference the appropriate methods to complete this task.  We want you to get comfortable referencing these types of operations in the relevent documentation that exists.

In [7]:
# Import python modules here
import numpy as np
import scipy as sp

sample = [3, 75, 98, 2, 10, 3, 14, 99, 44, 25, 31, 100, 356, 4, 23, 55, 327, 64, 6, 20]

# Your solution here
print 'Mean:', np.mean(sample)
print 'Median:', np.median(sample)
print 'Mode:', sp.stats.mode(sample)[0][0]
print 'Mode count:', sp.stats.mode(sample)[1][0]

Mean: 67.95
Median: 28.0
Mode: 3
Mode count: 2


**Check:** Explain the output of the mode() function.

<a name="Skewness"></a>
## Guided Practice: Skewness (20 mins)

**Skewness** is lack of symmetry in a distribution of data.

[Technical note: we will be talking about skewness here in the context of _unimodal_ distributions.]

![](../assets/images/skewness.png)

A **positive-skewed** distribution means the right side tail of the distribution is longer or fatter than the left.

Likewise a **negative-skewed** distribution means the left side tail is longer or fatter than the right.

Symmetric distributions have no skewness!

---

## Skewness and measures of central tendency

The mean, median, and mode are affected by skewness.

When a distribution is **symmetrical**, the mean, median, and mode are the same number.

When a distribution is negatively skewed, the mean is less than the median, which is less than the mode.

**Negative skew: mean < median < mode**

When a distribution is positively skewed, the mean is greater than the median, which is greater than the mode!

**Positive skew: mode < median < mean**

This way of thinking can help you, especially if you can't see a line graph of the data. All you need are the mean and the median. Nice!

1. If the mean < median, the data are skewed left.
2. If the mean > median, the data are skewed right.

---

## Practice (~5 Mins)

Using our understanding of skewness, does the list of numbers form a symmetric distribution? Is it skewed left of right?

Look at it again, and let's explain how it looks.


In [8]:
# This is our sample
sample = [3, 75, 98, 2, 10, 3, 14, 99, 44, 25, 31, 100, 356, 4, 23, 55, 327, 64, 6, 20]

In [9]:
# This space we may code a quick exmaple of plotting this distribution

print 'Mean:', np.mean(sample)
print 'Median:', np.median(sample)
print 'Mode:', sp.stats.mode(sample)[0][0]
print 'Mode count:', sp.stats.mode(sample)[1][0]

# Skewed right/positively

Mean: 67.95
Median: 28.0
Mode: 3
Mode count: 2


<a name="Range, Variance and Standard Deviation"></a>
## Guided / Idependent Practice: Range, Variance and Standard Deviation (20 mins)

Measures of variability like the **range**, **variance**, and **standard deviation** tell you about the spread of your data.

These measurements give complementary (and no less important!) information to the measures of central tendency (mean, median, mode).

---



## Range

The **range** is the difference between the lowest and highest values of a distribution.  Since this metric only depends on two observations, it's a very simple calculation.  Range shows how much the numbers in a set vary.


To calculate the range:

1. Sort the `sample` object in `ascending` order.
1. Subtract the first item from the last item.
1. Compare your result with Numpy's `ptp()` method that does the same thing.


---

In [10]:
sample = [3, 75, 98, 2, 10, 3, 14, 99, 44, 25, 31, 100, 356, 4, 23, 55, 327, 64, 6, 20]

# With numpy:
n_range = np.ptp(sample)
print n_range

# Now implement htis in pure python

sorted_sample = sorted(sample)
print sorted_sample[-1] - sorted_sample[0]

354
354


##### Variance

The **variance** is a numeric value used to describe how widely the numbers distribution vary.

In python variance can be calculated with:
```python
variance = []
n_mean = np.mean(n)

for n_ in n:
  variance.append((n_ - n_mean) ** 2)

variance = np.sum(variance)
variance = variance / len(n)
```

Which is **the average of the sum of the squared distances of each number from the mean of the numbers.**

![](../assets/images/dist_with_variance.png)

**Check:** What could a distribution with a large variance look like? A small?

**Check:** What does a variance of 0 mean?

Using numpy the variance is simply:
```python
variance = np.var(n)

print(variance)
9414.6475
```

---



### Standard deviation

The **standard deviation** is the square root of the variance.

Because the variance is the average of the distances from the mean _squared_, the standard deviation tells us approximately, on average, the distance of numbers in a distribution from the mean.

The standard deviation can be calculated with:
```python
std = np.std(n)

print(std)
97.029106457804716
```

![](../assets/images/dist_with_var_std.png)

**Optional Check (time permitting)**  Is this the same as the average of the absolute deviations from the mean? If not, what is the difference between the measures?



<a name="ind-practice"></a>
## Independent Practice: Topic (20 minutes)
- With the provided data, determine the mean, median, and mode.
- Is the data skewed left or right? How do you know?
- Find the range, variance and standard deviation of your data set. What does the standard deviation tell you about the distribution?
- Challenge: calculate the mode without using scipy!

In [11]:
sample = [3, 75, 98, 2, 10, 3, 14, 99, 44, 25, 31, 100, 356, 4, 23, 55, 327, 64, 6, 20]

<a name="conclusion"></a>
## Conclusion (5 mins)

In [17]:
print 'Mean:', sp.mean(sample)
print 'Median:', sp.median(sample)
print 'Mode:', sp.stats.mode(sample)[0][0]
print 'Mode count:', sp.stats.mode(sample)[1][0]

# Because the mean is greater than the median, the data is skewed right/positively

print 'Range:', sp.ptp(sample)
print 'Variance:', sp.var(sample)
print 'Standard deviation:', sp.std(sample)

Mean: 67.95
Median: 28.0
Mode: 3
Mode count: 2
Range: 354
Variance: 9414.6475
Standard deviation: 97.0291064578
