# Demo: Let's Make Some Data!

In [1]:
# range from 20 to 59
ages = range(20, 60)
ages

range(20, 60)

In [2]:
import random
random_ages = []
for num in range(100):
    random_ages.append(random.choice(ages))

In [5]:
print(random_ages) # using print() eschews pretty-printing

[54, 53, 53, 58, 46, 22, 37, 24, 45, 50, 44, 27, 29, 28, 35, 47, 50, 51, 57, 41, 36, 27, 39, 32, 26, 37, 24, 37, 50, 22, 56, 34, 44, 20, 50, 30, 30, 36, 54, 54, 47, 40, 40, 52, 29, 52, 53, 47, 49, 59, 52, 29, 48, 42, 41, 57, 29, 41, 43, 55, 39, 42, 50, 35, 37, 57, 48, 36, 49, 54, 33, 38, 33, 44, 38, 30, 54, 58, 32, 50, 30, 26, 53, 33, 32, 57, 42, 51, 57, 28, 53, 53, 23, 37, 23, 33, 37, 58, 40, 27]


## Question: What is our oldest age? Youngest age? Average age?

In [3]:
max(random_ages)

59

In [4]:
min(random_ages)

20

# Range: How Wide is the Dispersion of the Data?

In [5]:
def my_range(x):
    '''Python would let us call this function range,
       but if we did that, we would lost access to
       the builtin function range'''
    # pretty simple really..
    return max(x) - min(x)

In [6]:
my_range(random_ages)

39

In [7]:
nums = [10, 10, 100, 100]

In [8]:
my_range(nums)

90

In [9]:
nums = [10, 50, 50, 50, 50, 100]

In [10]:
my_range(nums)

90

## Question: How well does range explain the dataset? How useful is it?

In [11]:
# numpy has a range function
import numpy as np
np.ptp(random_ages) # "peak to peak"

39

# Mean: Average Value
* Mean, unlike range, is effected by all the values in the set.
* A way to measure the center of a dataset.

In [12]:
def mean(x):
    return sum(x) / len(x)

In [13]:
mean(random_ages)

39.77

In [14]:
np.mean(random_ages)

39.77

## Question: What do outliers do to the average?

# Median: Mid-Point of Values
* 50% of values above the median ; 50% below the median
* Another way to measure the center of a dataset.

In [15]:
def median(x):
    n = len(x)
    sorted_x = sorted(x)
    mid = n // 2
    if n % 2 == 0:
        return (sorted_x[mid - 1] + sorted_x[mid]) / 2
    else:
        return (sorted_x[mid])

In [16]:
median(random_ages)

40.0

In [17]:
np.median(random_ages)

40.0

## Question: What do outliers do to the median?

## Percentile: How Much Data Falls Below?
* Median is really 50th percentile
* You can pick an arbitrary percentile

In [18]:
np.percentile(random_ages, 50)

40.0

In [19]:
np.percentile(random_ages, 75)

49.0

In [20]:
np.percentile(random_ages, 25)

30.0

## Interquartile Range (IQR)
* $ IQR = Q_3 - Q_1 $
* 75th percentile - 25th percentile - the "middle" of the dataset
* A quick way to measure spread of the data and ignore the outliers

In [21]:
from scipy import stats
stats.iqr(random_ages)

19.0

## Mode
* The most common (frequently occurring) value in a set 
* Another way of measuring center

In [22]:
stats.mode(random_ages)

ModeResult(mode=array([36]), count=array([5]))

## Consider the spread of data in two hypothetical datasets

<img src="images/skew-2.png" width=400 height=400>

* How can we identify/quantify different spreads?
* Normal distribution or not? 
* Focus on the <span style="color:blue;font-weight:bold;">mean</span>, <span style="color:green;font-weight:bold;">median</span>, and <span style="color:red;font-weight:bold;">mode</span>

## Variance: How much spread is there in the dataset?
* $Var(X) = \frac{1}{n} \sum_{i=1}^n (a_i - \bar x)^2$
* Why is it squared?
* What are the units of variance, assuming our dataset from above?

In [23]:
np.var(random_ages)

130.0371

## Standard Deviation
* $\sqrt {Var(X)}$
* Puts the units back into something we are more familiar with
* "standard variation" from the mean
* Another measure of dispersion

In [27]:
np.std(random_ages)

11.13901252355881

<img style="height: 350px;" src="images/ss-01.png">
## Question: How do we use this?

## In a normal distribution...
* 68% of the data will fall within 1 standard deviation
* 95% of the data will fall within 2 standard deviations
* 99.5% of the data will fall within 3 standard deviations

## Skewness
* if we're trying to draw conclusions about a dataset, and we're expecting our sampling to reflect a normal distribution and then we believe we can make generalizations to the population at large, we will be wrong if our sample is skewed
* e.g., polling people who have home phones/land lines. What's wrong with that?

<img style="height: 200px;" src="images/skew-1.png">

In [28]:
stats.skew(random_ages)

-0.025550022737476627

In [32]:
data = [1, 2, 3, 75, 75, 75, 77, 78, 78, 91, 78, 81, 93, 94, 95, 96, 105, 106 ]

In [34]:
stats.skew(data)

-1.4291927105469837

In [33]:
np.mean(data)

72.38888888888889