### Descriptive Statistics

We will start by looking at some methods to summarize large datasets. There are several examples:

(a) Scores in a test

(b) Height of a group of athletes

(c) Time it takes to go from Point A to Point B

Say, the lecturer has to give grades A to D based on test scores out of 50. You can give roughly 10% A, 30% B, 40% C and the remaining D. How would he do this?

First let us import the libraries that we need

In [2]:
from __future__ import print_function
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook

Let us now examine the data

In [15]:
test_scores = np.asarray([19, 29, 19, 19, 37, 42, 21, 34, 12, 33, 46, 21, 27, 23, 24, 32, 49, 23, 15, 28, 17, 26, 26, 29, 28,
 21, 34, 26, 37, 21, 35, 27, 22, 25, 27, 34, 45, 42, 15, 11, 21, 36, 14, 37, 14])

print("Number of students:",len(test_scores))
print("Minimum score: {} Maximum score: {}".format(np.min(test_scores), np.amax(test_scores)))

plt.figure()
plt.plot(test_scores,".")
plt.xlabel("Student ID")
plt.ylabel("Score")
plt.grid()
plt.show()

Number of students: 45
Minimum score: 11 Maximum score: 49


<IPython.core.display.Javascript object>

There are a number of them who have scored more than 40. No one scored below 10. Seems like a good class!

### Central Tendency

You are interested in knowing how was the overall class performance in order to compare with other classes in your organization. 

What do you think is a metric that can be used to summarize the test scores given? 

There are 3 metrics that are commonly used to summarize data

1. Mean: Simple arithmetic average of all data values

2. Median: Value corresponding to the 50th percentile

3. Mode: Value that occurs most often

Which is most appropriate? That will depend on the specific application. 

In this case we focus only on mean

In [16]:
print("Test Mean: {0:0.2f} of class".format(np.mean(test_scores)))

Test Mean: 27.18 of class


Next let us look at how this data is distributed using a histogram

In [17]:
plt.figure()
plt.hist(test_scores,color='blue', bins = 10)
plt.show()

<IPython.core.display.Javascript object>

### Variability

Another useful measure to summarize data is to quantify the "spread" of the distribution. We already know the minimum and maximum scores (11 and 49). These scores seem quite different from the mean. But how do we answer specific questions like

a. Is a score of 41 an outlier? 

b. Is a score of 32 typical? 

We need to calculate the average deviation from the mean. In fact, we will look at the squared deviations as signed deviations tend to cancel out in symmetric distributions. This quantity is referred to as "variance" of the distribution. If $\{x_1,x_2,\cdots,x_N\}$ is the dataset and $\bar{X}$ is the mean, variance is
\begin{equation*}
\sigma^2 = \frac{1}{N}\sum_{i=1}^N{\left(x_i-\bar{X}\right)^2},
\end{equation*}
Notice we use $\sigma^2$ to denote variance. This makes sense because its magnitude is squared of the actual data value. The squared-root of variance, $\sigma$ is defined as the standard-deviation. We can regard $\bar{X}\pm \sigma$ as a typical value.

Let us evaluate these terms for the dataset we have been looking at


In [18]:
#Without ex-employee data
sigma_squared = np.var(test_scores)
sigma = np.sqrt(sigma_squared)
average = np.mean(test_scores)

print("Variance and standard deviation: {0: 0.2f}, {1: 0.2f}".format(sigma_squared, sigma) )
print("Typical test score: ({0: 0.2f}, {1: 0.2f})".format(average - sigma, average + sigma))
print("")

Variance and standard deviation:  87.39,  9.35
Typical test score: ( 17.83,  36.53)

