<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Describing Sales Data with `numpy`

---

<h1>Lab Guide<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Describing-Sales-Data-with-numpy" data-toc-modified-id="Describing-Sales-Data-with-numpy-1">Describing Sales Data with <code>numpy</code></a></span></li><li><span><a href="#Descriptive-Statistics" data-toc-modified-id="Descriptive-Statistics-2">Descriptive Statistics</a></span></li><li><span><a href="#Loading-CSV-files-with-python" data-toc-modified-id="Loading-CSV-files-with-python-3">Loading CSV files with python</a></span><ul class="toc-item"><li><span><a href="#1.-Loading-the-data" data-toc-modified-id="1.-Loading-the-data-3.1">1. Loading the data</a></span></li><li><span><a href="#2.-Separate-header-and-data" data-toc-modified-id="2.-Separate-header-and-data-3.2">2. Separate header and data</a></span></li><li><span><a href="#3.-Create-a-dictionary-with-the-data" data-toc-modified-id="3.-Create-a-dictionary-with-the-data-3.3">3. Create a dictionary with the data</a></span></li><li><span><a href="#4.-Convert-data-from-string-to-float" data-toc-modified-id="4.-Convert-data-from-string-to-float-3.4">4. Convert data from string to float</a></span></li></ul></li><li><span><a href="#Numpy-for-descriptive-statistics" data-toc-modified-id="Numpy-for-descriptive-statistics-4">Numpy for descriptive statistics</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#The-mean" data-toc-modified-id="The-mean-4.0.1">The mean</a></span></li><li><span><a href="#The-median" data-toc-modified-id="The-median-4.0.2">The median</a></span></li><li><span><a href="#The-variance" data-toc-modified-id="The-variance-4.0.3">The variance</a></span></li><li><span><a href="#Standard-deviation" data-toc-modified-id="Standard-deviation-4.0.4">Standard deviation</a></span></li><li><span><a href="#The-minimum" data-toc-modified-id="The-minimum-4.0.5">The minimum</a></span></li><li><span><a href="#The-maximum" data-toc-modified-id="The-maximum-4.0.6">The maximum</a></span></li><li><span><a href="#The-range" data-toc-modified-id="The-range-4.0.7">The range</a></span></li><li><span><a href="#The-skew" data-toc-modified-id="The-skew-4.0.8">The skew</a></span></li></ul></li><li><span><a href="#5.-Write-a-function-to-print-summary-statistics" data-toc-modified-id="5.-Write-a-function-to-print-summary-statistics-4.1">5. Write a function to print summary statistics</a></span></li><li><span><a href="#More-on-Skewness" data-toc-modified-id="More-on-Skewness-4.2">More on Skewness</a></span></li><li><span><a href="#6.-[Bonus]-Plot-the-distributions" data-toc-modified-id="6.-[Bonus]-Plot-the-distributions-4.3">6. [Bonus] Plot the distributions</a></span></li></ul></li><li><span><a href="#Sample-variance" data-toc-modified-id="Sample-variance-5">Sample variance</a></span></li><li><span><a href="#Some-useful-notation" data-toc-modified-id="Some-useful-notation-6">Some useful notation</a></span></li></ul></div>

## Descriptive Statistics

---

There are two main fields of statistics: **descriptive** and **inferential**.

- We use **descriptive** statistics to **make judgments** about the world based on **samples of data**. For example, what is the mean height of the people in a group, what is the largest/smallest observed height, etc. 
- When we start covering **modeling and hypothesis testing**, our focus will shift to **inferential statistics**, for example how much is the height of a person related to their weight, gender, etc.

Right now, we're going to focus on descriptive statistics: **describing, summarizing, and understanding data**.

Run the cell below to load the required packages and set up plotting in the notebook!

In [1]:
import numpy as np
import csv
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(font_scale=1.5)

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

## Loading CSV files with python

---

For this lab you will be using a truncated version of some sales data.

The csv has about 200 rows of data and four columns. The relative path to `sales_info.csv` is provided below.


Let's take a look at the `csv` module we imported. The csv module’s reader and writer objects read and write sequences. The following python code demonstrates a process for loading data from a CSV file and (in this case) appending each row to a list.


```python
import csv
print('Opening File. Data: ')
with open(my_csv_path, 'r') as f:
    reader = csv.reader(f)
    rows = list(reader)
```

The first argument to `csv.reader` is the string path to the file. The second argument specifies the "mode" for the open file object:
- `'r'` - Read (Default)
- `'w'` - Write
- `'a'` - Append;  Adds additional modifications to the end.  Unable to edit current contents.
- `'b'` - Binary (used when working with a binary file, i.e. 'rb', for reading a binary file)
- `'U'` - Opens text in Universal Newline mode.  As '\r', '\n', and '\r\n' all indicate a newline depending on which language the file was written in, Universal Newline Mode will read '\r', '\n', and '\r\n' as Python's '\n'.

See more details for [open](https://docs.python.org/3/tutorial/inputoutput.html) and the [csv library](https://docs.python.org/3.7/library/csv.html).

In [2]:
sales_csv_path = '../../../../../resource-datasets/sales_data_simple/sales.csv'

### 1. Loading the data

Set up an empty list called ```rows```.

Using the pattern for loading csvs we learned earlier, add all of the rows in the csv file to the rows list.

For your reference, the pattern is:
```python
with open(my_csv_path, 'r') as f:
    reader = csv.reader(f)
    ...
```

Beyond this, adding the rows in the csv file to the ```rows``` variable is up to you.



In [3]:
with open(sales_csv_path, 'r') as f:
    reader = csv.reader(f)
    rows = list(reader)

### 2. Separate header and data

The header of the csv is contained in the first index of the ```rows``` variable, as it is the first row in the csv file. 

Use python indexing to create two new variables: ```header``` which contains the 4 column names, and ```data``` which contains the list of lists, each sub-list representing a row from the csv.

Lastly, print ```header``` to see the names of the columns.

In [4]:
# A:
# header =
# data =

### 3. Create a dictionary with the data

Use loops or list comprehensions to create a dictionary called ```sales_data```, where the keys of the dictionary are the column names, and the values of the dictionary are lists of the data points of the column corresponding to that column name.

*The dictionary should look like this:* {'2015_margin': ['93.8022814583',
  '21.0824246877',
  '93.6124943024',
  '16.8247038328',...
  
*Note:* The order of the dictionary's keys might differ

In [5]:
# using a for loop


In [6]:
# using a list comprehension


**3.A Print out the first 10 items of the `volume_sold` column. The first item should be roughly 18.42 and the tenth item should be roughly 5.388.**

In [7]:
# A:


### 4. Convert data from string to float

As you can see, the data is still in string format (which is how it is read in from the csv). For each key:value pair in our ```sales_data``` dictionary, convert the values (column data) from string values to float values.

## Numpy for descriptive statistics


Numpy can be used to get quick insight about the data. We can calculate quantities like the mean, the median, the variance or the standard deviation of any numpy array or list using numpy functions.

Example:

In [8]:
my_list = [1, 2, 3, 4, 5]

#### The mean

The mean is the sum of the numbers in a list, divided by the length of that list.

In [9]:
np.mean(my_list)

3.0

#### The median

For **odd-length lists**: The median is the middle number of the ordered list.

For **even-length lists**: The median is the average of the two middle-most numbers of the ordered list.

The position of the median will be given by (n + 1)/2 with n being the length of the list
- if the answer is an integer, like 9, then the median is the 9th number
- if the answer is a decimal, like 12.5, then the median is the mean of the 12th and 13th numbers

In [10]:
np.median(my_list)

3.0

#### The variance

The variance is a numeric value used to describe the **degree of spread** around the mean in a distribution of numbers.

The variance is calculated by:
- subtracting each value from the mean
- squaring this difference
- finding the average of these differences (ie finding the sum, then dividing by n)

In Python variance can be calculated with:
```python
variance = []
n_mean = np.mean(numbers)

for num in numbers:
    variance.append((num - n_mean) ** 2)

variance = np.sum(variance)
variance = variance / len(numbers)
```


Using `numpy` the variance is simply:
```python
variance = np.var(n)
```

In [11]:
variance = []
n_mean = np.mean(my_list)

for num in my_list:
    variance.append((num - n_mean) ** 2)

variance = np.sum(variance)
variance = variance / len(my_list)
variance

2.0

In [12]:
np.var(my_list)

2.0

#### Standard deviation

The **standard deviation** (often written lowercase sigma: σ) is the **square root of the variance**.

In other words, for a random variable X, 

$$
\sigma(X) = \sqrt{{\rm Var}(X)}
$$

Because the **variance** is the **average of the distances from the mean _squared_**, the standard deviation tells us approximately, on average, the **distance of numbers in a distribution from the mean of the distribution**. The standard deviation is measured in the same units as the observed sample values and the mean and is therefore better for direct comparison.

The standard deviation can be calculated using numpy:
```python
std = np.std(n)
```

In [13]:
np.std(my_list)

1.4142135623730951

#### The minimum

This is the smallest value in the list.

In [14]:
np.min(my_list)

1

#### The maximum

This is the largest value in the list.

In [15]:
np.max(my_list)

5

#### The range

The range of values in the list is the difference between maximum and minimum and can be calculated with `np.ptp` (point-to-point).

In [16]:
np.ptp(my_list)

4

#### The skew 

The skew can be calculated in a similar way as the variance, only the power is cubic and in the end we divide by the standard deviation of the array cubed.

In [17]:
skew = []
n_mean = np.mean(my_list)

for num in my_list:
    skew.append((num - n_mean) ** 3)

skew = np.mean(skew)
skew = skew / np.std(my_list)**3
skew

0.0

To calculate the skew, we can use a function from the `scipy.stats` library.

In [18]:
import scipy.stats as stats

In [19]:
stats.skew(my_list)

0.0

### 5. Write a function to print summary statistics

Now write a function to print out summary statistics for the data.

Your function should:

- Accept two arguments: the column name and the data that column is associated with.
- Print out information, clearly labeling each item when you print it:
    1. Print out the column name
    2. Print the mean of the data using ```np.mean()```
    3. Print out the median of the data using ```np.median()```
    5. Print out the variance of the data using ```np.var()```
    6. Print out the standard deviation of the data using ```np.std()```
    1. Print out the minimum of the data with `np.min()`.
    1. Print out the maximum of the data with `np.max()`.
    1. Print out the range of the data with `np.ptp()`.
    1. Print out the skew of the data with `stats.skew()`.


In [20]:
# A:


**5.A Using your function, print the summary statistics for `volume_sold`.  Check: the mean should be roughly 10.02.**

In [21]:
# A:


**5.B Using your function, print the summary statistics for `2015_margin`. Check: the mean should be roughly 46.86.**

In [22]:
# A:


**5.C Using your function, print the summary statistics for `2015_q1_sales`. Check: the mean should be roughly 154632.**

In [23]:
# A:


**5.D Using your function, print the summary statistics for `2016_q1_sales`. Check: the mean should be roughly 154699.**

In [24]:
# A:


### More on Skewness

---

We observe a high skewness in the data.
Skewness refers to the **lack of symmetry** in a distribution of data.

> Technical note: we will be talking about skewness here only in the context of _unimodal_ distributions. The mode is the most frequent value.

![](./assets/images/skewness.png)

**A positively-skewed distribution is one whose right tail is longer or fatter than its left.**

Conversely, **a negatively-skewed distribution is one whose left tail is longer or fatter than its right**.


Symmetric distributions have no skewness!

### 6. [Bonus] Plot the distributions

- Use `plt.hist()` to plot a histogram for each of the variables. 
- Mark the locations of the means and medians. You can create vertical lines at their positions using `plt.vlines(np.mean(array), 0, 1000)`, etc. 
- Compare to the situation in the example plot.

## Sample variance

> **Note:** Often, the variance for data samples comes with a correction factor of $n/(n-1)$ where $n$ is the sample size (number of elements in the list). This makes sense in the context of inference when one is thinking about inferring the variance of a whole population by just looking at a sample. The reasons for doing that are more profound. However, you will often find that some software (e.g. pandas) uses the correction factor as the default setting. We can easily take care of that.

In [25]:
np.var(my_list)

2.0

In [26]:
(np.var(my_list)
 *len(my_list)/(len(my_list)-1))

2.5

The same can be done using the keyword ddof (degrees of freedom) in the numpy function.

In [27]:
np.var(my_list, ddof=0)

2.0

In [28]:
np.var(my_list, ddof=1)

2.5

## Some useful notation


In textbooks you will often find the following notation for a data sample $X=(X_1, X_2, \ldots, X_n)$:

- Mean $\mu$ (mu):

$$\mu = \frac{1}{n}(X_1+X_2+\ldots+X_n) = \frac{1}{n}\sum_{i=1}^{n}X_i$$

- Variance: 

$${\rm Var}(X) = \frac{1}{n}\left((X_1-\mu)^2+(X_2-\mu)^2+\ldots+(X_n-\mu)^2\right) = \frac{1}{n}\sum_{i=1}^{n}(X_i-\mu)^2$$

- Skew:

$${\rm Skew}(X) = \frac{1}{n\ \sigma(X)^3}\left((X_1-\mu)^3+(X_2-\mu)^3+\ldots+(X_n-\mu)^3\right) = \frac{1}{n}\sum_{i=1}^{n}\frac{(X_i-\mu)^3}{\sigma(X)^3}$$


The capital Sigma $\Sigma$ is just an abbreviation for carrying out the summation. Think of this notation in the same way we programmed it: we iterate over the different elements, perform always the same operation and in the end sum all elements before dividing by the number of elements in the list.