<h1>Python Statistics Fundamentals: How to Describe Your Data</h1>

<h2><b>1.Introduction</b></h2>
<p>In this notebook, I'll be taking notes from Real Python's lesson on the <b>fundamentals of statistics</b>. If you are interested, follow this <a href='https://realpython.com/python-statistics/'>link</a> to get the entire content. Thanks once again for the whole Real Python team - specially, in this case, <a href='https://realpython.com/team/mstojiljkovic/'>Mirko Stojiljković</a> for this course!</p>
<p>This notebook will cover the following <b>objectives</b>:</p>
<ul>
    <li>What numerical quantities you can use to describe and summarize your datasets</li>
    <li>How to calculate descriptive statistics in pure Python</li>
    <li>How to get descriptive statistics with available Python libraries</li>
    <li>How to visualize your datasets</li>
</ul>

<h2>

<h2><b>2.Understanding Descriptive Statistics</b></h2>
<p>The term <b>descriptive statistics</b> refers to the process of <b>describing</b> and <b>summarizing</b> data. It can use <b>two main approaches</b>:</p>
<ol>
    <li><b>Quantitative approach</b> - describes data numerically.</li>
    <li><b>Visual approach</b> - describes data with the assistance of data visualization tools (charts, histograms, plots, etc).</li>
</ol>
<p>When you describe a <b>single variable</b>, you're conducting a <b>univariate analysis</b>. When two variables are chosen to verify a statistical relationship, you are performing a <b>bivariate analysis</b>. Finally, <b>multivariate analysis</b> considers multiple variables.</p>


<h3><b>2.1 Types of Measures</b></h3>
<ol>
    <li><b>Central tendency</b> - describes the center of the data. Some of the measures include: <b>mean, mode, and median</b>.</li>
    <li><b>Variability</b> - describes the <b>spread</b> of the data. <b>Variance</b> and <b>standard deviation</b> represent useful measures.</li>
    <li><b>Correlation</b> or <b>joint variability</b> - tells you about the <b>relation</b> between a pair of variables in a dataset, in which <b>covariance</b> and the <b>correlation coefficient</b> are useful measures.</li>

<h3><b>2.2 Populations and Samples</b></h3>
<p>A <b>sample</b> constitutes a <b>subset of a population</b> and, ideally, should preserve the fundamental <b>features</b> of the population.</p>

<h3><b>2.3 Outliers</b></h3>
<p>An <b>outlier</b> is a data point that <b>significantly differs</b> from the majority of the data taken from a sample or a population. <b>Natural variation</b> in data, <b>change</b> in the behavior of the observed system, and <b>errors</b> during the process of data collection are some of the most frequent reasons for outliers, although you must always rely on <b>experience</b> to properly identify them.</p>

<h2><b>3. Python Statistics Libraries</b></h2>
<p>We'll be using the following libraries:</p>
<ul>
    <li><code>statistics</code></li>
    <li><code>NumPy</code></li>
    <li><code>SciPy</code></li>
    <li><code>Pandas</code></li>
    <li><code>Matplotlib</code></li>
</ul>
<hr>
<h2><b>4. Calculating Descriptive Statistics</b></h2>
<p>Let's import all the necessary packages:</p>

In [12]:
import math
import statistics
import numpy as np
import pandas as pd
import scipy.stats

<p>Let's now create some data:</p>

In [13]:
x = [8.0, 1, 2.5, 4, 28.0]
x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0] # You can also use float('nan') and np.nan

In [14]:
x

[8.0, 1, 2.5, 4, 28.0]

In [15]:
x_with_nan

[8.0, 1, 2.5, nan, 4, 28.0]

<p>Let's now create <code>np.ndarray</code> and <code>pd.Series</code> objects corresponding  to <code>x</code> and <code>x_with_nan</code>:</p>

In [16]:
y, y_with_nan = np.array(x), np.array(x_with_nan)
z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)

In [17]:
y

array([ 8. ,  1. ,  2.5,  4. , 28. ])

In [18]:
y_with_nan

array([ 8. ,  1. ,  2.5,  nan,  4. , 28. ])

In [19]:
z

0     8.0
1     1.0
2     2.5
3     4.0
4    28.0
dtype: float64

In [20]:
z_with_nan

0     8.0
1     1.0
2     2.5
3     NaN
4     4.0
5    28.0
dtype: float64

<h3><b>4.1 Measures of Central Tendency</b></h3>
<h4><b>4.1.1 Mean</b></h4>
<p>It is expressed as <b>Σᵢ𝑥ᵢ/𝑛,</b>, that is, the <b>sum</b> of all elements <b>𝑥ᵢ</b> divided by the <b>number of items in the dataset</b>.</p>
<p>In pure Python, you can apply the following methods:</p>

In [21]:
mean_ = sum(x) / len(x)
mean_

8.7

<p>You can also use Python's built-in <code>statistics</code> functions:

In [22]:
mean_ = statistics.mean(x)
mean_

8.7

In [23]:
mean_ = statistics.fmean(x)
mean_

8.7

Although both <code>mean()</code> and <code>fmean()</code> produce the same result, the latter was later on introduced as a faster alternative, always returning a <b>floating-point</b> number. Keep in mind, however, that if there are <code>nan</code> values among your data, these functions will also return a <code>nan</code> value.

In [24]:
mean_ = statistics.mean(x_with_nan)
mean_

nan

In [25]:
mean_ = statistics.fmean(x_with_nan)
mean_

nan

You can also use Numpy to get the mean with <code>np.mean()</code>:

In [26]:
mean_ = np.mean(x)
mean_

8.7

NumPy's <code>mean()</code> function and the <code>mean</code> method deliver the same output, even when considering <code>nan</code> values. 

In [27]:
np.mean(y_with_nan)

nan

In [28]:
y_with_nan.mean()

nan

You can ignore these <code>nan</code> values with <code>np.nanmean()</code>:

In [29]:
np.nanmean(y_with_nan)

8.7

Finally, <code>pd.Series</code> also have the method <code>.mean()</code>:

In [30]:
mean_ = z.mean()
mean_

8.7

Pandas' <code>mean()</code> function, however, automatically <b>ignore</b> <code>nan</code> values:

In [31]:
z_with_nan.mean()

8.7

To change this behavior, you need to modify the optional parameter <code>skipna</code>:

In [32]:
z_with_nan.mean(skipna=False)

nan

<h4><b>4.1.2 Weighted Mean</b></h4>
<p>Also known as the <b>weighted average</b>, it is a generalization of the arithmetic mean, and allow you to define the relative contribution of each observation of the dataset to the result.</p>
<p>For each data point <i>x<sub>i</sub></i> of the dataset <i>x</i>, where <i>i</i>=1,2,...,<i>n</i> and <i>n</i> is the number of items in <i>x</i>. Then, you multiply each observation with its corresponding weight, sum all the products, and then divide the resulting sum with the sum of weights <i>&Sigma;<sub>i</sub>(w<sub>i</sub>x<sub>i</sub>)/&Sigma;<sub>i</sub>w<sub>i</sub></i>.</p>
<p>The weighted mean is a useful feature when dealing with a dataset that contains items that occur with given relative frequencies.</p>
<p><b>Example</b>: set in which 20% of all items are equal to 2, 50% of the items are equal to 4, and the remaining 30%, to 8. We can calculate the weighted average as following:


In [33]:
0.2 * 2 + 0.5 * 4 + 0.3 * 8

4.8

<p>Taking into consideration the relative frequency of the observations, there is no need to know the total number of items in advance.</p>
<p>Using pure Python, you can get the same result using <code>sum()</code> with either <code>range()</code> or <code>zip()</code>:</p>

In [34]:
x = [8.0, 1, 2.5, 4, 28.0]
w = [0.1, 0.2, 0.3, 0.25, 0.15]

wmean = sum(w[i] * x[i] for i in range(len(x))) / sum(w)
wmean

6.95

In [35]:
wmean = sum(x_ * w_ for (x_, w_) in zip(x, w)) / sum(w)
wmean

6.95

<p>When dealing with larger datasets, <code>np.average()</code> is the better choice either for NumPy arrays or Pandas Series.

In [36]:
y, z, w = np.array(x), pd.Series(x), np.array(w)
wmean = np.average(y, weights=w)
wmean

6.95

In [37]:
wmean = np.average(z, weights=w)
wmean

6.95

<p>You can also use <code>w * y</code> with <code>np.sum()</code> or <code>.sum()</code>:</p>

In [38]:
(w * y).sum() / w.sum()

6.95

You must be careful, however, if your dataset contains <code>nan</code> observations:

In [39]:
w = np.array([0.1, 0.2, 0.3, 0.0, 0.2, 0.1])
(w * y_with_nan).sum() / w.sum()

nan

In [40]:
np.average(y_with_nan, weights=w)

nan

In [41]:
np.average(z_with_nan, weights=w)

nan

<h4><b>4.1.3 Harmonic Mean</b></h4>
<p> The <b>harmonic mean</b> is the reciprocal of the mean of all items in the dataset: <i>n / </i>&Sigma;<sub>i</sub>(1/<i>x</i><sub>i</sub>), where <i>i</i>=1,2,... <i>n</i> and <i>n</i> is the number of observations in the dataset <i>x</i>.

In [42]:
hmean = len(x) / sum(1 / item for item in x)
hmean

2.7613412228796843

<p>You can achieve the same result with <code>statistics.harmonic_mean()</code>:</p>

In [43]:
hmean = statistics.harmonic_mean(x)
hmean

2.7613412228796843

<p>A dataset containing a <code>nan</code>, a 0, or  a negative number will produce different results:</p>

In [44]:
statistics.harmonic_mean(x_with_nan)

nan

In [45]:
statistics.harmonic_mean([1,0,2])

0

In [None]:
# Will raise a StatisticsError
statistics.harmonic_mean([1, 2, -2])

<p>You can also use <code>scipy.stats.hmean()</code>:

In [48]:
scipy.stats.hmean(y)

2.7613412228796843

In [49]:
scipy.stats.hmean(z)

2.7613412228796843