# Descriptive statistics in Python

## Distribution of a single variable

### Measures of central tendency and positional measures

- arithmetic mean
- median
- geometric mean (useful for calculation of average rate of growth)
- trimmed mean
- positional measures - quantiles: quartiles, deciles, percentiles

### Measures of variability / dispersion

- variance, standard deviation
- IQR (interquartile range)
- mean absolute difference
- coefficient of variation: standard deviation divided by the mean

### Measures of shape of the distribution

- skewness (measure of asymmetry)
- kurtosis (measure of tailedness, extremity of tails)


In [4]:
import pandas as pd

csv_url = "https://docs.google.com/spreadsheets/d/1H6b5mkq68MeRQyP0Cr2weCpVkzmpR0c2Oi7p147o2a0/export?format=csv"

# Read the sheet into a DataFrame
d = pd.read_csv(csv_url)

print(d.head())

   height  handedness  right_hand_span  left_hand_span  head_circ eye_colour  \
0     159        0.88             19.0            19.0       54.0       Blue   
1     160       -1.00             19.0            20.0       57.0      Green   
2     161        0.79             17.0            16.5       57.0      hazel   
3     162        0.79             16.0            16.0       57.0       gray   
4     162        0.79             16.0            16.0       54.0      Brown   

   gender  siblings  movies  soda   bedtime       fb_freq  fb_friends  \
0  Female         2     3.0   7.0  02:00:00    once a day       135.0   
1  Female         2     0.5   2.0  04:30:00             0         1.0   
2  Female         3     3.0   2.0  23:50:00   once a week       354.0   
3  Female         2     0.0   2.0  23:10:00  almost never       192.0   
4  Female         2     1.0   3.0  00:00:00         never         1.0   

                  stat_likert  
0  Neither agree nor disagree  
1              S

In [9]:
import numpy as np
# arithmetic mean
print("Average height: ")
print(np.mean(d['height']))
print("Average number of facebook connections: ")
print(np.mean(d['fb_friends']))

Average height: 
176.21666666666667
Average number of facebook connections: 
289.9642857142857


In [14]:
# median
print("Median height: ")
print(np.median(d['height']))
print("Median number of Facebook connections: ")
print(np.median(d['fb_friends']))
print("Median number of Facebook connections (NAs excluded): ")
print(np.nanmedian(d['fb_friends']))
print("Median number of Facebook connections (NAs excluded -- method 2): ")
print(np.median(d['fb_friends'].dropna()))


Median height: 
176.0
Median number of Facebook connections: 
nan
Median number of Facebook connections (NAs excluded): 
222.0
Median number of Facebook connections (NAs excluded -- method 2): 
222.0


Geometric mean:

$$\left(\prod_{i=1}^n{x_i}\right)^{1/n}$$

Using (natural) logarithms ($\exp(x)$ means $e^x$):

$$\exp\left(\frac{1}{n}\left(\sum_{i=1}^n{\ln(x_i)}\right)\right)$$



In [25]:
print("Geometric mean for the height:")
print(np.exp(np.mean(np.log(d['height']))))
print("Geometric mean for the number of facebook friends (only positive numbers can go to the geometric mean):")
print(np.exp(np.mean(np.log(d['fb_friends'][d['fb_friends']>0].dropna()))))

Geometric mean for the height:
175.92789854951246
Geometric mean for the number of facebook friends (only positive numbers can go to the geometric mean):
127.79432124392397


In [31]:
# trimmed mean example
from scipy.stats import trim_mean

data = [1, 2, 3, 3, 3, 6, 7, 8, 9, 10]
# Trim 20% from each end
tmean = trim_mean(data, 0.2)
print(tmean)
# check
print(np.mean([3,3,3,6,7,8]))

5.0


np.float64(5.0)

In [36]:
# trimmed mean (cut 5% each side)
# arithmetic mean
from scipy.stats import trim_mean
print("Average height: ")
print(np.mean(d['height']))
print("Trimmed mean: ")
print(trim_mean(d['height'], 0.05))
print("Average number of facebook connections: ")
print(np.mean(d['fb_friends']))
print("Trimmed mean: ")
print(trim_mean(d['fb_friends'].dropna(), 0.05))

Average height: 
176.21666666666667
Trimmed mean: 
175.94444444444446
Average number of facebook connections: 
289.9642857142857
Trimmed mean: 
268.03846153846155


In [43]:
# quantiles / percentiles
print("Q1, Q2=median, q3 of height:")
print(np.quantile(d['height'], [.25, .5, .75]))
print("Q1, Q2=median, Q3 of the number of FB connections:")
print(np.quantile(d['fb_friends'].dropna(), [.25, .5, .75]))

Q1, Q2=median, q3 of height:
[168. 176. 184.]
Q1, Q2=median, Q3 of the number of FB connections:
[ 75. 222. 417.]


There are many differing algorithms to calculate quantiles (see: https://bookdown.org/blazej_kochanski/statistics1/centraltendency.html#determining-quantiles-in-practice). It seems that Excel, R, and Python use algorithm number 7 (https://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/stats/html/quantile.html).

In [48]:
# Standard deviation ('sample' formula)
print("SD of height:")
print(np.std(d['height'], ddof=1))
print("coefficient of variation:")
print(np.std(d['height'], ddof=1)/np.mean(d['height']))
print("SD of the number of FB friends:")
print(np.std(d['fb_friends'], ddof=1))
print("coefficient of variation:")
print(np.std(d['fb_friends'], ddof=1)/np.mean(d['fb_friends']))

SD of height:
10.23137690862416
coefficient of variation:
0.058061346308280484
SD of the number of FB friends:
285.8139232110617
coefficient of variation:
0.985686642432532


In [53]:
# Variance ('sample' formula)
print("Variance of height:")
print(np.var(d['height'], ddof=1))
print(np.std(d['height'], ddof=1)**2) #SD^2 - check

print("Variance of the number of FB connections:")
print(np.var(d['fb_friends'], ddof=1))


Variance of height:
104.68107344632767
104.68107344632767
Variance of the number of FB connections:
81689.59870129869


In [59]:
# IQR
from scipy.stats import iqr
print(iqr(d['height']))
print(iqr(d['fb_friends'].dropna()))

16.0
342.0


Mean absolute difference (MAD) formula:

$$\text{MAD} = \frac{1}{n}\sum_{i=1}^n |x_i - \bar{x}|$$

In [68]:
print("MAD of height: ")
print(np.mean(np.abs(d['height']-np.mean(d['height']))))
print("MAD of the number of FB connections: ")
print(np.mean(np.abs(d['fb_friends']-np.mean(d['fb_friends']))))

MAD of height: 
8.497777777777776
MAD of the number of FB connections: 
216.24234693877548


In [73]:
from scipy.stats import skew, kurtosis
print('Height - skewness')
print(skew(d['height']))
print('Height - kurtosis')
print(kurtosis(d['height']))

print('FB friends - skewness')
print(skew(d['fb_friends'].dropna()))
print('Kurtosis:')
print(kurtosis(d['fb_friends'].dropna()))

Height - skewness
0.38243032414032746
Height - kurtosis
-0.3409675285397866
FB friends - skewness
1.3909068422064903
kurtosis:
1.6654507022012153
