In [None]:
## May 12 2021
## Author: Benjamin Diethelm-Varela
## Crash course on statistics

# Variable types
The data types influence how data is summarized and visualized

Types are:
- Quantitative: numerical variables. Amiable to descriptive statistics (e.g. mean)
    - Continuous: any value within an interval (height, weight, time)
    - Discrete: countable value with a finite number of possible values (children)
- Categorical (qualitative): classifies items into groups with no numerical meaning to them
    - Ordinal: has some order or ranking associated to it (e.g. college class)
    - Nominal: no ranking (e.g. race, US States)

# Study design
There are a myriad of study designs available, from exploratory analysis of available data, to well studied data collection efforts

Study designs are performed in many fields: clinical trials, public opinion surveys, etc.

The general types of study designs are:
- Exploratory (watch for p-hacking here) vs confirmatory studies
- Comparative (contrast one quantity to another) vs non-comparative studies (predict absolute quantities, e.g. blood pressure)
- Observational (arises naturally) vs experimental (involves manipulation) studies

Typical epidemiological studies, such as effects of tobacco on lifespan, are observational. An example of experimental studies would be observing the effect of a fertilizer on crop yields.

On experiments, we randomly assign subjects to groups. On observational analyses, subjects are said to be exposed

## Statistical power
Power to assess whether a study design is likely to yield meaningful findings

## Bias
Measurements systematically off-target, or sample not representative. This is especially acute in observational studies

# Categorical and quantitative data

In [31]:
# We'll use a toy dataset for these notes
# Visualizations in seaborn
# Import pandas and numpy
%matplotlib widget
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Create some lists
topics = ['Computational chemistry', 'Medicinal chemistry', 'Biochemistry', 
          'Structural biology', 'Microbiology', 'Immunology', 'Pharmacology',
          'Genetics', 'Pharmacogenomics', 'Pharmaceutics', 'Pharmacometrics',
         'Cosmetics', 'Regulatory Science', 'Public Health']
researchers = [82, 45, 50, 96, 84, 75, 77, 60, 42, 87, 62, 35, 73, 61]
labs = [6,8,6,12,10,9,8,6,5,11,5,4,5,4]

# Weld them into a dictionary
department = {'topic': topics, 'number of researchers': researchers,
              'number of laboratories': labs}

# Create a dataframe
df = pd.DataFrame(department)
df

Unnamed: 0,topic,number of researchers,number of laboratories
0,Computational chemistry,82,6
1,Medicinal chemistry,45,8
2,Biochemistry,50,6
3,Structural biology,96,12
4,Microbiology,84,10
5,Immunology,75,9
6,Pharmacology,77,8
7,Genetics,60,6
8,Pharmacogenomics,42,5
9,Pharmaceutics,87,11


## Categorical data
This data classifies items into groups. An example is marital status. Another is ethnicity.

__Frequency tables__ are a common way of summarizing this data. For each category, we report a count and a percentage

__Bar charts__ are a common way of visualizing this data. On the x axis we collect categories, and on the y axis we record either frequency or percentage

Pie charts are also a common way, but there are issues with it, including labeling difficulties for small categories, and sometimes misleading visuals if strict labeling is not used

In [23]:
# frequency data for number of researchers
total = np.sum(df['number of researchers'])
totalr = np.sum(df['number of laboratories'])
totals = {'topic': 'Total', 'number of researchers': total,
              'number of laboratories': totalr}
dftot = df.append(totals, ignore_index=True)
dftot

Unnamed: 0,topic,number of researchers,number of laboratories
0,Computational chemistry,82,6
1,Medicinal chemistry,45,8
2,Biochemistry,50,6
3,Structural biology,96,12
4,Microbiology,84,10
5,Immunology,75,9
6,Pharmacology,77,8
7,Genetics,60,6
8,Pharmacogenomics,42,5
9,Pharmaceutics,87,11


In [24]:
percentages = dftot['number of researchers']/dftot.loc[14, 'number of researchers']*100
freqtable = pd.DataFrame({'topic' :dftot['topic'], 'number of researchers': dftot['number of researchers'],
                         'frequency': percentages})
freqtable

Unnamed: 0,topic,number of researchers,frequency
0,Computational chemistry,82,8.826695
1,Medicinal chemistry,45,4.843918
2,Biochemistry,50,5.382131
3,Structural biology,96,10.333692
4,Microbiology,84,9.041981
5,Immunology,75,8.073197
6,Pharmacology,77,8.288482
7,Genetics,60,6.458558
8,Pharmacogenomics,42,4.52099
9,Pharmaceutics,87,9.364909


In [26]:
# Bar chart with the data
freqtable2 = freqtable.iloc[0:14,:];
plt.figure();
sns.barplot(x=freqtable2['topic'], y=freqtable2['number of researchers'], color='darkred');
plt.xlabel('Department')
plt.ylabel('Number of researchers')
plt.title('Pharmaceutical Sciences research roster composition')
plt.xticks(rotation=90);
plt.tight_layout();

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

## Quantitative data
### Histograms
__Histograms__ are typically the first graphical display of quantitative data, that is, data with a numerical value associated where mathematical operations make sense. Histograms are ok with either discrete or continuous variables

Histograms are excellent for a first time data exploration effort. On any histogram, the x-axis shows the interval of possible values for the variable, while the y-axis shows the frequency associated with each value subgroup (called bins)

Histograms have the following four features:
- Shape: overall appearance. Might be symmetric, skewed to a side, or conforming to a distribution
- Center: the mean or the median of the histogram
- Spread: how far the data spreads to the extreme values of the variable. The main spread desctriptor is range (max-min)
- Outliers: data points that fall far from most other data points

So, for example, a dataset, when represented on a histogram could be: bell-shaped unimodal, with a mean and median of 50, a range of 20, and no apparent outliers.

Bell-shaped distributions are easy to work with in terms of statistics. Unimodal distributions make things even easier.

Do not confuse histograms with bar charts. They look similar but are very different. Histograms always look at quantitative data. Bar charts always look at categorical data.

On a bimodal distribution, there are two values that are of high frequency.
Right- or left-skewed distributions have long "tails". Those tails are usually determined by outliers

The __median__ cuts the data into two groups of equal total frequency. It is a nice robust measure of center

Especially when there are big outliers, make sure to indicate the range where the bulk of the data is.

Remember: the __mean__ is sensitive to extreme values. The __median__ isn't

In [27]:
# The histogram below shows how many labs for each researcher number interval there are
plt.figure()
sns.histplot(data=df['number of researchers'], bins = 10)
plt.xlabel('Number of researchers in lab')
plt.ylabel('Frequency')
plt.title('Lab staffing')

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Text(0.5, 1.0, 'Lab staffing')

### Numerical summaries
Histograms provide a good visual feel for the data, but sometimes it's important to dive deeper. Numerical summaries allow for that, down to the decimal.

The describe function in Python provides a numerical summary.

It is most common to see a five-number summary: 

- minimum, 
- 1st quartile (Q1; 25% tile)
- median (Q2; 50% tile)
- 3rd quartile (Q3; 75% tile)
- maximum

The 1st quartile's value means that 25% of the data falls below that value. Similar concepts for the remaining quartiles.

#### IQR

An important statistic is the Inerquartile Range (IQR). It is defined as:
    
    IQR = Q3 - Q1

IQR is another measure of spread. It gives you an idea of where most of the data is falling. The value represents the spread of the middle 50% of the data.

Data that are skewed to either side will show a mean that differs from the median significantly. It will also have a large stdev

Median and IQR are robust measures, however, as they are insensitive to outliers

#### Standard deviation

Another important measure of spread is the standard deviation. In a nutshell, it describes how much on average the data spreads out of the mean
It can be interpreted as: on average, a given value will lie within 1 SD above or below the mean

### Getting descriptive statistics with pandas and numpy
Pandas and NumPy have methods to get common statistical parameters: mean, 25th percentile (1st quartile), median (50th percentile, 2nd quartile), and 75th percentile (3rd quartile)
Let's use the lab dataset for that

In [17]:
res = df['number of researchers']

# Pandas methods
print('Mean:')
print(res.mean())
print('25th percentile:')
print(res.quantile(0.25))
print('Median:')
print(res.median()) # Remember that the median is the 50th percentile, or 2nd quartile
print('75th percentile:')
print(res.quantile(0.75))

print('\n')

# Numpy methods
print('Mean:')
print(np.mean(res))
print('25th percentile:')
print(np.percentile(res, 25))
print('Median:')
print(np.percentile(res, 50))
print('75th percentile:')
print(np.percentile(res, 75))

Mean:
66.35714285714286
25th percentile:
52.5
Median:
67.5
75th percentile:
80.75


Mean:
66.35714285714286
25th percentile:
52.5
Median:
67.5
75th percentile:
80.75


In [18]:
# The describe method provides the same functionality but in an easier fashion
# Numerical summary for number of researchers
df['number of researchers'].describe()

count    14.000000
mean     66.357143
std      18.566335
min      35.000000
25%      52.500000
50%      67.500000
75%      80.750000
max      96.000000
Name: number of researchers, dtype: float64

In [19]:
# Calculate IQR for number of researchers
iqr = 80.750000 - 52.500000
iqr

28.25

### Standard score
Mean and standard deviation are commonly reported variables for distributions. They are informative and allow getting a good idea of how the data behaves is the data is uniformly distributed

On a typical bell-shaped curve, we compute the mean, as well as one, two and three values up and down the stdev

So, for a curve with a mean of 0 and stdev of 1, we would mark 0, as well as 1, 2, and 3 to the right, and -1, -2, and -3 to the left.
On a normal distribution, 68% of values fall within 1 stdev. 95% of values fall within 2 stedvs, and 99.7% of values fall within 3 stdevs. Those values are known as the 68-95-99.7 rule, or the "empirical rule"

The standard score (Z score) tells us how unusual a value is. It's defined as:

    Standard score = (value-mean)/stdev
    
The closer to zero Z is, either on the negative or positive side, the less unusual the value is (again, assuming a normal dist.).
Negative Z values are below the mean. Positive Z values are above the mean.

### Box plots
Another useful visual for quantitative data

Boxplots visualize summarize the five values of the numerical summary: min, Q1, median, Q3, max. It also depicts the IQR.

In a boxplot, the upper whisker is the max. The upper edge of the box is the Q3, the middle line is the median (Q2), the lower edge of the box is the Q1, and the lower whisker is the min. The length of the box is the IQR.

Thus, boxplots immediately convey the spread of both all the data, and of the 50% middle data.

In [30]:
plt.figure();
sns.boxplot(y=df['number of researchers']);
plt.ylabel('Number of researchers in lab')
plt.xlabel('Pharmaceutical Sciences Division')
plt.title('Pharmaceutical Sciences lab staffing')

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Text(0.5, 1.0, 'Pharmaceutical Sciences lab staffing')

On skewed distributions, boxplots typically identify outliers specifically as dots above or below the whiskers.

Note: boxplots can hide gaps and clusters. Consider the histogram above; the missing bins are not captured in the boxplot

Boxplots are very useful for comparing sets of observations.