# Statistical Study Design

## Design Types
### Prospective vs. Retrospective (Data Collection)
  * **Prospective** study: data are collected from the *beginning* of the study
  * **Retrospective** study: data are acquired from previous events

### Longitudinal vs. Cross-Sectional (Duration)
  * **Longitudinal** study: researcher collections info over a period of time (maybe multiple times from each patient)
    * record patterns of change
    * establish direction & magnitude of causal relationships
  * **Cross-sectional** study: individuals are observed *only once* (at 1 point of itme)

Most surverys are cross-sectional, whereas experiments are usually longitudinal. Thus, cross-sectional research provides the "big picture" and provides the foundation to suggest more expensive research.

### Case-Control vs. Cohort Studies (Grouping)
  * **Case-Control** study: first, patients are treated and then they are selected for inclusion in the study *based on a certain criteria* (e.g. whethery they responded to a certain medication)
  * **Cohort** study: first, subjects of interest are selected and then these subjects are studied over time (e.g. for their response to a treatment)

### Randomized Controlled Trial (Experiment)
The gold standard for experimental clincal trials and the basis for the approval of new medications.

## Statistical Study Types
### Study Design Tree

In [2]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url="http://www.cebm.net/wp-content/uploads/2014/04/263_study-design.jpg")

### Study Types
1. **Sample Study/Survey**
  * *estimate parameter of a broader population*
2. **Observational Study**
  * *determine if a relationship (positive/negative correlation) exists between variables* (cannot conclude causality due to confounding variables)
  * observe outcomes *without imposing any treatment*
3. **Experimental Design**
  * *establish or show causality*
  * *actively impose* some treatment to observe the response

#### Observational vs. Experimental Comparative Studies
**Observational** and **Experimental** (randomized experiments) studies are *comparative* statistical studies that explore differences in explanatory variables (`x`)

##### Observational
Differences occurring in a *natural setting/grouping* (no randomly assigned intervention)
  * patients **are already undergoing the treatment** and other **are already not undergoing the treatment**
  
##### Randomized Experiment
differences occurring as a result of *randomly  assigned interventions*.
  * randomly assign patients into **treatment group** (receive treatment) and **control group** (placebo)

#### Relevant Links
  1. https://www.illustrativemathematics.org/content-standards/HSS/IC/B/3/tasks/2118
 2. https://onlinecourses.science.psu.edu/stat100/node/54
 3. https://www.khanacademy.org/math/ap-statistics/gathering-data-ap#types-of-studies-experimental-vs-observational

## Sample Survey
**Question**: *What is the `<population parameter>` for population `A`?*

Estimate the population parameter (e.g. average blood pressure) by computing the parameter for a sample that was *randomly sampled from the same population*.

This approach serves as a solution as it may be impractical to compute the parameter for an entire population. Essentially, a sample survey uses a sample of people who are intended to represent the larger population.

## Observational Study (Natural or Quasi-Experiment)
A type of study where researcher uses *observed information* to learn about a group of individuals in a natural setting. The researcher *only* collects information and does not "interact" with the study population (e.g. introduce intervention/treatment).

**Goal**: strive for association conclusions since "cause and effect" conclusions are not possible

#### Disclaimer: Correlation does not mean Causality
Observational studies do not prove causality between 2 variables, regardless of any correlation.

*Why?*  
**Confounding (*aka lurking*) variables** are inherent. They are *uncontrolled inputs* that provide alternative explanations for the results (effect on dependent variables).

### Descriptive Studies
Descriptive studies are often conducted as surveys. They explore characteristics  about the population, but do not answer questions about *how/when/why* the characteristics occurred (this is done under analytical research).

Results are data descriptions that are factual, accurate, and systematic, but such research cannot describe *what caused the situation*. Hence, descriptive research cannot be implemented to create a causal relationship where 1 variable affects another.

### Analytical Studies (Comparative)
Research testing hypotheses to answer questions about *how/when/why* the characteristics occur within a specific population.

Study Type | Directionality
:----|:---:
Cohort Study | *Exposure -> Outcome*
Case-Control Study | *Outcome -> Exposure*
Cross-Sectional Study | Exposed & Outcome *at the same time*

#### Cohort Studies
A form of longitudinal observational study that 
  * begins with a group of people *who do **not** have the disease* and have bee n exposed to a substance ("intervention")
  * takes baseline measurements
  * follows them over time to  determine whether there are correlations

**Cohort**: a group of people who share *a common characteristic or experience* within a defined period (e.g. are born, are exposed to a drug/vaccine/pollutant, undergo a medical procedure).

The **comparison group** are otherwise similar but may be from
  * the general population from which the population was drawn
  * another cohort thought to have had little or no exposure to the substance ("intervention") under investigation
  
Research Question: *Does exposure to **`X`** (e.g. smoking) associate with outcome **`Y`** (e.g. lung cancer).
  * Such study would recruit and follow a group/cohort of *smokers* and *non-smokers* (unexposed group) for a set period of time. The differences in incidence of **`Y`** (lung cnacer) between the groups would be noted.
  
#### Case-Control Study
A type of observational study in which 2 existing groups *differing in outcome* are identified and compared on the basis of some supposed causual attribute. Case-control studies are often used to **identify factors that may contribute to a medical condition** by comparing subjects who have the condition/disease (the **'cases'**) with patients who do not have it but are otherwise similar (the **'controls'**)

It's less than expensive than cohort studies.

#### Cross-Sectional Study
A descriptive study in which the *disease & exposure status are measured simultaneously* in a given population. Cross-sectional studies can be thought of as a providing a **"snapshot"** of the frequency and characteristics of a disease in a population  ***at a particular point in time*** .

This type of data can be used to assess the prevalence of acute or chronic conditions in a population. However, since exposreu and disease status are measured *at the same point in time*, *it may not be possible to distinguish whether the exposure **preceded or followed** the disease*. Thus, cause and effect relationships are not certain.

## Experimental Studies
Experimental studies deliberately influence events, creating differences in the explanatory variables, (e.g. treat patients with a new type of medication) and investigates the effects of these interventions. The goal is to establish or demonstrate *causality*, allowing for possible "**cause & effect**" conclusions.

They are *cohort studies*, where allocation of patients (investigational units) into **treatment** and **control** groups are achieved by a random process (*akin to a coin flip*).

In a designed experiment, there may be several conditions (*factors* or explanatory variables) that are **controlled** by the experiment. By having the groups differ in *only 1 aspect* (**factor: *treatment***), the study can detect teh effect of the treatment on the patients (*causality*).

In [1]:
Image(url= "http://www.itl.nist.gov/div898/handbook/pri/section1/gifs/img1351.gif")

# Design of Experiment (DOE) (Experimental Design)

## What is Experimental Design?
Experimental design is a framework to carefully plan experiments *in advance* so that data obtained can be analyzed to yield *valid* and *objective* conclusions.  Essentially, a rigorous study design reduces significant bias.

Experimental research design is concerned with the examination of the efefct of independent variable(s) on the dependent variable, where the independent variable is manipulated by some treatment/intervention. The effect of such interventions is observed on the dependent variable.

The goal is to: 
  * describe how participants are allocated into experimentl groups
  * minimize or eliminate **confounding variables** offering alternative explanations for experimental results
  * allow inferences about the relationship between independent variables (`x`) and dependent variables (`y`)
  * reduce variability -- easier to identify differnces between treatment outcomes

#### Example
The importance of a good study design (*prior to any actual analysis*) is the effect of the introduction of *clinicaltrials.gov* registry requiring the recording of trial methods and outcome measures *before* data collection: **drastic reduction of cardiovascular drug studies showing positive outcomes from 57% to 8%**.

#### Resources
1. http://www.statisticshowto.com/experimental-design/



## Principles
### 1) Control
Control the effects of extraneous variables on the response (`y`). This achieved by comparing *treatment* groups to a ***control*** group (placebo).

### 2) Replication
It's vital to replicate the experiment on *many subjects* to quantify the natural variation in the experiment.

### 3) Randomization
Random assignment of groups helps reduce *selection bias* and *allocation bias*. Hence, errors introduced by confounding variables are minimized by balancing their effects across the groups. In other words, bias is avoided by splitting the subjects to be tested into an *intervention* and *control* group randomly.

Essentially, the random assignment of treatments (intervention/placebo) ensures that measured *and* unmeasured characteristics/factors are **evenly divided** over the groups. THus, any differences from these factors are most likely due to chance (better "control" of them to isolate the factor of interest - *treatment*).

### Note: *Blinding*
Blinding also controls for bias inherent with experimental design (e.g. differences in individuals - age, race ,sex - that may skew results) and are difficult to control for.
  * **Single Blind**: participant doesn't know whether they are receiving a treatment or placebo (*prevent placebo effect*)
    * **Placebo Effect**: patients response to *any treatment*, regardless of whether it's effective or not
  * **Double Blind**: participant & researcher does not know


Essentially, blinding prevents subjective influence.

In [3]:
Image(url="http://www.statisticshowto.com/wp-content/uploads/2015/09/randomized-controlled-trial.png")

# Types of Experimental Designs (Randomization Techniques)
## Randomized Controlled Trial
Randomized Controlled Trials are the staple of DoE. It's the gold standard for clinical trials and validating new treatments. 

Properties:
  * Random assignment of subjects into **treatment(s)** & **control** groups
    * random stratification minimizes confounding factors by balancing it across the groups and reduces bias
  * several conditions (*factors*) are controlled, allowing the groups to differ in *only 1 aspect* (**treatment**) enables the results to illustrate the effect of the treatment on the patients.

## Independent Measures Design (Between Subjects)
Separate groups are created for *each* **treatment**. Each participant participant is only assigned to 1 treatment group.

The samples/groups in the study are *independent* of one another. Thus, the results of one group does not affect the other's.

#### Example: *Test 2 New Depression Medications*
Randomly assign patients into these groups:
  * Group 1: **Medication 1**
  * Group 2: **Medication 2**
  * Group 3: **Placebo**

## Factorial Design
Factorial experimental design investigates the effect of 2+ independent variables (`x` == factors) on 1 dependent variables (`y`). Essentially, **each combination of factors** is tested (*full factorial design*).

### Factorial Combinations (n<sub>levels</sub><sup>n<sub>factors</sub></sup>)
  * An experiment with 3 factors with 2 levels: 2x2x2 = 2<sup>3</sub>
  * An experiment with 2 factors with 3 levels: 3<sup>2</sup>

### Null Outcome
A null outcome occurs when the experiment's outcome (dependent variable = SAT score) remains the same regardless (*unaffected*) of how the levels and factors were combined. The factors did not have any impact on the dependent variable.

### Main Effect & Interaction Effect
Results are in the form of 2 types of effects:
  * **Main Effect**: the effect of an independent variable (*1 of the factors*) on the dependent variable
    * for a main effect to exist: there must be a consistent trend across the different levels
  * **Interaction Effect**: the effect occurring *between factors* (pair of factors causing the effect)


#### Example: *Investigate Components for Increasing SAT Scores*
Factorial experimental desing exploring components (SAT class, SAT prep book, extra homework) in increasing SAT scores. 
 
Components (**Factors**):
  * SAT class (yes or no)
  * SAT prep book (yes or no)
  * Extra homework (yes or no)
  
**3 factors with 2 levels** (yes/no): 2<sup>3</sup>

Effects:
  * **Main Effect**: SAT class causing an increase in SAT scores (those who took the class were consistently higher than those who didn't)
  * ** Interaction Effect**: interaction between factors SAT class & SAT prep book
     * group of students who took **SAT class** *AND* used **SAT prep book** showed an increase in SAT scores
     * group of students who did take the **SAT class** *BUT NOT* use **SAT prep book** didn't show any increase

## Randomized Block Design
The experimental design divides experimental subjects into **homogeneous blocks**. Then, treatments are *randomly assigned* to the blocks. The purpose of doing this is to **reduce variability** in the experiments.
  * Each block contains subjects that are *very similiar* (e.g. blocks by sex - male & female)

Properties:
  * Divide/stratify patients into **homogeneous blocks**, reducing variability in the experiment (**block**)
  * Randomly assign *treatment/placebo* for each block (**randomiozation**)
    * each block has a treatment and placebo group

### Blocks by Confounding Variable (Account for Source of Variability)
Blocking by a confouding variable (source of potential variability) allows the experimental desing to account for it and reduce its effect on results.

For example, if a variable was believed to be a *confounding variable* (source of variability), dividing the groups into blocks of that variable will mitigate any bias associated with that confounding variable. In other words, the block design ensures that the variable is closely balanced and equally representative in each group (*treatment/placebo*).
  * Example: simple random assignment with population containing an age imbalance may result in skewed results due to disproportionate representation of males/females in a certain group.
  
    
#### Example: Homogenous Blocks of Age for a Drug Study (Treatment/Placebo)
**Age** was determined to be  a  confounding variable, potentially affecting the results of the study. Thus the randomized block design divided the subjects into different age groups that were equally spread across **homogeneous blocks**.

The study tests the new drug on 1,000 people age 18-69):  

Age|Placebo|Drug
:--|:--:|:--:
18-29|100|100
30-39|100|100
40-49|100|100
50-59|100|100
60-69|100|100

Each block equally contains 200 people from *each age group*. They are then randomly assigned to the placebo or treatment group. Doing so, *removed age as a potential source of variability*.

## Matched-Pairs Design (Dependent Samples)
A special case of randomized block design. Lke randomized block design, the random assignment of subjects into *treatment* and *placebo* groups are done for each block. In the matched-pairs design, the *blocks are  composed of **matched pairs***. Thus, each pair is randomly assigned to the respective groups. 

**Goal**: maximize homogeneity in each pair. You want each pair to be as similar as possible.

#### Example: Homogenous Blocks of Age-Pairs for a Drug Study (Treatment/Placebo)
An experiment is designed to test a new drug has blocks of 200 males and 200 females. Each block contains 100 pairs, who are matched according to some criteria other than sex (e.g. age, medications, health conditions, etc.).

Then *each pair* is treated like a block, with each randomly assigned to receive the drug or placebo.

