In [35]:
import numpy as np
import matplotlib.pyplot as plt

# Let us welcome SciPy!
from scipy.stats import (trim_mean, mode, skew, 
gaussian_kde, pearsonr, spearmanr, beta)

import seaborn as sns
sns.set_context('poster')
sns.set(rc={'figure.figsize': (16., 9.)})
sns.set_style('whitegrid')

import pandas as pd
import scipy.stats.distributions as dist

import matplotlib as mpl
from dateutil.parser import parse
from statsmodels.tsa.seasonal import seasonal_decompose

# Intro to probability

## Rules

Laplace: $$ P(\text{event})=\frac{\text{favorable events}}{\text{total events}} $$

Union of events: $$ P(A \cup B)=P(A)+P(B)-P(A \cap B) $$

Intersection of events: $$ P(A \cap B)=P(A) \cdot P(B|A) $$

[Monty Hall problem](https://en.wikipedia.org/wiki/Monty_Hall_problem)?

## Questions
* 3 coins are tossed: what is the probability of 3 heads in a row?
* 3 coins are tossed: what is the probability of 2 heads in total?

# Discrete probability distributions

## Bernoulli distribution

[Wikipedia](https://en.wikipedia.org/wiki/Bernoulli_distribution)

Keys: 2 possible outcomes $\Omega=\{0,1\}$ with probability $1-p$ and $p$. Example: biased coin.

## Binomial Distribution

[Wikipedia](https://en.wikipedia.org/wiki/Binomial_distribution)

The Binomial distribution generalizes Bernouilli to the case were we do more than 1 "trial". 
It gives us the probability of having $k$ successes in $N$ total trials of Bernulli with probability $p$.

## Discrete probability distributions: Poisson

The Poisson distribution is used to describe _how many times something might happen in a specific timeframe_.

## Exponential distribution

Models the time it takes for a random event (with a constant rate) to occur.

# Linear regression

# Inferential statistics

In real life situations we only have access to samples of data, not to the entire population. Then, how can we draw conclussions about the underlying population as a whole? How confident can we be with this conclusions? The answer lies in the Inferential Statistics

## The Bootstrap

In real life we can not recreate the sampling distribution... we can infer the values of some statistics with tricks as the ones described above and the CTL. But: can we do something more general for any statistic?


Using bootstrapping consits on recreating a fake sampling distribution by solely having one sample! Let's use this to calculate in a different way the above estimation for the mean and its CI.

1) First, sample from the sample you already have!

2) Compute the mean of the bootstrapped sample.

3) Repeat that process many times.

4) Compute the mean for all the repetitions you have made.

# Hypothesis testing

The goal of classical hypothesis testing is to answer the question, “Given a sample and an apparent effect, what is the probability of seeing such an effect by chance?” Here’s how we answer that question:

1. The first step is to quantify the size of the apparent effect by choosing a test statistic.

2. The second step is to define a null hypothesis, which is a model of the system based on the assumption that the apparent effect is not real (i.e that it can be due to chance).

3. The third step is to compute a p-value, which is the probability of seeing the apparent effect if the null hypothesis is true.

4. The last step is to interpret the result. If the p-value is low, the effectis said to be statistically significant, which means that it is unlikely to have occurred by chance. In that case we infer that the effect is more likely to appear in the larger population.

5. The logic of this process is similar to a proof by contradiction. To prove a mathematical statement, 𝐴, you assume temporarily that 𝐴 is false. If that assumption leads to a contradiction, you conclude that 𝐴 must actually be true. Similarly, to test a hypothesis like, “This effect is real,” we assume, temporarily, that it is not. That’s the null hypothesis.

The research question for this section is, “The population proportion of Ireland having heart disease is 42%. Are more people suffering from heart disease in the US”?

Now, find the answer to this research question step by step.

Step 1: define the null hypothesis and alternative hypothesis.

Step 2: Assume that the dataset above is a representative sample from the population of the US. So, calculate the population proportion of the US having heart disease.

Step 3: Calculate the Test Statistic

Step 4: Compute the p-value

Step 5: Infer the conclusion from the p-value

# Time series

In [85]:
# import Australian drugs time series dataframe
df = pd.read_csv(
    'https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', 
    parse_dates=['date'],
    #index_col='date'
)