# Chapter 1: Exploratory data analysis

### Data reliable, by standards, data evidence usually fails, because:
1. Small number of observations: If pregnancy length is longer for first babies, the difference is probably small compared to natural variation. In that case, we might have to compare a large number of pregnancies to be sure that a difference exists.
2. Selection bias: People who join a discussion of this question might be interested because their first babies were late. In that case the process of selecting data would bias the results.
3. Confirmation bias: People who believe the claim might be more likely to contribute examples that confirm it. People who doubt the claim are more likely to cite counterexamples.
4. Inaccuracy: Anecdotes are often personal stories, and often misremem-bered, misrepresented, repeated inaccurately, etc. 

## 1.1. A statistical approach
1. Data collection: We will use data from a large national survey that was designed explicitly with the goal of generating statistically valid inferences about the U.S. population.
2. Descriptive statistics: We will generate statistics that summarize the data concisely, and evaluate different ways to visualize data.
3. Exploratory data analysis: We will look for patterns, differences, and other features that address the questions we are interested in. At the same time we will check for inconsistencies and identify limitations.
3. Estimation: We will use data from a sample to estimate characteristics of the general population.
4. Hypothesis testing: Where we see apparent effects, like a difference between two groups, we will evaluate whether the effect might have happened by chance.

By performing these steps with care to avoid pitfalls, we can reach conclusions
that are more justifiable and more likely to be correct.

## 1.2. Extraction data information
1. Data variables and how data explain information
2. Transformation (data cleaning): check for error, deal with special values, convert data into different formats
3. Validation: verifying data export or import from another resource, errors may occur. (df.birthwgt_lb.value_counts().sort_index() gives summary the birthwgt_lb Series from dataframe)
4. Interpretation: to work with data effectively, we have to think on two levels at the same time: the level of statistics and the level of context. 

# Chapter 2: Distributions

## 2.1. Histograms
The graph shows the frequency of each value (number of times the value appears).

1. how data plot, the most common value (mode, mean or median), how data distribution (normal Gaussian,...)
2. what value would scientists expect (long tail left or right, mode,...). That depends on the presented variable values
3. Outliers: Looks at histograms, it's easy to identify the most common values and the shape of the distribution, but rare values are not always visible.

For example, In the list of pregnancy lengths for live births, the 10 lowest values are [0, 4, 9, 13, 17, 18, 19, 20, 21, 22]. Values below 10 weeks are certainly errors; the most likely explanation is that the outcome was not coded correctly. Values higher than 30 weeks are probably legitimate. Between 10 and 30 weeks, it is hard to be sure; some values are probably errors, but some represent premature babies.

- In particular, Unlikely values are the values out of expectation.

The best way to handle outliers depends on “domain knowledge”; that is, information about where the data come from and what they mean. And it depends on what analysis you are planning to perform.

In this example, the motivating question is whether first babies tend to be early (or late). When people ask this question, they are usually interested in full-term pregnancies, so for this analysis I will focus on pregnancies longer than 27 weeks.

## 2.2. Summarizing distributions
Given descriptive statistics, the statistics designed to answer these question are called the Summary Statistics. Some of the characteristics we might to report are:

1. central tendency: Do the value tend to cluster around a particular point?
2. modes: Is there more than one cluster?
3. spread: How quickly do the probabilities drop off as we move away from the models?
4. outliers: Are there extreme values far from the modes?

Most common summary statistics is the Mean, which is the central tendency of the distribution.

## 2.3. Variance
Variance is a summary statistic intended to describe the variability or spread of a distribution. The variance of a set of values is

$ \displaystyle{S^2 = \frac{1}{n}\sum_i{\left(x_i - \bar{x}\right)^2}}  $

- the square root of variance S is the standard deviation.

For example, the mean pregnancy length is 38.6 weeks, the standard deviation is 2.7 weeks. That means we should expect deviations of 2-3 weeks to be common.

Variance of pregnancy length is 7.3, which is hard to interpret, especially since the units are "squared week" (${week^2}$)

## 2.4 Effect size
In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity.

An effect size is a summary statistic intended to describe (wait for it) the size of an effect. For example, to describe the difference between two groups, one obvious choice is the difference in the means.

Cohen's d is a statistic intended to compare the difference between two groups to the variability within groups

$$\displaystyle{d = \frac{\bar{x_1} - \bar{x_2}}{s}}$$

- s is the pooled standard deviation pooled_var= (n1 * var1 + n2 * var2) / (n1 + n2) for population and pooled_var= ((n1 -1)*var1 + (n2-1)*var2)/(n1 + n2 - 2) for samples

## 2.5. Reporting results
How you report results also depends on your goals. If you are trying to demonstrate the importance of an effect, you might choose summary statistics that emphasize differences. If you are trying to reassure a patient, you might choose statistics that put the differences in context.

Of course your decisions should also be guided by professional ethics. It’s ok to be persuasive; you should design statistical reports and visualizations that tell a story clearly. But you should also do your best to make your reports honest, and to acknowledge uncertainty and limitations.

# Chapter 3: Probability Mass Functions
## 3.1. PMFs
Another way to represent a distribution is a probability mass function (PMF), which maps from each value to its probability. A probability is a frequency expressed as a fraction of the sample size, n. To get from frequencies to probabilities, we divide through by n, which is called normalization.

$ p_X(x) = P(X = x)$

**limits of PMFs**

PMFs work well if the number of values is small. But as the number of values increases, the probability associated with each value gets smaller and the effect of random noise increases.

Overall, these distributions resemble the bell shape of a normal distribution, with many values near the mean and a few values much higher and lower.

But parts of this figure are hard to interpret. There are many spikes and valleys, and some apparent differences between the distributions. It is hard to tell which of these features are meaningful. Also, it is hard to see overall patterns; for example, which distribution do you think has the higher mean?

These problems can be mitigated by binning the data; that is, dividing the range of values into non-overlapping intervals and counting the number of values in each bin. Binning can be useful, but it is tricky to get the size of the bins right. If they are big enough to smooth out noise, they might also smooth out useful information.

# Chapter 4: Cumulative distribution functions
## 4.1. Percentiles
If you have taken a standardized test, you probably got your results in the form of a raw score and a percentile rank. In this context, the percentile rank is the fraction of people who scored lower than you (or the same). So if you are “in the 90th percentile,” you did as well as or better than 90% of the people who took the exam.

## 4.2. CDFs
The CDF is the function that maps from a value to its percentile rank.

The CDF is a function of x, where x is any value that might appear in the distribution. To evaluate CDF(x) for a particular value of x, we compute the fraction of values in the distribution less than or equal to x.

## 4.3. Percentile-based statistics
Percentile can be used to compute percentile-based summary statistics, so we can know the median by devide the distribution in half.

Interquartile range (IQR): measurement of the spread of a distribution

Percentile are often used to summarize the shape of a distribution. For example, the distribution of income is often reported in “quintiles”; that is, it is split at the 20th, 40th, 60th and 80th percentiles. Other distributions are divided into ten “deciles”. Statistics like these that represent equally-spaced points in a CDF are called quantiles.

## 4.4. Glossary
• percentile rank: The percentage of values in a distribution that are less than or equal to a given value.

• percentile: The value associated with a given percentile rank.

• cumulative distribution function (CDF): A function that maps from values to their cumulative probabilities. CDF(x) is the fraction of the sample less than or equal to x.

• inverse CDF: A function that maps from a cumulative probability, p, to the corresponding value.

• median: The 50th percentile, often used as a measure of central tendency.

• interquartile range: The difference between the 75th and 25th percentiles, used as a measure of spread.

• quantile: A sequence of values that correspond to equally spaced percentile ranks; for example, the quartiles of 
a distribution are the 25th, 50th and 75th percentiles.

• replacement: A property of a sampling process. “With replacement” means that the same value can be chosen more than once; “without replacement” means that once a value is chosen, it is removed from the population.

# Chapter 5: Modeling distributions
The distributions we have used so far are called empirical distributions because they are based on empirical observations, which are necessarily finite samples.

The alternative is an analytic distribution, which is characterized by a CDF that is a mathematical function. Analytic distributions can be used to model empirical distributions. In this context, a model is a simplification
that leaves out unneeded details.

## 5.1. The Exponential distritbution
The CDF of the exponential distribution:

$$ \displaystyle{CDF(x) = 1 - e^{-\lambda x}} $$

- $\lambda$ the parameter determines the shape of the distribution, oftern is 0.5, 1, 2. 

In the real world, exponential distributions come up when we look at a series of events and measure the times between events, called **interarrival times**. If the events are equally likely to occur at any time, the distribution of interarrival times tends to look like an exponential distribution.

## 5.2. The Normal distribution
Known as Gaussian distribuiton

The logarithms Normal distribution

$$ \displaystyle{CDF_{lognormal}(x) = CDF_{normal}(logx)}$$

The parameters of lognormal distribution are usually denoted $\mu$ and $\sigma$, but they are not for mean and standard deviation. The mean of lognormal distribution is $\displaystyle{exp(\mu + \sigma^2/2)}$ and the standard deviation is so complicated

## 5.3. The Pareto distribution
The CDF of the Pareto distribution

$$ CDF(x) = 1 - \left(\frac{x}{x_m}\right)^{-\alpha} $$

The parameters $x_m$ (the minimum possible value) and $\alpha$ determine the location and shape of the distribution 

## 5.4. Why model?
Like all models, analytic distributions are abstractions, which means they leave out details that are considered irrelevant. For example, an observed distribution might have measurement errors or quirks that are specific to the sample; analytic models smooth out these idiosyncrasies.

Analytic models are also a form of data compression. When a model fits a dataset well, a small set of parameters can summarize a large amount of data.

#### Glossary

• **empirical distribution**: The distribution of values in a sample.

• **analytic distribution**: A distribution whose CDF is an analytic function.

• **model**: A useful simplification. Analytic distributions are often good models of more complex empirical distributions.

• **interarrival time**: The elapsed time between two events.

• **complementary CDF**: A function that maps from a value, x, to the fraction of values that exceed x, which is 1 − CDF(x).

• **standard normal distribution**: The normal distribution with mean 0 and standard deviation 1.

• **normal probability plot**: A plot of the values in a sample versus random values from a standard normal distribution.

# Chapter 6: Probability Density Functions
Probability density measures probability per unit of x. In order to get a probability mass, you have to integrate over x.

#### Kernel Density Estimation
KDE is an algorithm that takes a sample and finds an appropriately smooth PDF that fit the data.

Estimating a density function with KDE is useful for several purposes:

• **Visualization**: During the exploration phase of a project, CDFs are usually the best visualization of a distribution. After you look at a CDF, you can decide whether an estimated PDF is an appropriate model of the distribution. If so, it can be a better choice for presenting the distribution to an audience that is unfamiliar with CDFs.

• **Interpolation**: An estimated PDF is a way to get from a sample to a model of the population. If you have reason to believe that the population distribution is smooth, you can use KDE to interpolate the density for values that don’t appear in the sample.

• **Simulation**: Simulations are often based on the distribution of a sample. If the sample size is small, it might be appropriate to smooth the sample distribution using KDE, which allows the simulation to explore more possible outcomes, rather than replicating the observed data.

## 6.1. Moments
Any time you take a sample and reduce it to a single number, that number is a statistic. The statistics we have seen so far include mean, variance, median, and interquartile range.

A raw moment is a kind of statistic. If you have a sample of values, x i , the kth raw moment is:

$$ \displaystyle{m^{'}_k = \frac{1}{n}\sum_i x^k_i}$$

When k = 1 the result is the sample mean. 

The central moments are more useful. The kth central moments is:

$$ \displaystyle{m_k = \frac{1}{n}\sum_i (x_i - \bar{x})^k}$$

When k = 2, the result is the variance.

## 6.2. Skewness
Skewness is a property that describes the shape of a distribution. If the distribution is symmetric around its central tendency, it is unskewed. If the values extend farther to the right, it is “right skewed” and if the values extend left, it is “left skewed.”

The statistic is robust, which means that it is less vulnerable to the effect of outliers.

# Chapter 7: Relationship between variables
Two variables are related if knowing one gives you information about the other. For example, height and weight are related; people who are taller tend to be heavier.

## 7.1. Scatter plots
The scatter plot show how data points are overlapping, that is proportional to density. The plot shows relationships clearly without introducing misleading artifacts.

## 7.2. Characterizing relationships
For example, the plot shows the result, between 140 and 200cm relationship between these variables is roughly linear. This range includes more than 99% of the data.

## 7.3. Correlation
A correlation is a statistic intended to quantify the strength of the relationship between two variables

The problem is all the variables are not expressed in the same units or different distribution.

There are two common solutions:

1. Transform each value to a **standard score**, which is number of standard deviations from the mean. This is the Pearson product-moment correlation coefficient.

2. Transform each value to its rank, which is its index in the sorted list of values. This is the Spearman rank correlation coefficient.

For example:

1. If X is a series of n values $x_i$, the Standard scores: $z_i = (x_i - \mu)/\sigma$, So, the Z values are dimensionless (no units)

2. If X is normally distributed, so is Z. But if X is skewed or has outliers, it is more robust to use percentile ranks. If we compute a new variable, R, so that $r_i$ is the rank of $x_i$, the distribution of R is uniform from 1 to n, regardless of the distribution of X.

## 7.4. Covariance
Covariance is a meansure of the tendency of two variables to vary together. If we have two series, X and Y, their deviations from the mean are:

$ dx_i = x_i - \bar x $ and $ dy_i = y_i - \bar y $

$$ Cov(X,Y) = \frac{1}{n}\sum dx_idy_i $$

Where n is the length of the two series (they have to be the same length).

## 7.5. Pearson's correlation
The problem for Covariance for summary statistic that is hard to interpret. for example, the units of X and Y (height and weight, kilogram-centimeters, what does it mean?)

**Using the Standard score**, compute the product of standard scores of both X and Y:
$ \displaystyle{p_i = \frac{(x_i - \bar x)}{S_X} \frac{(y_i - \bar y)}{S_Y}} $

- Where $S_X and S_Y$ are the standard deviations of X and Y. the mean of these products is: 
$ \displaystyle{\rho = \frac{1}{n}\sum_i pi}$

- The Pearson's Correlation: $\displaystyle{\rho = \frac{Cov(X,Y)}{S_X S_Y}}$

The Pearson's correlation is in range -1 and 1, if $\rho$ positive, mean that one variable is high, the other tends to be high. If $\rho$ negative, the correlation is negative, so when one variable is high, the other is low. Zero means there is no relationship between two variables.

The Pearson's correlation is a linear relationship. 

## 7.6. Nonlinear relationships
When Pearson's correlation is nearly zero. We need to careful look on the scatter plot to compute the correlation.

**Spearman’s rank correlation** is an alternative that mitigates the effect of outliers and skewed distributions. To compute Spearman’s correlation, we have to compute the rank of each value, which is its index in the sorted sample. For example, in the sample [1, 2, 5, 7] the rank of the value 5 is 3, because it appears third in the sorted list. Then we compute Pearson’s correlation for the ranks.

#### Correlation and causation.
If variables A and B are correlated, so, it could be explaned: A causes B, or B causes A, or some other set of factors causes both A and B.

So what can you do to provide evidence of causation?

If A comes before B, then A can cause B, but not the other way around. The order of events can help to infer the direction of causation. 

# Chapter 8: Estimation
Estimate the population mean by using the sample mean and it is a normal distribution. 
1. One option is to identify and discard outliers, then compute the sample mean of the rest. Or, another option is to use the median as an estimator.
2. If there are no outliers, compute the Mean Squared Error (MSE) for every times to compute sample mean minimizes

$$ MSE = \frac{1}{m}\sum (\bar x - \mu)^2$$

    - Where m is the mumber of times going to estimate the sample (not n the size of sample)

## 8.1 Guess the variance
Estimator of the variance $\sigma^2$, uses the sample variance $S^2, S^2 = \frac{1}{n}\sum (x_i - \bar x)^2$ and the **unbiased** estimator of $\sigma^2, S^2_{n-1} = \frac{1}{n - 1}\sum (x_i - \bar x)^2$

## 8.2. Sampling distributions
How is sample estemated? 

The error is caused by random selection is called Sampling Error.

There are two common ways to summarize the sampling distribution:
1. **Standard Error (SE)**: a measure of how far we expect the estimate to be off, on average. For each simulated experiment, we compute the error of sampling ($\bar x - \mu$), and then compute Root Mean Squared Error (RMSE).
2. **Confidence Interval (CI)**: a range that includes a given fraction of the sampling distribution. For example, the 90% CI is the range from the 5th to the 95th percentile. 

#### Glossary
• estimation: The process of inferring the parameters of a distribution from a sample.

• estimator: A statistic used to estimate a parameter.

• mean squared error (MSE): A measure of estimation error.

• root mean squared error (RMSE): The square root of MSE, a more meaningful representation of typical error magnitude.

• maximum likelihood estimator (MLE): An estimator that computes the point estimate most likely to be correct.

• bias (of an estimator): The tendency of an estimator to be above or below the actual value of the parameter, when averaged over repeated experiments.

• sampling error: Error in an estimate due to the limited size of the sample and variation due to chance.

• sampling bias: Error in an estimate due to a sampling process that is not representative of the population.

• measurement error: Error in an estimate due to inaccuracy collecting or recording data.

• sampling distribution: The distribution of a statistic if an experiment is repeated many times.

• standard error: The RMSE of an estimate, which quantifies variability due to sampling error (but not other sources of error).

• confidence interval: An interval that represents the expected range of an estimator if an experiment is repeated many times.

# Chapter 9: Hypothesis testing
## 9.1 Classical Hypothesis testing
Whether the effects we see in a sample are likely to appear in the large population? For example, the difference in mean prgnancy length for first babies and others. We would like to see that effect reflects a real difference for women or if it might appear in the sample by chance.

The goal of the classical hypothesis testing is anwser the question, "Given a sample and an apparent effect, what is the probability of seeing such an effect by chance?", here the anwsers that question:

1. Quantify the size of the apparent effect by choosing a **test satistic**. For example, the apparent effect is a difference in pregnancy length between first babies and others, so a natural choice for the test statistic is the difference in means between the two groups.
2. Define a **Null Hypothesis**, which is a model of the system based on the assumption that the apparent effect is not real. For example, the null hypothesis is that there is no difference between first babies and others; that is, that pregnancy lengths for both groups have the same distribution.
3. Compute a **p-value**, which is the probability of seeing the apparent effect if the null hypothesis is true. For example, we would compute the actual difference in means, then compute the probability of seeing a difference as big, or bigger, under the null hypothesis.
4. The last step is to interpret the result. If the p-value is low, the effect is said to be **statistically significant**, which means that it is unlikely to have occurred by chance. we would compute the actual difference in means, then compute the probability of seeing a difference as big, or bigger, under the null hypothesis.

The logic of this process is similar to a proof by contradiction. To prove a mathematical statement, A, you assume temporarily that A is false. If that assumption leads to a contradiction, you conclude that A must actually be true.

Similarly, to test a hypothesis like, “This effect is real,” we assume, temporarily, that it is not. That’s the null hypothesis. Based on that assumption, we compute the probability of the apparent effect. That’s the p-value. If the p-value is low, we conclude that the null hypothesis is unlikely to be true.

Example:

- Null Hypothesis: the distribution for the two groups are the same.

- After completed computing the p-value, the result is about 0.17, which means that we expect to see a difference as big as the observed effect about 17% of the time. So this effect is not statistically significant.

- After 1000 attemps, the simulation never yields an effects as big as the observed difference, 0.12 lbs. So, we would report p < 0.001, and conclude that the difference in the birth weight is statistically significant.

One-sided or Two-sided test, when p-value is 0.09 for one-sided test, and a half the p-value (0.045) for two-sided test

#### Testing a correlation
Using Pearson's correlation (or Spearman's) to expect the positive correlation, and the Null hypothesis is that there is no correlation between two groups. So, through the observed correlation is small, it is statistically significant.

#### Testing proportion
To test this hypothesis, we compute the expected frequency for each value, the difference between the expected and observed frequencies, and the total absolutte difference. 

#### Chi-squared tests
Testing proportion, where $O_i$ is an observed frequencies and $E_i$ is the expected frequencies.

$$X^2 = \sum_i \frac{(O_i - E_i)^2}{E_i} $$

This example demonstrates an important point: the p-value depends on the choice of test statistic and the model of the null hypothesis, and sometimes these choices determine whether an effect is statistically significant or not.

This example demonstrates a limitation of chi-squared tests: they indicate that there is a difference between the two groups, but they don’t say anything specific about what the difference is.

#### Errors
In classical hypothesis testing, an effect is considered statistically significant if the p-value is below some threshold, commonly 5%. This procedure raises two questions:

• If the effect is actually due to chance, what is the probability that we will wrongly consider it significant? This probability is the false positive rate.

• If the effect is real, what is the chance that the hypothesis test will fail? This probability is the false negative rate.The false positive rate is relatively easy to compute: if the threshold is 5%, the false positive rate is 5%. Here’s why:

• If there is no real effect, the null hypothesis is true, so we can compute the distribution of the test statistic by simulating the null hypothesis. Call this distribution $CDF_T$.

• Each time we run an experiment, we get a test statistic, t, which is drawn from $CDF_T$ . Then we compute a p-value, which is the probability that a random value from CDF T exceeds t, so that’s 1 − $CDF_T(t)$.

• The p-value is less than 5% if CDF_T(t) is greater than 95%; that is, if t exceeds the 95th percentile. And how often does a value chosen from CDF_T exceed the 95th percentile? 5% of the time. So if you perform one hypothesis test with a 5% threshold, you expect a false positive 1 time in 20.

#### Power
The false negative rate is harder to compute because it depends on the actual effect size, and normally we don’t know that. One option is to compute a rate conditioned on a hypothetical effect size.

For example, if we assume that the observed difference between groups is accurate, we can use the observed samples as a model of the population and run hypothesis tests with simulated data.

## 9.2 Replication
1. In performed multiple tests, the chance of a false positive is about 1 in 20 if running one hypothesis test. It can be acceptable. But if running 20 tests, we should expect at least one false positive, most of time.
2. The good chance of generating a false positive when using the same dataset for exploration and testing. To compensate for multiple tests, we can adjust the p-value threshold. Or partitioning data into a exploration and other for testing.

In the example:

• The difference in mean pregnancy length is 0.16 weeks and statistically significant with p < 0.001 (compared to 0.078 weeks in the original dataset).

• The difference in birth weight is 0.17 pounds with p < 0.001 (compared to 0.12 lbs in the original dataset).

• The correlation between birth weight and mother’s age is 0.08 with p < 0.001 (compared to 0.07).

• The chi-squared test is statistically significant with p < 0.001 (as it was in the original).

In summary, all of the effects that were statistically significant in the original dataset were replicated in the new dataset, and the difference in pregnancy length, which was not significant in the original, is bigger in the new dataset and significant.

In [6]:
# test proportion
import numpy as np
from scipy import stats

class DiceTest():
    def testStatistic(self, data):
        observed = data
        n = sum(observed)
        expected = np.ones(6) * n / 6 
        test_stats = sum(abs(observed - expected))
        return test_stats
    
    def runModel(self, n=60):
        values = [1, 2, 3, 4, 5, 6]
        rolls = np.random.choice(values, n, replace=True)
        freqs = np.bincount(rolls)
        freqs = freqs[1:]
        return freqs
    

test = DiceTest()
freqs = test.runModel()

stats = test.testStatistic(freqs)
stats

# Chapter 10: Linear least squares
Correlation coefficients measure the strength and sign of a relationship, but not the slope. There are several ways to estimate the slope; the most common is a linear least squares fit. A “linear fit” is a line intended to model the relationship between variables. A “least squares” fit is one that minimizes the mean squared error (MSE) between the line and the data.

$$ y = B_0 + B_1x$$

The vertical deviation from the line, or **residual** 

$ residual = y - (B_0 + B_1x)$

The residual might be due to random factors like measurement error, or non-random factors that are unknown. For example, if we are try to predict weight as a function of height, unknown factors might include diet, exercise, and body type.

## 10.1 Residual 
Residual takes x and y to estimate parameters inter (B_0) and slope (B_1). It returns the differences between the actual and the fitted line. 

To visualize the residuals, grouping respondents by age and compute the percentile of each group. Ideally these lines should be flat, indicating that the residuals are random, and parallel, indicating that variances of the residuals is the same for all age groups.

## 10.2. Estimation
The parameters slope and inter are estimates based on the sample. They are vulnerable to sampling bias, measurement error, and sampling error. Sampling bias is caused by non-representative sampling, measurement error is caused by error in collecting and recording data, and sampling error is the result of measuring a sample rather than the entire population.

To assess sampling error, we ask, “If we run this experiment again, how much variability do we expect in the estimates?” We can answer this question by running simulated experiments and computing sampling distributions of the estimates. 

## 10.3. Goodness of fit
Use to measure the quality of a linear model. One of the simplest is the standard deviation of the residuals.

If using the linear model to make predictions, std(res) is the Root Mean Squared Error (RMSE) of the predictions. For example, using mother's age to guess the birth weight.

$R^2$ (R squared) is another way to measure the goodness of fit. 

## 10.4. Testing a linear model
The test statistics is $R^2$ and the null hypothesis is that there is no relationship between the variables.

Another approach is to test whether the apparent slope is due to chance. The null hypothesis is that the slope is actually zero; in that case we can model the birth weights as random variations around their mean. The p-value is less than 0.001, so although the estimated slope is small, it is unlikely to be due to chance.

Estimating the p-value by simulating the null hypothesis is strictly correct. So, estimate the p-value in two ways:

- Compute the probability that the slope under null hypothesis exceeds the observed slope.
- Compute the probability that the slope in the sampling distribution falls below 0. (if the estimated slope were negative, we would compute the probability that the slope in the sampling distribution exceeds 0).

## 10.5. Weighted resampling
The data colection is in serveral groups in order to improve the chance of getting statistically significant results; in other to improve the power of tests involving these groups.

To correct for oversampling, we can use resampling; that is, we can draw samples from the survey using probabilities proportional to sampling weights. Then, for any quantity we want to estimate, we can generate sampling distributions, standard errors, and confidence intervals. As an example, I will estimate mean birth weight with and without sampling weights.

#### Glossary
• linear fit: a line intended to model the relationship between variables.

• least squares fit: A model of a dataset that minimizes the sum of squares of the residuals.
• residual: The deviation of an actual value from a model.

• goodness of fit: A measure of how well a model fits data.

• coefficient of determination: A statistic intended to quantify goodness of fit.

• sampling weight: A value associated with an observation in a sample that indicates what part of the population it represents.

# Chapter 11: Regression
The goal of regression analysis is to describe the relationship between one set of variables, called the dependent variables, and another set of variables, called independent or explanatory variables.

In the previous chapter we used mother’s age as an explanatory variable to predict birth weight as a dependent variable. When there is only one dependent and one explanatory variable, that’s simple regression. In this chapter, we move on to multiple regression, with more than one explanatory variable. If there is more than one dependent variable, that’s multivariate regression.

If the relationship between the dependent and explanatory variable is linear, that’s linear regression. For example, if the dependent variable is y and the explanatory variables are x_1 and x_2 , we would write the following linear regression model:

$$ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon $$

Where $\beta_0$ is the intercept, $\beta_1$ is the parameter associated with $x_1$, $\beta_2$ is the parameter associated with the $x_2$, and the $\epsilon$ is the residual due to random variation or other unknown factors.

Given a sequence of values for y and sequences for x1 and x2 , we can find the parameters, β_0 , β_1 , and β_2 , that minimize the sum of $ε^2$ . This process is called **ordinary least squares**.

## 11.1. Multiple regression
In Section 4.5 we saw that first babies tend to be lighter than others, and this effect is statistically significant. But it is a strange result because there is no obvious mechanism that would cause first babies to be lighter. So we might wonder whether this relationship is spurious.

In fact, there is a possible explanation for this effect. We have seen that birth weight depends on mother’s age, and we might expect that mothers of first babies are younger than others.

## 11.2. Data mining
So far we have used regression models for explanation; for example, in the previous section we discovered that an apparent difference in birth weight is actually due to a difference in mother’s age. But the R 2 values of those models is very low, which means that they have little predictive power. In this section we’ll try to do better.

For most of these variables, we haven’t done any cleaning. Some of them are encoded in ways that don’t work very well for linear regression. As a result, we might overlook some variables that would be useful if they were cleaned properly. But maybe we will find some good candidates.

## 11.3. Prediction
Sort and choose the variable with the highest value of R^2 (in data mining step)

Each variable provides useful predictive and assuming that variable is known

Apply all variables to the formula and test for a few models.

Check the p-values and R^2 that are big enough to reject the null hypothesis.

## 11.4. Logistic regression
Linear regression can be generalized to handle other kinds of dependent variables. If the dependent variable is boolean, the generalized model is called logistic regression. If the dependent variable is an integer count, it’s called Poisson regression.

As an example of logistic regression, let’s consider a variation on the office pool scenario. Suppose a friend of yours is pregnant and you want to predict whether the baby is a boy or a girl. You could use data from the NSFG to find factors that affect the “sex ratio”, which is conventionally defined to be the probability of having a boy.

# Chapter 12: Time series analysis
A time series is a sequence of measurements from a system that varies in time. One famous example is the “hockey stick graph” that shows global average temperature over time (see https://en.wikipedia.org/wiki/Hockey_stick_graph).

## 12.1. Linear regression
Apply Linear regression to time series analysis, The estimated slopes indicate that the price of high quality cannabis dropped by about 71 cents per year during the observed interval; for medium quality it increased by 28 cents per year, and for low quality it increased by 57 cents per year. These estimates are all statistically significant with very small p-values.

The R^2 value for high quality cannabis is 0.44, which means that time as an explanatory variable accounts for 44% of the observed variability in price. For the other qualities, the change in price is smaller, and variability in prices is higher, so the values of R 2 are smaller (but still statistically significant).

## 12.2. Serial correlation
As prices vary from day to day, you might expect to see patterns. If the price is high on Monday, you might expect it to be high for a few more days; and if it’s low, you might expect it to stay low. A pattern like this is called serial correlation, because each value is correlated with the next one in the series.

## 12.3. Autocorrelation
Set up auto-shift the time series by an interval.

![time series](images/time_series.png "Figures 12.5")

Figure 12.5 (left) shows autocorrelation functions for the three quality categories, with nlags=40. The gray region shows the normal variability we would expect if there is no actual autocorrelation; anything that falls outside this range is statistically significant, with a p-value less than 5%. Since the false positive rate is 5%, and we are computing 120 correlations (40 lags for each of 3 times series), we expect to see about 6 points outside this region. In fact, there are 7. We conclude that there are no autocorrelations in these series that could not be explained by chance.

Figure 12.5 (right) shows autocorrelation functions for prices with this simulated seasonality. As expected, the correlations are highest when the lag is a multiple of 7. For high and medium quality, the new correlations are statistically significant. For low quality they are not, because residuals in this category are large; the effect would have to be bigger to be visible through the noise.

Time series analysis can be used to investigate, and sometimes explain, the behavior of systems that vary in time. It can also make predictions. But for most purposes it is important to quantify error. In other words, we want to know how accurate the prediction is likely to be.

There are three sources of error we should take into account:

• Sampling error: The prediction is based on estimated parameters, which depend on random variation in the sample. If we run the experiment again, we expect the estimates to vary.

• Random variation: Even if the estimated parameters are perfect, the observed data varies randomly around the long-term trend, and we expect this variation to continue in the future.

• Modeling error: We have already seen evidence that the long-term trend is not linear, so predictions based on a linear model will eventually fail.

![](images/predicted_price.png "figure 12.6")
Figure 12.6: Predictions based on the linear fits, showing variation due to sampling error and prediction error

Figure 12.6 shows the result. The dark gray region represents a 90% confidence interval for the sampling error; that is, uncertainty about the estimated slope and intercept due to sampling.

The lighter region shows a 90% confidence interval for prediction error, which is the sum of sampling error and random variation.

These regions quantify sampling error and random variation, but not modeling error. In general modeling error is hard to quantify, but in this case we can address at least one source of error, unpredictable external events.

The regression model is based on the assumption that the system is stationary; that is, that the parameters of the model don’t change over time. Specifically, it assumes that the slope and intercept are constant, as well as the distribution of residuals.

<img src="images/predicted_price_medium.png" alt="figure 12.7" style="width:613px; high:234px"/>

But looking at the moving averages in Figure 12.3, it seems like the slope changes at least once during the observed interval, and the variance of the residuals seems bigger in the first half than the second.

Figure 12.7 shows the result for the medium quality category. The lightest gray area shows a confidence interval that includes uncertainty due to sampling error, random variation, and variation in the interval of observation.

The model based on the entire interval has positive slope, indicating that prices were increasing. But the most recent interval shows signs of decreasing prices, so models based on the most recent data have negative slope. As a result, the widest predictive interval includes the possibility of decreasing prices over the next year.

# Chapter 13: Survival analysis
Survival analysis is a way to describe how long things last. It is often used to study human lifetimes, but it also applies to “survival” of mechanical and electronic components, or more generally to intervals in time before an event.

If someone you know has been diagnosed with a life-threatening disease, you might have seen a “5-year survival rate,” which is the probability of surviving five years after diagnosis. That estimate and related statistics are the result of survival analysis.

#### Glossary
• survival analysis: A set of methods for describing and predicting life-times, or more generally time until an event occurs.

• survival curve: A function that maps from a time, t, to the probability of surviving past t.

• hazard function: A function that maps from t to the fraction of people alive until t who die at t.

• Kaplan-Meier estimation: An algorithm for estimating hazard and survival functions.

• cohort: a group of subjects defined by an event, like date of birth, in a particular interval of time.

• cohort effect: a difference between cohorts.

• NBUE: A property of expected remaining lifetime, “New better than used in expectation.”

• UBNE: A property of expected remaining lifetime, “Used better than new in expectation.”

# Chapter 14: Analytic methods
## 14.1. Normal distributions
Suppose you are a scientist studying gorillas in a wildlife preserve. Having weighed 9 gorillas, you find sample mean x̄ = 90 kg and sample standard deviation, S = 7.5 kg. If you use x̄ to estimate the population mean, what is the standard error of the estimate?

The result is an approximation of the sampling distribution. Then we use the sampling distribution to compute standard errors and confidence intervals:

1. The standard deviation of the sampling distribution is the standard error of the estimate; in the example, it is about 2.5 kg.
2. The interval between the 5th and 95th percentile of the sampling distribution is a 90% confidence interval. If we run the experiment many times, we expect the estimate to fall in this interval 90% of the time. In the example, the 90% CI is (86, 94) kg.

Assuming the weights of adult roughly normally distributed.

$$ X \sim N(\mu, \sigma^2)$$

where ~ is distributed and N for Normal. A linear transformation of X is X' = aX + b, a & b are real numbers. X' and X are the same family. the Normal distribution of X' $ X' \sim N(a\mu + b, a^2\sigma^2$. If Z = X + Y, and $X \sim N(\mu_X,\sigma^2_X$ and $Y \sim N(\mu_Y, \sigma^2_Y)$ then

$$Z \sim N(\mu_X + \mu_Y, \sigma^2_X + \sigma^2_Y)$$

## 14.2. Central limit theorem
More specifically, if the distribution of the values has mean and standard deviation µ and σ, the distribution of the sum is approximately N(nµ, nσ^2 ).