# 3.1 - Given a scenario, apply the appropriate descriptive statistical methods.

### Distribution

Statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur, we can think of a distribution as a function that describes the relationship between observations in a sample space.

For example, think of a Histogram. Histograms that represent very large data sets grouped into many classes have a relatively smooth appearance. Consequently, the distribution can be modeled by a smooth curve that is close to the tops of the bars. This curve is called a distribution curve.

**Probability Distribution**: Probability distributions indicate the likelihood of an event or outcome. P(x) = the likelihood that random variable takes a specific value of x.

For example, In an experiment three fair coins are tossed, then sample space is:

``` S = [HHH, HHT, HTH, THH, HTT, TTH, THT, TTT] ```

X is our random variable, having values 3, 2, 1, 0. So:

$ P(X = 0) = P(TTT) = 1/8 $

$ P(X = 1) = P(HTT) + P(TTH) + P(THT) = 3/8 $

$ P(X = 2) = P(HHT) + P(HTH) + P(THH) = 3/8 $

$ P(X = 3) = P(HHH) = 1/8 $

We can make a table from this calcution, called the probablity distribution of random variable X:

| X (random variable) | P(X) |
| ------------------- | ---- |
| 0 | 1/8 |
| 1 | 3/8 |
| 2 | 3/8 |
| 3 | 1/8 |

**Distribution can be divided into 2 types:**
1. Discrete distribution:
    - Based on discrete random variable, examples are Binomial Distribtion and Possion Distribution
2. Continuous distribution:
    - Based on continuous random variable, easmples are Normal Distribiton, Uniform Distribution and Exponential Distribtion

**Probability Mass Function:**

A function that gives the probability that a discrete random variable is exactly equal to some value. Sometimes it is also known as the discrete density function. The probability mass function is often the primary means of defining a discrete probability distribution, and such functions exist for either scalar or multivariate random variables whose domain is discrete.

Understand this concept in detail is far beyond the scope of this course, as it does start to involve some complex mathmatics. I will go ahead and throw the equations down below for later reference, but if you don't understand what is going on, thats okay, just undestand that this is esentally calcuation the probality of a variable. 

Let x be discrete random variable then its Probability Mass Function p(x) is defined such that:

1. $p(x) \ge 0$
2. $\sum\rho(\chi) = 1$
3. $p(x) = P(X = x)$

Let x be continuous random variable then probability density function F(x) is defined such that:

1. $F(x) \ge 0$
2. $\int_{-\infty}^{+\infty} F(\chi) d\chi = 1$
3. $P(a < x < b) = \int_{a}^{b} F(\chi) d\chi$

**Properties of Discrete Distribution:**

1. $\sum P(\chi) = 1$
2. $E(x) = \sum\chi * P(\chi)$
3. $V(x) = E(\chi^2) - (E(\chi))^2$

**Properties of Continuous Distribution:**

1. $\int_{-\infty}^{+\infty} f(\chi) d\chi = 1$
2. $E(x) = \int_{-\infty}^{+\infty} \chi * f(\chi) d\chi$
3. $V(x) = E(\chi^2) - (E(\chi))^2$
4. $p(a < x < b) = \int_{a}^{b} f(\chi) d\chi$

- E(x) denotes expected value or average value of the random variable x.
- V(x) denotes the variance of the random variable x.

### Variance

Shows us the numerical mesure of the spread of points in our data set. 

Consider the folling sequence:

``` [3, 1, 2, 5, 3, 5, 6, 10, 2, 9] ```

To find the variance of this sequence, we first need to calculate the mean on the sequence.

**Mean** = $(3 + 1 + 2 + ... + 9) / 10 = 4.6$

Next we need to take the mean and subtract it from each number in our sequence, square that number and then add all of those together, finally dividing again by the total numbers in our sequence.

**Variance** = $((3 - 4.6)^2 + (1 - 4.6)^2 + ... + (9 - 4.6)^2) / 10 = 8.24$

### Standard deviation

Shows us the avrage distace for each point in our data set from the mean. 

If we have already calculate the variance of the data set, finding the standard devation is quite easy, we simply need to get the square root of the variance. 

**Standard deviation** = $\sqrt{8.24} = 2.87$

### Percent change

A way to express a change in a variable. It represents the relative change between the old value and the new one.

If $V_1$ represents the old value and $V_2$ the new one:

Percentage change = $\frac{ΔV}{V_1} = \frac{V_2 - V_1}{V_1}$ * 100%

### Percent difference

The difference between two values divided by the average of the two values. Shown as a percentage.

Percentage Difference is used when both values mean the same kind of thing (for example the heights of two people).

- But if there is an old value and a new value, we should use Percentage Change
- Or if there is an approximate value and an exact value, we should use Percentage Error

**Example**

$\frac{25 - 15}{(25 + 15)/2}$ * 100% = 50%

### Confidence intervals

Measures the degree of uncertainty or certainty in a sampling method. They can take any number of probability limits, with the most common being a 95% or 99% confidence level. Confidence intervals are conducted using statistical methods, such as a t-test (covered in 3.2).

The biggest misconception regarding confidence intervals is that they represent the percentage of data from a given sample that falls between the upper and lower bounds. For example, one might erroneously interpret the aforementioned 99% confidence interval of 70-to-78 inches as indicating that 99% of the data in a random sample falls between these numbers. This is incorrect, though a separate method of statistical analysis exists to make such a determination. Doing so involves identifying the sample's mean and standard deviation and plotting these figures on a bell curve.

**Calculating Confidence Interval**

Suppose a group of researchers is studying the heights of high school basketball players. The researchers take a random sample from the population and establish a mean height of 74 inches.

The mean of 74 inches is a point estimate of the population mean. A point estimate by itself is of limited usefulness because it does not reveal the uncertainty associated with the estimate; you do not have a good sense of how far away this 74-inch sample mean might be from the population mean. What's missing is the degree of uncertainty in this single sample.

Confidence intervals provide more information than point estimates. By establishing a 95% confidence interval using the sample's mean and standard deviation, and assuming a normal distribution as represented by the bell curve, the researchers arrive at an upper and lower bound that contains the true mean 95% of the time.

Assume the interval is between 72 inches and 76 inches. If the researchers take 100 random samples from the population of high school basketball players as a whole, the mean should fall between 72 and 76 inches in 95 of those samples.

If the researchers want even greater confidence, they can expand the interval to 99% confidence. Doing so invariably creates a broader range, as it makes room for a greater number of sample means. If they establish the 99% confidence interval as being between 70 inches and 78 inches, they can expect 99 of 100 samples evaluated to contain a mean value between these numbers.

A 90% confidence level, on the other hand, implies that we would expect 90% of the interval estimates to include the population parameter, and so forth.