<a href="https://colab.research.google.com/github/adeeconometrics/literate-programming/blob/main/Basic_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Statistics and Simulations

---- 
## Contents
- Basic Concepts and Definitions
    - Data and Variables 
    - Sampling Tecniques
- Data Presentation
    - Constructing Frequency Distribution Table
    - Basic Data Visualization
- Measures of Central Tendency
    - Ungrouped Data 
    - Grouped Data 
- Measures of Dispersion, Positions, and Shape
    - Dispersion
    - Position
    - Shape
- The Normal Distribution
    - Definitions and Properties 
    - Concepts behind: Integrals
    - Applications

## Introduction
*The processing of statistical information has a long history that extends back to the beginning of mankind. - R.E. Walpole, Introduction to Statistics 3rd ediction*

Data processing and using descriptive methods to retrieve relevant information have been used to manage different sorts of events such as taxes, tallying voting scores, wars and so on. Today, thanks to the development of probability theory, these mere descriptions can be extended to deal uncertain events such as decision-making and forecasting, and extract relevant features which we can generalize under some sufficient amount of collected data.


In that note, there are two main departments of statistics:(1) descriptive statistics which merely describes some feature about the data it is usually composed with those methods concerned with collecting and describing a set of data so as to lead to a meaningful information, and (2) inferential statistics which comprises those methods concerned with the analysis of a subset of data leading to **predictions** (or inferences) about the entire set of data. 

As you may wonder, the generalizations associated with statistical inferences are *always* subject to uncertainties. In dealing with such *uncertain* events, we use our understanding of probability theory (which is a subject for a separate discussion).

In this notebook, we will explore how to implement these concepts and bring them into practice. 


<!-- the format is as follows: concept, formula, worked problem, solution using Python (this should explain the process algorithmically and not hide some details)-->

## Problems for Statisticians
<!-- this assumption makes it more fun and engaging, shoudl we include this or not?-->

Although this notebook is not exclusively made for a statistician, it is helpful to imagine ourselves as practitioners of statistics and emphathize to the problems faced by statisticians. 

Statistical analysis provide us the power to test claims or assumptions and subject them into the light of finding *validity*. There is an intellectual discourse that talk about the rigour of this subject, however, we will not adress those philosophical issues in this notebook. 

But how do we test different statistical tests and compare them of each other? Does sophisticated statistical method guarantee a stronger result in a way that is coherent with our understanding of the world? 

<!-- should you further this discussion? -->
----

## Basic Concepts and Definitions 

Before we further our discussion, it is important to ground ourselves with some set of conceptual definition of terms. Here are the list of the basic concepts that we shall beign with:

- Variable - a characteristic or attribute that can assume different values 
- Data - the value of the variable 
- Random Data - value of the variables that are determined by chance 
- Data set - a collection of data values
- Data value (or datum) - each value in the data set
- Population - all subjects that are being studied 
    - Parameter -  numerical summary or any measurement coming from a **population**
- Sample - group of subject selected from a **population**
    - Statistic - measure of the sample

### Data and Variables 
Here we define the basic types (or forms) that our data or variables can embody.

- Qualitative -  a broad category for any variable that can’t be counted (i.e. has no numerical value). Nominal and ordinal variables fall under this umbrella term.

- Quantitaitve - A broad category that includes any variable that can be counted, or has a numerical value associated with it. Examples of variables that fall into this category include discrete variables and ratio variables.

- Dependent variable -  the outcome of an experiment. As you change the independent variable, you watch what happens to the dependent variable.

- Independent variable -  a variable that is not affected by anything that you, the researcher, does. Usually plotted on the x-axis.

Remark: more extensive list these kinds are linked [here](https://www.statisticshowto.com/probability-and-statistics/types-of-variables/)

#### Levels of Measurement 
- Nominal -  names 
- Ordinal -  categories in scale e.g. ranking
- Interval -  comparison between the numerical differences are meaningful but not the ratio of the measurements
- Ratio - one that affects or influences another variable

----
### Data Collection
*Get the Facts first, and then you can distort them as much as you please. - M. Twain*

We may define data as the collection of facts, and statistics that are used for reference of analysis. 

#### Methods of Data Collection
1. Direct Method

2. Indirect Method 

3. Registration Method

4. Experimental Method

<!-- where can we find or collect some data? Link some database.-->

### Sampling Techniques 
In determining the sample size- typically we use samples of the population and not the population itsel because of the tremendous cost of in doing so. Instead, we rely on several techniques of sampling to get the best approximation of the behavior (or the trends) that are manifested by our population. 

To determine the *sufficient* sample size from a given population, the **Slovin's Formula** is usually used which is defined as follows: $$n= \frac{N}{1+Ne^2}$$

where $n$ is for the sample size, $N$  for population size, and $e$ for the margin of error. 

On that note, here are the basic kinds of sampling techniques. 
<!-- explain with illustrations and examples-->

1. Simple Random Sampling 
2. Systematic Sampling
3. Stratified Sampling
4. Cluster Sampling 

It is important to note that these techniques are done relative to the study (or experiment) that you are working on. Hence, there are best practices (or techniques) for investigating different events for different instances. A more extensive list of sampling techniques is linked [here](https://en.wikipedia.org/wiki/Sampling_(statistics)

In [None]:
# applying Slovin's Formula in Python
Sample_size = lambda population_size, error: [population_size/(1+population_size*(error**2))]

## Data Presentation 

### Constructing Frequency Distribution Table

### Basic Data Visualization

In [None]:
# visualizing data using matplotlib

# Statistical Measures of Data 
*A variety of Statisitcal Measures are employed to summarize and describe sets of data. - R.E. Walpole*

## Measure of Central Tendencies
The centrla tendency of a distribution pertains to the typical value for a [probability distribution](https://en.wikipedia.org/wiki/Probability_distribution). The most common measures of central tendencies are the arithmetic mean, median, and the mode which we shall discuss relative to the kinds of data that we are working on i.e. ungrouped and grouped data. 

Measures of central tendency ought to give us a single value that represents the average of values for all the outcomes of your data set. 

----
Before we get the measures of our central tendency, it is important that we take note of the *data structure* of our data set. We begin by asking ourselves the question: are the data grouped or not? Once we get the structure of our data, we can perform the following operations to get what we desire on our data set. 

### Ungrouped Data 
Can be represented as a list of data, or arrays of data. Ungrouped data is the data you first gather from an experiment or study. The data is raw — that is, not yet sorted into categories, classified. An ungrouped set of data is basically a list of numbers.

#### Mean

- **Population mean**. If the set of data $x_1,x_2,...,x_n$, not necessarily all distinct represents a finite population of size $N$, then the population mean is given by
$$\mu = \frac{1}{N} \sum_{i=1}^{N} x_i$$

<!-- explain in terms of code what this means-->

- **Sample mean**. If the set of data $x_1,x_2,...,x_n$, not necessarily all distinct represents, represents a finit sample of size $n$, then the sample mean is given by 
$$\bar{x}= \frac{1}{n} \sum_{i=1}^{n} x_i$$

#### Median
The median of a set is of observations sorted in increasing order of magnitude is the middle value -- when the number of obsvervations is odd the middle value will be the median, otherwise the arithmetic mean of two middle values shall be the median (in the case where the set of observations are even).

- if even $\tilde{x} = \frac{x_n + x_{n+1}}{2}$
- if odd $\tilde{x} = x_{\frac{n+1}{2}}$

#### Mode 
The mode of a set of observations is the value which occurs most often or the value that has the gratest amount of frequency. 

----

### Grouped Data 
Some structure of data sets are grouped by some range of numbers. Or categorized in some orders of magnitude. For such cases, the following versions for measuring (approximating) the central tendencies are listed as follows.

<!-- how do we group data and tally them, describe briefly -->

#### Mean
$$\bar{x} = \frac{1}{n} \sum f x_m$$
Where: 

> $f_m:= \text{frequency of the median class}$

> $x_i:= \text{class mark}$

> $n:= \text{total number of observations}$


#### Median

$$\tilde{x} = L+ \left(  \frac{\frac{n}{2} - s_b}{f_m} \right) i$$

Where:

> $f_m:= \text{frequency of the median class}$ 

> $x_i:= \text{class mark}$

> $n:= \text{total number of observations}$

> $L:= \text{lower boundary of the median class}$

> $S_b=<  \text{cf of the class before the median class}$

> $i :=\text{size of the class interval}$

#### Mode

$$\hat{x}= L \left( \frac{\Delta_1}{\Delta_1 +\Delta_2} \right)i$$

Where:
> $\Delta_1:= \text{difference in the frequencies of the modal class and the next lower class}$

> $\Delta_2:= \text{difference in the frequencies of the modal class and the next higher class}$

>$i = \text{size of class interval}$

>$L:=\text{lower boundary of the modal class}$

In [None]:
# measures of central tendencies 

## Measures of Dispersion, Position, and Shape
----
### Measures of Dispersion
Measures of dispersion (or variability) tells us the average distance of each observation from the *center of the distribution*; they summarize and describe the extent to which scores in a distribution differ from each other. 

#### Measures of Absolute Dispersion
- **Range** is the difference between the highest and the lowes values; the simplest but most unreliable measure of dispersion. Range is given by: $range = HV- LV$ where HV is the highest vaue, and LV is the lowest value. 

- **Variance** is the average of the squared deviation of each score from the **mean**.
    - Ungrouped data 
        - Population Variance 
        $$\sigma ^2=\frac{1}{N} \sum f(x- \mu)^2$$

        - Sample Variance 
        $$s^2 = \frac{1}{n-1} \sum (x- \bar{x})^2$$

    - Grouped data 
        - Population Variance 
        $$\sigma ^2=\frac{1}{N} \sum f(x_m- \mu)^2$$

        - Sample Variance 
        $$s^2 = \frac{1}{n-1} \sum (x_m- \bar{x})^2 \text{or } s^2 = \frac{n \sum fx^2 - (\sum fx)^2}{n(n-1)}$$

- **Standard Deviation** is the square root of the variance
<!-- demonstrate this by giving the code -->
    - Ungrouped data 
        - Population Standard Deviation
        $$\sigma= \sqrt{\frac{1}{N} \sum f(x- \mu)^2}$$

        - Sample Standard Deivation
        $$s = \sqrt{\frac{1}{n-1} \sum (x- \bar{x})^2} $$

    - Grouped data 
        - Population Standard Deviation
        $$\sigma =\sqrt{\frac{1}{N} \sum f(x_m- \mu)^2}$$

        - Sample Standard Deivation
        $$s = \sqrt{\frac{1}{n-1} \sum (x_m- \bar{x})^2} $$
    
----
#### Measures of Relative Dispersion

**Coefficient of Variation** is the ratio of the standard deviation to the mean; used to compair variability of two or more sets of data even when are they expressed in different units: $cv = \frac{\sigma}{\bar{x}}$

**Chebyshev's Theorem**The fraction of any set of numbers lying within $k$ standard deviations of those numbers of the mean of those numbers is at least 
$1 - \frac{1}{k^2}$.


In [None]:
# measures of dispersion

### Measures of Position
The measures of position are used for locating a position of non-central piece of data relative to the entire dataset

- **z-score** measures how many standard deviation an observation is above or below the mean. 
    - Population: $z= \frac{x- \mu}{\sigma}$
    - Sample: $z= \frac{x-\bar{x}}{s}$

- **Fractiles** are a specific fraction or percentage of the observation given in a set must fall. Since the definition other measures may be defined in terms of percentiles, we will leave our formulas for computing the percentile and a conversion formula for converting percentiles to Quartiles and Deciles. 
    - Percentiles 
        - Ungrouped 
            1. Arrange the data from lowest to highest
            2. Substitute into the formula $c=\frac{np}{100}$ where $n$ = total number of values and $p$ = percentile rank.
            3. If $c$ is a whole number, use the value halfway between the c-th and $(c+1)$st values when coming up from the lowest value; else round up to the next whole number [starting at the lowest value, count over to the number that corresponds to the rounded up value].
        - Grouped. The k-th percentile on the class interval with at least $\frac{kn}{100}$ cumulative frequency. The k-th percentile is given by 
        $$P_k = L + \frac{\left( \frac{kn}{100} -S_b \right) i }{f_p}$$ 
        Where:
            - $f_p$ = frequency of the percentile class 
            - $n$ =  total number of observations
            - $i$ = size of the class interval
            - $L$ = lower boundary of the percentile class
            - $S_b$ =< cumulative frequency of the class before percentile class 
    - Quartiles: $Q_1 = P_{25}$
    - Deciles: $D_1 = P_{10}$ 

In [None]:
# determining the measures of position

### Measures of Shapes 
 **Skewness** refers to the degree of symmetry and asymmetry of a distribution; the normal distribution is bell-shaped and symmetric through the mean: it has the property of $\text{mean = median = mode}$.
<!-- Insert illustration-->

- **negatively skewed** - the case where the mean is less than the median; we say that *the mass of the distribution is concentrated on the right of the figure*
- **positively skewed** - the case where the mean is greater than median; we say that *the mass of the distribution is concentrated on the left of the figure*

**Computing for Skewness**
$$SK = \frac{3(\bar{x} - \tilde{x})}{s} | s\; \text{= standard deviation}$$

-Interpretation
    - if $SK=0$, the distribution is *normal*
    - if $SK<0$, the distribution is skewed to the *left*
    - if $SK>0$, the distribution is skewed to the *right*


**Kurtosis** measures the peakness of the distribution; defines how heavily the tails of a distribution differ from the tails of a normal distribution. 
<!-- Insert illustration-->
- Mesokurtic - means a normal distribution
- Leptokurtic -  means that the distribution is more peaked than the normal distribution
- Platykurtic -  means that the distribution is flatter than the normal distribution

**Computing for Kurtosis**
- Ungrouped
$$Ku=\frac{\sum (x- \tilde{x})^4}{ns^4}$$
- Grouped 
$$Ku=\frac{\sum (x_m- \tilde{x})^4}{ns^4}$$

Interpretation
- if $Ku=3$, the distribution is *mesokurtic*
- if $Ku <3$, the distribution is *leptokurtic*
- if $Ku>3$, the distribution is *platykurtic*

In [None]:
# checking skewness and kurtosis of a distribution