# BA2 - Probability & statistics for engineers 
## Author : Max Ciriani Testing the git commit/push
![image.png](attachment:d3d1b0de-84c9-48ce-9c4b-a91ebe214026.png)
# Introduction to statistics

## Math (mean, median, variance etc.)
**Quantitative** (Numbers) vs **qualitative** (descriptive)

**Mean / Average / Expected Value** are all equivalent and interchangeable. Easily affected by **outliars**

**Inferential mean of a sample** : $\huge E(y) = \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i $. *(E as in expected value)*

**Mean of a population** : $\huge \eta = \frac{1}{N} \sum_{i=1}^n x_i $

**Median** = the middle number in a descending or ascending list of numbers. It can be more descritive than the mean as 
it is not as influenced by **outliars**. $\rightarrow$ sensitive to extreme values.

*Remark*, when the median is located between two values, take the average.

The **Mode** is the value that appears most frequently in a data set. A set of data may have one mode, more than one mode, or no mode at all.
*Remark*, in a normal distribution plot, the mode is the same as the median and the mean.


**Population Variance** : $\huge \delta^2 = \frac{\sum_(y-\eta)^2}{N}$

**Sample variance** : $\huge S^2 = \frac{\sum_(y-\bar{y})^2}{n-1}$    *If the denominator is n-1 -> unbiased, if n -> biased*

**Residuals and degrees of freedom**: *Definition*: The *residuals* are the deviations of the individual data points from the mean : 
$$\large y_1-\bar{y}, y_2 - \bar{y}, y_3 - \bar{y}, ... , y_n - \bar{y},$$

The sum of the residuals is always 0 : $\large \sum_(y-\bar{y}) = 0$

The **number of degrees of freedom** $ v=n-1$. This is why we use $ n-1$ in the sample standard deviation.
Otherwise, we would "double-count" one piece of data because the mean plus the n-1 residuals fully define the data set.

**Noise** = Effects of things outside our controle





## Quartiles
**Quartiles** are the extension of the median (50% point) and can be used to generate a **box plot**.
    
    (i) Interquartile range  IQR = Q3 - Q1

    (ii) Q1 = 25%    Like the median but for 25%

    (iii) Q3 = 75%

    (iv) Lower outlier(s) < Q1 - (1.5 * IQR)

    (v) Upper outlier(s) > Q3 + (1.5* IQR)

In [None]:
'''
Import here useful libraries
Run this cell first for convenience
'''
import numpy as np
from scipy import stats
import scipy
import warnings
import pandas as pd
import matplotlib.pyplot as plt
from scipy.ndimage import mean, median, variance
warnings.simplefilter('ignore', DeprecationWarning)

In [173]:
dataP = np.array([58.8, 30.8, 27.3, 29.9, 17.7, 76.5])
dataP = dataP.reshape(len(dataP), 1) #prepare data for  1 = number of columns
df = pd.DataFrame(data=dataP, columns = ['Series 2, Exercice 2'])
print("Data: \n", df, flush=True)

Data: 
    Series 2, Exercice 2
0                  58.8
1                  30.8
2                  27.3
3                  29.9
4                  17.7
5                  76.5


In [175]:
Q1 = df.quantile(0.25).values[0] #quantile return a Pandas series
Q3 = df.quantile(0.75).values[0] #.values return the ndarray stored in the series
IQR = Q3 - Q1
print("Q1: ", Q1, flush=True)
print("Q3: ", Q3, flush=True)
print("IQR: ", round(IQR, 2))
    
del Q1, Q3, IQR, df, dataP

Q1:  27.95
Q3:  51.8
IQR:  23.85


To calculate manually the quartiles, let n = number data points and q = percentile (25% and 75% in our case)
$$\huge Q1_{position} = q \cdot (n+1)$$
If Q1 is not a round number, e.g 1.75, then it is 75% of the way between the data point 1 and data point 2

*Remark*, when the data set is very large troncating doesn't affect precision much. With small data sets interpolation is preffered.

In [201]:
data = np.array([44.5, 47.1, 48.7, 41.6, 32.8, 18.3])
data = data.reshape(len(data), 1)
df = pd.DataFrame(data=data, columns=["Ex 2"])
Q1 = df.quantile(0.25).values[0]
Q3 = df.quantile(0.75).values[0]
IQR = Q3 - Q1
print("Exercice 2b")
print("Q1: ", Q1, flush=True)
print("Q3: ", Q3, flush=True)
print("IQR: ", round(IQR, 2))
del data, df, Q1, Q3, IQR

Exercice 2b
Q1:  35.0
Q3:  46.45
IQR:  11.45


## Ruling out outliars using quantiles

In [198]:
data = np.array([58.8, 30.8, 27.3, 29.9, 17.7, 76.5, 44.5, 47.1, 48.7, 41.6, 32.8, 18.3, 73.3, 57.1, 66, 93.8, 133.2, 81.1, 30.6, 24.2, 16.6, 38.9, 28.7, 23.6])
data = data.reshape(len(data),1)
df = pd.DataFrame(data=data, columns=["data:"])
Q1 = df.quantile(0.25).values[0]
Q3 = df.quantile(0.75).values[0]
IQR = Q3 - Q1
mask = ((df < (Q1 - 1.5*IQR)) | (df > Q3 + (1.5 * IQR)))
outliars = (df*mask).to_numpy()
print("Outliers:")
for e in outliars:
    if(e != 0):
        print(e)

del data, df, Q1, Q3, IQR, mask, outliars

Outliers:
[133.2]


### Sampling

A **Sample** is a subset of a **population**. **Sampling in inferencial statistics**, (i) selecting a sample from the population, (ii) the results from the sample are generalized to the population. In other words, **Inferencing** is assuming that the characteristics of the *sample* 
apply to the population.

If the sampling results in an inaccurate representation of the population, there is poor **external validity**.
After drawing a **hypthosis** from the inferencing process, we need to **test** it.


# Different tpyes of plots / graphes

## Histogramme AKA "frequency polygon"
x-axis indicate a range of values, the y-axis indicate a frequency. 
![image.png](attachment:f7424931-2014-4461-8340-44ad52156282.png)


## Ogive or cummulative frequency plot 
![image.png](attachment:1ea73e40-81a6-461c-a92e-b99df3eb71b9.png)


## Pie chart
## Bar chart
*Definition* : A bar chart is in numerical values (**ordinal scaled data**) whilst a histogramme is in frequency.
## Time series graph
*Definition* : The horizontal axis represents time periods, the vertical axis represents the numerical values associated to the time periods.