## Statistics for machine learning

### Types of Data

Understanding data types is the foundation of statistics.

1. Qualitative (Categorical) -features (columns) in your dataset that represent categories or groups. 

- Nominal - Categories with no order (e.g., gender, colors).
- Ordinal - Categories with order (e.g., education level, satisfaction ratings).

2. Quantitative (Numerical)

- Discrete - Countable numbers (e.g., number of students).
- Continuous - Measurable values (e.g., height, weight, temperature).

Statistics is the science of collecting, organizing, analyzing, and interpreting data to make decisions.
It helps us summarize information and understand patterns.

Two main branches:

- Descriptive Statistics - Used to summarize, organize, and present data in a meaningful way.
- Inferential Statistics - Making predictions/conclusions about a population from a sample (hypothesis testing, confidence intervals).

## Types of Sampling
Sampling is the process of selecting a subset of data (sample) from a larger group (population) to make conclusions about the whole population.

### 1. Probability Sampling

- In this type, every element in the population has a known, non-zero chance of being selected.
- It reduces bias and allows for statistical inference.

#### a) Simple Random Sampling (SRS)

- Every item has an equal chance of selection.
- Example: A teacher wants to select 5 students at random from a class of 30.

#### b) Systematic Sampling

- Selecting every kth element from an ordered list.
- Example: From a list of 1,000 students, pick every 10th student.

#### c) Stratified Sampling

- Population is divided into strata (groups) based on characteristics (e.g., gender, age).
- A random sample is taken from each stratum.
- Example: In a company with 70 men and 30 women, a researcher wants 40 employees. They randomly pick 28 men and 12 women (keeping proportions).
- Ensures representation of subgroups.
- often used in ML (e.g., stratified k-fold cross-validation ensures class balance in classification).

#### d) Cluster Sampling

- Population is divided into clusters (e.g., schools, cities).
- Randomly select clusters, then sample all members in those clusters.
- Example: A researcher wants to study high school students in a county with 20 schools. Instead of sampling from every school, she randomly selects 4 schools. 
- Useful when population is spread across a wide area.

### 2. Non-Probability Sampling

Here, not all members of the population have a chance of being selected.
It’s easier and cheaper, but more bias-prone.

#### a) Convenience Sampling

- Choose the easiest members to reach.
- Example: Surveying friends, people at the mall.

#### b) Judgmental/Purposive Sampling

- Researcher selects participants based on judgment of who is most useful.
- Example: Interviewing only industry experts.

#### c) Quota Sampling

- Population divided into groups, and samples are taken non-randomly until a quota is met.
- Example: A surveyor needs 20 responses: 10 men and 10 women. Instead of randomly selecting, they stop people on the street until they reach 10 men and 10 women.

#### d) Snowball Sampling

- Existing participants recruit future participants.
- Example: Imagine you are researching drug users in a city. You start by finding one participant who uses drugs.
- Common in hard-to-reach groups (e.g., people with rare diseases).


## Descriptive Statistics
### 1. Measures of Central Tendency

- Mean (Arithmetic Average) - Sensitive to outliers (extreme values).
- Median (Middle Value)
- Mode (Most Frequent Value)

### 2. Measures of Dispersion (Spread)
These describe how much variation exists in the data.

- a) Range -  difference between the maximum and minimum values in a dataset.
- b) Variance 
Variance measures how much the values in a dataset spread out from the mean.
It looks at the average squared deviation from the mean.
When working with a sample (not the whole population), we divide by `n -1` instead of `n`
![image-2.png](attachment:image-2.png)

- Low variance → data points are close to the mean (less spread).
- High variance → data points are spread out widely from the mean.

In Machine Learning

- Variance helps understand feature variability.
- High variance in models → overfitting (model is too sensitive to training data).
- Low variance in models → underfitting (model is too simplistic).

- c) Standard Deviation
Square root of variance → measures spread of data in the same unit as the data, easier to interpret.

- d) Interquartile Range (IQR)
It is the measure of statistical dispersion (spread). 
It tells us how spread out the middle 50% of the values in a dataset are.
It focuses only on the central part of the data (ignoring the extreme values).

calculated as: `IQR=Q3−Q1`

### Inferential Statistics
#### Key Concepts in Inferential Statistics

#### Population vs. Sample

- Population: Entire group we are interested in (e.g., all university students in Kenya).
- Sample: Subset of the population used for study (e.g., 200 students from 5 universities).

#### Parameters vs. Statistics

- Parameter: A numerical summary of the population (e.g., true average height of all students).
- Statistic: A numerical summary of the sample (e.g., average height of sampled 200 students).

#### Estimation

- Point Estimation: Using sample data to calculate a single value (estimate) of a population parameter.
- Example: Mean salary of a sample = estimate of population mean salary.

- Interval Estimation: Gives a range of values (confidence interval) likely to contain the population parameter.
- Example: “The average salary is between $900–$1,100 with 95% confidence.”

In [1]:
import numpy as np

In [3]:
scores = [70, 75, 80, 85, 90]
scores

[70, 75, 80, 85, 90]

In [4]:
# Step 1: Mean
mean_score = np.mean(scores)
print("Mean:", mean_score)

Mean: 80.0


In [4]:
# Step 2: Variance
variance = np.var(scores, ddof=0)   # Population variance
print("Variance:", variance)

Variance: 50.0


In [5]:
# Step 3: Standard Deviation
std_dev = np.std(scores, ddof=0)    # Population standard deviation
print("Standard Deviation:", std_dev)

Standard Deviation: 7.0710678118654755


`ddof` stands for Delta Degrees of Freedom.

When calculating variance and standard deviation, you can use two approaches:

1. Population formula - divide by N (total number of data points).
2. Sample formula - divide by N - 1 (because we use a sample to estimate the population).

In NumPy:

- `ddof`=0 → uses Population formula (divide by N).
- `ddof`=1 → uses Sample formula (divide by N-1).