## Statistics

Statistics is a foundational pillar of machine learning, providing essential tools for data understanding, uncertainty quantification, and reliable model building. 

#### Descriptive Statistics

Descriptive statistics are used to describe and summarize dataset features, helping to uncover patterns, outliers, and distributions early in ML projects. Key measures include:

- Mean (average): The sum of all values divided by the number of values.

- Median: The middle value in a sorted dataset (splits the data in half).

- Mode: The value that appears most frequently.

- Variance: The average of the squared differences from the mean — shows how spread out the data is.

- Standard deviation: The square root of variance, indicating typical distance from the mean.

- Range: Difference between max and min values.

- Interquartile range (IQR): Spread of the middle 50% of values.

In [None]:
import pandas as pd

data = [1, 2, 2, 3, 4, 5]
df = pd.DataFrame(data, columns=["Values"])
df.describe()

Unnamed: 0,Values
count,6.0
mean,2.833333
std,1.47196
min,1.0
25%,2.0
50%,2.5
75%,3.75
max,5.0


#### Inferential Statistics

Inferential statistics allow us to make generalizations or predictions about an entire population based on a sample. Common techniques:

- Hypothesis testing: Determines if sample results are likely for the population or just due to chance.

- Confidence intervals: Estimate a range likely containing the population parameter.

- Regression analysis: Models relationships between variables (see below).

Python libraries like scipy.stats and statsmodels offer built-in functions for t-tests, confidence intervals, etc.

In [None]:
from scipy import stats

# Compares means of two groups to see if they are statistically different
# Here we compare two small sample groups
# t_stat: measures the size of the difference relative to the variation in the sample data
# p_value: probability of observing the data if the null hypothesis is true

t_stat, p_value = stats.ttest_ind([1,2,3], [4,5,6])
print(f"T-statistic: {t_stat}, P-value: {p_value}")

# A low p-value (< 0.05) indicates strong evidence against the null hypothesis
# Here we use two small sample groups for demonstration
# In practice, use larger samples for reliable results

T-statistic: -3.6742346141747673, P-value: 0.021311641128756727


#### Probability Theory

Probability quantifies uncertainty in data and models. ML relies on:

- Random variables: Variables whose values are outcomes of a random phenomenon.

- Probability distributions: Functions like the normal (Gaussian) or binomial distributions that model the probability of outcomes.

- Bayes' theorem: Updates probability estimates based on new evidence.

In [10]:
# probability of getting heads in 3 coin tosses:

from math import comb
p = 0.5
prob_2_heads = comb(3,2) * (p**2) * ((1-p)**1)  # Binomial probability
print(f"Probability of getting exactly 2 heads in 3 tosses: {prob_2_heads}")


Probability of getting exactly 2 heads in 3 tosses: 0.375


In [11]:
# For normal distribution:

from scipy.stats import norm
prob = norm.cdf(1.96)  # Probability value below z=1.96
print(f"Probability of z < 1.96: {prob}")  # Should be close to 0.975

Probability of z < 1.96: 0.9750021048517795


#### Sampling Techniques

Sampling selects representative data from a population, which is crucial as ML rarely uses entire populations.

- Random sampling: Each data point has equal selection chance.

- Stratified sampling: Keeps proportional distribution of key subgroups.

In [12]:
# Python (random, stratified) using sklearn:

from sklearn.model_selection import train_test_split
import numpy as np

X = np.array([[i] for i in range(10)])  # Features
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])  # Binary target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)


#### Regression Analysis

Regression models the relationship between variables–essential for prediction.

- Linear regression: Predicts a continuous variable.

- Logistic regression: Predicts categorical outcomes (classification).

In [None]:
# Using scikit-learn:


from sklearn.linear_model import LinearRegression, LogisticRegression
model = LinearRegression().fit(X_train, y_train)
log_model = LogisticRegression().fit(X_train, y_train)
print(f"Linear Regression R^2: {model.score(X_test, y_test)}")
print(f"Logistic Regression Accuracy: {log_model.score(X_test, y_test)}")
# Linear Regression R^2: 1.0
# Logistic Regression Accuracy: 1.0

Linear Regression R^2: 0.8260679098062556
Logistic Regression Accuracy: 1.0


#### Hypothesis Testing

Used to check if observed differences in data are statistically significant.

- Null hypothesis: No effect or difference (default assumption).

- Alternative hypothesis: Contradicts null, asserting a real effect.

#### Statistical Learning Theory

This underpins the mathematical basis of machine learning—studying how well models generalize from training data to unseen data by focusing on:

- Generalization: Model performance on new, unseen data.

- Bias-variance tradeoff: Balancing model complexity and error.

In [None]:
# Sources:
# [1](https://www.clicdata.com/blog/statistics-for-machine-learning/)
# [2](https://www.reddit.com/r/MLQuestions/comments/u6l4bn/how_to_learn_machine_learning_my_roadmap/)
# [3](https://www.tutorialspoint.com/machine_learning/machine_learning_statistics.htm)
# [4](https://www.youtube.com/watch?v=7eh4d6sabA0)
# [5](https://www.simplilearn.com/what-is-descriptive-statistics-article)
# [6](https://www.machinelearningmastery.com/machine-learning-in-python-step-by-step/)
# [7](https://www.geeksforgeeks.org/data-science/descriptive-statistic/)
# [8](https://www.nrigroupindia.com/e-book/Introduction%20to%20Machine%20Learning%20with%20Python%20(%20PDFDrive.com%20)-min.pdf)
# [9](https://builtin.com/data-science/intro-descriptive-statistics)
# [10](https://www.youtube.com/watch?v=c8W7dRPdIPE)