# Discrete Math & Statistics for Machine Learning

Discrete mathematics and statistics provide the structure behind:

- Combinatorics (counting & probability foundations)
- Statistics (mean, variance, covariance)
- Hypothesis testing
- Information theory (entropy, cross-entropy)

These concepts directly power:
- Naive Bayes
- Decision trees
- Cross-entropy loss
- Feature selection

In [5]:
import numpy as np
import matplotlib.pyplot as plt

np.set_printoptions(precision=4, suppress=True)

# 1. Combinatorics

Combinatorics helps us count possibilities.

Factorial:
n! = n × (n-1) × ... × 1

Used in probability distributions.

In [6]:
def factorial_manual(n):
    if n == 0:
        return 1
    result = 1
    for i in range(1, n+1):
        result *= i
    return result

print("5! =", factorial_manual(5))

5! = 120


## Permutations

Number of ways to arrange r objects from n:

P(n, r) = n! / (n-r)!

In [7]:
from math import factorial, log2

def permutations(n, r):
    return factorial(n) // factorial(n-r)

print("P(5,2) =", permutations(5,2))

P(5,2) = 20


## Combinations

Number of ways to choose r objects from n:

C(n, r) = n! / (r! (n-r)!)

Used in binomial distribution.

In [8]:
def combinations(n, r):
    return factorial(n) // (factorial(r) * factorial(n-r))

print("C(5,2) =", combinations(5,2))

C(5,2) = 10


# 2. Binomial Distribution

Probability of exactly k successes in n trials:

P(X=k) = C(n,k) p^k (1-p)^(n-k)

Used in:
- Binary classification
- Bernoulli processes

In [9]:
def binomial_probability(n, k, p):
    return combinations(n, k) * (p**k) * ((1-p)**(n-k))

print("P(3 successes in 5 trials, p=0.5):", binomial_probability(5,3,0.5))

P(3 successes in 5 trials, p=0.5): 0.3125


# 3. Descriptive Statistics

Statistics summarize data.

Key measures:
- Mean
- Variance
- Standard deviation

In [10]:
data = np.random.randn(1000)

mean = np.mean(data)
variance = np.var(data)
std = np.std(data)

print("Mean:", mean)
print("Variance:", variance)
print("Std Dev:", std)

Mean: 0.02999326587781465
Variance: 0.9824353249435221
Std Dev: 0.9911787552926677


# 4. Covariance

Covariance measures how two variables move together.

Cov(X,Y) = E[(X - μx)(Y - μy)]

Used in:
- PCA
- Feature relationships

In [11]:
x = np.random.randn(1000)
y = 2*x + np.random.randn(1000)*0.5

cov_matrix = np.cov(x, y)

print("Covariance matrix:\n", cov_matrix)

Covariance matrix:
 [[0.9977 1.9986]
 [1.9986 4.283 ]]


# 5. Hypothesis Testing (Intuition)

Null hypothesis (H₀):
No effect / no difference.

Alternative hypothesis (H₁):
There is an effect.

We compute a test statistic and compare against a threshold.

Example: One-sample z-test.

In [12]:
sample = np.random.normal(loc=1.0, scale=1.0, size=100)

population_mean = 0
sample_mean = np.mean(sample)
sample_std = np.std(sample)

z_score = (sample_mean - population_mean) / (sample_std / np.sqrt(len(sample)))

print("Z-score:", z_score)

Z-score: 9.343825080808616


If |z| is large, we reject the null hypothesis.

In ML:
Hypothesis testing helps evaluate models.

# 6. Information Theory

Information theory is foundational for:

- Decision Trees
- Cross-Entropy Loss
- KL Divergence
- Language Models

The key concept is entropy.

## Entropy

Entropy measures uncertainty:

H(X) = - Σ p(x) log₂ p(x)

High entropy → high uncertainty
Low entropy → predictable

In [13]:
def entropy(probabilities):
    return -sum(p * log2(p) for p in probabilities if p > 0)

probs = [0.5, 0.5]
print("Entropy (fair coin):", entropy(probs))

probs = [0.9, 0.1]
print("Entropy (biased coin):", entropy(probs))

Entropy (fair coin): 1.0
Entropy (biased coin): 0.4689955935892812


# 7. Cross Entropy

Used as loss function in classification.

Cross-Entropy:

H(p, q) = - Σ p(x) log q(x)

In neural networks:
- p = true labels
- q = predicted probabilities

In [14]:
def cross_entropy(true, pred):
    return -np.sum(true * np.log(pred + 1e-9))

true = np.array([1, 0])
pred = np.array([0.8, 0.2])

print("Cross-Entropy:", cross_entropy(true, pred))

Cross-Entropy: 0.22314355006420974


# 8. KL Divergence

Measures difference between two distributions:

KL(p || q) = Σ p(x) log(p(x)/q(x))

Used in:
- Variational Autoencoders
- Regularization
- Probabilistic modeling

In [15]:
def kl_divergence(p, q):
    return sum(p[i] * log2(p[i]/q[i]) for i in range(len(p)) if p[i] > 0)

p = [0.6, 0.4]
q = [0.5, 0.5]

print("KL Divergence:", kl_divergence(p, q))

KL Divergence: 0.029049405545331364


# Discrete Math → Machine Learning

Combinatorics → Binomial models  
Statistics → Data summarization  
Covariance → PCA  
Entropy → Decision Trees  
Cross-Entropy → Neural Network Loss  
KL Divergence → Modern probabilistic models  

Discrete mathematics provides structure for reasoning about data.