#VARIANCE

Variance is a fundamental concept in statistics that measures the spread or dispersion of a set of data points around their mean (average). It provides an idea of how much the values in the dataset differ from the mean. A higher variance indicates that the data points are more spread out, while a lower variance indicates that they are closer to the mean.

### Understanding Variance

#### 1. **Population Variance**
The population variance is used when you have data for the entire population. It is denoted by $(\sigma^2)$.

#### 2. **Sample Variance**
The sample variance is used when you have data from a sample of the population. It is denoted by $(s^2)$.

### Formula for Variance

#### Population Variance
$[\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 ]$
where:
- $( \sigma^2 )$ is the population variance.
- \( N \) is the number of data points in the population.
- $( x_i )$ is each individual data point.
- $( \mu )$ is the population mean.

#### Sample Variance
$[ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 ]$
where:
- $( s^2 )$ is the sample variance.
- $( n )$ is the number of data points in the sample.
- $( x_i )$ is each individual data point.
- $( \bar{x} )$ is the sample mean.

### Steps to Calculate Variance

1. **Find the Mean (Average):**
   - For population: $( \mu = \frac{1}{N} \sum_{i=1}^{N} x_i )$
   - For sample: $( \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i )$

2. **Subtract the Mean from Each Data Point and Square the Result:**
   - Calculate $((x_i - \mu)^2)$ for each data point in a population.
   - Calculate $((x_i - \bar{x})^2)$ for each data point in a sample.

3. **Sum All the Squared Differences:**
   - Sum the squared differences obtained in the previous step.

4. **Divide by the Number of Data Points:**
   - For population: Divide by \(N\).
   - For sample: Divide by \(n-1\).

### Example Calculation

Suppose we have the following sample data: \(5, 10, 15, 20, 25\).

1. **Calculate the Mean:**
   $[ \bar{x} = \frac{5 + 10 + 15 + 20 + 25}{5} = \frac{75}{5} = 15 ]$

2. **Subtract the Mean and Square the Result:**
   - $((5 - 15)^2 = (-10)^2 = 100)$
   - $((10 - 15)^2 = (-5)^2 = 25)
   - $((15 - 15)^2 = 0^2 = 0)$
   - $((20 - 15)^2 = 5^2 = 25)$
   - $((25 - 15)^2 = 10^2 = 100)$

3. **Sum of Squared Differences:**
   $[ 100 + 25 + 0 + 25 + 100 = 250 ]$

4. **Calculate the Sample Variance:**
   $[ s^2 = \frac{250}{5-1} = \frac{250}{4} = 62.5 ]$

So, the sample variance $(s^2)$ is 62.5.

### Interpretation

Variance gives you an idea of the spread of the data points. In this example, a variance of 62.5 indicates that, on average, the data points are spread out from the mean by about the square root of 62.5 (which is approximately 7.91). However, since variance is in squared units, standard deviation (the square root of variance) is often used to interpret the spread in the original units of the data.

### Key Points
- **Variance measures dispersion:** It shows how much the data points deviate from the mean.
- **Population vs. Sample:** Population variance uses \(N\) (total number of data points), while sample variance uses \(n-1\) (one less than the sample size) to correct for bias.
- **Squared units:** Variance is in squared units of the original data, which can make interpretation less intuitive.

By understanding variance, you gain insights into the variability and consistency of your data, which is essential for many statistical analyses and decision-making processes.

#WHY VARIANCE WHEN WE ALREADY HAVE STANDARD DEVIATION

Variance and standard deviation are both measures of dispersion or spread in a set of data. While they are closely related, they serve slightly different purposes and are used in different contexts. Here's why both measures are important:

### Variance

1. **Mathematical Simplicity in Formulas:**
   - Variance $(\sigma^2)$ for population, $(s^2)$ for sample) is the foundation for many statistical formulas and techniques. For example, in inferential statistics, the calculation of confidence intervals, hypothesis tests, and other advanced statistical methods often start with variance.
   - Many theoretical properties in statistics are more conveniently expressed in terms of variance. For example, the variance of the sum of independent random variables is the sum of their variances.

2. **Additive Property:**
   - Variance has an additive property that standard deviation does not have. For two independent random variables \(X\) and \(Y\), the variance of their sum is equal to the sum of their variances:
     $[ \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) ]$

3. **Quadratic Nature:**
   - The squaring of deviations in variance amplifies the effect of larger deviations. This property is useful in many contexts, such as when assessing the spread of data and identifying outliers.

### Standard Deviation

1. **Interpretability:**
   - Standard deviation $((\sigma)$ for population, $(s)$ for sample) is in the same units as the original data, making it easier to interpret. For example, if you are measuring heights in centimeters, the standard deviation will also be in centimeters.
   - It provides a direct sense of how much the data varies from the mean, making it more intuitive for reporting and explaining variability.

2. **Comparability:**
   - Standard deviation allows for easier comparison of variability between different datasets, especially when the units of measurement are the same. It is often used in descriptive statistics for this reason.

### Example to Illustrate the Difference

Consider a dataset of exam scores: \([60, 70, 80, 90, 100]\).

1. **Calculate the Mean:**
   $[ \mu = \frac{60 + 70 + 80 + 90 + 100}{5} = 80 ]$

2. **Calculate the Variance:**
   - Find the squared deviations from the mean:

     $[ (60 - 80)^2 = 400 ]$
     $[ (70 - 80)^2 = 100 ]$
     $[ (80 - 80)^2 = 0 ]$
     $[ (90 - 80)^2 = 100 ]$
     $[ (100 - 80)^2 = 400 ]$
   - Average the squared deviations:
     $[ \text{Variance} = \frac{400 + 100 + 0 + 100 + 400}{5} = 200 ]$

3. **Calculate the Standard Deviation:**
   - Take the square root of the variance:
     $[ \text{Standard Deviation} = \sqrt{200} \approx 14.14 ]$

### Why Both Are Needed

- **Theoretical and Practical Applications:** Variance is used in theoretical work and for deriving other statistical properties. Standard deviation is more practical for interpretation and communication.
- **Statistical Modeling:** Many statistical models, such as linear regression, use variance to estimate parameters and assess the fit of the model.
- **Decision Making:** In finance, for instance, variance is used to quantify risk (volatility), while standard deviation helps investors understand the risk in terms they are familiar with.

In summary, variance and standard deviation are both crucial for understanding and describing data. Variance serves as a fundamental building block in many statistical methods, while standard deviation offers a more intuitive measure of spread that is easier to interpret and communicate.

In [1]:
import numpy as np

# Generate two independent random variables
np.random.seed(42)  # For reproducibility
X = np.random.normal(0, 1, 1000)  # Mean = 0, Standard Deviation = 1
Y = np.random.normal(0, 2, 1000)  # Mean = 0, Standard Deviation = 2

# Calculate individual variances
var_X = np.var(X, ddof=1)  # ddof=1 to get sample variance
var_Y = np.var(Y, ddof=1)

# Calculate the variance of their sum
Z = X + Y
var_Z = np.var(Z, ddof=1)

print(f"Variance of X: {var_X}")
print(f"Variance of Y: {var_Y}")
print(f"Variance of X + Y: {var_Z}")
print(f"Sum of variances of X and Y: {var_X + var_Y}")
print(f"Difference: {var_Z - (var_X + var_Y)}")


Variance of X: 0.9588638535851024
Variance of Y: 3.979660938600569
Variance of X + Y: 4.780686822064231
Sum of variances of X and Y: 4.938524792185671
Difference: -0.15783797012143985


In [2]:
import numpy as np

# Generate two independent random variables X and Y with a larger sample size
X = np.random.normal(loc=0, scale=1, size=1000000)  # Mean = 0, Std Dev = 1
Y = np.random.normal(loc=0, scale=2, size=1000000)  # Mean = 0, Std Dev = 2

# Calculate variances of X and Y using population variance (ddof=0)
var_X = np.var(X, ddof=0)
var_Y = np.var(Y, ddof=0)

# Calculate the variance of X + Y
Z = X + Y
var_Z = np.var(Z, ddof=0)

# Theoretical variance of X + Y should be the sum of variances of X and Y
var_X_plus_Y = var_X + var_Y

# Print results
print(f"Variance of X: {var_X}")
print(f"Variance of Y: {var_Y}")
print(f"Variance of X + Y (calculated): {var_Z}")
print(f"Sum of variances of X and Y: {var_X_plus_Y}")
print(f"Difference: {var_Z - var_X_plus_Y}")


Variance of X: 1.000480489545111
Variance of Y: 4.010056629791974
Variance of X + Y (calculated): 5.01394464567955
Sum of variances of X and Y: 5.0105371193370845
Difference: 0.003407526342465772


In [1]:
import numpy as np

# Generate two independent random variables X and Y with a larger sample size
X = np.random.normal(loc=0, scale=1, size=1000000)  # Mean = 0, Std Dev = 1
Y = np.random.normal(loc=0, scale=2, size=1000000)  # Mean = 0, Std Dev = 2

# Calculate variances of X and Y using population variance (ddof=0)
var_X = np.var(X, ddof=0)
var_Y = np.var(Y, ddof=0)

# Calculate the variance of X + Y
Z = X + Y
var_Z = np.var(Z, ddof=0)

# Theoretical variance of X + Y should be the sum of variances of X and Y
var_X_plus_Y = var_X + var_Y

# Print results
print(f"Variance of X: {var_X}")
print(f"Variance of Y: {var_Y}")
print(f"Variance of X + Y (calculated): {var_Z}")
print(f"Sum of variances of X and Y: {var_X_plus_Y}")
print(f"Difference: {var_Z - var_X_plus_Y}")


Variance of X: 0.9982339276247678
Variance of Y: 4.0141250409532265
Variance of X + Y (calculated): 5.012632327522757
Sum of variances of X and Y: 5.012358968577995
Difference: 0.00027335894476276934
