# Elementary Statistical Testing

- Familiarity with the Descriptive Statistics notebook.
- Python Libraries such as:
    - Numpy
    - Scipy
    - Sklearn
    - Statsmodels
    - Pandas

Elementary statistical testing refers to the basic methods used to make decisions or draw conclusions from data. These tests help determine whether observed patterns or differences are real or occurred by chance. Here are some key aspects:

1. **Null and Alternative Hypothesis**: Like in all hypothesis testing, we start with the null hypothesis (H₀) that assumes no effect or no difference, and the alternative hypothesis (H₁) that suggests a real effect or difference.

2. **Common Types of Tests**:
   - **Z-test**: Used to compare sample data to the population when the sample size is large (typically \(n > 30\)) and the population variance is known.
   - **T-test**: Used when the sample size is small or the population variance is unknown, typically to compare means.
   - **Chi-square test**: Used for categorical data to test relationships or compare observed frequencies to expected ones.

3. **Significance Level (α)**: Commonly set at 0.05, this is the threshold for determining if a result is statistically significant. If the result is below α, we reject the null hypothesis.

4. **P-value**: The p-value tells us how likely it is that the observed data would occur if the null hypothesis were true. A small p-value (less than α) suggests that the observed result is unlikely to be due to chance alone, leading us to reject the null hypothesis.

Elementary statistical testing is foundational in analyzing data, used in everything from comparing averages to examining relationships between variables.

### Hypothesis Testing
Hypothesis testing is a way to make decisions or draw conclusions about a population based on data from a sample. It helps us figure out if something we observe (like a difference or an effect) is real or just happened by chance.

1. **Null Hypothesis (H₀)**: This is a starting assumption that nothing has changed, or there’s no effect. We assume it's true until we have evidence to prove otherwise.

2. **Alternative Hypothesis (H₁)**: This is what we're trying to prove — that there is a change, effect, or difference.

We collect data, calculate a test result, and then decide whether the evidence is strong enough to reject the null hypothesis and support the alternative hypothesis.

3. **Test Statistic**: A test statistic is calculated from the sample data and is used to assess the strength of the evidence against the null hypothesis. The type of test statistic depends on the hypothesis test being conducted. (ex: t-statistic)

4. **Significance Level**: The threshold for the t-statistic to reject the null hypothesis. Commonly used thresholds are 5% or 1%.

5. **P-Value**:  The p-value is the probability of obtaining a test statistic as extreme as the one observed, assuming the null hypothesis is true. If the p-value is less than the significance level α, we reject the null hypothesis.

### Types of Tests:
Most commonly used tests are testing for a mean. The goal is to make inferences about the population mean based on sample data. Commonly used tests for a mean are:

#### **1. T-Test/Z-Test**: The t-test and z-test are both statistical tests used to determine if there are significant differences between means.

Before using we have to assume that the observations are independent and come from a Normal distribution.

Formula:

$$
t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}}
$$

We can also retrieve the t-statistic from the table by calculating Degrees of Freedom (df):

$$ {df} = {n}-1 $$
as ${n}$ is number of samples

The critical value of t-statistic should look like this:

Degrees of Freedom | 0.10  | 0.05  | 0.01  | 0.001 |
|------------------|-------|-------|-------|-------|
1                  | 6.314 | 12.71 | 63.66 | 636.62|
2                  | 2.920 | 4.303 | 9.925 | 22.327|
3                  | 2.353 | 3.182 | 5.841 | 10.215|
4                  | 2.132 | 2.776 | 4.604 | 7.173 |
5                  | 2.015 | 2.571 | 4.032 | 5.893 |
6                  | 1.943 | 2.447 | 3.707 | 5.208 |
7                  | 1.895 | 2.364 | 3.499 | 4.785 |
8                  | 1.860 | 2.306 | 3.355 | 4.501 |
9                  | 1.833 | 2.262 | 3.250 | 4.297 |
10                 | 1.812 | 2.228 | 3.169 | 4.144 |

We use T-Test when the number of samples < 30 or when the standard deviation is unknown, Otherwise Z-Test is used.

**Central Limit Theorem (CLT)**: The theorem states that the distribution of the sample mean will approach a normal distribution as the sample size becomes large, regardless of the original population's distribution. According to the theorem we use T-Test when we dont know the standard deviation.

**Types of the test depending on the sample**:

##### 1. **One-Sample t-Test**:
   - **Objective**: Test if the sample mean is significantly different from a known value (e.g., a population mean).
   - **When to Use**: When you have one sample and want to compare its mean to a known value.
   - **Code**:


In [None]:
import numpy as np
from scipy import stats

# Sample data
data = [2.3, 2.9, 3.1, 2.8, 3.0]

# Hypothesized population mean
population_mean = 3.0

# Perform one-sample t-test
t_stat, p_value = stats.ttest_1samp(data, population_mean)

print(f"t-statistic: {t_stat}")
print(f"p-value: {p_value}")


t-statistic: -1.2923246855119326
p-value: 0.2658472860753787



##### 2. **Independent Samples t-Test**:
   - **Objective**: Compare the means of two independent groups to determine if they are significantly different.
   - **When to Use**: When you have two separate groups and want to compare their means.


In [None]:
import numpy as np
from scipy import stats

# Sample data for two groups
group1 = [2.3, 2.9, 3.1, 2.8, 3.0]
group2 = [3.2, 3.5, 3.6, 3.3, 3.7]

# Perform independent two-sample t-test
t_stat, p_value = stats.ttest_ind(group1, group2)

print(f"t-statistic: {t_stat}")
print(f"p-value: {p_value}")


t-statistic: -3.8247315498700623
p-value: 0.00505561083180125



##### 3. **Paired Samples t-Test (Dependent)**:
   - **Objective**: Compare means from the same group at different times or under different conditions.
   - **When to Use**: When you have paired or matched observations, such as before-and-after measurements on the same subjects.


In [None]:
import numpy as np
from scipy import stats

# Sample data before and after treatment
before = [2.3, 2.9, 3.1, 2.8, 3.0]
after = [2.8, 3.2, 3.4, 3.1, 3.3]

# Perform paired sample t-test
t_stat, p_value = stats.ttest_rel(before, after)

print(f"t-statistic: {t_stat}")
print(f"p-value: {p_value}")


t-statistic: -8.500000000000004
p-value: 0.0010505780707892483



#### **2. ANOVA Test**: An extension for T-Test for more than two groups
ANOVA (Analysis of Variance) is a statistical test used to determine if there are significant differences among the means of three or more groups. It helps assess whether at least one group mean is different from the others, based on sample data.

**Key Concepts**

- **Purpose**: To compare the means of multiple groups to see if there are statistically significant differences between them.
- **Types**:
  - **One-Way ANOVA**: Tests differences among the means of three or more independent groups based on one factor (e.g., comparing the effectiveness of three different teaching methods).
  - **Two-Way ANOVA**: Tests differences among means based on two factors and their interaction (e.g., examining the effects of teaching method and class size on student performance).

**Assumptions**
1. **Independence**: Observations in each group should be independent of each other.
2. **Normality**: Data in each group should be approximately normally distributed (especially important for small sample sizes).
3. **Homogeneity of Variances**: The variances among the groups should be roughly equal.

**Procedure**
1. **Formulate Hypotheses**:
   - **Null Hypothesis (H0)**: All group means are equal.
   - **Alternative Hypothesis (H1)**: At least one group mean is different.

2. **Calculate F-Statistic**: Compares the variance between groups to the variance within groups.

3. **Compare F-Statistic to Critical Value**: Determine if the observed F-Statistic is significantly large compared to a critical value from the F-distribution.

ANOVA is widely used in experimental designs and research to evaluate the effects of different treatments or conditions.

#### **Non-parametric Tests**
Non-parametric tests are statistical methods used when data doesn't meet the assumptions of parametric tests, such as normal distribution or equal variances. They rely on the ranks of the data rather than the raw data values. These tests are ideal for ordinal, skewed, or non-normally distributed data

Common tests used are:

### **1. Chi-Square**
The Chi-Square Test is a non-parametric test used to assess the association between categorical variables or to compare observed frequencies with expected frequencies. There are two primary types of Chi-Square Tests:

#### **1. Chi-Square Test of Independence**

**Purpose**: To determine if there is a significant association between two categorical variables.

**Example**: Testing if there is a relationship between gender (male/female) and voting preference (yes/no).

**Steps**:

1. **Create a Contingency Table**: Construct a table that displays the frequency counts for each combination of categories for the two variables.

2. **Calculate Expected Frequencies**: For each cell in the table, calculate the expected frequency under the assumption that the variables are independent. The formula is:
   $$
   E_{ij} = \frac{(R_i \cdot C_j)}{N}
   $$
   Where:
   - $E_{ij}$ = Expected frequency for cell $(i, j)$
   - $R_{i}$ = Total frequency for row $(i)$
   - $C_{J}$ = Total frequency for column $(j)$
   - ${N}$ = Total number of observations

3. **Compute the Chi-Square Statistic**:
The Chi-Square statistic is calculated as follows:

$$
\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}
$$

where:
- $O_{ij}$ = Observed frequency for cell $(i, j)$
- $E_{ij}$ = Expected frequency for cell $(i, j)$


4. **Determine Degrees of Freedom**: Calculate the degrees of freedom using:
   $$
   \text{df} = (r - 1) \times (c - 1)
   $$
   Where:
   - r = Number of rows
   - c = Number of columns

5. **Compare to Critical Value**: Compare the calculated \(\chi^2\) value to the critical value from the Chi-Square distribution table with the appropriate degrees of freedom or compute the p-value.

6. **Draw Conclusions**: If the p-value is less than the significance level (e.g., 0.05), reject the null hypothesis and conclude that there is a significant association between the variables.


In [None]:
import numpy as np
from scipy import stats

# Example contingency table: rows are gender (Male, Female) and columns are voting preferences (Yes, No)
data = np.array([[30, 10],   # Male
                 [20, 40]])  # Female

# Perform Chi-Square Test of Independence
chi2_stat, p_value, dof, expected = stats.chi2_contingency(data)

print("Chi-Square Statistic:", chi2_stat)
print("p-Value:", p_value)
print("Degrees of Freedom:", dof)
print("Expected Frequencies Table:")
print(expected)


Chi-Square Statistic: 15.041666666666666
p-Value: 0.00010516355403363098
Degrees of Freedom: 1
Expected Frequencies Table:
[[20. 20.]
 [30. 30.]]



#### **2. Chi-Square Test of Goodness of Fit**

**Purpose**: To determine if a sample distribution matches an expected distribution.

**Example**: Testing if the observed distribution of colors in a bag of candies fits the expected proportions.

**Steps**:

1. **State the Expected Frequencies**: Define the expected proportions or frequencies based on a theoretical distribution.

2. **Compute the Chi-Square Statistic**:
   $$
   \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
   $$
   Where:
   - $O_i$ = Observed frequency for category $i$
   - $E_i$ = Expected frequency for category $i$
3. **Determine Degrees of Freedom**: Calculate the degrees of freedom using:
   $$
   \text{df} = k - 1
   $$
   Where:
   - $k$ = Number of categories

4. **Compare to Critical Value**: Compare the calculated $\chi^2$ value to the critical value from the Chi-Square distribution table with the appropriate degrees of freedom or compute the p-value.

5. **Draw Conclusions**: If the p-value is less than the significance level, reject the null hypothesis and conclude that the observed distribution differs significantly from the expected distribution.

The Chi-Square Test is widely used in categorical data analysis and helps determine if there are significant deviations from what is expected or if there is an association between categorical variables.

In [None]:
import numpy as np
from scipy import stats

# Observed frequencies of candies (e.g., Red, Blue, Green, Yellow)
observed = np.array([20, 30, 10, 10])

# Expected frequencies (assuming equal distribution)
expected = np.array([17.5, 17.5, 17.5, 17.5])

# Perform Chi-Square Test of Goodness of Fit
chi2_stat, p_value = stats.chisquare(f_obs=observed, f_exp=expected)

print("Chi-Square Statistic:", chi2_stat)
print("p-Value:", p_value)


Chi-Square Statistic: 15.714285714285715
p-Value: 0.0012976436580202741


Next tests use rank sums instead of the difference in the mean of the sample. The advantage of taking the rank sums rather than the difference in means is that the data need not be normally distributed.


### **2. Mann-Whitney U Test**

- **Purpose**: To compare differences between two independent groups when the data is not normally distributed.
- **Example**: Comparing scores from two different teaching methods.
- **Formula**:
  $$
  U = R - \frac{n(n + 1)}{2}
  $$
  Where \(R\) is the sum of ranks for the group and \(n\) is the number of observations.


In [None]:
from scipy import stats

# Sample data for two independent groups
group1 = [23, 45, 67, 89, 12]
group2 = [34, 56, 78, 90, 23]

# Perform the Mann-Whitney U test
u_statistic, p_value = stats.mannwhitneyu(group1, group2, alternative='two-sided')

# Print results
print(f"Mann-Whitney U Statistic: {u_statistic}")
print(f"p-Value: {p_value}")


Mann-Whitney U Statistic: 9.5
p-Value: 0.6004018480969686



### **3. Wilcoxon Signed-Rank Test**

- **Purpose**: To compare two related samples or paired observations.
- **Example**: Comparing pre-treatment and post-treatment scores from the same subjects.
- **Formula**:
  $$
  W = \text{min}(W^+, W^-)
  $$
  Where \(W^+\) and \(W^-\) are the sums of ranks for positive and negative differences, respectively.


In [None]:
from scipy import stats

# Sample data: paired observations before and after treatment
before = [15, 20, 18, 24, 29]
after = [21, 25, 18, 23, 28]

# Perform the Wilcoxon Signed-Rank Test
statistic, p_value = stats.wilcoxon(before, after)

# Print results
print(f"Wilcoxon Signed-Rank Statistic: {statistic}")
print(f"p-Value: {p_value}")


Wilcoxon Signed-Rank Statistic: 3.0
p-Value: 0.4614509878333607



### **4. Kruskal-Wallis H Test**

- **Purpose**: To compare medians among three or more independent groups. Ideal usage is when the data is not normally distributed.
- **Example**: Comparing median salaries among different job positions.
- **Formula**:
  $$
  H = \frac{12}{N(N + 1)} \sum \frac{R_j^2}{n_j} - 3(N + 1)
  $$
  Where $(R_j)$ is the sum of ranks for group $(j)$, $(n_j)$ is the number of observations in group $(j)$, and $(N)$ is the total number of observations.
  
  The test is very similar to ANOVA tests but instead of using the mean of every dataset we sort the data by ranks then we use it to calculate our statistic so we compare them to each other. Which gives us an advantage of being able to use it when ANOVA assumptions are violated or the data is not normally distributed.



In [1]:
import scipy.stats as stats

# Sample data for three independent groups
group1 = [10, 20, 30, 40, 50]
group2 = [15, 25, 35, 45, 55]
group3 = [12, 22, 32, 42, 52]

# Perform Kruskal-Wallis H Test
statistic, p_value = stats.kruskal(group1, group2, group3)

# Print results
print(f"Kruskal-Wallis H Statistic: {statistic}")
print(f"p-value: {p_value}")

Kruskal-Wallis H Statistic: 0.5
p-value: 0.7788007830714049



### **5. Friedman Test**

- **Purpose**: To compare medians among three or more related groups.
- **Example**: Comparing ratings of different products by the same group of people over time.
- **Formula**:
  $$
  \chi^2_F = \frac{12}{n k (k + 1)} \left( \sum_{j=1}^k R_j^2 - \frac{k (k + 1)^2}{4} \right)
  $$
  Where \(R_j\) is the sum of ranks for treatment \(j\), \(n\) is the number of subjects, and \(k\) is the number of treatments.


In [2]:
import scipy.stats as stats

# Sample data: three sets of paired data (e.g., measurements at different times or treatments)
group1 = [10, 20, 30]
group2 = [15, 25, 35]
group3 = [12, 22, 32]

# Perform Friedman Test
statistic, p_value = stats.friedmanchisquare(group1, group2, group3)

# Print results
print(f"Friedman Statistic: {statistic}")
print(f"p-value: {p_value}")

Friedman Statistic: 6.0
p-value: 0.04978706836786395



### **6. Spearman’s Rank Correlation**

- **Purpose**: To measure the strength and direction of the monotonic relationship between two variables.
- **Example**: Assessing the relationship between the ranks of students' performances in two subjects.
- **Formula**:
  $$
  \rho = 1 - \frac{6 \sum d_i^2}{n (n^2 - 1)}
  $$
  Where \(d_i\) is the difference between the ranks of each pair of observations, and \(n\) is the number of observations.


In [3]:
import scipy.stats as stats

# Sample data for two variables
x = [10, 20, 30, 40, 50]
y = [12, 24, 33, 47, 55]

# Calculate Spearman's correlation
corr, p_value = stats.spearmanr(x, y)

# Print results
print(f"Spearman's Correlation Coefficient: {corr}")
print(f"p-value: {p_value}")


Spearman's Correlation Coefficient: 0.9999999999999999
p-value: 1.4042654220543672e-24


### **7. Kendall’s Tau-a**

- **Purpose**: To measure the strength and direction of association between two ranked variables, providing a non-parametric alternative to Pearson's and Spearman's correlation.
- **Example**: Assessing the relationship between rankings of two different judges on a set of performances.
- **Formula**:
  $$
  \tau_a = \frac{C - D}{\frac{1}{2} n(n - 1)} = \frac{C - D}{C + D}
  $$
  Where \(C\) is the number of concordant pairs, \(D\) is the number of discordant pairs, and \(n\) is the number of observations.
  
Kendall's Tau-a is particularly useful for ordinal data or non-normally distributed continuous data, making it robust against outliers and ties.

In [4]:
import numpy as np
from scipy.stats import kendalltau

# Example data
x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 6, 7, 8, 7])

# Calculate Kendall's Tau
tau, p_value = kendalltau(x, y)

print(f"Kendall's Tau-a: {tau}, p-value: {p_value}")

Kendall's Tau-a: 0.7378647873726218, p-value: 0.07697417298126674
