## Types of Hypothesis Testing

## Chi-Square Test

Chi-squared test indicates that there is a relationship between two entities. Handling data often involves testing hypotheses to extract useful information. In categorical analysis, chi-square tests are used to determine whether observed frequencies differ significantly from expected frequencies under a given hypothesis.

Chi-squared test, or χ² test, helps in determining whether these two variables are associated with each other.

This test is widely used in market research, healthcare, social sciences, and more to analyze categorical relationships.

For example, Entity 1: People’s favorite colors and Entity 2: Their preference for ice cream.

Null Hypothesis (H₀): Favorite color and ice cream preference are independent (no relationship).
Alternative Hypothesis (H₁): They are dependent (a relationship exists).
By comparing observed survey data with expected frequencies (if no relationship existed), the Chi-Square test calculates a test statistic (χ²). If this value is large enough, we reject H₀, concluding that color preference does influence ice cream choice and vice versa.

### Formula For Chi-Square Test
$$
X^2 = \sum \frac{(O_i - E_i)^2}{E_i}
$$

Symbols are broken down as follows:

$O_i$: Observed frequency <br>
$E_i$: Expected frequency <br>
<br>
Categorical Variables <br>

Categorical variables classify data into distinct, non-numerical groups (e.g., colors, fruit types).

#### Key Characteristics:

* Distinct Groups: No overlap (e.g., hair color: blonde, brunette). <br>
* Non-Numerical: No inherent order (e.g., "apple" ≠ "orange" numerically). <br>
* Limited Options: Fixed categories (e.g., traffic lights: red, yellow, green).<br>

#### Example:

<i> "Do you prefer tea, coffee, or juice?" → Categories: tea/coffee/juice. </i>

### Steps for Chi-Square Test

Steps and an illustration of an example of how sex influences which type of ice-cream a person will choose using a chi-square test are added below:

#### Step 1: Define Hypothesis

* Null Hypothesis (H₀): The observed frequencies match the expected distribution.
* Alternative Hypothesis (H₁): The observed frequencies do not match the expected distribution.

#### Step 2: Gather and Organize Data

Gather Information about the Two Category Variables:

Before performing a chi-square test, you should have on hand information about two categorical variables you wish to observe.

You must collect details on people’s sex (male or female) and their best flavors (e.g., chocolate, vanilla, strawberry).
Once this information is collected, it can be inserted into a contingency table.
The hypothesis is that men prefer vanilla while women prefer chocolate. So we need to record how many have chosen vanilla among all male respondents versus the number who chose chocolate out of all female respondents.

**Here's an example of what a contingency table might look like:**

<table> 
    <tr> <th>    </th> <th> Chocolate </th> <th> Vanilla </th> <th> Strawberry </th></tr> 
    <tr> <th> Male </th> <td> 20 </td> <td> 15 </td> <td> 10 </td> </tr>
    <tr> <th> Female </th> <td> 25 </td> <td> 20 </td> <td> 30 </td> </tr>
    
</table>

#### Step 3: Calculate Expected Frequencies
* **Get Observed Frequency:** In any specific cell, the expected frequency can be described as the number of occurrences that would be expected if the two variables were independent.
* **Expected Frequency Calculation:** This involves multiplying the sums of rows and columns in proportion, then dividing by the total number of observations in a table.
Observed frequency is the table given above.
$$
E_{ij} = \frac{(Row Total)×(Column Total)}{Grand Total}
$$
* Male and chocolate: $\frac{45 x 45}{120} = 16.875$
* Male and Vanilla: $\frac{45 x 35}{120} = 13.125$

Summarizing,
* Male: Chocolate: 16.875, Vanilla: 13.125, Strawberry: 15.0,
* Female: Chocolate:12.125, Vanilla: 21.875, Strawberry: 25.0

#### Step 4: Perform Chi-Square Test
Use Chi-Square Formula:
$$
x^2 = \sum (O_i-E_i)^2 / E_i
$$

#### Step 5: Determine Degrees of Freedom (df)

df = (number of rows - 1) × (number of columns - 1)

$$
df=(r−1)(c−1)=(2−1)(3−1)=2
$$

#### Step 6: Find p-value

* Compare χ² to the Chi-Square Distribution Table for the given df.
χ² = 4.69 with df=2: Critical value at α=0.05 is 5.991. Since 4.69 < 5.991, p > 0.05 

#### Step 7: Interpret Results
If the p-value is less than a certain significance level (e.g., 0.05), then we reject the null hypothesis, which is commonly denoted by α. Thus, it means that category variables highly correlate with each other.
When a p-value is above α, it implies that we cannot reject the null hypothesis; hence, there is insufficient evidence for establishing the relationship between these variables.
No significant evidence supports the claim that men prefer vanilla or women prefer chocolate (p>0.05).

* If the p-value is less than a certain significance level (e.g., 0.05), then we reject the null hypothesis, which is commonly denoted by α. Thus, it means that category variables highly correlate with each other.
* When a p-value is above α, it implies that we cannot reject the null hypothesis; hence, there is insufficient evidence for establishing the relationship between these variables.

No significant evidence supports the claim that men prefer vanilla or women prefer chocolate (p>0.05).




## ANOVA

**ANOVA (Analysis of Variance)** is a statistical method used to determine whether there are significant differences between the means of three or more independent groups by analyzing the variability within each group and between the groups. It helps in testing the null hypothesis that all group means are equal.

It does this by comparing two types of variation: (F-statistics)

1. Differences BETWEEN groups (how much group averages differ from each other)
2. Differences WITHIN groups (how much individuals in the same group vary naturally).
If the between-group differences are significantly larger than within-group variation, ANOVA tells us: At least one group is truly different. Otherwise, it concludes: The differences are likely due to random chance.

For example:

<i> Compare test scores of students taught with 3 methods (Traditional, Online, Hybrid). ANOVA is used to determine if at least one teaching method yields significantly different average scores. </i>

### ANOVA Formula
The **ANOVA formula** is made up of numerous parts. The best way to tackle an ANOVA test problem is to organize the formulae inside an ANOVA table.  

### Assumptions of ANOVA
These must be validated before analysis:

Independence: Observations are randomly sampled, and groups are independent.
Normality: Residuals (errors) are approximately normally distributed (checked via Q-Q plots or Shapiro-Wilk test).
Homoscedasticity: Equal variances across groups (verified using Levene’s or Bartlett’s test).
ANOVA is robust to minor violations of normality and homoscedasticity with balanced sample sizes.

### Calculating ANOVA
Let's explore calculating ANOVA for the scenario:

Compare plant growth under 3 fertilizers (A, B, C):

* Fertilizer A: [10, 11, 12]
* Fertilizer B: [7, 8, 9]
* Fertilizer C: [4, 5, 6]
1. State Hypothesis
Null Hypothesis (H_0): μ_A = μ_B = μ_C
Alternative Hypothesis (H_a): At least one μ differs.
2. Calculate Group means and Grand mean.
Group Means: 

3. Compute Sum of Squares (SS):
SSB (Sum of Squares Between Groups): Accounts for variation due to the treatment or independent variable.


SSE (Sum of Squares Error or Within Groups): Accounts for variation within groups (random error or residuals).
 

SST (Total Sum of Squares): Accounts for total variation from overall mean.
SST = SSB + SSW

SSB = 3(11 − 8)2 + 3(8 − 8)2 + 3(5 − 8)2 = 3(9) + 3(0) + 3(9) = 54

SSE:

$$
Fertilizer A: (10 − 11) + (11−11)2 + (12−11)2 = 1 + 0 + 1 = 2
$$

$$
Fertilizer B: (7 − 8)2 + (8 − 8)2 + (9 − 8)2 = 1 + 0 + 1 = 2
$$

$$
Fertilizer C: (4 − 5)2 + (5 − 5)2 + (6 − 5)2 = 1 + 0 + 1 = 2
$$
$$
SSW = 2 + 2 + 2 = 6
$$
$$
SST = 54 + 6 = 60
$$

4. Calculate Degrees of Freedom (df):
$$
df1 (Between Groups) = k - 1, \text{where k is number of groups.}
##
df2 (Within Groups) = N - k, where N is the total observations.
df3 (Total) = N - 1.

df1 = 3 - 1 = 2
df2 = 9 - 3 = 6
df3 = 9 - 1 = 8
5. Calculate Mean Squares (MS):
MSB (Mean Square Between Groups) = SSB / df1.  

MSE (Mean Square Error) = SSE / df2. 


6. F-statistic:
The F-statistic is calculated as the ratio of MSB to MSE: 


7. P-value:
The p-value is used to decide whether differences among groups are statistically significant. When the p-value is smaller than the significance level (α), the null hypothesis is rejected.
If F > Fcritical  → p < 0.05 : Null Hypothesis Rejected

Use the F-distribution table or software with: Numerator df1 = 2 , Denominator df2 = 6, α=0.05
Critical F-value, Fcritical: 5.14 (From F-distribution table)
F > Fcritical : 27 > 5.14 → p < 0.05; Reject null hypothesis

Types of ANOVA
An ANOVA test can be classified as either one-way or two-way based on the number of independent variables involved.

One-Way ANOVA
This test is used to see if there is a variation in the mean values of three or more groups. Such a test is used where the data set has only one independent variable. If the test statistic exceeds the critical value, the null hypothesis is rejected, and the averages of at least two different groups are statistically significant.

Two-Way ANOVA
Two independent variables are used in the two-way ANOVA. A two-way ANOVA test is used to determine the main effect of each independent variable and whether there is an interaction effect. Each factor is examined independently to determine the main effect, as in a one-way ANOVA. Furthermore, all components are analyzed at the same time to test the interaction impact.