# Day 7: Hypothesis Testing and Statistical Inference

Today, we will explore hypothesis testing and statistical inference, two critical concepts in data science that allow us to make decisions and draw conclusions from data. These tools help us determine whether the patterns and relationships we observe in data are statistically significant or could have occurred by chance.

## Topics Covered:
- Introduction to Hypothesis Testing
- Test Statistics
- p-Values
- Confidence Intervals


## 1. Indroduction to hypothesis testing

Hypothesis testing is a statistical method used to make decisions or inferences about a population based on sample data. 
It involves 
- formulating a hypothesis, 
- collecting data, 
- and determining whether the data supports or refutes the hypothesis. 

Hypothesis testing is a fundamental tool in data science, as it allows data scientists to make objective conclusions about their data and the models they build.

### Steps in Hypothesis Testing:
1. Formulate the hypotheses:
    - null hypothesis $ H_0 $:
    - Alternative Hypothesis $ H_1 $ or $ H_A $

2. Select a significance Level $ Œ± $
    -  The significance level is the probability of rejecting the null hypothesis when it is actually true. Common choices are 0.05 (5%) or 0.01 (1%).

3. Choose the Appropriate Test:
    - The choice of test depends on the type of data and the hypothesis being tested (e.g., t-test, chi-square test, ANOVA).

4. Calculate the Test statistics:
    - Use the sample data to calculate a test statistic, which will be compared to a critical value to determine whether to reject the null hypothesis.

5. Make Decision:
    - Compare the $ p-value $ (the probability of obtaining the observed data assuming the null hypothesis is true) to the significance level. If the p-value is less than $ ùõº $, reject the null hypothesis.

6. Draw Conclusion:
    -  Based on the decision, conclude whether there is sufficient evidence to support the alternative hypothesis.

#### Example: A/B Testing in Online Marketing

Suppose you are working for an e-commerce company and want to test whether a new website design (Version B) leads to higher conversion rates (i.e., more purchases) compared to the current design (Version A).

1. Formulate the hypotheses:
    - null hypothesis $ H_0 $:
        - The conversion rates for Version A and Version B are the same 
    - Alternative Hypothesis $ H_1 $ or $ H_A $ :
        - The conversion rate for Version B is higher than that for Version A.

2. Select a significance Level $ Œ± $
    -  Set $ ùõº $ = 0.05, meaning you are willing to accept a 5% chance of incorrectly rejecting the null hypothesis.

3. Choose the Appropriate Test:
    - Use a two-sample t-test to compare the conversion rates between the two versions of the website.

4. Calculate the Test statistics:
    - Collect data on the number of visitors and the number of conversions for both versions over a fixed period.
    - Calculate the conversion rate for each version and use the t-test formula to calculate the test statistic.

5. Make Decision:
    - Compare the $ p-value $ based on the test statistic. If the p-value is less than 0.05, reject the null hypothesis.

6. Draw Conclusion:
    -  If the null hypothesis is rejected, conclude that the new website design (Version B) leads to higher conversion rates. If not, conclude that there is no significant difference between the two designs.

In [1]:
import numpy as np
from scipy import stats

# Conversion rates for Version A and Version B
conversions_A = np.array([50, 55, 60, 65, 70])
conversions_B = np.array([60, 65, 70, 75, 80])

# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(conversions_B, conversions_A)

# Display the results
print(f"t-Statistic: {t_statistic:.2f}")
print(f"p-Value: {p_value:.2f}")

# Decision rule
if p_value < 0.05:
    print("Reject the null hypothesis: The new design has a higher conversion rate.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in conversion rates.")


t-Statistic: 2.00
p-Value: 0.08
Fail to reject the null hypothesis: There is no significant difference in conversion rates.


## 2. Test Statistics

In hypothesis testing, the test statistic is a standardized value that is calculated from the sample data during the hypothesis test. The type of test statistic depends on the hypothesis test being performed and the type of data involved (e.g., t-tests for means, chi-square for categorical data). The p-value is a measure of the probability of observing the sample results, or more extreme results, if the null hypothesis is true.

### Common Types of Statistical Tests
Selecting the correct test depends on the type of data, the distribution of the data, and the hypothesis being tested. Here are some of the most commonly used tests in hypothesis testing

#### T-test (Student t-test)

- Purpose: 
    - The t-test compares the mean of two groups to see if they are statistically difffrent from each other.

- Types of t-test:
    - One sample:
        - Compares the sample mean to known population mean.
    - Indepent (Two-sample) t-test:
        - Compares the means of the two indepent groups.
    - parired t-test:
        - Compares means from the same group at two diffrent times

- Test statistics:
    - The t-statistic is used when the population variance is unknown and the sample size is relatively small

##### When to Use:
    - The data is approximately normally distributed.
    - You are comparing means between one or two groups.

##### Example: Sales perfoamrnce comparison

A company wants to compare the sales performance of two regions (A and B). The independent t-test is used to compare the average sales between the two regions.

In [2]:
from scipy import stats

# Sales data for Region A and Region B
region_A = [100, 102, 105, 98, 97]
region_B = [110, 108, 112, 115, 111]

# Independent t-test
t_statistic, p_value = stats.ttest_ind(region_A, region_B)
print(f't-statistic: {t_statistic:.2f}, p-value: {p_value:.2f}')


t-statistic: -5.86, p-value: 0.00


### ANOVA(Analysis of variance)

- Purpose:
    - ANOVA is used to compare the means of three or more groups to see if at least one group mean is dfferent from the others.
    
- Types
    - One-way ANOVA
        - Compares the means of three or more independent groups
    - Two-way ANOVA
        - Examines the effecr of two factors on the means of groups

- Test Statistic
    - The F-statistic is used to determine whether the variability between group means is larger than the variability within groups.

##### When to Use
    - You have more than two groups or conditions.
    - The data is normally distributed.


#### Example: Comparing ROI of marketing campaigns

A company wants to test whether different marketing strategies (TV, online, and print) result in different sales outcomes. One-way ANOVA can be used to test for differences between the means of sales for the three strategies.

### Chi-Square Test

- Purpose
    - The chi-square test is used to test the association between two categorical variables.
- Types
    - Chi-square test of independence: 
        - Determines whether two categorical variables are independent.
    - Chi-square goodness of fit test:
        - Determines whether observed data fits an expected distribution.
- Test statistics
    - The chi-square $ x^2 $ statistic measures the discrepancy between observed and expected frequencies in categorical data.
#### When to use
    - Both variables are categorical.
    - You want to test the independence or distribution of categorical data.

#### Example: Customer satisfaction analysis


A survey is conducted to determine whether customer satisfaction (satisfied/unsatisfied) is associated with customer age group (young/middle-aged/old). A chi-square test can determine if there is a significant association between age group and satisfaction.

In [3]:
import scipy.stats as stats

# Contingency table of observed frequencies
data = [[30, 20], [20, 40], [50, 30]]  # Example counts for 3 groups and 2 categories

# Chi-square test
chi2_stat, p_value, _, _ = stats.chi2_contingency(data)
print(f'Chi-square statistic: {chi2_stat:.2f}, p-value: {p_value:.2f}')


Chi-square statistic: 13.18, p-value: 0.00


### Z-Test

- Purpose:
    - The z-test compares sample and population means to determine if they are significantly different, assuming the population variance is known.

- Types:
    - One-sample z-test: 
        - Tests whether the sample mean differs from the population mean.
    - Two-sample z-test: 
        - Compares means from two independent samples.

- Test statistic:
    -  The z-statistic is used when the population variance is known, and the sample size is large.

#### When to use:
    - The sample size is large (n > 30).
    - Population variance is known.


#### Example: Quality control

A manufacturing company wants to test if the average weight of their product is equal to the advertised weight of 500 grams. A z-test can be used to check if the difference is statistically significant.

### Pearson Correlation Coefficient (r)

- Purpose
    - Measures the linear relationship between two continuous variables. The Pearson correlation coefficient ranges from -1 to 1, where:

        - 1: A perfect positive linear relationship.
        - 0: No linear relationship.
        - -1: A perfect negative linear relationship.
- Test Statistic: 
    - The Pearson correlation coefficient $ ùëü $ is used to measure the strength and direction of the relationship.

#### When to Use:
    - Both variables are continuous and normally distributed.
    - The relationship between the variables is linear.


#### Example: Optimum production analysis



A manufacturing company wants to determine the relationship between the number of units produced and the total production time. The aim is to assess if more time invested in production increases the number of units produced

In [4]:
import numpy as np
from scipy.stats import pearsonr

# Production data: [production time in hours, units produced]
production_time = [5, 6, 7, 8, 10]
units_produced = [50, 60, 70, 80, 100]

# Calculate Pearson correlation coefficient
corr, p_value = pearsonr(production_time, units_produced)
print(f'Pearson correlation: {corr:.2f}, p-value: {p_value:.2f}')


Pearson correlation: 1.00, p-value: 0.00


## 3. p-Values

The p-value is a crucial concept in hypothesis testing that helps determine the strength of evidence against the null hypothesis. It represents the probability of obtaining results as extreme as the observed data, assuming the null hypothesis is true. The smaller the p-value, the stronger the evidence against the null hypothesis.

- Interpreting p-Values
    -  p-value < Œ± (significance level): 
        - If the p-value is less than the chosen significance level (typically 0.05), we reject the null hypothesis. This means there is strong evidence that the observed result did not occur by chance, and the alternative hypothesis may be true.

    - p-value > Œ±: 
        - If the p-value is greater than the significance level, we fail to reject the null hypothesis. This indicates insufficient evidence to conclude that the observed result is statistically significant.

### Example: Effectiveness of a New Drug

Suppose a pharmaceutical company is testing whether a new drug reduces blood pressure more effectively than a placebo. They conduct a clinical trial where a group of patients receives the drug, and another group receives a placebo.

- Formulate the Hypothesis:

    - Null Hypothesis (H‚ÇÄ): 
        - The new drug has the same effect on reducing blood pressure as the placebo.
    - Alternative Hypothesis (H‚ÇÅ): 
        - The new drug reduces blood pressure more effectively than the placebo.
- Collect Data: 
    - The company measures the change in blood pressure for patients in both groups.

- Calculate the p-value: 
    - After collecting the data, the company uses a two-sample t-test to determine if there is a significant difference between the blood pressure reductions in the two groups.



In [5]:
from scipy import stats

# Sample data: blood pressure reduction in the drug group and placebo group
drug_group = [10, 12, 9, 15, 13, 10, 12]
placebo_group = [5, 7, 6, 4, 6, 8, 7]

# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(drug_group, placebo_group)

# Print the p-value
print(f"p-Value: {p_value:.3f}")

# Decision
if p_value < 0.05:
    print("Reject the null hypothesis: The drug is more effective at reducing blood pressure.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the drug and placebo.")


p-Value: 0.000
Reject the null hypothesis: The drug is more effective at reducing blood pressure.


## 4. Confidence Intervals

A confidence interval (CI) provides a range of values that is likely to contain the population parameter (such as a mean or proportion) with a certain level of confidence (e.g., 95%). Unlike point estimates, which give a single value (like a sample mean), confidence intervals offer a range within which the true population value is expected to lie, giving a better understanding of the parameter's uncertainty.

### How to interpret a confidence interval

A 95% confidence interval means that if we were to repeat the sampling process 100 times, we would expect the true population parameter to fall within the calculated interval 95 times. It does not mean that there is a 95% chance the parameter is within the interval; instead, it‚Äôs a reflection of the method's reliability over many repeated samples.

### Key Concepts

- Points Estimate
    - The sample statistic (e.g., sample mean) used as a point estimate for the population parameter.

- Margin of Error
    - The range added and subtracted from the point estimate, accounting for the uncertainty in the estimate.

- Confidence Level:
    - The percentage (e.g., 95%) that expresses how confident we are that the interval contains the true parameter

#### Formula for Confidence Interval:

$ CI = \bar{x} \pm Z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}} $


This formula represents the confidence interval (CI) where:

- $ \barùë• $  is the sample mean,
- $Z_{\alpha/2}$ is the Z-value corresponding to the confidence level (e.g., 1.96 for 95% confidence),
- œÉ is the population standard deviation (or sample standard deviation if the population standard deviation is unknown),
- n is the sample size.

#### Example: Estimating the AVERAGE Sales per store

lets assume carrefour wants to estimate the average daily sales of its stores across different locations. They take a random sample of 40 stores and find that the average daily sales are ‚Ç¨5,000 with a standard deviation of ‚Ç¨500. The company wants to calculate a 95% confidence interval to estimate the average sales across all stores.

- $ \barùë• $ sample mean = 5000,
- $Z_{\alpha/2}$ confidence level is 1.96% for 95% confidence level,
- œÉ (standard deviation) = 500 ,
- n the sample size = 40.

In [7]:
import numpy as np
import scipy.stats as stats

# Sample data
sample_mean = 5000  # Average sales in $
sample_std = 500  # Standard deviation in $
sample_size = 40  # Number of stores

# Z-value for 95% confidence level
z_value = 1.96

# Calculate margin of error
margin_of_error = z_value * (sample_std / np.sqrt(sample_size))

# Calculate confidence interval
lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error

# Display the result
print(f"95% Confidence Interval for Average Sales: (‚Ç¨{lower_bound:.2f}, ‚Ç¨{upper_bound:.2f})")


95% Confidence Interval for Average Sales: (‚Ç¨4845.05, ‚Ç¨5154.95)


This confidence interval will give a range of values within which the company can be 95% confident that the true average sales across all their stores fall. For example, if the calculated interval is (4845.05, 5154.95), this means the company can be 95% confident that the true average daily sales for all stores fall between ‚Ç¨4845.05 and ‚Ç¨5154.95)

### Conclusion

In Day 7, we've covered essential concepts such as Hypothesis Testing, Test Statistics, p-Values, and Confidence Intervals. These statistical tools are fundamental for analyzing data, drawing conclusions, and making decisions based on evidence in data science.

Key Takeaways:
- Hypothesis Testing: 
    - A method for making decisions or inferences about a population based on sample data.
- Test Statistics: 
    - These are calculated values used to decide whether to reject the null hypothesis, depending on the hypothesis being tested.
- p-Values: 
    - A measure of the strength of evidence against the null hypothesis. A smaller p-value (typically less than 0.05) indicates stronger evidence to reject the null hypothesis.
- Confidence Intervals: 
    - Provide a range within which the true population parameter is likely to fall, giving more context to point estimates by showing uncertainty.



This concludes our week 1 of 100 days data science challenge.

In Week 1, we covered the foundational topics in data science to ensure a strong understanding of the essential tools and techniques for tackling data challenges. Here's a summary of what we've accomplished:

#### Week 1 Overview
    - Day 1: Introduction, Setup, and Python Basics

        Introduced the challenge and set up the working environment.
        Covered Python basics, including data types, control flow, and functions.

    - Day 2: Data Science Libraries

        Explored core data science libraries such as NumPy, Pandas, Matplotlib, and Seaborn.
        Implemented basic operations for data manipulation, analysis, and visualization.

    - Day 3: Data Cleaning and Exploration

        Learned essential data cleaning techniques, such as handling missing data, removing duplicates, and data transformations.
        Introduced exploratory data analysis (EDA) to summarize data, identify patterns, and visualize key insights.

    - Day 4: Data Cleaning and Exploration Project

        Applied data cleaning techniques to a real dataset (e.g., Titanic dataset).
        Performed EDA to generate meaningful insights and documented the findings.

    - Day 5: Introduction to Mathematics for Data Science

        Introduced mathematical concepts essential for data science, such as linear equations, polynomials, and calculus.
        Explored how these concepts are applied in real-life data science tasks, including house price predictions and product sales modeling.

    - Day 6: Introduction to Statistics for Data Science

        Covered basic statistics, including descriptive statistics, probability theory, and probability distributions.
        Examples like customer purchase probability and analyzing age distributions in datasets were explored.

    - Day 7: Hypothesis Testing and Statistical Inference

        Introduced hypothesis testing, p-values, and confidence intervals.
        Conducted hypothesis tests using real-world examples like A/B testing in marketing and product comparison using t-tests, ANOVA, and chi-square tests.



#### What's Coming in Week 2:
    In Week 2, we'll be diving deeper into more complex data science techniques, including supervised learning and its mathematical foundations. We'll cover:

        - Regression Models:
            Linear Regression, Multiple Regression, and their applications in predictive modeling.
        - Classification Techniques:
            Algorithms such as Logistic Regression, Decision Trees, and k-NN.
        - Mathematical Foundations for Machine Learning:
            Building on calculus and linear algebra concepts, with hands-on implementation.
        - Model Evaluation:
            Metrics like Accuracy, Precision, Recall, F1 Score, and ROC-AUC for understanding model performance.

These topics will equip me with the essential tools to handle complex data and build models that solve real-world business problems. Looking forward to your progress!