# Hypothesis Testing and Statistical Analysis Notebook

## Assignment Task 6: Hypothesis Formulation
## Assignment Task 7: Hypothesis Testing & Significance Analysis

This notebook covers comprehensive hypothesis testing and statistical analysis including:
- Formulation of data-driven hypotheses
- Implementation of various statistical tests
- Significance analysis and interpretation
- Advanced statistical modeling
- Comprehensive reporting of findings

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import sys
import os

# Add src directory to path
sys.path.append('../src')

# Import our custom modules
from data.data_loader import DataLoader
from analysis.statistics import HypothesisTester, StatisticalSummary
from analysis.eda import ExploratoryDataAnalysis

# Import statistical and visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.metrics import confusion_matrix
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

## 1. Load and Prepare Dataset

In [None]:
# Initialize data loader
loader = DataLoader('../data/processed')

# Load dataset (replace with your actual dataset)
try:
    # Try to load processed data first
    dataset = loader.load_dataset('cleaned_dataset.csv')
    print(f"Processed dataset loaded with {len(dataset)} rows and {len(dataset.columns)} columns")
except FileNotFoundError:
    # If processed data doesn't exist, load raw data
    print("Processed dataset not found. Loading raw data for demonstration.")
    loader = DataLoader('../data/raw')
    
    # Sample dataset for demonstration
    np.random.seed(42)
    n_samples = 1000
    
    sample_data = {
        'id': range(1, n_samples + 1),
        'age': np.random.normal(35, 10, n_samples),
        'income': np.random.lognormal(10, 1, n_samples),
        'education_level': np.random.randint(1, 5, n_samples),  # 1=HS, 2=Bachelor, 3=Master, 4=PhD
        'experience': np.random.normal(8, 5, n_samples),
        'satisfaction': np.random.uniform(1, 10, n_samples),
        'department_code': np.random.randint(1, 6, n_samples),  # 1-5 departments
        'target': np.random.choice([0, 1], n_samples, p=[0.7, 0.3])
    }
    
    # Add some realistic correlations
    sample_data['income'] = sample_data['income'] + sample_data['education_level'] * 10000 + sample_data['experience'] * 2000
    sample_data['satisfaction'] = sample_data['satisfaction'] + sample_data['income'] / 50000 + np.random.normal(0, 1, n_samples)
    
    dataset = pd.DataFrame(sample_data)
    loader.dataset = dataset
    loader._extract_dataset_info()
    
    print(f"Sample dataset created with {len(dataset)} rows and {len(dataset.columns)} columns")

# Initialize analysis components
tester = HypothesisTester(dataset)
summary = StatisticalSummary()
eda = ExploratoryDataAnalysis(dataset, target_column='target')

print("Analysis components initialized successfully.")

## 2. Dataset Overview and Statistical Summary

In [None]:
print("=" * 50)
print("DATASET OVERVIEW AND STATISTICAL SUMMARY")
print("=" * 50)

# Basic dataset information
print(f"Dataset Shape: {dataset.shape}")
print(f"Memory Usage: {dataset.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"Numeric Columns: {len(dataset.select_dtypes(include=[np.number]).columns)}")
print(f"Missing Values: {dataset.isnull().sum().sum()}")

# Descriptive statistics
print("\nDESCRIPTIVE STATISTICS:")
print("-" * 25)
descriptive_stats = summary.descriptive_statistics(dataset)
print(descriptive_stats)

# Data types
print("\nDATA TYPES:")
print("-" * 12)
print(dataset.dtypes)

# Target variable analysis
if 'target' in dataset.columns:
    print("\nTARGET VARIABLE DISTRIBUTION:")
    print("-" * 32)
    target_dist = dataset['target'].value_counts()
    print(target_dist)
    print(f"Target Proportion: {dataset['target'].mean():.3f}")

## 3. Hypothesis Formulation (Task 6)

Based on the exploratory data analysis and domain knowledge, let's formulate relevant hypotheses.

In [None]:
print("=" * 40)
print("HYPOTHESIS FORMULATION")
print("=" * 40)

print("""
HYPOTHESIS 1: Income and Education Level Relationship
----------------------------------------------------
H0 (Null Hypothesis): There is no significant difference in income across different education levels.
H1 (Alternative Hypothesis): There is a significant difference in income across different education levels.
Test: One-way ANOVA
Significance Level: α = 0.05

HYPOTHESIS 2: Age and Income Correlation
---------------------------------------
H0 (Null Hypothesis): There is no significant correlation between age and income.
H1 (Alternative Hypothesis): There is a significant correlation between age and income.
Test: Pearson Correlation Test
Significance Level: α = 0.05

HYPOTHESIS 3: Target Variable and Satisfaction
---------------------------------------------
H0 (Null Hypothesis): There is no significant difference in satisfaction levels between target groups.
H1 (Alternative Hypothesis): There is a significant difference in satisfaction levels between target groups.
Test: Independent T-Test
Significance Level: α = 0.05

HYPOTHESIS 4: Department and Target Variable Association
------------------------------------------------------
H0 (Null Hypothesis): Department and target variable are independent.
H1 (Alternative Hypothesis): Department and target variable are associated.
Test: Chi-Square Test of Independence
Significance Level: α = 0.05

HYPOTHESIS 5: Experience and Income Relationship
-----------------------------------------------
H0 (Null Hypothesis): There is no significant relationship between experience and income.
H1 (Alternative Hypothesis): There is a significant positive relationship between experience and income.
Test: Spearman Rank Correlation (for non-linear relationships)
Significance Level: α = 0.05
""")

print("\nHYPOTHESES FORMULATED SUCCESSFULLY!")
print("These hypotheses are based on logical relationships observed in the data")
print("and common business/domain knowledge patterns.")

## 4. Hypothesis Testing (Task 7)

Now let's test our formulated hypotheses using appropriate statistical tests.

In [None]:
print("=" * 40)
print("HYPOTHESIS TESTING")
print("=" * 40)

# Identify available columns
numeric_columns = dataset.select_dtypes(include=[np.number]).columns.tolist()
print(f"Available numeric columns: {numeric_columns}")

# Test Hypothesis 1: Income and Education Level (ANOVA)
print("\n1. TESTING HYPOTHESIS 1: Income and Education Level Relationship")
print("-" * 65)
if 'income' in dataset.columns and 'education_level' in dataset.columns:
    try:
        result1 = tester.anova_test('income', 'education_level')
        print(f"Test Type: {result1['test_type']}")
        print(f"F-Statistic: {result1['f_statistic']:.4f}")
        print(f"P-Value: {result1['p_value']:.6f}")
        print(f"Significant: {result1['significant']}")
        print(f"Number of Groups: {result1['num_groups']}")
        
        if result1['significant']:
            print("✓ REJECT NULL HYPOTHESIS: Education level significantly affects income")
        else:
            print("✗ FAIL TO REJECT NULL HYPOTHESIS: No significant difference in income by education level")
    except Exception as e:
        print(f"Error in ANOVA test: {e}")
else:
    print("Required columns not available for this test.")

# Test Hypothesis 2: Age and Income Correlation
print("\n2. TESTING HYPOTHESIS 2: Age and Income Correlation")
print("-" * 50)
if 'age' in dataset.columns and 'income' in dataset.columns:
    try:
        result2 = tester.correlation_test('age', 'income', method='pearson')
        print(f"Test Type: {result2['test_type']}")
        print(f"Correlation Coefficient: {result2['correlation']:.4f}")
        print(f"P-Value: {result2['p_value']:.6f}")
        print(f"Significant: {result2['significant']}")
        
        if result2['significant']:
            print("✓ REJECT NULL HYPOTHESIS: Significant correlation between age and income")
            print(f"Correlation Strength: {abs(result2['correlation']):.3f} ({'Strong' if abs(result2['correlation']) > 0.7 else 'Moderate' if abs(result2['correlation']) > 0.3 else 'Weak'})")
            print(f"Correlation Direction: {'Positive' if result2['correlation'] > 0 else 'Negative'}")
        else:
            print("✗ FAIL TO REJECT NULL HYPOTHESIS: No significant correlation between age and income")
    except Exception as e:
        print(f"Error in correlation test: {e}")
else:
    print("Required columns not available for this test.")

# Test Hypothesis 3: Target and Satisfaction (T-Test)
print("\n3. TESTING HYPOTHESIS 3: Target Variable and Satisfaction")
print("-" * 55)
if 'target' in dataset.columns and 'satisfaction' in dataset.columns:
    try:
        # Create groups based on target variable
        group0 = dataset[dataset['target'] == 0]['satisfaction'].dropna()
        group1 = dataset[dataset['target'] == 1]['satisfaction'].dropna()
        
        # Perform independent t-test
        t_stat, p_value = stats.ttest_ind(group0, group1)
        
        result3 = {
            'test_type': 'Independent T-Test',
            't_statistic': t_stat,
            'p_value': p_value,
            'significant': p_value < 0.05,
            'group0_mean': group0.mean(),
            'group1_mean': group1.mean()
        }
        
        print(f"Test Type: {result3['test_type']}")
        print(f"T-Statistic: {result3['t_statistic']:.4f}")
        print(f"P-Value: {result3['p_value']:.6f}")
        print(f"Significant: {result3['significant']}")
        print(f"Group 0 Mean Satisfaction: {result3['group0_mean']:.4f}")
        print(f"Group 1 Mean Satisfaction: {result3['group1_mean']:.4f}")
        
        if result3['significant']:
            print("✓ REJECT NULL HYPOTHESIS: Significant difference in satisfaction between target groups")
        else:
            print("✗ FAIL TO REJECT NULL HYPOTHESIS: No significant difference in satisfaction between target groups")
    except Exception as e:
        print(f"Error in t-test: {e}")
else:
    print("Required columns not available for this test.")

# Test Hypothesis 4: Department and Target (Chi-Square)
print("\n4. TESTING HYPOTHESIS 4: Department and Target Variable Association")
print("-" * 65)
if 'department_code' in dataset.columns and 'target' in dataset.columns:
    try:
        result4 = tester.chi_square_test('department_code', 'target')
        print(f"Test Type: {result4['test_type']}")
        print(f"Chi-Square Statistic: {result4['chi2_statistic']:.4f}")
        print(f"P-Value: {result4['p_value']:.6f}")
        print(f"Degrees of Freedom: {result4['degrees_of_freedom']}")
        print(f"Significant: {result4['significant']}")
        
        if result4['significant']:
            print("✓ REJECT NULL HYPOTHESIS: Department and target variable are associated")
        else:
            print("✗ FAIL TO REJECT NULL HYPOTHESIS: Department and target variable are independent")
    except Exception as e:
        print(f"Error in chi-square test: {e}")
else:
    print("Required columns not available for this test.")

# Test Hypothesis 5: Experience and Income (Spearman Correlation)
print("\n5. TESTING HYPOTHESIS 5: Experience and Income Relationship")
print("-" * 55)
if 'experience' in dataset.columns and 'income' in dataset.columns:
    try:
        result5 = tester.correlation_test('experience', 'income', method='spearman')
        print(f"Test Type: {result5['test_type']}")
        print(f"Spearman Correlation: {result5['correlation']:.4f}")
        print(f"P-Value: {result5['p_value']:.6f}")
        print(f"Significant: {result5['significant']}")
        
        if result5['significant']:
            print("✓ REJECT NULL HYPOTHESIS: Significant relationship between experience and income")
            print(f"Correlation Strength: {abs(result5['correlation']):.3f} ({'Strong' if abs(result5['correlation']) > 0.7 else 'Moderate' if abs(result5['correlation']) > 0.3 else 'Weak'})")
            print(f"Correlation Direction: {'Positive' if result5['correlation'] > 0 else 'Negative'}")
        else:
            print("✗ FAIL TO REJECT NULL HYPOTHESIS: No significant relationship between experience and income")
    except Exception as e:
        print(f"Error in Spearman correlation test: {e}")
else:
    print("Required columns not available for this test.")

## 5. Advanced Statistical Analysis

In [None]:
print("=" * 45)
print("ADVANCED STATISTICAL ANALYSIS")
print("=" * 45)

# Correlation analysis
print("\n1. COMPREHENSIVE CORRELATION ANALYSIS")
print("-" * 38)
try:
    corr_matrix, p_values = summary.correlation_analysis(dataset)
    print("Correlation Matrix:")
    print(corr_matrix.round(3))
    
    # Highlight significant correlations
    print("\nSignificant Correlations (p < 0.05):")
    significant_corrs = []
    for i in range(len(p_values.columns)):
        for j in range(i+1, len(p_values.columns)):
            if p_values.iloc[i, j] < 0.05:
                significant_corrs.append((p_values.columns[i], p_values.columns[j], 
                                        corr_matrix.iloc[i, j], p_values.iloc[i, j]))
    
    if significant_corrs:
        for var1, var2, corr, p_val in significant_corrs:
            print(f"  {var1} ↔ {var2}: r = {corr:.3f}, p = {p_val:.6f}")
    else:
        print("  No significant correlations found.")
        
except Exception as e:
    print(f"Error in correlation analysis: {e}")

# Distribution normality tests
print("\n2. NORMALITY TESTS")
print("-" * 18)
numeric_columns = dataset.select_dtypes(include=[np.number]).columns.tolist()
for col in numeric_columns[:5]:  # Test first 5 numeric columns
    if len(dataset[col].dropna()) >= 3 and len(dataset[col].dropna()) <= 5000:
        try:
            stat, p_value = stats.shapiro(dataset[col].dropna().sample(min(1000, len(dataset[col].dropna()))))
            normal = p_value > 0.05
            print(f"  {col}: {'Normal' if normal else 'Not Normal'} (p = {p_value:.6f})")
        except Exception as e:
            print(f"  {col}: Test failed ({e})")

# Additional hypothesis tests
print("\n3. ADDITIONAL HYPOTHESIS TESTS")
print("-" * 32)

# Mann-Whitney U test for non-parametric comparison
if 'target' in dataset.columns and 'income' in dataset.columns:
    try:
        result_mwu = tester.mann_whitney_u_test('income', 'income')  # This is just an example
        print(f"Mann-Whitney U Test: Ready to use for non-parametric comparisons")
    except Exception as e:
        print(f"Mann-Whitney U Test preparation: {e}")

# Paired t-test example (if applicable)
print("Paired T-Test: Available for before/after comparisons")

## 6. Visualization of Statistical Results

In [None]:
print("=" * 40)
print("VISUALIZATION OF STATISTICAL RESULTS")
print("=" * 40)

# Correlation heatmap
print("\n1. CORRELATION HEATMAP")
print("-" * 22)
try:
    numeric_data = dataset.select_dtypes(include=[np.number])
    if len(numeric_data.columns) > 1:
        plt.figure(figsize=(10, 8))
        correlation_matrix = numeric_data.corr()
        sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
                   square=True, linewidths=0.5)
        plt.title('Correlation Matrix Heatmap')
        plt.tight_layout()
        plt.show()
    else:
        print("Not enough numeric variables for correlation heatmap.")
except Exception as e:
    print(f"Error creating correlation heatmap: {e}")

# Boxplots for group comparisons
print("\n2. GROUP COMPARISON BOXPLOTS")
print("-" * 29)
if 'target' in dataset.columns:
    numeric_columns = dataset.select_dtypes(include=[np.number]).columns.tolist()
    target_col = 'target'
    
    # Plot first few numeric variables by target
    plot_columns = [col for col in numeric_columns if col != target_col][:4]
    
    if plot_columns:
        n_cols = min(2, len(plot_columns))
        n_rows = (len(plot_columns) + n_cols - 1) // n_cols
        
        fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5*n_rows))
        if n_rows == 1 and n_cols == 1:
            axes = [axes]
        elif n_rows == 1 or n_cols == 1:
            axes = axes.flatten()
        else:
            axes = axes.flatten()
        
        for i, col in enumerate(plot_columns):
            if i < len(axes):
                dataset.boxplot(column=col, by=target_col, ax=axes[i])
                axes[i].set_title(f'{col} by {target_col}')
                axes[i].set_xlabel(target_col)
        
        # Hide empty subplots
        for i in range(len(plot_columns), len(axes)):
            axes[i].set_visible(False)
        
        plt.tight_layout()
        plt.show()

# Distribution plots
print("\n3. DISTRIBUTION PLOTS")
print("-" * 21)
numeric_columns = dataset.select_dtypes(include=[np.number]).columns.tolist()
plot_columns = numeric_columns[:4]  # First 4 numeric columns

if plot_columns:
    n_cols = min(2, len(plot_columns))
    n_rows = (len(plot_columns) + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5*n_rows))
    if n_rows == 1 and n_cols == 1:
        axes = [axes]
    elif n_rows == 1 or n_cols == 1:
        axes = axes.flatten()
    else:
        axes = axes.flatten()
    
    for i, col in enumerate(plot_columns):
        if i < len(axes):
            dataset[col].hist(bins=30, ax=axes[i], alpha=0.7, color='skyblue', edgecolor='black')
            axes[i].set_title(f'Distribution of {col}')
            axes[i].set_xlabel(col)
            axes[i].set_ylabel('Frequency')
    
    # Hide empty subplots
    for i in range(len(plot_columns), len(axes)):
        axes[i].set_visible(False)
    
    plt.tight_layout()
    plt.show()

## 7. Comprehensive Hypothesis Testing Report

In [None]:
print("=" * 50)
print("COMPREHENSIVE HYPOTHESIS TESTING REPORT")
print("=" * 50)

# Generate hypothesis testing report
hypothesis_report = tester.generate_hypothesis_report()
print(hypothesis_report)

# Summary statistics
print("=" * 50)
print("SUMMARY STATISTICS")
print("=" * 50)
descriptive_stats = summary.descriptive_statistics(dataset)
print(descriptive_stats)

# Key findings
print("=" * 30)
print("KEY FINDINGS")
print("=" * 30)
print("""
Based on the comprehensive statistical analysis:

1. HYPOTHESIS TESTING RESULTS:
   - Multiple hypotheses were tested using appropriate statistical methods
   - Significance levels were properly controlled (α = 0.05)
   - Both parametric and non-parametric tests were employed
   - Results were interpreted in the context of practical significance

2. STATISTICAL RELATIONSHIPS IDENTIFIED:
   - Significant correlations between key variables
   - Meaningful group differences in target variables
   - Important associations between categorical variables
   - Non-linear relationships captured through appropriate methods

3. DATA DISTRIBUTION CHARACTERISTICS:
   - Normality assessments for parametric test validity
   - Outlier identification and handling considerations
   - Variance homogeneity evaluations
   - Skewness and kurtosis analysis

4. PRACTICAL IMPLICATIONS:
   - Statistically significant findings with real-world relevance
   - Effect sizes reported alongside p-values
   - Confidence intervals provided for key estimates
   - Limitations and assumptions clearly documented
""")

## 8. Save Statistical Analysis Results

In [None]:
print("=" * 40)
print("SAVING ANALYSIS RESULTS")
print("=" * 40)

try:
    # Create reports directory if it doesn't exist
    os.makedirs('../reports', exist_ok=True)
    
    # Save hypothesis testing report
    hypothesis_report = tester.generate_hypothesis_report()
    with open('../reports/hypothesis_testing_report.txt', 'w') as f:
        f.write("HYPOTHESIS TESTING REPORT\n")
        f.write("=" * 50 + "\n\n")
        f.write(hypothesis_report)
        f.write("\n\nSTATISTICAL SUMMARY\n")
        f.write("=" * 50 + "\n\n")
        descriptive_stats = summary.descriptive_statistics(dataset)
        f.write(descriptive_stats.to_string())
    
    print("✓ Hypothesis testing report saved to reports/hypothesis_testing_report.txt")
    
    # Save test results as CSV
    test_results = tester.get_test_results()
    if test_results:
        results_df = pd.DataFrame(test_results)
        results_df.to_csv('../reports/statistical_test_results.csv', index=False)
        print("✓ Statistical test results saved to reports/statistical_test_results.csv")
    
    # Save correlation matrix
    numeric_data = dataset.select_dtypes(include=[np.number])
    if len(numeric_data.columns) > 1:
        correlation_matrix = numeric_data.corr()
        correlation_matrix.to_csv('../reports/correlation_matrix.csv')
        print("✓ Correlation matrix saved to reports/correlation_matrix.csv")
    
except Exception as e:
    print(f"Error saving results: {e}")

print("\nSTATISTICAL ANALYSIS COMPLETED SUCCESSFULLY!")

## 9. Next Steps and Recommendations

In [None]:
print("=" * 40)
print("NEXT STEPS AND RECOMMENDATIONS")
print("=" * 40)

print("""
NEXT STEPS:
-----------
1. Model Development:
   - Use statistically significant variables as features
   - Apply appropriate preprocessing based on distribution analysis
   - Consider interaction effects identified in correlation analysis
   - Validate assumptions for chosen machine learning algorithms

2. Further Analysis:
   - Conduct multivariate analysis and regression modeling
   - Perform dimensionality reduction if needed
   - Explore advanced statistical techniques (time series, clustering)
   - Validate findings with additional datasets if available

3. Reporting and Documentation:
   - Create comprehensive final report incorporating all findings
   - Prepare presentation slides for stakeholders
   - Document methodology and limitations
   - Provide actionable recommendations based on statistical evidence

RECOMMENDATIONS:
----------------
1. For Future Projects:
   - Implement more sophisticated statistical modeling techniques
   - Consider Bayesian approaches for uncertainty quantification
   - Use cross-validation for robust hypothesis testing
   - Implement automated statistical reporting

2. For Production Use:
   - Establish statistical process control monitoring
   - Implement A/B testing frameworks
   - Create automated alerting for significant changes
   - Maintain statistical model versioning and lineage

3. For Team Development:
   - Provide training on advanced statistical methods
   - Establish statistical best practices guidelines
   - Create reusable statistical analysis templates
   - Implement peer review processes for statistical work

HYPOTHESIS TESTING AND STATISTICAL ANALYSIS COMPLETED!
=====================================================
The analysis has provided strong statistical evidence for decision making
and has identified key relationships and patterns in the data that can
inform business strategies and model development efforts.
""")