<a href="https://colab.research.google.com/github/anshupandey/Machine_Learning_Training/blob/master/JPMC24/code0x_Data_quality_analysis_with_hypothesis_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Quality Analysis Using Hypothesis Testing Techniques


### Objective:
Perform data quality check and analysis using hypothesis testing techniques with a focus on:
- Assessing the difference in data distribution.
- Using statistical techniques to compare data quality against benchmarks.


### 1. Hypothesis Testing for Missing Data Proportion:
- **Objective**: We'll check if the proportion of missing data in the column
'Number of non-deposit taking microfinance institutions' exceeds an acceptable benchmark (10%).
- **Null Hypothesis (H0)**: The proportion of missing data is less than or equal to 10%.
- **Alternative Hypothesis (H1)**: The proportion of missing data exceeds 10%.


### 2. Outlier Detection in 'Number of life insurance policies per 1,000 adults':
- **Objective**: Test whether the distribution of the 'Number of life insurance policies per 1,000 adults'
contains significant outliers compared to the rest of the dataset.
- **Null Hypothesis (H0)**: The distribution contains no significant outliers.
- **Alternative Hypothesis (H1)**: The distribution contains significant outliers.







In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from scipy import stats

In [None]:
# Load the cleaned India FAS data
india_data_cleaned = pd.read_csv('india_fas_cleaned.csv')

### 1. Hypothesis Testing for Missing Data Proportion in 'Number of non-deposit taking microfinance institutions'

- **Objective**: We'll check if the proportion of missing data in the column
'Number of non-deposit taking microfinance institutions' exceeds an acceptable benchmark (10%).
- **Null Hypothesis (H0)**: The proportion of missing data is less than or equal to 10%.
- **Alternative Hypothesis (H1)**: The proportion of missing data exceeds 10%.

In [None]:
# Set benchmark for acceptable missing data proportion (10%)
benchmark_proportion = 0.10

# Calculate actual missing proportion in the column
missing_proportion = india_data_cleaned['Number of non-deposit taking microfinance institutions'].isna().mean()
print(missing_proportion)

0.35


In [None]:
from statsmodels.stats.weightstats import ztest
# Perform Z-test
missing_dist = india_data_cleaned['Number of non-deposit taking microfinance institutions'].isna().astype(int)
z_score, p_value = ztest(missing_dist.values, value=benchmark_proportion)
print(f"P-value: {p_value}")

P-value: 0.02233137323552469


In [None]:
# Perform one-sample z-test for proportions
n = len(missing_dist)
p_hat = missing_proportion # sample mean
p_0 = benchmark_proportion # populaiton mean
s_std = np.std(missing_dist.values) # sample standard deviation
z_stat = (p_hat - p_0) / (s_std/np.sqrt(n))

In [None]:
# Get the p-value
p_value_missing_data = stats.norm.sf(abs(z_stat)) * 2
print(p_value_missing_data)

0.019076322101778374


### 2. Outlier Detection in 'Number of life insurance policies per 1,000 adults'

- **Objective**: Test whether the distribution of the 'Number of life insurance policies per 1,000 adults'
contains significant outliers compared to the rest of the dataset.
- **Null Hypothesis (H0)**: The distribution contains no significant outliers.
- **Alternative Hypothesis (H1)**: The distribution contains significant outliers.


In [None]:
# Using the IQR method to detect outliers
life_insurance_data = india_data_cleaned['Number of life insurance policies per 1,000 adults'].dropna()

Q1 = life_insurance_data.quantile(0.25)
Q3 = life_insurance_data.quantile(0.75)
IQR = Q3 - Q1


In [None]:
# Define outlier range
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Detect outliers
outliers = life_insurance_data[(life_insurance_data < lower_bound) | (life_insurance_data > upper_bound)]


In [None]:
# Perform z-test on outliers (if any)
if len(outliers) > 0:
    z_stat_outliers = (outliers.mean() - life_insurance_data.mean()) / life_insurance_data.std()
    p_value_outliers = stats.norm.sf(abs(z_stat_outliers)) * 2
else:
    p_value_outliers = None
print(p_value_outliers)

0.9965990095690176


In [None]:
from statsmodels.stats.weightstats import ztest
z_score, p_value = ztest(outliers.values, value=life_insurance_data.mean())
print(f"P-value: {p_value}")

P-value: 0.9978742744046479


In [None]:

# Display the results
{
    "missing_data_z_stat": z_stat,
    "missing_data_p_value": p_value_missing_data,
    "outliers_p_value": p_value_outliers,
}


{'missing_data_z_stat': 2.3440361546924775,
 'missing_data_p_value': 0.019076322101778374,
 'outliers_p_value': 0.9965990095690176,
 'anova_f_statistic': nan,
 'anova_p_value': nan}

### Results Interpretation:

1. **Hypothesis Testing for Missing Data Proportion:**
   - **Z-Statistic:** 3.73
   - **P-Value:** 0.00019
   The p-value is significantly less than 0.05, suggesting that the missing data proportion exceeds the acceptable benchmark of 10%, indicating potential data quality issues.

2. **Outlier Detection in 'Number of life insurance policies per 1,000 adults':**
   - **P-Value:** 0.9966
   The high p-value suggests that the outliers do not significantly affect the overall data distribution, indicating no major data quality issues.
