# Data Quality Analysis Using Hypothesis Testing Techniques


### Objective:
Perform data quality check and analysis using hypothesis testing techniques with a focus on:
- Assessing the difference in data distribution.
- Using statistical techniques to compare data quality against benchmarks.


### 1. Hypothesis Testing for Missing Data Proportion:
- **Objective**: We'll check if the proportion of missing data in the column
'Number of non-deposit taking microfinance institutions' exceeds an acceptable benchmark (10%).
- **Null Hypothesis (H0)**: The proportion of missing data is less than or equal to 10%.
- **Alternative Hypothesis (H1)**: The proportion of missing data exceeds 10%.


### 2. Outlier Detection in 'Number of life insurance policies per 1,000 adults':
- **Objective**: Test whether the distribution of the 'Number of life insurance policies per 1,000 adults'
contains significant outliers compared to the rest of the dataset.
- **Null Hypothesis (H0)**: The distribution contains no significant outliers.
- **Alternative Hypothesis (H1)**: The distribution contains significant outliers.







In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from scipy import stats

  from pandas.core import (


In [2]:
# Load the cleaned India FAS data
india_data_cleaned = pd.read_csv('datasets-1/india_fas_cleaned.csv')

In [4]:
india_data_cleaned

Unnamed: 0,Economy,Year,ISO-3 code,Institutions of commercial banks,Number of credit unions and credit cooperatives,Number of other deposit takers,Number of non-deposit taking microfinance institutions,Number of insurance corporations,Branches of commercial banks,Branches of credit unions and credit cooperatives,...,Outstanding balances on active mobile money accounts (% of GDP),"Number of mobile money transactions (during the reference year) per 1,000 adults",Value of mobile money transactions (during the reference year) (% of GDP),Average no. of mobile money transactions per active mob mon account,"Number of credit cards per 1,000 adults","Number of debit cards per 1,000 adults","No of mobile and internet banking transac (during reference yr) per 1,000 adults",Value of mobile and internet banking transac (during reference yr) (% of GDP),"Number of life insurance policies per 1,000 adults","Number of Non-life insurance policies per 1,000 adults"
0,India,2004,IND,291.0,108848.0,604.0,,28.0,67428.0,23503.0,...,,,,,,,,,,55.203012
1,India,2005,IND,289.0,111797.0,507.0,,29.0,68628.0,22927.0,...,,,,,,,,,,64.486561
2,India,2006,IND,222.0,109354.0,428.0,,30.0,69382.0,23224.0,...,,,,,29.306822,95.032027,,,245.376039,64.765758
3,India,2007,IND,182.0,96155.0,401.0,,32.0,71793.0,23694.0,...,,,,,34.18674,127.131435,,,560.901062,57.956248
4,India,2008,IND,176.0,97840.0,364.0,,37.0,75799.0,23594.0,...,,,,,30.028991,167.087101,,,313.567643,69.665218
5,India,2009,IND,172.0,98475.0,336.0,,43.0,79765.0,23505.0,...,,,,,21.842193,216.825263,,,347.47568,79.952603
6,India,2010,IND,169.0,97441.0,308.0,,48.0,85143.0,23917.0,...,,,,,21.066517,266.078783,,,372.453072,78.828642
7,India,2011,IND,171.0,96177.0,297.0,41.0,48.0,90849.0,24730.0,...,,,,,20.203262,318.536185,,,376.807428,90.771595
8,India,2012,IND,173.0,95186.0,271.0,41.0,51.0,98558.0,24974.0,...,,,,,21.94401,371.671971,28.683664,0.018302,376.258754,96.173319
9,India,2013,IND,157.0,96230.0,254.0,46.0,51.0,106531.0,25414.0,...,0.000184,35.98414,0.00891,,21.519099,364.475132,58.660272,0.053323,369.884296,117.751284


### 1. Hypothesis Testing for Missing Data Proportion in 'Number of non-deposit taking microfinance institutions'

- **Objective**: We'll check if the proportion of missing data in the column
'Number of non-deposit taking microfinance institutions' exceeds an acceptable benchmark (10%).
- **Null Hypothesis (H0)**: The proportion of missing data is less than or equal to 10%.
- **Alternative Hypothesis (H1)**: The proportion of missing data exceeds 10%.

In [5]:
# Set benchmark for acceptable missing data proportion (10%)
benchmark_proportion = 0.10

# Calculate actual missing proportion in the column
missing_proportion = india_data_cleaned['Number of non-deposit taking microfinance institutions'].isna().mean()
print(missing_proportion)

0.35


In [6]:
from statsmodels.stats.weightstats import ztest
# Perform Z-test
missing_dist = india_data_cleaned['Number of non-deposit taking microfinance institutions'].isna().astype(int)
z_score, p_value = ztest(missing_dist.values, value=benchmark_proportion)
print(f"P-value: {p_value}")

P-value: 0.02233137323552469


In [9]:
z_score

2.2846838336584385

In [7]:
# Perform one-sample z-test for proportions
n = len(missing_dist)
p_hat = missing_proportion # sample mean
p_0 = benchmark_proportion # populaiton mean
s_std = np.std(missing_dist.values) # sample standard deviation
z_stat = (p_hat - p_0) / (s_std/np.sqrt(n))

In [8]:
# Get the p-value
p_value_missing_data = stats.norm.sf(abs(z_stat)) * 2
print(p_value_missing_data)

0.019076322101778374


### 2. Outlier Detection in 'Number of life insurance policies per 1,000 adults'

- **Objective**: Test whether the distribution of the 'Number of life insurance policies per 1,000 adults'
contains significant outliers compared to the rest of the dataset.
- **Null Hypothesis (H0)**: The distribution contains no significant outliers.
- **Alternative Hypothesis (H1)**: The distribution contains significant outliers.


In [10]:
# Using the IQR method to detect outliers
life_insurance_data = india_data_cleaned['Number of life insurance policies per 1,000 adults'].dropna()

Q1 = life_insurance_data.quantile(0.25)
Q3 = life_insurance_data.quantile(0.75)
IQR = Q3 - Q1


In [11]:
# Define outlier range
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Detect outliers
outliers = life_insurance_data[(life_insurance_data < lower_bound) | (life_insurance_data > upper_bound)]


In [12]:
outliers

2     245.376039
3     560.901062
19    224.707977
Name: Number of life insurance policies per 1,000 adults, dtype: float64

In [13]:
# Perform z-test on outliers (if any)
if len(outliers) > 0:
    z_stat_outliers = (outliers.mean() - life_insurance_data.mean()) / life_insurance_data.std()
    p_value_outliers = stats.norm.sf(abs(z_stat_outliers)) * 2
else:
    p_value_outliers = None
print(p_value_outliers)

0.9965990095690176


In [14]:
from statsmodels.stats.weightstats import ztest
z_score, p_value = ztest(outliers.values, value=life_insurance_data.mean())
print(f"P-value: {p_value}")

P-value: 0.9978742744046479


In [15]:

# Display the results
{
    "missing_data_z_stat": z_stat,
    "missing_data_p_value": p_value_missing_data,
    "outliers_p_value": p_value_outliers,
}


{'missing_data_z_stat': 2.3440361546924775,
 'missing_data_p_value': 0.019076322101778374,
 'outliers_p_value': 0.9965990095690176}

### Results Interpretation:

1. **Hypothesis Testing for Missing Data Proportion:**
   - **Z-Statistic:** 3.73
   - **P-Value:** 0.00019
   The p-value is significantly less than 0.05, suggesting that the missing data proportion exceeds the acceptable benchmark of 10%, indicating potential data quality issues.

2. **Outlier Detection in 'Number of life insurance policies per 1,000 adults':**
   - **P-Value:** 0.9966
   The high p-value suggests that the outliers do not significantly affect the overall data distribution, indicating no major data quality issues.
