In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns

## Data Loading and Preparation

In this initial step, we load and prepare the datasets for hypothesis testing. The training and testing datasets are read from CSV files, and the features and labels are separated.

1. **Loading Data**: The training and testing datasets are loaded from the specified file paths.
2. **Feature and Label Separation**: The `label` column, which represents the target variable (plain or no plain language), is separated from the feature set for both training and testing datasets.
3. **Data Type Conversion**: Each feature column is parsed to float type, excluding the `label` and `file` columns, to facilitate accurate and reliable statistical analysis.

In [10]:
# Load the training and testing datasets
data_train = pd.read_csv('data_clean/data_train.csv')
data_test = pd.read_csv('data_clean/data_test.csv')

# Split the training and test data into features (X) and labels (y)
X_train, y_train = data_train.drop('label', axis=1), data_train['label']
X_test, y_test = data_test.drop('label', axis=1), data_test['label']

# Parse each column to float except the label column and file column
for col in X_train.columns:
    if col not in ['label', 'file']:
        X_train[col] = X_train[col].astype(float)
        X_test[col] = X_test[col].astype(float)

## Ensuring Data Independence

To maintain the independence of our training and testing datasets and avoid any potential bias from data augmentation, we remove entries that contain the term 'section' in the `file` column. This step ensures that our hypothesis tests will be based on truly independent samples. After filtering out the augmented data, we combine the cleaned training and testing datasets for further analysis.

In [11]:
# Filter out entries in the training data where the 'file' column contains 'section'
data_train = data_train[~data_train['file'].str.contains('section')]
# Filter out entries in the testing data where the 'file' column contains 'section'
data_test = data_test[~data_test['file'].str.contains('section')]

# Combine the cleaned training and testing datasets
data = pd.concat([data_train, data_test])

To account for multiple comparisons in our hypothesis tests, we adjust the significance level using the Bonferroni correction. This step reduces the risk of Type I errors (false positives) when testing multiple hypotheses simultaneously. The adjusted significance level is calculated by dividing the desired alpha level (0.05) by the number of features (excluding the label column).

In [12]:
# Calculate the adjusted significance level using the Bonferroni correction
a = 0.05 / (X_train.shape[1] - 1)

# Print the adjusted significance level and the number of features
print('Adjusted significance level (alpha):', a)
print('Number of features (excluding label column):', X_train.shape[1] - 1)

Adjusted significance level (alpha): 0.00078125
Number of features (excluding label column): 64


## Kolmogorov-Smirnov Test for Feature Differences

We perform the Kolmogorov-Smirnov (KS) test to compare the distributions of each feature between the two groups: plain language (label = 1) and no plain language (label = 0). The KS test helps us determine if there are significant differences in the distributions of features between these two groups.

For each feature, we calculate the KS statistic and the corresponding p-value. We then save the results to a file and print features with p-values greater than the adjusted significance level. Finally, we store the results in a DataFrame for further analysis.

In [13]:
# Combine the cleaned training and testing datasets
data = pd.concat([data_train, data_test])

# Initialize a DataFrame to store the KS test results
ks_df = pd.DataFrame(columns=['feature', 'p_value'])

# Perform the KS test for each feature
for i, col in enumerate(X_train.columns[1:]):
    # Calculate the KS statistic and p-value
    statistic, p_value = stats.ks_2samp(
        data[data['label'] == 0][col], data[data['label'] == 1][col], alternative='two-sided')
    
    # Append the p-value to a text file
    with open('tests_results/ks.txt', 'a') as f:
        f.write(col + ': ' + str(p_value) + '\n')
    
    # Print the feature name and p-value if it is greater than the adjusted significance level
    if p_value > a:
        print(col + ': ', p_value)
    
    # Store the results in the DataFrame
    ks_df.loc[i] = [col, p_value]

passive_voice:  0.002218541515625423
money:  0.0015298603986525772


## Mann-Whitney U Test

In addition to the Kolmogorov-Smirnov test, we use the Mann-Whitney U test to further analyze the differences between the plain language (label = 1) and no plain language (label = 0) groups. The Mann-Whitney U test is a non-parametric test that assesses whether there is a significant difference in the distributions of two independent samples.

For each feature, we calculate the Mann-Whitney U statistic and the corresponding p-value. We then save the results to a file and print features with p-values greater than the adjusted significance level. The results are also stored in a DataFrame for further analysis.


In [14]:
# Initialize a DataFrame to store the Mann-Whitney U test results
mannwhitneyu_df = pd.DataFrame(columns=['feature', 'p_value'])

# Perform the Mann-Whitney U test for each feature
for i, col in enumerate(X_train.columns[1:]):
    # Calculate the Mann-Whitney U statistic and p-value
    statistic, p_value = stats.mannwhitneyu(
        data[data['label'] == 0][col], data[data['label'] == 1][col], alternative='two-sided')
    
    # Append the p-value to a text file
    with open('tests_results/mannwhitneyu.txt', 'a') as f:
        f.write(col + ': ' + str(p_value) + '\n')
    
    # Print the feature name and p-value if it is greater than the adjusted significance level
    if p_value > a:
        print(col + ': ', p_value)
    
    # Store the results in the DataFrame
    mannwhitneyu_df.loc[i] = [col, p_value]

SMOGIndex:  0.02647392849429105


## Wilcoxon Signed-Rank Test for Paired Samples

To further investigate the differences between plain language (label = 1) and no plain language (label = 0) groups, we employ the Wilcoxon signed-rank test. This non-parametric test compares two related samples to assess whether their population mean ranks differ.

Given the nature of our data, we generate multiple random samples to ensure robust results. For each feature, we:

1. **Sample Generation**: Generate random samples from the no plain language group to match the size of the plain language group.
2. **Wilcoxon Test Execution**: Perform the Wilcoxon test for the paired samples.
3. **Result Averaging**: Average the p-values over multiple iterations to obtain a stable estimate.
4. **Result Storage and Display**: Save the results to a file and print features with p-values greater than the adjusted significance level.

This iterative sampling approach helps to account for variability in the data and provides a comprehensive analysis of feature differences.

In [15]:
# Number of samples for the Wilcoxon test
num_samples = 1000

# Perform the Wilcoxon signed-rank test for each feature
for i, col in enumerate(X_train.columns[1:]):
    results = []
    sample_size = min(len(data[data['label'] == 0][col]), len(data[data['label'] == 1][col]))
    for j in range(num_samples):
        # Generate random samples from the no plain language group
        random_sample1 = np.random.choice(data[data['label'] == 0][col], size=sample_size, replace=True)
        
        # Perform the Wilcoxon test for the paired samples
        statistic, p_value = stats.wilcoxon(random_sample1, data[data['label'] == 1][col], alternative='two-sided')
        results.append(p_value)
        print(f"{col}: {p_value} at sample {j}")
    
    # Average the p-values over multiple iterations
    average_p_value = np.mean(results)
    
    # Append the average p-value to a text file
    with open('tests_results/wilcoxon.txt', 'a') as f:
        f.write(col + ': ' + str(average_p_value) + '\n')
    
    # Print the feature name and average p-value if it is greater than the adjusted significance level
    if average_p_value > a:
        print(col + ': ', average_p_value)

Kincaid: 1.1089203751132156e-118 at sample 0
Kincaid: 2.7920216690559565e-118 at sample 1
Kincaid: 1.542508970324655e-117 at sample 2
Kincaid: 1.9217934792147002e-112 at sample 3
Kincaid: 1.355434075091787e-113 at sample 4
Kincaid: 7.035477405180306e-112 at sample 5
Kincaid: 3.883434575271734e-115 at sample 6
Kincaid: 2.4517470139463136e-113 at sample 7
Kincaid: 3.103468963789575e-118 at sample 8
Kincaid: 1.8638884884518648e-111 at sample 9
Kincaid: 9.190098140413486e-102 at sample 10
Kincaid: 1.7992029559736838e-112 at sample 11
Kincaid: 9.216215837864819e-114 at sample 12
Kincaid: 3.734225951177189e-119 at sample 13
Kincaid: 5.6229936219174016e-118 at sample 14
Kincaid: 1.5156689962107613e-122 at sample 15
Kincaid: 4.031528887387566e-124 at sample 16
Kincaid: 3.471158378736752e-114 at sample 17
Kincaid: 4.563761918359909e-105 at sample 18
Kincaid: 3.0204719296823082e-117 at sample 19
Kincaid: 3.8483720241187565e-118 at sample 20
Kincaid: 1.405671402899109e-114 at sample 21
Kincaid: 4

### Re-evaluating SMOG Index with Wilcoxon Signed-Rank Test

The SMOG Index is a measure of readability, and this test helps us determine if there is a significant difference in readability between the two groups. In a previous analysis, the SMOG Index was identified as having a high p-value in one of our statistical tests, suggesting it might not be significantly different between plain language and no plain language groups. To verify this result, we apply the Wilcoxon signed-rank test, a robust non-parametric test, to reassess the significance of the SMOG Index.

By generating multiple random samples and performing the Wilcoxon test across these samples, we obtain a more reliable estimate of the p-value. This thorough re-evaluation helps us determine whether the SMOG Index should indeed be considered insignificant or if it plays a meaningful role in differentiating between the two groups.

This analysis demonstrates that even if a feature initially appears insignificant based on one test, further investigation using different statistical methods can provide a more comprehensive understanding of its importance.


In [16]:
results = []

# Determine the sample size based on the smaller group
sample_size = min(len(data[data['label'] == 0]['SMOGIndex']), len(data[data['label'] == 1]['SMOGIndex']))

# Perform the Wilcoxon signed-rank test for multiple samples
for j in range(num_samples):
    # Generate random samples from the no plain language group
    random_sample1 = np.random.choice(data[data['label'] == 0]['SMOGIndex'], size=sample_size, replace=True)
    
    # Perform the Wilcoxon test for the paired samples
    statistic, p_value = stats.wilcoxon(random_sample1, data[data['label'] == 1]['SMOGIndex'], alternative='two-sided')
    results.append(p_value)
    print(f"{'SMOGIndex'}: {p_value} at sample {j}")

# Calculate the average p-value over multiple iterations
average_p_value = np.mean(results)

# Print the average p-value
print('Average p-value for SMOGIndex:', average_p_value)

SMOGIndex: 4.1081290962217335e-10 at sample 0
SMOGIndex: 1.0948558254076112e-17 at sample 1
SMOGIndex: 2.7162773253278663e-12 at sample 2
SMOGIndex: 2.9821456949871285e-12 at sample 3
SMOGIndex: 1.5950438998599587e-15 at sample 4
SMOGIndex: 1.579705687357294e-13 at sample 5
SMOGIndex: 1.6072201755615835e-11 at sample 6
SMOGIndex: 7.591243820198843e-12 at sample 7
SMOGIndex: 2.793320924324168e-13 at sample 8
SMOGIndex: 2.8816632947609676e-13 at sample 9
SMOGIndex: 2.627706883905058e-12 at sample 10
SMOGIndex: 1.0610656064384753e-11 at sample 11
SMOGIndex: 1.4182900396601227e-09 at sample 12
SMOGIndex: 7.432797557653439e-13 at sample 13
SMOGIndex: 4.33981025203395e-11 at sample 14
SMOGIndex: 6.976742457000181e-13 at sample 15
SMOGIndex: 1.5744374763269262e-13 at sample 16
SMOGIndex: 3.679246713470977e-12 at sample 17
SMOGIndex: 6.091736287872386e-17 at sample 18
SMOGIndex: 9.810268971433516e-11 at sample 19
SMOGIndex: 4.698256238109275e-12 at sample 20
SMOGIndex: 5.926454848925305e-12 at