# Can we beat Welch's t-test?

The current version of this script is super-stupid. But to do any interesting research, you have to plant
a few small acorns, and this is the smallest of acorns.

Suppose you have to evaluate NUMBER_OF_EXPERIMENTS experiments very small experiment. By a weird
coincidence, they all have the same number of participants. There are CONTROL_GROUP_SIZE patients
getting a placebo, and EXPERIMENT_GROUP_SIZE getting a drug.

Unfortunately approximately PROPORTION_OF_EXPERIMENTS_THAT_DO_NOTHING of those drugs actually have no effect either.

You want to create a classifier to determine whether or not the experimental drug worked.

Also, amazingly, the control group and the experimental group metrics are normally distributed. Obviously, the
experiments where the drug is useless have the same mean and standard deviation. In the experiments where
the drug is useful, they differ. The means of these datasets are uniformly distributed between 0-10. The standard deviations are uniformly distributed by 1-4.

Being a well-trained statistician with years of experience, you use Welch's t-test.

Your offsider is a clueless data scientist whose sole training consisted of reading the sklearn documentation.

You have a future self who will implement a beautifully trained transformer or fully-connected neural network. (I haven't done that in this code yet. One day soon.)

Who wins?

In [1]:
CONTROL_GROUP_SIZE=5
EXPERIMENT_GROUP_SIZE=6
PROPORTION_OF_EXPERIMENTS_THAT_DO_NOTHING=0.66
NUMBER_OF_EXPERIMENTS=10000

In [2]:
import scipy
import scipy.stats
import collections
import pandas
import sklearn.model_selection
import sklearn.svm
import sklearn.dummy
import sklearn.metrics
import sklearn.tree
import sklearn.ensemble
import sklearn.neighbors

In [3]:
Experiment = collections.namedtuple("Experiment",
                                    ["ControlLoc", "ControlScale", "ControlValues",
                                   "ExperimentLoc", "ExperimentScale", "ExperimentValues", 
                                   "ShouldShowResult"])

In [4]:
def generate_experiment():
    null_experiment = scipy.stats.uniform.rvs() <= PROPORTION_OF_EXPERIMENTS_THAT_DO_NOTHING
    control_loc = scipy.stats.uniform.rvs(loc=0, scale=10)
    control_scale = scipy.stats.uniform.rvs(loc=1, scale=4)
    if null_experiment:
        experiment_loc = control_loc
        experiment_scale = control_scale
    else:
        experiment_loc = scipy.stats.uniform.rvs(loc=0, scale=10)
        experiment_scale = scipy.stats.uniform.rvs(loc=1, scale=4)
    control_values = scipy.stats.norm.rvs(loc=control_loc, scale=control_scale, size=CONTROL_GROUP_SIZE)
    experiment_values = scipy.stats.norm.rvs(loc=experiment_loc, scale=experiment_scale, size=EXPERIMENT_GROUP_SIZE)
    return Experiment(ControlLoc=control_loc, 
                      ControlScale=control_scale,
                     ControlValues=control_values,
                     ExperimentLoc=experiment_loc,
                     ExperimentScale=experiment_scale,
                     ExperimentValues=experiment_values,
                     ShouldShowResult=not(null_experiment))

def generate_experiments(number_of_experiments):
    return [generate_experiment() for n in range(number_of_experiments)]

In [5]:
def create_feature_dataframe(experiments):
    records = []
    for experiment in experiments:
        record = {}
        for i in range(CONTROL_GROUP_SIZE):
            record[f"c{i}"] = experiment.ControlValues[i]
        for i in range(EXPERIMENT_GROUP_SIZE):
            record[f"x{i}"] = experiment.ExperimentValues[i]
        records.append(record)
    return pandas.DataFrame.from_records(records)

def create_target_series(experiments):
    return pandas.Series([x.ShouldShowResult for x in experiments])

In [6]:
experiments = generate_experiments(NUMBER_OF_EXPERIMENTS)
create_feature_dataframe(experiments)

Unnamed: 0,c0,c1,c2,c3,c4,x0,x1,x2,x3,x4,x5
0,-0.206126,8.522555,10.064102,-0.853037,8.452155,-4.433929,4.602424,8.761225,2.613004,1.967372,0.990905
1,0.501765,11.001093,7.251494,14.495405,10.152011,7.130783,10.953517,10.992327,17.070594,4.902604,4.919637
2,1.751106,2.741874,5.272238,4.575064,2.102541,5.414399,4.340850,4.936145,4.348434,2.791548,3.419493
3,9.060280,8.686143,9.621933,9.113837,7.500838,9.598649,8.075901,7.975168,10.110625,6.638207,7.513025
4,5.583946,0.336508,2.344874,5.757725,-1.038442,9.173740,4.693067,2.313086,2.273397,3.966556,3.555573
...,...,...,...,...,...,...,...,...,...,...,...
9995,7.507727,7.504997,7.683371,7.330439,7.131142,-2.130101,-0.003836,8.515860,3.426279,0.930638,2.000744
9996,-1.288976,2.701757,-1.093941,-1.493802,1.597263,7.900321,9.869044,0.611283,7.947053,6.435785,3.343748
9997,15.739962,9.606768,5.739813,12.978162,15.704103,10.242445,3.246099,11.414767,4.416069,6.439252,1.910737
9998,1.624305,6.375156,4.740752,-4.512823,6.580758,18.655564,1.270996,0.176932,7.189917,-0.601702,-1.252281


In [7]:
create_target_series(experiments).mean()

0.3307

In [8]:
class WelchTTest:
    def __init__(self):
        pass
    def fit(self, X,y, weights=[]):
        pass
    def predict(self, Xs):
        answer = []
        for experiment in Xs:
            outcome = scipy.stats.ttest_ind(experiment.ControlValues, experiment.ExperimentValues, equal_var=False)
            if outcome.pvalue < 0.05:
                answer.append(True)
            else:
                answer.append(False)
        return answer

In [9]:
welch = WelchTTest()
welch_answers = welch.predict(experiments)
print(sklearn.metrics.confusion_matrix(create_target_series(experiments), welch_answers))

[[6394  299]
 [2048 1259]]


In [10]:
sklearn.metrics.confusion_matrix([False, False, False, False, True, True, True, True],
                                [True, False, False, False, True, True, True, True])[0][1]

1

In [11]:
def type_1_error_ratio(y_true, y_pred):
    return sklearn.metrics.confusion_matrix(y_true, y_pred)[0][1] / len(y_true)

type_1_error_score = sklearn.metrics.make_scorer(type_1_error_ratio, greater_is_better=False)

## The target to beat...

For accuracy.

In [12]:
welch_accuracy = sklearn.metrics.accuracy_score(create_target_series(experiments), welch_answers)
welch_accuracy

0.7653

But you have to keep the type 1 error rate below this...

In [13]:
type_1_error_ratio(create_target_series(experiments), welch_answers)

0.0299

## Various stupid ways to beat it

Dummy gives us a baseline to be sure that the real models aren't just predicting the most common case.

In [14]:
dummy_scores = sklearn.model_selection.cross_validate(sklearn.dummy.DummyClassifier(strategy='most_frequent'), 
                                       create_feature_dataframe(experiments),
                                       create_target_series(experiments),
                                       scoring={'accuracy': 'accuracy', 
                                                'type_1_error_ratio': type_1_error_score}
                                      )
print("Mean accuracy:", dummy_scores['test_accuracy'].mean())
print("Mean type 1 error:", abs(dummy_scores['test_type_1_error_ratio'].mean()))

Mean accuracy: 0.6693
Mean type 1 error: 0.0


Random forest seems to beat Welch if there is enough training data.

In [15]:
%%time
rfc_scores = sklearn.model_selection.cross_validate(sklearn.ensemble.RandomForestClassifier(), 
                                       create_feature_dataframe(experiments),
                                       create_target_series(experiments),
                                       scoring={'accuracy': 'accuracy', 
                                                'type_1_error_ratio': type_1_error_score}
                                               )
print("Mean accuracy:", rfc_scores['test_accuracy'].mean())
print("Mean type 1 error:", abs(rfc_scores['test_type_1_error_ratio'].mean()))

Mean accuracy: 0.7724
Mean type 1 error: 0.0315
CPU times: user 14.7 s, sys: 87.7 ms, total: 14.8 s
Wall time: 14.8 s


In [16]:
if abs(rfc_scores['test_type_1_error_ratio'].mean()) < 0.05 and rfc_scores['test_accuracy'].mean() > welch_accuracy:
    print("Random forest classifier beat the Welch t-test while maintaining a low type 1 error ratio")

Random forest classifier beat the Welch t-test while maintaining a low type 1 error ratio


### Lost causes

You would think that SVM models would work really well; but they take too long to be practical. It's also
not clear whether they are actually improving with more data.

In [17]:
#%%time
#sklearn.model_selection.cross_validate(sklearn.svm.SVC(kernel='rbf', C=1e9), 
#                                       create_feature_dataframe(experiments),
#                                       create_target_series(experiments))['test_score'].mean()

In [18]:
#%%time
#sklearn.model_selection.cross_validate(sklearn.svm.SVC(kernel='poly', C=1e9, degree=2), 
##                                       create_feature_dataframe(experiments),
#                                       create_target_series(experiments))['test_score'].mean()

In [19]:
#sklearn.model_selection.cross_validate(sklearn.svm.LinearSVC(dual=True, C=1e9), 
#                                       create_feature_dataframe(experiments),
#                                       create_target_series(experiments))['test_score'].mean()