<center><img src='https://drive.google.com/uc?id=1_utx_ZGclmCwNttSe40kYA6VHzNocdET' height="60">

AI TECH - Akademia Innowacyjnych Zastosowań Technologii Cyfrowych. Program Operacyjny Polska Cyfrowa na lata 2014-2020
<hr>

<img src='https://drive.google.com/uc?id=1BXZ0u3562N_MqCLcekI-Ens77Kk4LpPm'>


Projekt współfinansowany ze środków Unii Europejskiej w ramach Europejskiego Funduszu Rozwoju Regionalnego 
Program Operacyjny Polska Cyfrowa na lata 2014-2020,
Oś Priorytetowa nr 3 "Cyfrowe kompetencje społeczeństwa" Działanie  nr 3.2 "Innowacyjne rozwiązania na rzecz aktywizacji cyfrowej".   
Tytuł projektu:  „Akademia Innowacyjnych Zastosowań Technologii Cyfrowych (AI Tech)”
    </center>

# Statistical machine learning - Notebook 6, version for students
**Author: Michał Ciach**  
**Date: 20.11.2021**

## Description
In this class, we will deal with some advanced topics about statistical hypothesis testing.  
We will cover non-parametric tests that do not rely on assuming a particular distribution of the data. We will perform a power analysis for the Student's t-test and use it to illustrate a common misconception about tests in general.  Finally, we will learn how to control false positive results when multiple tests are performed repeatedly.  

In [1]:
!pip install gdown
!gdown https://drive.google.com/uc?id=1GW1pjKOCoKOlC4Jqbqql_ghYD_n0iC6O
!gdown https://drive.google.com/uc?id=1FInZ2jrlZGNColU4sHF9JKGHP39fTVut
!gdown https://drive.google.com/uc?id=1n1qS6dcVVKcVJOuUIIm0VTz6cSyrtzDH
!pip install --upgrade scipy

Downloading...
From: https://drive.google.com/uc?id=1GW1pjKOCoKOlC4Jqbqql_ghYD_n0iC6O
To: /content/BDL municipality incomes 2015-2020.csv
100% 228k/228k [00:00<00:00, 36.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1FInZ2jrlZGNColU4sHF9JKGHP39fTVut
To: /content/BDL municipality area km2 2015-2020.csv
100% 180k/180k [00:00<00:00, 58.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1n1qS6dcVVKcVJOuUIIm0VTz6cSyrtzDH
To: /content/BDL municipality population 2015-2020.csv
100% 222k/222k [00:00<00:00, 57.0MB/s]


## Data & library imports

In [2]:
import pandas as pd
import numpy as np
import plotly.express as px
from scipy.stats import mannwhitneyu, ttest_ind, ttest_rel, norm
from statsmodels.stats.multitest import fdrcorrection
from statsmodels.stats.power import TTestPower


pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.



In [3]:
income = pd.read_csv('BDL municipality incomes 2015-2020.csv', sep=';', dtype={'Code': 'str'})
population = pd.read_csv('BDL municipality population 2015-2020.csv', sep='\t', dtype={'Code': 'str'})

In [4]:
voivodeship_names = {
    '02': 'Dolnośląskie',
    '04': 'Kujawsko-pomorskie',
    '06': 'Lubelskie',
    '08': 'Lubuskie',
    '10': 'Łódzkie',
    '12': 'Małopolskie',
    '14': 'Mazowieckie',
    '16': 'Opolskie',
    '18': 'Podkarpackie',
    '20': 'Podlaskie',
    '22': 'Pomorskie',
    '24': 'Śląskie',
    '26': 'Świętokrzyskie',
    '28': 'Warmińsko-mazurskie',
    '30': 'Wielkopolskie',
    '32': 'Zachodniopomorskie'
}

In [5]:
code_list = [s[:2] for s in income["Code"]]
name_list = [voivodeship_names[code] for code in code_list]
income['Voivodeship'] = name_list

## The Mann-Whitney (a.k.a. Wilcoxon) test

The two-sample Student's t-test assumes a normal distribution of the data. When this assumption is only slightly violated, like for the log-income data, the results are still reliable, especially for large sample sizes. As we have seen in the previous classes, the estimator of the mean is more normally distributed than the original data, which increases the robustness of this test for small deviations from normality. However, when this assumption is stronly violated, like for the non-transformed income data, the results are no longer reliable. One way to solve this problem is to use non-parametric tests. A non-parametric test is defined as a test that does not rely on the assumption of a distribution of the data.   

One of the most common non-parametric tests is the Mann-Whitney U-test, also known as the two-sample Wilcoxon's test. It's often used as a replacement for the Student's t-test when the data is not distributed normally. However, the null hypotheses of these two tests are different, and it's important to understand this difference to avoid misleading results. 

In contrast to the Student's t-test, the Mann-Whitney's one doesn't test the equality of parameters like the mean - hence the name *non-parametric*. Instead, it's null hypothesis is that $\mathbb{P}(X > Y) = 1/2$, i.e. that if we take a random observation $X$ from the first sample, and a random observation $Y$ from the second sample, it's equally likely that the first is greater or smaller than the second. A one-sided alternative hypothesis may be e.g. that  $\mathbb{P}(X > Y) > 1/2$, i.e. that samples from the first population tend to be larger than sample from the second one. In this case, we say that the first sample is *stochastically greater* than the second one.  

Sidenote: the actual null hypothesis of the Mann-Whitney test is slightly different, but the one described above is a very close and much simpler approximation that's usually used in practice.   

**Exercise 1.** Select the data about the incomes of Mazowieckie and Wielkopolskie municipalities in 2020 and remove the rows with missing observations. Implement you own version of the Mann-Whitney's test to compare the (untransformed) incomes between the two voivodeships. You can find the necessary equations [here](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) (use the normal approximation for the test statistic). 

Based on the value of the test statistic, use the `norm.cdf()` function from `scipy` to compute the p-value (i.e. the probability that the test statistic has values that suggest the alternative hypothesis) in a one-sided test with the alternative that Wielkopolskie is stochastically richer than Mazowieckie. You may assume that there are no ties in the data (i.e. that there are no two identical incomes) to use the simpler formulas. 

Compare your results to `mannwhitneyu` function from `scipy` (both the p-value and the value of the test statistic). Pay attention to the default parameters to obtain identical results. Compare the results to a Welsch's test (a two-sample Student's t-test for different sample sizes and variances) implemented in `ttest_ind` from `scipy`. Are the results of the two tests consistent? Can you conclude that one of the voivodeships is richer than the other? Hint: Compare the means and the medians of the incomes. 





In [6]:
## Write your code here

voivodeship = ['Wielkopolskie', 'Mazowieckie']
voivodeship_income = income.loc[income['Voivodeship'].isin(voivodeship)]
voivodeship_income_subset = voivodeship_income[['2020', 'Voivodeship']]
voivodeship_income_subset = voivodeship_income_subset.dropna()

In [7]:
wielkopolskie = voivodeship_income_subset.loc[(voivodeship_income_subset['Voivodeship'] == 'Wielkopolskie')]
mazowieckie = voivodeship_income_subset.loc[(voivodeship_income_subset['Voivodeship'] == 'Mazowieckie')]

In [8]:
def s(x, y):
  if x > y:
    return 1
  elif x == y:
    return 0.5
  else:
    return 0

def mann_whitney_test(sample_1, sample_2):
  u_statistic = 0
  for i in range(len(sample_1)):
    for j in range(len(sample_2)):
      u_statistic += s(sample_1[i], sample_2[j])
  return u_statistic

In [9]:
test_statistic = mann_whitney_test(mazowieckie['2020'].array, wielkopolskie['2020'].array)
test_statistic

28105

In [10]:
n1 = len(mazowieckie)
n2 = len(wielkopolskie)

mean_u = n1 * n2 / 2
std_u = np.sqrt((n1 * n2 * (n1+n2+1)) / 12)
test_statistic_norm = (test_statistic-mean_u)/std_u
p = 1 - norm.cdf(test_statistic_norm)
p

0.9999814087062562

In [11]:
mannwhitneyu(mazowieckie['2020'].array, wielkopolskie['2020'].array)

MannwhitneyuResult(statistic=28105.0, pvalue=3.722776778222481e-05)

In [12]:
ttest_ind(mazowieckie['2020'].array, wielkopolskie['2020'].array)

Ttest_indResult(statistic=0.5484201472954098, pvalue=0.5836309389884577)

In [13]:
mazowieckie['2020'].mean(), np.median(mazowieckie['2020'])

(74146361.68592356, 14185640.715)

In [14]:
wielkopolskie['2020'].mean(), np.median(wielkopolskie['2020'])

(47549763.05986725, 18980941.075)

## Power analysis

In the previous classes, we've learned how to determine the number of observations needed for a sufficiently accurate estimation of the parameter. Power analysis is an analogous procedure for statistical tests. The power of a test is the probability of rejecting the null hypothesis when the alternative is true (obtaining a *true positive* result). You can read more about it [here](https://nickmccullum.com/power-analysis-in-python/).  

When analyzing the properties of a statistical test, we have four parameters in total: 
1. Significance level $\alpha$ (the probability of a false positive result)
2. Sample size $N$ 
3. Effect size (e.g. mean over standard deviation in the one-sample t-test)
4. Power (the probability of a true positive)

Setting any three of those, we can determine the fourth one. For example, we may look for the effect size necessary to get a 5% probability of false positive and 80% of a true positive result on 20 observations. 

Unfortunately, theoretical power analysis usually relies on complex formulas, and there are only a few functions in Python libraries to perform it for only the most common tests.

**Exercise 2.** In this exercise, we will perform a power analysis of a paired t test that we did in the last class in order to determine the sample size necessary to get a power of 80% at a significance level 5%. We will also see how the power depends on the distribution of the data set.  

Take the data about the income of all municipalities in 2019 and 2020 and remove the missing rows and rows where any income is equal to zero. Compute the log10 of the income. Compute the effect sizes (the mean over the standard deviation of the difference of observations) for the log-income and the non-transformed income. Use the `TTestPower.solve_power()` function to estimate the necessary number of observations for both data sets. If you have trouble running this function, ask Google for help. Compare the required numbers of observations of a one-sided and a two-sided alternative.   

Validate the results of `TTestPower` empirically. For 1000 repetitions, get a sample of a given size, perform a paired t-test, and check if the result is positive (i.e. if the p-value is below 0.05). Compute the proportions of true positive results for the log-transformed and the non-transformed data. Do the results agree with theoretical predictions? If not, explain why this may happen and whether in this case it's good or bad.   


In [15]:
## Write your code here
income_subset = income[['2020', '2019']]
income_subset = income_subset.dropna()
income_subset = income_subset.loc[(income_subset!=0).all(axis=1)]
income_subset['log-income_2020'] = np.log10(income_subset['2020'])
income_subset['log-income_2019'] = np.log10(income_subset['2019'])
income_subset['no_log_difference'] = income_subset['2020'] - income_subset['2019']
income_subset['log_difference'] = income_subset['log-income_2020'] - income_subset['log-income_2019']
income_subset

Unnamed: 0,2020,2019,log-income_2020,log-income_2019,no_log_difference,log_difference
0,1.138563e+08,1.107985e+08,8.056357,8.044534,3.057754e+06,0.011823
1,4.288890e+07,3.871533e+07,7.632345,7.587883,4.173570e+06,0.044462
2,2.754443e+07,2.572157e+07,7.440034,7.410298,1.822855e+06,0.029736
3,3.341908e+07,3.913420e+07,7.523995,7.592556,-5.715115e+06,-0.068562
4,2.484304e+07,2.177626e+07,7.395205,7.337983,3.066784e+06,0.057222
...,...,...,...,...,...,...
2504,1.941628e+07,1.760038e+07,7.288166,7.245522,1.815902e+06,0.042644
2505,2.503944e+07,1.359405e+07,7.398625,7.133349,1.144539e+07,0.265276
2506,3.840694e+08,3.432316e+08,8.584410,8.535587,4.083774e+07,0.048822
2507,1.739014e+09,1.545381e+09,9.240303,9.189035,1.936335e+08,0.051268


In [16]:
mean_no_log, std_no_log = np.mean(income_subset['no_log_difference']), np.std(income_subset['no_log_difference'])
mean_no_log, std_no_log

(3275689.4963511396, 12383943.548743062)

In [17]:
mean_log, std_log = np.mean(income_subset['log_difference']), np.std(income_subset['log_difference'])
mean_log, std_log

(0.06759688088595536, 0.07417980200903816)

In [18]:
TTestPower().solve_power(effect_size=mean_no_log/std_no_log, nobs=None,
                       alpha=0.05, power=0.8, alternative='two-sided')

114.1165722901473

In [19]:
true_positive = 0
for _ in range(1000):
    sample_2020 = income_subset['2020'].sample(114)
    sample_2019 = income_subset['2019'].sample(114)
    pval = ttest_rel(sample_2020, sample_2019, alternative='two-sided').pvalue
    if pval < 0.05:
      true_positive += 1
true_positive, true_positive/1000

(31, 0.031)

In [20]:
TTestPower().solve_power(effect_size=mean_no_log/std_no_log, nobs=None,
                       alpha=0.05, power=0.8, alternative='larger')

89.73347056260931

In [21]:
true_positive = 0
for _ in range(1000):
    sample_2020 = income_subset['2020'].sample(90)
    sample_2019 = income_subset['2019'].sample(90)
    pval = ttest_rel(sample_2020, sample_2019, alternative='greater').pvalue
    if pval < 0.05:
      true_positive += 1
true_positive, true_positive/1000

(57, 0.057)

In [22]:
TTestPower().solve_power(effect_size=mean_log/std_log, nobs=None,
                       alpha=0.05, power=0.8, alternative='two-sided')

11.518743471586536

In [23]:
true_positive = 0
for _ in range(1000):
    sample_2020 = income_subset['log-income_2020'].sample(12)
    sample_2019 = income_subset['log-income_2019'].sample(12)
    pval = ttest_rel(sample_2020, sample_2019, alternative='two-sided').pvalue
    if pval < 0.05:
      true_positive += 1
true_positive, true_positive/1000

(57, 0.057)

In [24]:
TTestPower().solve_power(effect_size=mean_log/std_log, nobs=None,
                       alpha=0.05, power=0.8, alternative='larger')

8.962829877130206

In [25]:
true_positive = 0
for _ in range(1000):
    sample_2020 = income_subset['log-income_2020'].sample(9)
    sample_2019 = income_subset['log-income_2019'].sample(9)
    pval = ttest_rel(sample_2020, sample_2019, alternative='greater').pvalue
    if pval < 0.05:
      true_positive += 1
true_positive, true_positive/1000

(99, 0.099)

## Multiple hypothesis testing



When performing statistical tests, we reject the null hypothesis when the p-value is below a set threshold, traditionally 0.05. A consequence of this approach is that 5% of positive results will be false positives, which may be a huge number when thousands of tests are performed in large-scale studies.  

In order to limit the number of false positives, we use *multiple testing corrections*. One of the most useful one is the Benjamini-Hochberg correction, which controls the false discovery rate, i.e. the proportion of false positives among all positive results. Note the difference from significance levels: the false discovery rate is the proportion of false positives among all positives in the *results* of the tests, while the significance level is the proportion of false positives among the *true negatives* (cases when $H_0$ is true). 

**Exercise 3.** A common way to perform the Benjamini-Hochberg correction is to transform the p-values $p_1, \dots, p_m$ in a way to obtain so-called *q-values* $q_1, \dots, q_m$, such that we have an FDR on the level of $Q$ when we accept all hypotheses $H_i$ with $q_i \leq Q$.  

Based on the description of the Benajmini-Hochberg procedure from [this Wikipedia article](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini%E2%80%93Hochberg_procedure), figure out how to compute the q-values. Write a function that accepts a vector of p-values and returns the corresponding q-values. Compare your results to the `fdrcorrection` function from the `statsmodels` library on a p-value vector (0.01, 0.1, 0.01, 0.2, 0.01, 0.1).  

Select the data about the income of Mazowieckie municipalities from the years 2015-2020. For each year from 2016 onwards, perform a t-test on a significance level $\alpha$=0.05 to check if the average income has increased compared to the previous year. How many positive results did you get? How many false positives do you expect? Use the Benjamini-Hochberg correction to perform a study in which you expect that within all the positive results (years with increased average income) you will have 1% of false results.  

In [26]:
test_p = [0.01, 0.1, 0.01, 0.2, 0.01, 0.1]

def BH_correct(p):
      n = len(p)
      q_vals = [0] * n 
      p.sort()
      q_vals[n-1] = p[n-1]
      for i in reversed(range(len(p)-1)):
        q_vals[i] = min(q_vals[i+1], p[i] * (n/(i+1)))
      return q_vals

print('My q-values:', BH_correct(test_p))
print('Statsmodels:', fdrcorrection(test_p)[1])

My q-values: [0.02, 0.02, 0.02, 0.12, 0.12, 0.2]
Statsmodels: [0.02 0.02 0.02 0.12 0.12 0.2 ]


In [27]:
## Analyze the income increases here
mazowieckie = income.loc[(income['Voivodeship'] == 'Mazowieckie')]
years = ['2015', '2016', '2017', '2018', '2019', '2020']
mazowieckie = mazowieckie[years]
mazowieckie = mazowieckie.dropna()

In [28]:
alpha = 0.05

for i in range(1, len(years)):
  p_val = ttest_rel(mazowieckie[years[i]], mazowieckie[years[i-1]], alternative='greater').pvalue
  if p_val <= alpha:
    print(f'Income has increased for years {years[i]} - {years[i-1]}, p-value: {p_val}')
  else:
    print(f'Income has not increased for years {years[i]} - {years[i-1]}, p-value: {p_val}')

Income has increased for years 2016 - 2015, p-value: 0.0012370001919484626
Income has increased for years 2017 - 2016, p-value: 0.03223112010521277
Income has not increased for years 2018 - 2017, p-value: 0.05710655643099397
Income has increased for years 2019 - 2018, p-value: 4.646533361072804e-07
Income has not increased for years 2020 - 2019, p-value: 0.12105067210305027


<center><img src='https://drive.google.com/uc?id=1_utx_ZGclmCwNttSe40kYA6VHzNocdET' height="60">

AI TECH - Akademia Innowacyjnych Zastosowań Technologii Cyfrowych. Program Operacyjny Polska Cyfrowa na lata 2014-2020
<hr>

<img src='https://drive.google.com/uc?id=1BXZ0u3562N_MqCLcekI-Ens77Kk4LpPm'>


Projekt współfinansowany ze środków Unii Europejskiej w ramach Europejskiego Funduszu Rozwoju Regionalnego 
Program Operacyjny Polska Cyfrowa na lata 2014-2020,
Oś Priorytetowa nr 3 "Cyfrowe kompetencje społeczeństwa" Działanie  nr 3.2 "Innowacyjne rozwiązania na rzecz aktywizacji cyfrowej".   
Tytuł projektu:  „Akademia Innowacyjnych Zastosowań Technologii Cyfrowych (AI Tech)”
    </center>