# Assignment 6: Differences between empirical data sets

---
## Background

### Problem Analysis
1. Analyse the differences between different perspectives using the data in tables 6. Do the different perspectives identify different defects, i.e. with a statistically significant difference?
2. How can assumed differences between the documents be analysed using the information from tables 6 and 7? Analyse if the outcome from the two documents is different. What are the implications if there happens to be a difference between the documents?
3. Are there significant differences between the perspectives in terms of outcome? Use the data in table

### The Methods
##### Kruskal-Wallis
The Kruskal-Wallis test is a non-parametric method for testing whether there are statistically significant differences between the medians of three or more independent groups. It is an alternative to one-way ANOVA when the assumption of normality is not met.

**Assumptions**  
- The samples are independent of each other
- The data in each group is ordinal or continuous

##### Mann Whitney U
The Mann-Whitney U test is a non-parametric test used to determine if there is a difference between two independent groups. It is an alternative to the independent t-test when data do not meet the assumption of normality.

**Assumptions**
- The two samples are independent of each other
- The data are ordinal or continuous

---
## Solution

In [2]:
import pandas as pd
import scipy.stats as stats
import numpy as np


df_6 = pd.read_csv('data/table_6.csv', delimiter=";", header=[1])
df_6.head()

Unnamed: 0,Defect number,User,Tester,Designer,User.1,Tester.1,Designer.1
0,1,2,1,2,1,1,3
1,2,4,2,3,2,4,4
2,3,0,3,1,0,0,2
3,4,0,1,1,3,2,4
4,5,3,2,2,0,0,0


In [27]:
df_7 = pd.read_csv('data/table_7.csv', delimiter=";")
df_7.head()

Unnamed: 0,Id,Perspective,Document,Time,Defects,Efficiency,Rate
0,1,U,ATM,187,8,2.567,0.276
1,2,D,PG,150,8,3.2,0.267
2,3,T,ATM,165,9,3.273,0.31
3,4,U,PG,185,11,3.568,0.367
4,5,D,ATM,155,8,3.097,0.276


### 6.1 Is there a statistically significant difference between the three perspectives?

1. Concatenate both documents, as we don't care about the difference between documents but rather between perspectives. Since the defects are not the same for both documents, it would not make sense to just sum them up per defect number.
2. Given that we look at the counts, which are small discrete values (0 to 5), normality might become a problem and determining the variance might also become a challenge. Hence we focus directly on non-parametric tests. Two of them would be kruskal-wallis and chi-squared test of independence. Since the chi-squared only works for 2 variables, we would have to perform 3 tests. Instead we use the kruskal-wallis test, which directly works with three variables.

In [4]:
df_6_1 = df_6[['User', 'Tester', 'Designer']]
df_6_2 = df_6[['User.1', 'Tester.1', 'Designer.1']].rename(columns={'User.1': 'User', 'Tester.1': 'Tester', 'Designer.1': 'Designer'})
df = pd.concat([df_6_1, df_6_2], ignore_index=True)
df.drop(df.tail(1).index, inplace=True)
df = df.astype(int)
df.shape

(59, 3)

In [5]:
# perform kruskal-wallis test
h_stats, p = stats.kruskal(df['User'], df['Tester'], df['Designer'])

# interpret
alpha = 0.05
print("Significance level: ", alpha)
print('H statistics:', h_stats)
print("p-value: ", p)
if p > alpha:
    print("Fail to reject H0: The perspective has no impact on the identification of defects")
else:
    print("Reject H0: The perspective has an impact on the identification of defects")

Significance level:  0.05
H statistics: 0.6018682913012375
p-value:  0.7401265116917417
Fail to reject H0: The perspective has no impact on the identification of defects


**Interpretation**  
Even though the h statistics is high (0.6), we fail to reject the null hypothesis (p-value=0.74 > 0.05). Hence there doesn't seem to be a statistically significant difference between the three perspectives.

### 6.2 Are the documents statistically significantly different?
Since we aim to evaluate, whether both documents are statistically significant different (independent), we evaluate for each perspective the statistical difference between document PG and ATM (for Table 6). For that purpose, we could use the independent t-test, however, since it is a parameterized approach, we would need to consider normality etc. Instead, we employ the Mann-Whitney U Test as the unparamterized alternative to compare two variables for independence.

In [16]:
# use mann-whitney u test to compare User and User.1 from df_6
df_6_local = df_6.drop(df_6.tail(1).index, inplace=False).astype(int)

for col in ['User', 'Tester', 'Designer']:
    u, p = stats.mannwhitneyu(df_6_local[col], df_6_local[col + '.1'])
    print(f"Mann-Whitney U Test for perspective {col} in PG document and {col} in ATM document: U-statistics={u}, p-value = {p}")

Mann-Whitney U Test for perspective User in PG document and User in ATM document: U-statistics=459.0, p-value = 0.5349541850206717
Mann-Whitney U Test for perspective Tester in PG document and Tester in ATM document: U-statistics=404.0, p-value = 0.7932018367857695
Mann-Whitney U Test for perspective Designer in PG document and Designer in ATM document: U-statistics=364.0, p-value = 0.3663125334185646


**Interpretation**  
For each perspective, the difference between the documents is not statistically significant, as all p-values are way above 0.05.

For table 7, we aggregate the data into two datasets, one for document PG, and one for document ATM.
Afterward we evaluate for each (again with the Mann-Whitney U test) whether there is a statistically significant difference between each variable (`Time`, `Defects`, `Efficiency`, `Rate`).

In [28]:
variables = ['Time', 'Defects', 'Efficiency', 'Rate']
# convert column x into int
df_7[variables[:-1]] = df_7[variables[:-1]].astype(int)
df_7[variables[-1]] = df_7[variables[-1]].astype(float)

# perform two sample t-test for each variable to check if the programming language has an effect on the variable
for column in variables:
    u, p = stats.mannwhitneyu(df_7[df_7['Document'] == 'PG'][column], df_7[df_7['Document'] == 'ATM'][column])
    print(f"Mann-Whitney U Test for variable {column} in PG in ATM document: U-statistics={u}, p-value = {p}")

Mann-Whitney U Test for variable Time in PG in ATM document: U-statistics=51.0, p-value = 0.011355298125865448
Mann-Whitney U Test for variable Defects in PG in ATM document: U-statistics=82.5, p-value = 0.21597411416320444
Mann-Whitney U Test for variable Efficiency in PG in ATM document: U-statistics=138.0, p-value = 0.2683465527582466
Mann-Whitney U Test for variable Rate in PG in ATM document: U-statistics=68.0, p-value = 0.06713848168803553


### 6.3 Do the perspectives significantly impact the outcome variables?
Here, we focus now on the perspectives. Hence, we split the dataset into each perspective ignoring the different documents. Then we analyse the difference between each outcome variable given the different perspectives. Since we compare three different variables, we employ like in 6.1 the kruskal-wallis test.

In [29]:
variables = ['Time', 'Defects', 'Efficiency', 'Rate']

# perform two sample t-test for each variable to check if the programming language has an effect on the variable
for column in variables:
    h, p = stats.kruskal(df_7[df_7['Perspective'] == 'U'][column], df_7[df_7['Perspective'] == 'T'][column], df_7[df_7['Perspective'] == 'D'][column])
    print(f"Kruskal-Wallis for variable {column} over all three perspectives: h-statistics={u}, p-value = {p}")

Kruskal-Wallis for variable Time over all three perspectives: h-statistics=68.0, p-value = 0.1283584253184742
Kruskal-Wallis for variable Defects over all three perspectives: h-statistics=68.0, p-value = 0.31533214931759795
Kruskal-Wallis for variable Efficiency over all three perspectives: h-statistics=68.0, p-value = 0.824510188833907
Kruskal-Wallis for variable Rate over all three perspectives: h-statistics=68.0, p-value = 0.34494439053893794


**Interpretation**  
For all variables the p-value is higher than 0.05, hence we can not reject the null-hypothesis. The perspectives seem to not have a significant impact on any of those variables.