**Investigating the Adoption of Research Software**

Category: Software Quality

Profile: Users and Developers


***1) Import Required Libraries***

In [None]:
%matplotlib inline

import os, sys
sys.path.insert(1, os.path.join(sys.path[0], '..'))
from functools import reduce
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import statistics as s
import scipy.stats as stats
from scipy.stats import ttest_ind
from scipy.stats import bootstrap

%matplotlib notebook
import seaborn as sns




***2) Loading and Preprocessing Data***

In [None]:
from google.colab import files
import io
uploaded = files.upload()


Saving Survey_Perfil_Todos_Usuario_e_Desenvolvedor.xlsx to Survey_Perfil_Todos_Usuario_e_Desenvolvedor.xlsx


In [None]:
file_name = next(iter(uploaded))
file_name

'Survey_Perfil_Todos_Usuario_e_Desenvolvedor.xlsx'

In [None]:
df = pd.read_excel(file_name)

df

Unnamed: 0,PERFIL,TEMPO_USO,FREQUENCIA_USO,IDADE,GENERO,FORMACAO,ANOS_EXPERIENCIA,UTILIZACAO_D01F01,UTILIZACAO_D01F02,UTILIZACAO_D01F03,...,QUALIDADE_D02F07,QUALIDADE_D02F08,NAO_ADOCAO_D03F01,NAO_ADOCAO_D03F02,NAO_ADOCAO_D03F03,NAO_ADOCAO_D03F04,NAO_ADOCAO_D03F05,NAO_ADOCAO_D03F06,NAO_ADOCAO_D03F07,NAO_ADOCAO_D03F08
0,1.0,3.0,3.0,2.0,1.0,4.0,3.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
1,1.0,4.0,4.0,1.0,2.0,4.0,2.0,4.0,2.0,3.0,...,2.0,2.0,4.0,5.0,2.0,4.0,2.0,3.0,4.0,3.0
2,1.0,3.0,4.0,2.0,2.0,2.0,2.0,2.0,5.0,4.0,...,5.0,5.0,4.0,5.0,2.0,3.0,4.0,4.0,5.0,4.0
3,1.0,2.0,3.0,3.0,1.0,4.0,2.0,4.0,3.0,4.0,...,4.0,4.0,5.0,4.0,5.0,5.0,3.0,3.0,5.0,3.0
4,1.0,1.0,5.0,3.0,1.0,4.0,2.0,5.0,3.0,5.0,...,4.0,5.0,5.0,5.0,3.0,4.0,2.0,5.0,4.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
168,2.0,4.0,4.0,2.0,2.0,4.0,3.0,4.0,3.0,4.0,...,4.0,4.0,5.0,5.0,4.0,4.0,3.0,0.0,4.0,4.0
169,2.0,2.0,3.0,2.0,2.0,4.0,2.0,4.0,3.0,5.0,...,2.0,4.0,5.0,5.0,4.0,4.0,2.0,3.0,2.0,4.0
170,2.0,1.0,3.0,1.0,2.0,2.0,2.0,3.0,2.0,4.0,...,4.0,5.0,4.0,5.0,3.0,4.0,3.0,2.0,4.0,4.0
171,2.0,4.0,3.0,2.0,2.0,5.0,3.0,4.0,4.0,5.0,...,4.0,5.0,5.0,5.0,5.0,5.0,1.0,2.0,5.0,5.0




***3) Non-Adoption of Research Software***

D03F01 = Not having ease of use

D03F02 = Not having documentation about the usage

D03F03 = Not having scientific disclosure about the software

D03F04 = Not have quality

D03F05 = Not having open source

D03F06 = Not to be free

D03F07 = Not having maintenance and continuous evolution

D03F08 = Not having adoption by researchers/professors

Total of respondents users =  143

Total of respondents developers =  30

In [None]:
print("Profile: Users - Non-Adoption of Research Software")

# Selecting columns
df3 = df.loc[df['PERFIL'] == 1.0, lambda df:['NAO_ADOCAO_D03F01', 'NAO_ADOCAO_D03F02', 'NAO_ADOCAO_D03F03', 'NAO_ADOCAO_D03F04', 'NAO_ADOCAO_D03F05', 'NAO_ADOCAO_D03F06', 'NAO_ADOCAO_D03F07', 'NAO_ADOCAO_D03F08' ]]
df3.columns = ['D03F01', 'D03F02','D03F03', 'D03F04', 'D03F05', 'D03F06', 'D03F07', 'D03F08']
df3_fa = df3
df3

Profile: Users - Non-Adoption of Research Software


Unnamed: 0,D03F01,D03F02,D03F03,D03F04,D03F05,D03F06,D03F07,D03F08
0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
1,4.0,5.0,2.0,4.0,2.0,3.0,4.0,3.0
2,4.0,5.0,2.0,3.0,4.0,4.0,5.0,4.0
3,5.0,4.0,5.0,5.0,3.0,3.0,5.0,3.0
4,5.0,5.0,3.0,4.0,2.0,5.0,4.0,4.0
...,...,...,...,...,...,...,...,...
138,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
139,2.0,3.0,1.0,3.0,1.0,4.0,2.0,1.0
140,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0
141,5.0,4.0,2.0,5.0,2.0,5.0,4.0,4.0


In [None]:
print("Profile: Developers - Non-Adoption of Research Software")

# Selecting columns
df4 = df.loc[df['PERFIL'] == 2.0, lambda df:['NAO_ADOCAO_D03F01', 'NAO_ADOCAO_D03F02', 'NAO_ADOCAO_D03F03', 'NAO_ADOCAO_D03F04', 'NAO_ADOCAO_D03F05', 'NAO_ADOCAO_D03F06', 'NAO_ADOCAO_D03F07', 'NAO_ADOCAO_D03F08' ]]
df4.columns = ['D03F01', 'D03F02','D03F03', 'D03F04', 'D03F05', 'D03F06', 'D03F07', 'D03F08']
df4_fa = df4
df4

Profile: Developers - Non-Adoption of Research Software


Unnamed: 0,D03F01,D03F02,D03F03,D03F04,D03F05,D03F06,D03F07,D03F08
143,3.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0
144,4.0,4.0,5.0,5.0,3.0,4.0,3.0,3.0
145,4.0,5.0,3.0,4.0,5.0,4.0,4.0,3.0
146,4.0,5.0,5.0,5.0,5.0,4.0,4.0,5.0
147,5.0,4.0,4.0,5.0,5.0,4.0,4.0,4.0
148,4.0,5.0,4.0,3.0,3.0,3.0,5.0,4.0
149,5.0,5.0,5.0,5.0,5.0,4.0,5.0,5.0
150,3.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0
151,5.0,3.0,4.0,5.0,1.0,3.0,3.0,3.0
152,5.0,3.0,3.0,5.0,2.0,5.0,2.0,4.0


***3.1) Independent T Test***

```
# Isto está formatado como código
```



Two-sample t-test - Assumptions

_ Whether the two samples data groups are independent.

_ Whether the data elements in respective groups follow any normal distribution.

_ Whether the given two samples have similar variances (the homogeneity assumption).

Independed T test

The two sample t-test has the following hypothesis:

H0 => µ1 = µ2 (population mean of dataset_1 is equal to dataset_2)

HA => µ1 ≠µ2 (population mean of dataset_1 is different from dataset_2)

The Bootstrap method

We test the hypothesis that the means are different (a two-tailed test) and used the 'bias-corrected and accellerated' correction (BCa) on the confidence interval.

**Assumptions: Levene’s Test for Equality of Variances**

Levene’s test is one of the popular tests in inferential statistics that addresses the data drawn from non-normal distribution.

First to test for the homogeneity of variances. To do this, we will use Levene’s test for homogeneity of variance.

The Levene test tests the null hypothesis that all input samples are from populations with equal variances. The small p-value suggests that the populations do not have equal variances.

Hypotheses:

H0: The variances are homogeneous   (assumed equal variances) then  p > 0,05


H1: The variances are not homogeneous  (equal variances not assumed) then  p  ≤ 0,05

In [None]:
print(stats.levene(df3["D03F01"], df4["D03F01"]))
print(stats.levene(df3["D03F02"], df4["D03F02"]))
print(stats.levene(df3["D03F03"], df4["D03F03"]))
print(stats.levene(df3["D03F04"], df4["D03F04"]))
print(stats.levene(df3["D03F05"], df4["D03F05"]))
print(stats.levene(df3["D03F06"], df4["D03F06"]))
print(stats.levene(df3["D03F07"], df4["D03F07"]))
print(stats.levene(df3["D03F08"], df4["D03F08"]))


LeveneResult(statistic=0.944396842429332, pvalue=0.33252226241648275)
LeveneResult(statistic=2.698810210800768, pvalue=0.10226073140080995)
LeveneResult(statistic=0.04786478641014595, pvalue=0.8270821517092352)
LeveneResult(statistic=0.567319271815448, pvalue=0.4523618813852538)
LeveneResult(statistic=1.4176230658661217, pvalue=0.23544522850327487)
LeveneResult(statistic=4.563789892826454e-06, pvalue=0.9982979683915618)
LeveneResult(statistic=1.408884048029599, pvalue=0.23688807089215025)
LeveneResult(statistic=2.2094892074783563, pvalue=0.1390055119303004)


The Levene's test is not significant meaning there is homogeneity of variances and we can proceed.


In [None]:
import scipy.stats as stats
# Print the variance of both data groups
print(np.var(df3), np.var(df4))

D03F01    1.290332
D03F02    1.412392
D03F03    1.915497
D03F04    1.061861
D03F05    1.749719
D03F06    1.276053
D03F07    1.565847
D03F08    1.532104
dtype: float64 D03F01    0.440000
D03F02    0.512222
D03F03    1.782222
D03F04    0.448889
D03F05    2.288889
D03F06    1.382222
D03F07    0.890000
D03F08    0.898889
dtype: float64


**Assumptions: Distribution of groups (normal distribution or not)**

See ipynb of Kolmogorov and Shapiro Wilk.

**Indepent T Test Traditional**

In [None]:
import scipy.stats as stats
from scipy.stats import ttest_ind

#perform two sample T-test
alpha = 0.05

#equal_var = “True”: The standard independent two sample t-test will be conducted by taking into consideration the equal population variances.

# Perform the two sample t-test with  equal variances
ttest_ind(df3, df4, equal_var=True)

Ttest_indResult(statistic=array([-0.94564456, -1.97584469, -1.15099271, -0.753206  , -2.46509197,
        1.15836253, -1.01540161, -1.37119063]), pvalue=array([0.34566421, 0.04978112, 0.25134212, 0.45236188, 0.01468546,
       0.24833133, 0.31134896, 0.17211231]))

**Independent T Test Results:**

Here, since the p-value of each column is greater than alpha = 0.05 so we cannot reject the null hypothesis of the independent t test.

Thus, we have evidence to say that the mean of values of factors of users and developers between the two data groups is different to factors of non-adoption in two factors were (p = 0.049) and (p = 0.014)


**3.2) Independent T Test with Bootstrap**

We can simply order the values from smallest to largest and then look at the 2.5% quantile and the 97.5% quantile to find the two-tailed 95%-CI. With 1,000 samples, the 2.5% quantile is equal to the value of the 50th smallest mean (because 1,000 * 0.025 = 25), and the 97.5% quantile is equal to the value of the 950th mean from smaller to larger, or the 50th largest mean.

The bootstrap can also be used to estimate confidence intervals of multi-sample statistics, including those calculated by hypothesis tests. scipy.stats.ttest_ind perform’s independent t test for equal scale parameters, and it returns two outputs: a statistic, and a p-value. To get a confidence interval for the test statistic, we first wrap scipy.stats.ttest_ind in a function that accepts two sample arguments, accepts an axis keyword argument, and returns only the statistic.
We get a confidence interval for the test statistic.

In [None]:
def my_statistic(df3, df4, axis=1):
    statistic = ()
    statistic  = ttest_ind(df3, df4, equal_var=True)
    return statistic

df3_t = df3.values.tolist()
df4_t = df4.values.tolist()

data = (df3_t, df4_t)
res = bootstrap(data, my_statistic, axis=-1, confidence_level=0.95, n_resamples=1000, 
                         random_state=1000, method='percentile')
print("independent t test array ")
print(ttest_ind(df3, df4)[0]) 
print("independent t test p-value")
print(ttest_ind(df3, df4)[1]) 

print("CI Low 2,5%")
print(res.confidence_interval.low)

print("CI High 97,5%")
print(res.confidence_interval.high)


independent t test array 
[-0.94564456 -1.97584469 -1.15099271 -0.753206   -2.46509197  1.15836253
 -1.01540161 -1.37119063]
independent t test p-value
[0.34566421 0.04978112 0.25134212 0.45236188 0.01468546 0.24833133
 0.31134896 0.17211231]
CI Low 2,5%
[[-4.39721153e+00 -6.17629810e+00 -7.44848618e+00 ... -6.42565301e+00
  -3.38932584e+00 -7.52313141e+00]
 [ 1.58681019e-04  2.16255653e-04  8.09949357e-12 ...  2.33198502e-04
   8.69893548e-04  2.90439581e-12]]
CI High 97,5%
[[1.98844096 3.20465423 0.00727232 ... 1.13924409 1.4502219  2.08847696]
 [0.71356455 0.82740296 0.81780316 ... 0.86351105 0.50267057 0.94393834]]


In [None]:
# Published by https://github.com/mayer79/Bootstrap-p-values/blob/master/Bootstrap%20p%20values.ipynb

%matplotlib notebook
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as stats
from scipy.stats import bootstrap
import seaborn as sns


def boot_matrix(z, B):
    """Bootstrap sample
    
    Returns all bootstrap samples in a matrix"""
    
    n = len(z)  # sample size
    idz = np.random.randint(0, n, size=(B, n))  # indices to pick for all boostrap samples
    return z[idz]

def bootstrap_mean(x, B=10000, alpha=0.05, plot=False):
    """Bootstrap standard error and (1-alpha)*100% c.i. for the population mean

    Returns bootstrapped standard error and different types of confidence intervals"""
   
    # Deterministic things
    n = len(x)  # sample size
    orig = x.mean()  # sample mean
    se_mean = x.std()/np.sqrt(n) # standard error of the mean
    qt = stats.t.ppf(q=1 - alpha/2, df=n - 1) # Student quantile
    
    # Generate boostrap distribution of sample mean
    xboot = boot_matrix(x, B=B)
    sampling_distribution = xboot.mean(axis=1)
   
   # Standard error and sample quantiles
    se_mean_boot = sampling_distribution.std()
    quantile_boot = np.percentile(sampling_distribution, q=(100*alpha/2, 100*(1-alpha/2)))
 
    # RESULTS
    print("Estimated mean:", orig)
    print("Classic standard error:", se_mean)
    print("Classic student c.i.:", orig + np.array([-qt, qt])*se_mean)
    print("\nBootstrap results:")
    print("Standard error:", se_mean_boot)
    print("t-type c.i.:", orig + np.array([-qt, qt])*se_mean_boot)
    print("Percentile c.i.:", quantile_boot)
    print("Basic c.i.:", 2*orig - quantile_boot[::-1])

    if plot:
        plt.hist(sampling_distribution, bins="fd")  

#Let's use this function to calculate different bootstrap 95%-confidence intervals for the true mean of a right skewed sample.
#np.random.seed(984564) # for reproducability
#x = np.random.exponential(scale=20, size=30)
#%time bootstrap_mean(x, plot=True)

#Normal distributions and have equal variances.
#np.random.seed(984564) # for reproducability
#x = np.random.normal(loc=11, scale=20, size=30)
#y = np.random.normal(loc=15, scale=20, size=20)
#%time bootstrap_t_pvalue(x, y)

#Welch's t-test for unequal variances works
#np.random.seed(345244) # for reproducability
#x = np.random.normal(loc=11, scale=20, size=30)
#y = np.random.normal(loc=15, scale=10, size=20)
#%time bootstrap_t_pvalue(x, y, plot=True)

#Welch's t-test for unequal variances and non-normality
np.random.seed(399888) # for reproducability
#x = np.random.exponential(scale=20, size=30)
#y = np.random.exponential(scale=10, size=20)
#Welch's t-test for equal variances and non-normality
x = np.random.exponential(scale=20, size=143)
y = np.random.exponential(scale=20, size=30)
%time bootstrap_t_pvalue(x, y, plot=True)


def bootstrap_t_pvalue(x, y, equal_var=False, B=10000, plot=False):
    """Bootstrap p values for two-sample t test
    
    Returns boostrap p value, test statistics and parametric p value"""
    
    # Original t test statistic
    orig = stats.ttest_ind(x, y, equal_var=equal_var)
    
    # Generate boostrap distribution of t statistic
    xboot = boot_matrix(x - x.mean(), B=B) # important centering step to get sampling distribution under the null
    yboot = boot_matrix(y - y.mean(), B=B)
    sampling_distribution = stats.ttest_ind(xboot, yboot, axis=1, equal_var=equal_var)[0]

    # Calculate proportion of bootstrap samples with at least as strong evidence against null    
    p = np.mean(sampling_distribution >= orig[0])
    
    # RESULTS
    print("p value for null hypothesis of equal population means:")
    print("Parametric:", orig[1])
    print("Bootstrap:", 2*min(p, 1-p))
    
    # Plot bootstrap distribution
    if plot:
        plt.figure()
        plt.hist(sampling_distribution, bins="fd") 

x = df3.values
y = df4.values

bootstrap_t_pvalue(x, y, equal_var=False, B=10000, plot=False)

p value for null hypothesis of equal population means:
Parametric: 0.926926664263225
Bootstrap: 0.911


<IPython.core.display.Javascript object>

CPU times: user 149 ms, sys: 3.37 ms, total: 153 ms
Wall time: 160 ms
p value for null hypothesis of equal population means:
Parametric: [0.19419384 0.00897302 0.2491984  0.33021136 0.03108258 0.26994238
 0.23226133 0.11224694]
Bootstrap: 0.22567500000000007


In [None]:
from scipy.stats import bootstrap
import numpy as np

#Resample the data: for each sample in data and for each of n_resamples, take a random sample of the original sample (with replacement) of the same size as the original sample.
#Method bias-corrected and accelerated bootstrap confidence interval ('BCa')

df3_t = (df3,)
       
#calculate 95% bootstrapped confidence interval for median
bootstrap_ci = bootstrap(df3_t, np.mean, confidence_level=0.95, n_resamples=1000, 
                         random_state=1000, method='Bca')

#view 95% boostrapped confidence interval
print(bootstrap_ci.confidence_interval)


ConfidenceInterval(low=array([3.95804196, 3.79020979, 2.90574639, 4.17482517, 2.45392971,
       3.92307692, 3.63557276, 3.41958042]), high=array([4.34965035, 4.16733557, 3.35664336, 4.52447552, 2.87412587,
       4.29370629, 4.04195804, 3.83167095]))


In [None]:
from scipy.stats import bootstrap
import numpy as np

#Resample the data: for each sample in data and for each of n_resamples, take a random sample of the original sample (with replacement) of the same size as the original sample.
#Method bias-corrected and accelerated bootstrap confidence interval ('BCa')

df4_t = (df4,)
       
#calculate 95% bootstrapped confidence interval for median
bootstrap_ci = bootstrap(df4_t, np.mean, confidence_level=0.95, n_resamples=9999, 
                         random_state=1000, method='Bca')

#view 95% boostrapped confidence interval
print(bootstrap_ci.confidence_interval)

ConfidenceInterval(low=array([4.1       , 4.1       , 2.93333333, 4.2       , 2.73333333,
       3.33333333, 3.66666667, 3.53333333]), high=array([4.6       , 4.63333333, 3.9       , 4.7       , 3.8       ,
       4.2       , 4.36666667, 4.23333333]))


 **Independent T Test with Bootstrap Results**

The Independent Student’s t-test with bootstrap was too statistically significant. The bootstrap-t approach work better than the standard ​T-test CI.
The t-test revealed that “Not having documentation
about the usage” (p = 0.009) and “Not having open source” (p = 0.031)
scored significantly different between users and developers with 95% confidence
interval.