# Linus’s Law: More Eyes Fewer Flaws in Open Source Projects

Danilo Favato, Daniel Ishitani, Johnatan Oliveira, Eduardo Figueiredo

Department of Computer Science, Federal University of Minas Gerais (UFMG)

Belo Horizonte, Brazil

This Python notebook reproduce our sampling steps

## Main study methodology

Importing required packages.

In [None]:
import random
import json
import requests  # External lib, requires installation

from collections import defaultdict
import pandas as pd  # External lib, requires installation

### Sampling

**Caution**:
As we cannot control GitHub search API and projects evolution the steps below may generate a different sample from the one used in the original study. If you want to use the exact data sampled by the authors skip to the SonarQube Analysis.

We tried to avoid selection bias by employing a method that tried to be as closer to a simple random sampling as possible.

GitHub API returns only the first 1000 results which are paginated in pages of size 30. We generated a sequence of 100 numbers from 0 to 1000, without reposition. Then we calculated the page and position within the page for each of the numbers in the sequence.


In [None]:
random.seed(27091989)
items = random.sample(range(0, 1000), 100)
items.sort()
sample_pages = defaultdict(list)
for item in items:
    sample_pages[item // 30 + 1].append(item % 30)

Querying GitHub API

In [None]:
sample = []
for page in sample_pages:
    search_url = f'https://api.github.com/search/repositories?page={page}&q=python+language:python'
    r = requests.get(search_url)
    while not r.ok:
        time.sleep(10)
        r = requests.get(search_url)
    for item in sample_pages[page]:
        sample.append(r.json()['items'][item])

Saving selected projects. Uncomment if you wish to overwrite original data.

In [None]:
#with open('sample.json', 'w') as f:
#    f.write(json.dumps(sample))

Cloning every project. Uncomment if you wish to overwrite original data.

In [None]:
folder_field = 'full_name'
url_field = 'git_url'
for row in df.itertuples():
    project_folder = 'projects/' + row.full_name.split('/')[0]
#    !mkdir -p {project_folder}
#    !git -C {project_folder} clone {row.git_url} --depth=1 

### SonarQube Analysis

**Requirements:** You must have a SonarQube instance running on http://localhost:9000 that can be accessed with the credentials admin:admin

**Caution:** Using a different version or quality profile may change the results obtained by the authors. In the original study SonarQube version 7.7 with the standard Python quality profile was used. If you wish to use the exact same data used in the original study proceed to the Statistical Analysis section.


Load sampled projects

In [None]:
with open('sample.json', 'r') as f:
    sample = json.load(f)
df = pd.DataFrame.from_records(sample)
df['project_name'] = df.apply(lambda x: x['full_name'].replace('/', '.'), axis=1)
df.set_index('project_name', inplace=True)
fields = [
    'size', 'open_issues', 'forks', 'watchers'
]

Extract the source code from the compressed file, or not, if you have downloaded it yourself:

In [None]:
!cat projects_source_code.tar.7z.parta* > projects_source_code.tar.7z
!7z x -so projects_source_code.tar.7z | tar xf - -C .

Creating the projects in SonarQube

In [None]:
for row in df.itertuples():
    project_name = row.full_name.replace('/', '.')
    !curl -u admin:admin -X POST 'http://localhost:9000/api/projects/create?key={project_name}&name={project_name}' 

Running the Sonar-Scanner

In [None]:
for row in df.itertuples():
    project_name = row.full_name.replace('/', '.')
    project_folder = 'projects/' + row.full_name.split('/')[0]
    !sonar-scanner -Dsonar.projectKey={project_name} -Dsonar.sources={project_folder} -Dsonar.host.url=http://localhost:9000

Extract analysis results from SonarQube.

In [None]:
metrics = (
    'ncloc',
    'complexity',
    'duplicated_lines_density',
    'sqale_rating',
    'sqale_index',
    'sqale_debt_ratio',
    'reliability_remediation_effort',
    'security_remediation_effort'
)
_metrics = ','.join(metrics)

test_data = []
for row in df.itertuples():
    project_key = row.full_name.replace('/', '.')
    url = f'http://localhost:9000/api/measures/search?projectKeys={project_key}&metricKeys={_metrics}'
    r = requests.get(url)
    measures = r.json()['measures']
    for m in measures:
        test_data.append({
            'project': m['component'],
            'metric': m['metric'],
            'value': float(m['value'])
        })
        
TYPES = ('CODE_SMELL', 'BUG')
SEVERITIES = ('BLOCKER', 'CRITICAL', 'MAJOR', 'MINOR', 'INFO')
issues_data = []
for row in df.itertuples():
    project_key = row.full_name.replace('/', '.')
    project_metrics = {'project_key': project_key}
    for t in TYPES:
        for s in SEVERITIES:
            url = f'http://localhost:9000/api/issues/search?componentKeys={project_key}&languages=py&types={t}&severities={s}'
            r = requests.get(url)
            project_metrics['_'.join((t, s)).lower()] = r.json()['total']
    issues_data.append(project_metrics)

Data transformation

In [None]:
issues_df = pd.DataFrame(
    issues_data, index=[x['project_key'] for x in issues_data]
)

stat_df = pd.merge(
    df[fields],
    issues_df,
    how='left', left_index=True, right_index=True
).drop_duplicates()

Save results. Uncomment if you wish to overwrite original data.

In [None]:
#stat_df.to_excel('data.xlsx')

### Statistical Analysis

Please refer to the R notebook for the Statistial Analysis.