# OLIVIA
**Open-source Library Indexes Vulnerability Identification and Analysis**

The use of centralized library repositories to reduce development times and costs is universal, in virtually all languages and types of software projects. Due to the transitivity of dependencies, the appearance of a single defect in the repository can have extensive and difficult-to-predict effects on the ecosystem. These defects cause functional errors or performance or security problems. The risk is difficult to grasp for developers, who only explicitly import a small part of the dependencies.

OLVIA uses an approach based on the vulnerability of the dependency network of software packages, which measures how sensitive the repository is to the random introduction of defects. The goals of the model are  to contribute to the understanding of propagation mechanisms of software defects and to study feasible protection strategies. This can benefit multiple parties:

* **Centralised package managers**, to establish policies and manual or automatic control processes that improve the security and stability of the repositories.
* **Software developers** in general, to assess the different risks introduced by the dependencies used in their projects, and **package developers** in particular to understand their responsibility on the ecosystem.
* Developers of **continuous quality tools**, to define the concept of vulnerability based on the modeling of the network of package dependencies.
---

**Author**: Daniel Setó Rey

https://github.com/dsr0018/olivia

**License**: Olivia and this notebook are published under a MIT [license](https://github.com/dsr0018/olivia/blob/master/LICENSE). The information of dependencies has been obtained from the libraries.io [data snapshots](https://libraries.io/data) (by Tidelift).

---

*This notebook is part of a user guide series that cover in detail the operation of the library.*
*This time we treat the basic operations of loading/creating models and dealing with package properties and metrics.*



**<span style="color:red">WARNING</span>**

The ***numpy==1.18.5*** package is olivia dependence, but a Build version is no longer available.

The compilation process of the package takes a ratillo and needs addicted dependencies in the system

In [2]:
import sys
sys.path.append('../../../olivia/')

!pip install -r ../../../olivia/requirements.txt

[0m


## A - Basic model usage

[01 - Load model](#01---Load-model)&ensp;|&ensp;[02 - Package properties](#02---Package-properties)&ensp;|&ensp;[03 - Package metrics](#03---Package-metrics)&ensp;|&ensp;[04 - Custom models](#04---Custom-models) 

*OliviaNetwork* is essentially a directed graph with some additional structures to facilitate working with metrics in large dependency networks. The model can be built from a NetworkX directed network or from a file in adjacency list format.

In [3]:
from olivia.model import OliviaNetwork

### 01 - Load model

Load a pre-built model from file. This one is based in a snapshot of the Python Package Index (https://bioconductor.org/) from 2020-01 (data from https://libraries.io/):

In [4]:
bioconductor = OliviaNetwork(r'../results/olivia_prebuilts/bioconductor.olv')

As expected, *len()* returns the number of the packages in the network:

In [5]:
len(bioconductor)

3509

You may iterate over package names:

In [6]:
# Packages in bioconductor starting with 'zai'
for package in bioconductor:
    if package[0]=='b':
        print(package)

basecallQC
benchdamic
biscuiteer
branchpointer
bamsignals
bayNorm
bigPint
biomformat
banocc
bcSeq
bigmelon
biodbKegg
bnbc
bugsigdbr
bandle
bioDist
biodb
biodbNci
blacksheepr
bsseq
barcodetrackR
beadarray
bioCancer
biodbHmdb
biodbUniprot
blima
breakpointR
bambu
baySeq
bgx
biocViews
biodbLipidmaps
bnem
bumphunter
ballgown
bioassayR
biodbExpasy
biodbNcbi
biovizBase
brainflowprobes
bacon
beer
biobroom
biodbChebi
biodbMirbase
biotmle
borealis
batchelor
biobtreeR
biomaRt
bluster
biocGraph
biomvRCNS
beachmat
biosigner
basilisk.utils
beadarraySNP
basilisk
brendaDb
betareg
biocthis
broom
base64enc
bit64
biglm
beanplot
boot
biclust
bitops
biscuiteerData
bezier
bigmemory
binom
bbmle
batchtools
bookdown
biwt
biganalytics
beeswarm
bibtex
bslib
base
bestNormalize
bigrquery
bnlearn
bsplus
baseline
brglm
bootstrap
brew
bamlss
benchmarkme
breastCancerVDX
bench
backbone
bs4Dash
bdsmatrix
breakpointRdata
bit
base64url
bladderbatch
base64
blockmodeling
binr
blme
babelgene
bnstruct
bigstatsr
biomartr
broom

### 02 - Package properties

Access via *getitem* returns a special view object:

In [7]:
bioconductor['BiocGenerics']

<olivia.model.PackageInfoView at 0x7f8c9d6f4940>

*PackageInfoView* contains methods returning stats for specific packages. For example, direct dependants are other packages that import networkx in their code:

In [8]:
print(f"BiocGenerics has {len(bioconductor['BiocGenerics'].direct_dependants())} direct dependants")  

BiocGenerics has 480 direct dependants


Packages may depend on NetworkX not only directly, but also via transitive dependencies:

![Dependants and transitive dependants](docs/img/dependants.png "Transitive dependencies")  
<br>

In [9]:
print(f"BiocGenerics has {len(bioconductor['BiocGenerics'].transitive_dependants())} transitive dependants (includes direct dependants)")

BiocGenerics has 1704 transitive dependants (includes direct dependants)


Packages are returned as sets so we may apply the usual set operators. For example, these packages depend on NetworkX but do not explicitly import it:

In [10]:
print(bioconductor['BiocGenerics'].transitive_dependants() - bioconductor['BiocGenerics'].direct_dependants())

{'dpeak', 'SimBindProfiles', 'scDDboost', 'circRNAprofiler', 'CancerSubtypes', 'TOAST', 'sscore', 'PrInCE', 'MungeSumstats', 'coMET', 'cbaf', 'GEOsubmission', 'DExMA', 'ZygosityPredictor', 'RegEnrich', 'multiSight', 'FunChIP', 'CSSQ', 'InteractiveComplexHeatmap', 'missRows', 'derfinderPlot', 'BiocOncoTK', 'scde', 'hyperdraw', 'HiCDCPlus', 'DIAlignR', 'BiocSklearn', 'staRank', 'GRridge', 'countsimQC', 'CSAR', 'getDEE2', 'StarBioTrek', 'MAGAR', 'ChIPsim', 'affycomp', 'rCGH', 'CTdata', 'GeneStructureTools', 'immunoClust', 'scRepertoire', 'OCplus', 'crisprseekplus', 'pqsfinder', 'miRSM', 'GenomicDistributions', 'vissE', 'iterClust', 'metagenomeSeq', 'DMRforPairs', 'scry', 'goTools', 'Rhisat2', 'IFAA', 'SparseSignatures', 'dasper', 'MsFeatures', 'MGFM', 'HiTC', 'EpiMix', 'Cormotif', 'HTSFilter', 'CoCiteStats', 'ASSIGN', 'BgeeDB', 'flowPloidy', 'scone', 'GEOmetadb', 'BBCAnalyzer', 'pvac', 'fedup', 'AgiMicroRna', 'FastqCleaner', 'geneXtendeR', 'EBSEA', 'MPRAnalyze', 'TOP', 'MethReg', 'netZooR

Lets pick one of them and check its dependencies, the packages on which it depends:

In [11]:
bioconductor['BiocGenerics'].direct_dependencies()

{'R', 'graphics', 'methods', 'stats', 'utils'}

MetaboCoreUtils is not a direct dependency but a transitive one:

In [12]:
bioconductor['MetaboCoreUtils'].transitive_dependencies()

{'BiocGenerics',
 'MASS',
 'MsCoreUtils',
 'R',
 'S4Vectors',
 'clue',
 'graphics',
 'methods',
 'stats',
 'stats4',
 'utils'}

### 03 - Package metrics

**REACH** of a package *u* is the number of transitive dependents of *u* plus 1.

In [13]:
bioconductor['BiocGenerics'].reach()

1705

REACH represents the number of packages in the network that could be affected by the occurrence of a defect in *u*, like a bug or a security vulnerability. A bug in networkx could affect 590 packages, including NetworkX.

You may calculate REACH package by package, as in the previous example. However, this involves many redundant computations and is very slow. OLIVIA provides efficient methods to calculate REACH for all the nodes in the network.

In [14]:
from olivia.packagemetrics import Reach
bioconductor_reach = bioconductor.get_metric(Reach)

Computing Reach
     Processing node: 3K      


*OliviaModel.get_metric(...)* returns a *MetricStats* object with the results of the computation. *get_metric(...)* accepts as parameter classes implementing the *compute()* method, such as the ones in *olivia.packagemetrics*

In [15]:
bioconductor_reach

<olivia.packagemetrics.MetricStats at 0x7f8c9d6f4880>

In [16]:
bioconductor_reach['BiocGenerics']

1705

Once calculated through *get_metric*, the *MetricStats* object is cached into the *OliviaNetwork* model. In this way, other complex algorithms that use large metric results are freed from managing each one on their own. 

So there is really no need to store the results into an independent variable like we did.

In [17]:
bioconductor_reach = bioconductor.get_metric(Reach)

Reach retrieved from metrics cache


In [18]:
bioconductor.get_metric(Reach)['BiocGenerics']

Reach retrieved from metrics cache


1705

The management of the cache is semi-automatic. You can request a value from a network-wide metric that has not yet been calculated and it will be computed and cached the first time:

In [19]:
from olivia.packagemetrics import Impact

%time bioconductor.get_metric(Impact)['BiocGenerics']

Computing Impact
     Processing node: 3K      
CPU times: user 69.6 ms, sys: 498 µs, total: 70.1 ms
Wall time: 396 ms


7357

In [20]:
%time bioconductor.get_metric(Impact)['BiocGenerics']

Impact retrieved from metrics cache
CPU times: user 1.31 ms, sys: 163 µs, total: 1.47 ms
Wall time: 7 ms


7357

In [21]:
bioconductor.get_metric(Impact)['BANDITS']

Impact retrieved from metrics cache


1

In [22]:
bioconductor.get_metric(Impact)['GenomicFeatures']

Impact retrieved from metrics cache


462

By the way, **IMPACT** is an alternative way of measuring the effect of a defect appearing in the network. It corresponds to the number of "links" affected (the number of "imports" in Python terms), it could be a better measure of the effort required to recover the network. Technically speaking it is the number of arcs in the graph induced by a node and its transitive dependents.

On the other hand, **SURFACE** is the size of the set of transitive dependencies plus 1. SURFACE(*u*) is the number of packages whose defects could affect *u*. High SURFACE packages are more vulnerable to random failures.

In [23]:
from olivia.packagemetrics import Surface

bioconductor.get_metric(Surface)['BiocGenerics']

Computing Surface
     Processing node: 0K      


481

![Package metrics](docs/img/pmetrics.png "Olivia package metrics")
<br>

*MetricStats* is not just about storing values. It also has some basic methods that are useful for working with metrics. For example, you can get top and bottom packages according to the metric value:

In [24]:
bioconductor_reach.top(10)

[('R', 2109),
 ('stats', 1997),
 ('methods', 1982),
 ('utils', 1957),
 ('graphics', 1860),
 ('BiocGenerics', 1705),
 ('stats4', 1533),
 ('grDevices', 1490),
 ('S4Vectors', 1462),
 ('Biobase', 1437)]

In [25]:
bioconductor_reach.bottom()

[('metaseqR2', 1)]

In [26]:
bioconductor.get_metric(Surface).top(10)

Surface retrieved from metrics cache


[('R', 1796),
 ('methods', 1493),
 ('stats', 1294),
 ('utils', 1064),
 ('ggplot2', 718),
 ('S4Vectors', 662),
 ('graphics', 640),
 ('grDevices', 594),
 ('GenomicRanges', 540),
 ('IRanges', 516)]

As *MetricStats* implements arithmetic operators, you may define compound metrics or other operations like corrections or normalization:

In [27]:
normalized_reach = bioconductor.get_metric(Reach)/len(bioconductor)
normalized_reach.top(10)

Reach retrieved from metrics cache


[('R', 0.6010259333143345),
 ('stats', 0.5691080079794814),
 ('methods', 0.5648332858364207),
 ('utils', 0.5577087489313195),
 ('graphics', 0.5300655457395269),
 ('BiocGenerics', 0.48589341692789967),
 ('stats4', 0.43687660302080367),
 ('grDevices', 0.42462239954402964),
 ('S4Vectors', 0.41664291821031635),
 ('Biobase', 0.4095183813052152)]

and there we can see that a failure in *six* could affect more than 40% of the packets in the network.

Likewise, if we normalize SURFACE in relation to the size of the network, we obtain the probability that a uniformly random failure will affect each package:


In [28]:
(bioconductor.get_metric(Surface)/len(bioconductor)).top(10)

Surface retrieved from metrics cache


[('R', 0.5118267312624679),
 ('methods', 0.42547734397264175),
 ('stats', 0.3687660302080365),
 ('utils', 0.30322029068110573),
 ('ggplot2', 0.20461669991450557),
 ('S4Vectors', 0.18865773724707893),
 ('graphics', 0.1823881447705899),
 ('grDevices', 0.16927899686520376),
 ('GenomicRanges', 0.15388999715018523),
 ('IRanges', 0.14705044172128812)]

Some examples of compound metrics:

In [29]:
from olivia.packagemetrics import DependentsCount

mean_degree = bioconductor.get_metric(DependentsCount).values.mean()
degree_divergence = (bioconductor.get_metric(DependentsCount)-mean_degree)**2
degree_divergence.top(5)

Computing Dependents Count
DependentsCount retrieved from metrics cache


[('R', 3193116.4110936164),
 ('methods', 2202046.240389712),
 ('stats', 1651043.3692013395),
 ('utils', 1112875.8798881448),
 ('ggplot2', 502580.78726916516)]

In [30]:
# Impact / Reach ratio
(bioconductor.get_metric(Impact)/bioconductor.get_metric(Reach)).top(5)

Impact retrieved from metrics cache
Reach retrieved from metrics cache


[('methods', 5.011604439959637),
 ('R', 4.87624466571835),
 ('stats', 4.815723585378067),
 ('utils', 4.7884517118037815),
 ('graphics', 4.51505376344086)]