# OLIVIA
**Open-source Library Indexes Vulnerability Identification and Analysis**

The use of centralized library repositories to reduce development times and costs is universal, in virtually all languages and types of software projects. Due to the transitivity of dependencies, the appearance of a single defect in the repository can have extensive and difficult-to-predict effects on the ecosystem. These defects cause functional errors or performance or security problems. The risk is difficult to grasp for developers, who only explicitly import a small part of the dependencies.

OLVIA uses an approach based on the vulnerability of the dependency network of software packages, which measures how sensitive the repository is to the random introduction of defects. The goals of the model are  to contribute to the understanding of propagation mechanisms of software defects and to study feasible protection strategies. This can benefit multiple parties:

* **Centralised package managers**, to establish policies and manual or automatic control processes that improve the security and stability of the repositories.
* **Software developers** in general, to assess the different risks introduced by the dependencies used in their projects, and **package developers** in particular to understand their responsibility on the ecosystem.
* Developers of **continuous quality tools**, to define the concept of vulnerability based on the modeling of the network of package dependencies.
---

**Author**: Daniel Setó Rey

https://github.com/dsr0018/olivia

**License**: Olivia and this notebook are published under a MIT [license](https://github.com/dsr0018/olivia/blob/master/LICENSE). The information of dependencies has been obtained from the libraries.io [data snapshots](https://libraries.io/data) (by Tidelift).

---

*This notebook is part of a user guide series that cover in detail the operation of the library.*
*This time we treat the basic operations of loading/creating models and dealing with package properties and metrics.*




## A - Basic model usage

[01 - Load model](#01---Load-model)&ensp;|&ensp;[02 - Package properties](#02---Package-properties)&ensp;|&ensp;[03 - Package metrics](#03---Package-metrics)&ensp;|&ensp;[04 - Custom models](#04---Custom-models) 

*OliviaNetwork* is essentially a directed graph with some additional structures to facilitate working with metrics in large dependency networks. The model can be built from a NetworkX directed network or from a file in adjacency list format.

In [1]:
from olivia.model import OliviaNetwork

### 01 - Load model

Load a pre-built model from file. This one is based in a snapshot of the Python Package Index (https://pypi.org/) from 2020-01 (data from https://libraries.io/):

In [2]:
pypi = OliviaNetwork(r'data/pypi-2020-01-12.olv')

As expected, *len()* returns the number of the packages in the network:

In [3]:
len(pypi)

50766

You may iterate over package names:

In [126]:
# Packages in PyPI starting with 'zai'
for package in pypi:
    if package[0:3]=='zai':
        print(package)

zaim
zaifer
zaius-export
zaidan


### 02 - Package properties

Access via *getitem* returns a special view object:

In [5]:
pypi['networkx']

<olivia.model.PackageInfoView at 0x91dc198>

*PackageInfoView* contains methods returning stats for specific packages. For example, direct dependants are other packages that import networkx in their code:

In [6]:
print(f"NetworkX has {len(pypi['networkx'].direct_dependants())} direct dependants")  

NetworkX has 372 direct dependants


Packages may depend on NetworkX not only directly, but also via transitive dependencies:

![Dependants and transitive dependants](docs/img/dependants.png "Transitive dependencies")  
<br>

In [38]:
print(f"NetworkX has {len(pypi['networkx'].transitive_dependants())} transitive dependants (includes direct dependants)")

NetworkX has 589 transitive dependants (includes direct dependants)


Packages are returned as sets so we may apply the usual set operators. For example, these packages depend on NetworkX but do not explicitly import it:

In [40]:
print(pypi['networkx'].transitive_dependants() - pypi['networkx'].direct_dependants())

{'ceilometer-powervm', 'nobrainer', 'irdap', 'mosaik.Demo-semver', 'LiSE', 'nslsii', 'angrop', 'mosaik.ScenarioTools', 'imply', 'xenonpy', 'score.jinja2', 'mriqc', 'nodepool', 'score.sa.db', 'pyvcloud', 'balkhash', 'dvc-cc', 'envdump-sha1n', 'ue4-ci-helpers', 'ndexgenehancerloader', 'toai', 'brainio', 'networking-powervm', 'mosaik.SimConfig', 'pyxrf', 'degas', 'fastr', 'termlink', 'scan-to-paperless', 'compath-utils', 'galaxy-lib', 'prov-db-connector', 'biobakery-workflows', 'biobb-analysis', 'sos-python', 'dbt-spark', 'scriptcwl', 'omesa', 'Orange3-Text', 'cifsdk_zyre', 'mosaik.scenario-tools', 'csirtgsdk', 'lime', 'xd-cwl-utils', 'sdcflows', 'score.jsapi', 'ephysiopy', 'score.ctx', 'qpsphere', 'score.ws', 'sceptre-aws-resolver', 'score.auth', 'dbt-snowflake', 'trove', 'mzml2isa', 'biobb-io', 'tridentx', 'girder-tech-journal', 'niflow-nipype1-workflows', 'cwltool', 'followthemoney-enrich', 'score.webassets', 'colorific', 'score.projects', 'girder-large-image', 'calrissian', 'malwareco

Lets pick one of them and check its dependencies, the packages on which it depends:

In [43]:
pypi['nobrainer'].direct_dependencies()

{'click', 'nibabel', 'numpy', 'scikit-image', 'tensorflow-probability'}

NetworkX is not a direct dependency but a transitive one:

In [41]:
pypi['nobrainer'].transitive_dependencies()

{'PyWavelets',
 'bz2file',
 'click',
 'cloudpickle',
 'decorator',
 'gast',
 'matplotlib',
 'networkx',
 'nibabel',
 'numpy',
 'pillow',
 'scikit-image',
 'scipy',
 'six',
 'tensorflow-probability'}

### 03 - Package metrics

**REACH** of a package *u* is the number of transitive dependents of *u* plus 1.

In [4]:
pypi['networkx'].reach()

590

REACH represents the number of packages in the network that could be affected by the occurrence of a defect in *u*, like a bug or a security vulnerability. A bug in networkx could affect 590 packages, including NetworkX.

You may calculate REACH package by package, as in the previous example. However, this involves many redundant computations and is very slow. OLIVIA provides efficient methods to calculate REACH for all the nodes in the network.

In [3]:
from olivia.packagemetrics import Reach
pypi_reach = pypi.get_metric(Reach)

Computing Reach
     Processing node: 50K      


*OliviaModel.get_metric(...)* returns a *MetricStats* object with the results of the computation. *get_metric(...)* accepts as parameter classes implementing the *compute()* method, such as the ones in *olivia.packagemetrics*

In [50]:
pypi_reach

<olivia.packagemetrics.MetricStats at 0x248710b8>

In [6]:
pypi_reach['networkx']

590

Once calculated through *get_metric*, the *MetricStats* object is cached into the *OliviaNetwork* model. In this way, other complex algorithms that use large metric results are freed from managing each one on their own. 

So there is really no need to store the results into an independent variable like we did.

In [16]:
pypi_reach = pypi.get_metric(Reach)

Reach retrieved from metrics cache


In [53]:
pypi.get_metric(Reach)['networkx']

Reach retrieved from metrics cache


590

The management of the cache is semi-automatic. You can request a value from a network-wide metric that has not yet been calculated and it will be computed and cached the first time:

In [7]:
from olivia.packagemetrics import Impact

%time pypi.get_metric(Impact)['networkx']

Computing Impact
     Processing node: 50K      
Wall time: 2.64 s


680

In [10]:
%time pypi.get_metric(Impact)['networkx']

Impact retrieved from metrics cache
Wall time: 0 ns


680

In [56]:
pypi.get_metric(Impact)['numpy']

Impact retrieved from metrics cache


8104

In [57]:
pypi.get_metric(Impact)['matplotlib']

Impact retrieved from metrics cache


831

By the way, **IMPACT** is an alternative way of measuring the effect of a defect appearing in the network. It corresponds to the number of "links" affected (the number of "imports" in Python terms), it could be a better measure of the effort required to recover the network. Technically speaking it is the number of arcs in the graph induced by a node and its transitive dependents.

On the other hand, **SURFACE** is the size of the set of transitive dependencies plus 1. SURFACE(*u*) is the number of packages whose defects could affect *u*. High SURFACE packages are more vulnerable to random failures.

In [12]:
from olivia.packagemetrics import Surface

pypi.get_metric(Surface)['pandas']

Computing Surface
     Processing node: 0K       


5

![Package metrics](docs/img/pmetrics.png "Olivia package metrics")
<br>

*MetricStats* is not just about storing values. It also has some basic methods that are useful for working with metrics. For example, you can get top and bottom packages according to the metric value:

In [61]:
pypi_reach.top(10)

[('six', 22315),
 ('idna', 17223),
 ('certifi', 16760),
 ('urllib3', 16438),
 ('chardet', 16384),
 ('chardet2', 15732),
 ('requests', 15731),
 ('attrs', 14606),
 ('pyparsing', 13873),
 ('appdirs', 13828)]

In [62]:
pypi_reach.bottom()

[('1pass', 1)]

In [83]:
pypi.get_metric(Surface).top(10)

Surface retrieved from metrics cache


[('sos-papermill', 248),
 ('dsc', 248),
 ('dvc-cc', 235),
 ('paasta-tools', 227),
 ('sos-python', 226),
 ('networking-baremetal', 225),
 ('sos-notebook', 225),
 ('molecule-azure', 223),
 ('magnum', 223),
 ('karbor', 222)]

As *MetricStats* implements arithmetic operators, you may define compound metrics or other operations like corrections or normalization:

In [4]:
normalized_reach = pypi.get_metric(Reach)/len(pypi)
normalized_reach.top(10)

Reach retrieved from metrics cache


[('six', 0.43956585116022534),
 ('idna', 0.33926249852263324),
 ('certifi', 0.3301422211716503),
 ('urllib3', 0.3237993932947248),
 ('chardet', 0.32273568924083046),
 ('chardet2', 0.3098924477012174),
 ('requests', 0.3098727494779971),
 ('attrs', 0.2877122483551984),
 ('pyparsing', 0.27327345073474374),
 ('appdirs', 0.2723870306898318)]

and there we can see that a failure in *six* could affect more than 40% of the packets in the network.

Likewise, if we normalize SURFACE in relation to the size of the network, we obtain the probability that a uniformly random failure will affect each package:


In [13]:
(pypi.get_metric(Surface)/len(pypi)).top(10)

Surface retrieved from metrics cache


[('sos-papermill', 0.004885159358625852),
 ('dsc', 0.004885159358625852),
 ('dvc-cc', 0.0046290824567624),
 ('paasta-tools', 0.0044714966710002755),
 ('sos-python', 0.00445179844778001),
 ('networking-baremetal', 0.0044321002245597445),
 ('sos-notebook', 0.0044321002245597445),
 ('molecule-azure', 0.0043927037781192136),
 ('magnum', 0.0043927037781192136),
 ('karbor', 0.004373005554898948)]

Some examples of compound metrics:

In [5]:
from olivia.packagemetrics import DependentsCount

mean_degree = pypi.get_metric(DependentsCount).values.mean()
degree_divergence = (pypi.get_metric(DependentsCount)-mean_degree)**2
degree_divergence.top(5)

Computing Dependents Count
DependentsCount retrieved from metrics cache


[('requests', 112783115.1271673),
 ('six', 25109514.74037298),
 ('numpy', 15271991.188468162),
 ('click', 6329951.601658093),
 ('setuptools', 4468740.2382258745)]

In [10]:
# Impact / Reach ratio
(pypi.get_metric(Impact)/pypi.get_metric(Reach)).top(5)

Impact retrieved from metrics cache
Reach retrieved from metrics cache


[('jsii', 5.434782608695652),
 ('publication', 5.407407407407407),
 ('aws-cdk.aws-iam', 5.174418604651163),
 ('aws-cdk.region-info', 5.149425287356322),
 ('cattrs', 4.829787234042553)]

### 04 - Custom models

An *OliviaNetwork* model can be built from a text file with package dependencies in adjacency list format. You may provide compressed GZIP or BZ2 files.

In [121]:
net = OliviaNetwork()
net.build_model(r'data/pypi-dependencies-net-2020-01-12.bz2')

Reading dependencies file...
Building Olivia Model
     Finding strongly connected components (SCCs)...
     Building condensation network...
     Adding structural meta-data...
     Done


In [122]:
net.get_metric(Reach).top(10)

Computing Reach
     Processing node: 50K      


[('six', 22315),
 ('idna', 17223),
 ('certifi', 16760),
 ('urllib3', 16438),
 ('chardet', 16384),
 ('chardet2', 15732),
 ('requests', 15731),
 ('attrs', 14606),
 ('pyparsing', 13873),
 ('appdirs', 13828)]

 *build_model(...)* also accepts an arbitrary NetworkX directed network as input:

In [123]:
import networkx as nx
net = OliviaNetwork()
net.build_model(nx.path_graph(5, create_using=nx.DiGraph))

Building Olivia Model
     Finding strongly connected components (SCCs)...
     Building condensation network...
     Adding structural meta-data...
     Done


In [125]:
net.get_metric(Reach).top(5)

Reach retrieved from metrics cache


[(0, 5), (1, 4), (2, 3), (3, 2), (4, 1)]

'network' property is a reference to a NetworkX (https://networkx.org/) DiGraph object representing the dependency structure:

Once built, models may be saved with *OliviaNetwork.save(filename)* method. Files are GZIP compressed and also store the cached metrics.