# OLIVIA
**Open-source Library Indexes Vulnerability Identification and Analysis**

The use of centralized library repositories to reduce development times and costs is universal, in virtually all languages and types of software projects. Due to the transitivity of dependencies, the appearance of a single defect in the repository can have extensive and difficult-to-predict effects on the ecosystem. These defects cause functional errors or performance or security problems. The risk is difficult to grasp for developers, who only explicitly import a small part of the dependencies.

OLVIA uses an approach based on the vulnerability of the dependency network of software packages, which measures how sensitive the repository is to the random introduction of defects. The goals of the model are  to contribute to the understanding of propagation mechanisms of software defects and to study feasible protection strategies. This can benefit multiple parties:

* **Centralised package managers**, to establish policies and manual or automatic control processes that improve the security and stability of the repositories.
* **Software developers** in general, to assess the different risks introduced by the dependencies used in their projects, and **package developers** in particular to understand their responsibility on the ecosystem.
* Developers of **continuous quality tools**, to define the concept of vulnerability based on the modeling of the network of package dependencies.
---

**Author**: Daniel Setó Rey

https://github.com/dsr0018/olivia

**License**: Olivia and this notebook are published under a MIT [license](https://github.com/dsr0018/olivia/blob/master/LICENSE). The information of dependencies has been obtained from the libraries.io [data snapshots](https://libraries.io/data) (by Tidelift).

---

*This notebook is part of a user guide series that cover in detail the operation of the library.*
*This time we treat the basic operations of loading/creating models and dealing with package properties and metrics.*




## A - Basic model usage

[01 - Load model](#01---Load-model)&ensp;|&ensp;[02 - Package properties](#02---Package-properties)&ensp;|&ensp;[03 - Package metrics](#03---Package-metrics)&ensp;|&ensp;[04 - Custom models](#04---Custom-models) 

*OliviaNetwork* is essentially a directed graph with some additional structures to facilitate working with metrics in large dependency networks. The model can be built from a NetworkX directed network or from a file in adjacency list format.

Install requirements of olivia library

In [1]:
!pip install -r "../../olivia_finder/olivia/requirements.txt"



In [2]:
# Add olivia to the path
import sys, os
sys.path.append('../../olivia_finder/olivia')

In [3]:
from olivia.model import OliviaNetwork

### 0 - Build olivia model from graph

In [4]:
# cran_network_graph_path = os.path.abspath('../networks/cran_nrework.bz2')
# cran = OliviaNetwork()
# cran.build_model(cran_network_graph_path)
# cran.save('cran_model.olv')

### 01 - Load model

Load a pre-built model from file. This one is based in a snapshot of the Python Package Index (https://pypi.org/) from 2020-01 (data from https://libraries.io/):

In [5]:
cran = OliviaNetwork(r'cran_model.olv')

As expected, *len()* returns the number of the packages in the network:

In [6]:
len(cran)

18671

### 02 - Package properties

Access via *getitem* returns a special view object:

In [7]:
cran['ggplot2']

<olivia.model.PackageInfoView at 0x7fa0e4975be0>

*PackageInfoView* contains methods returning stats for specific packages. For example, direct dependants are other packages that import networkx in their code:

In [8]:
print(f"ggplot2 has {len(cran['ggplot2'].direct_dependants())} direct dependants")  

ggplot2 has 16 direct dependants


Packages may depend on NetworkX not only directly, but also via transitive dependencies:

![Dependants and transitive dependants](../../olivia_finder/olivia/docs/img/dependants.png "Transitive dependencies")  
<br>

In [9]:
print(f"ggplot2 has {len(cran['ggplot2'].transitive_dependants())} transitive dependants (includes direct dependants)")

ggplot2 has 35 transitive dependants (includes direct dependants)


Packages are returned as sets so we may apply the usual set operators. For example, these packages depend on NetworkX but do not explicitly import it:

In [10]:
print(cran['ggplot2'].transitive_dependants() - cran['ggplot2'].direct_dependants())

{'pillar', 'methods', 'utf8', 'colorspace', 'fansi', 'lattice', 'viridisLite', 'munsell', 'splines', 'pkgconfig', 'graphics', 'labeling', 'RColorBrewer', 'nlme', 'R6', 'farver', 'Matrix', 'magrittr', 'utils'}


Lets pick one of them and check its dependencies, the packages on which it depends:

**Direct Dependencies**

In [11]:
ggplot2_dd_count = len(cran['ggplot2'].direct_dependencies())
print(f"ggplot2 has {ggplot2_dd_count} direct dependencies")

ggplot2 has 3192 direct dependencies


**Transitive Dependencies**

In [12]:
parallel_td_count = len(cran['ggplot2'].transitive_dependencies())
print(f"ggplot2 has {parallel_td_count} transitive dependencies")

ggplot2 has 4719 transitive dependencies


### 03 - Package metrics

**REACH** of a package *u* is the number of transitive dependents of *u* plus 1.

In [13]:
cran['A3'].reach()

7

REACH represents the number of packages in the network that could be affected by the occurrence of a defect in *u*, like a bug or a security vulnerability. A bug in networkx could affect 590 packages, including NetworkX.

You may calculate REACH package by package, as in the previous example. However, this involves many redundant computations and is very slow. OLIVIA provides efficient methods to calculate REACH for all the nodes in the network.

In [14]:
from olivia.packagemetrics import Reach
cran_reach = cran.get_metric(Reach)

Computing Reach
     Processing node: 18K      


*OliviaModel.get_metric(...)* returns a *MetricStats* object with the results of the computation. *get_metric(...)* accepts as parameter classes implementing the *compute()* method, such as the ones in *olivia.packagemetrics*

In [15]:
cran_reach

<olivia.packagemetrics.MetricStats at 0x7fa0e49f3b80>

In [16]:
cran_reach['ggplot2']

36

Once calculated through *get_metric*, the *MetricStats* object is cached into the *OliviaNetwork* model. In this way, other complex algorithms that use large metric results are freed from managing each one on their own. 

So there is really no need to store the results into an independent variable like we did.

In [17]:
pypi_reach = cran.get_metric(Reach)

Reach retrieved from metrics cache


In [18]:
cran.get_metric(Reach)['ggplot2']

Reach retrieved from metrics cache


36

The management of the cache is semi-automatic. You can request a value from a network-wide metric that has not yet been calculated and it will be computed and cached the first time:

In [19]:
from olivia.packagemetrics import Impact

%time cran.get_metric(Impact)['ggplot2']

Computing Impact
     Processing node: 18K      
CPU times: user 410 ms, sys: 1.17 ms, total: 411 ms
Wall time: 403 ms


116

In [20]:
%time cran.get_metric(Impact)['ggplot2']

Impact retrieved from metrics cache
CPU times: user 49 µs, sys: 18 µs, total: 67 µs
Wall time: 70.6 µs


116

In [21]:
cran.get_metric(Impact)['MASS']

Impact retrieved from metrics cache


6

In [22]:
cran.get_metric(Impact)['nlme']

Impact retrieved from metrics cache


11

By the way, **IMPACT** is an alternative way of measuring the effect of a defect appearing in the network. It corresponds to the number of "links" affected (the number of "imports" in Python terms), it could be a better measure of the effort required to recover the network. Technically speaking it is the number of arcs in the graph induced by a node and its transitive dependents.

On the other hand, **SURFACE** is the size of the set of transitive dependencies plus 1. SURFACE(*u*) is the number of packages whose defects could affect *u*. High SURFACE packages are more vulnerable to random failures.

In [23]:
from olivia.packagemetrics import Surface

cran.get_metric(Surface)['igraph']

Computing Surface
     Processing node: 0K       


1088

![Package metrics](../../olivia_finder/olivia/docs/img/pmetrics.png "Olivia package metrics")
<br>

*MetricStats* is not just about storing values. It also has some basic methods that are useful for working with metrics. For example, you can get top and bottom packages according to the metric value:

In [24]:
cran_reach.top(10)

[('sMSROC', 253),
 ('popstudy', 237),
 ('PsychWordVec', 235),
 ('PALMO', 225),
 ('wallace', 224),
 ('jsmodule', 224),
 ('RISCA', 223),
 ('TidyConsultant', 221),
 ('mlmts', 218),
 ('packDAMipd', 215)]

<ins>**Note:**</ins>  

- **sMSROC**: a package that provides a function to calculate the area under the ROC curve for survival data.

- **popstudy**: a package that provides tools for conducting population studies and calculating measures of prevalence and incidence.

- **PsychWordVec**: a package that provides word vectors for the analysis of psychology and linguistics data.

- **PALMO**: a package that provides tools for the analysis of paleontological and morphological data.

- **wallace**: a package that provides tools for visualization and analysis of data on species and their geographic distribution.

- **jsmodule**: a package that provides functions to import JavaScript modules into R.

- **RISCA**: a package that provides tools for risk and safety analysis in the chemical industry.

- **TidyConsultant**: a package that provides tools for enterprise data cleansing and analysis.

- **mlmts**: a package that provides tools for multivariate time series analysis.

- **packDAMipd**: a package that provides tools for microarray data analysis.

In [25]:
cran_reach.bottom()

[('R', 1)]

In [26]:
cran.get_metric(Surface).top(10)

Surface retrieved from metrics cache


[('R', 17223),
 ('methods', 15103),
 ('utils', 15037),
 ('stats', 14360),
 ('graphics', 12809),
 ('grDevices', 12763),
 ('grid', 9241),
 ('rlang', 8993),
 ('lattice', 8945),
 ('magrittr', 8916)]

As *MetricStats* implements arithmetic operators, you may define compound metrics or other operations like corrections or normalization:

In [27]:
normalized_reach = cran.get_metric(Reach)/len(cran)
normalized_reach.top(10)

Reach retrieved from metrics cache


[('sMSROC', 0.013550425794012104),
 ('popstudy', 0.012693481870280113),
 ('PsychWordVec', 0.012586363879813614),
 ('PALMO', 0.01205077392748112),
 ('wallace', 0.011997214932247872),
 ('jsmodule', 0.011997214932247872),
 ('RISCA', 0.011943655937014621),
 ('TidyConsultant', 0.011836537946548122),
 ('mlmts', 0.011675860960848375),
 ('packDAMipd', 0.011515183975148627)]

and there we can see that a failure in *six* could affect more than 40% of the packets in the network.

Likewise, if we normalize SURFACE in relation to the size of the network, we obtain the probability that a uniformly random failure will affect each package:


In [28]:
(cran.get_metric(Surface)/len(cran)).top(10)

Surface retrieved from metrics cache


[('R', 0.9224465749022548),
 ('methods', 0.808901505007766),
 ('utils', 0.8053666113223716),
 ('stats', 0.7691071715494617),
 ('graphics', 0.6860371699426919),
 ('grDevices', 0.6835734561619624),
 ('grid', 0.4949386749504579),
 ('rlang', 0.4816560441326121),
 ('lattice', 0.4790852123614161),
 ('magrittr', 0.47753200149965186)]

<ins>**Note:**</ins>

-   **R**: the base R package, which provides the core functionality of the R language.

-   **methods**: a package that provides methods for generic functions, which are functions that can operate on objects of different classes.

-   **utils**: a package that provides various utility functions for R, such as functions for reading and writing data, managing files and directories, and managing the R environment.

-   **stats**: a package that provides functions for statistical modeling, inference, and analysis.

-   **graphics**: a package that provides functions for creating and manipulating graphics in R, such as plots, charts, and diagrams.

-   **grDevices**: a package that provides functions for interacting with graphical devices, such as displays, printers, and files.

-   **grid**: a package that provides functions for creating and manipulating grid-based graphics, such as tables and plots.

-   **rlang**: a package that provides functions for working with expressions and symbols in R, such as parsing, evaluation, and manipulation of expressions.

-   **lattice**: a package that provides functions for creating and manipulating high-level graphics in R, such as trellis plots and conditioned plots.

-   **magrittr**: a package that provides a pipe operator (%>%) for chaining together multiple operations in R, allowing for more readable and concise code.

Some examples of compound metrics:

In [29]:
from olivia.packagemetrics import DependentsCount

mean_degree = cran.get_metric(DependentsCount).values.mean()
degree_divergence = (cran.get_metric(DependentsCount)-mean_degree)**2
degree_divergence.top(5)

Computing Dependents Count
DependentsCount retrieved from metrics cache


[('Seurat', 1930.1271107449854),
 ('immunarch', 1594.6614152814323),
 ('pguIMP', 1220.3292959519908),
 ('epitweetr', 1151.4628720861026),
 ('MetaIntegrator', 956.8636004884377)]

<ins>**Note:**</ins>

- **Seurat**: a package that provides tools for single cell data analysis, including clustering of similar cells and identification of differentially expressed genes.

- **immunoarch**: a package that provides tools for the analysis of immune cell sequencing data, including immune cell sorting and clone identification.

- **pguIMP**: a package that provides tools for comparative genomics data analysis, including the detection of genes that are present or absent in different genomes.

- **epitweetr**: A package that provides tools for the collection and analysis of epidemiology-related Twitter data.

- **MetaIntegrator**: a package that provides tools for meta-analysis data analysis, including combining results from individual studies and identifying heterogeneity between studies.

In [30]:
# Impact / Reach ratio
(cran.get_metric(Impact)/cran.get_metric(Reach)).top(5)

Impact retrieved from metrics cache
Reach retrieved from metrics cache


[('margaret', 5.348066298342541),
 ('packDAMipd', 5.255813953488372),
 ('hlaR', 5.15),
 ('healthyverse', 5.137931034482759),
 ('spatstat.local', 5.107142857142857)]