# OLIVIA
**Open-source Library Indexes Vulnerability Identification and Analysis**

https://github.com/dsr0018/olivia

The use of centralized library repositories to reduce development times and costs is universal, in virtually all languages and types of software projects. Due to the transitivity of dependencies, the appearance of a single defect in the repository can have extensive and difficult-to-predict effects on the ecosystem. These defects cause functional errors or performance or security problems. The risk is difficult to grasp for developers, who only explicitly import a small part of the dependencies.

OLVIA uses an approach based on the vulnerability of the dependency network of software packages, which measures how sensitive the repository is to the random introduction of defects. The goals of the model are  to contribute to the understanding of propagation mechanisms of software defects and to study feasible protection strategies. This can benefit multiple parties:

* **Centralised package managers**, to establish policies and manual or automatic control processes that improve the security and stability of the repositories.
* **Software developers** in general, to assess the different risks introduced by the dependencies used in their projects, and **package developers** in particular to understand their responsibility on the ecosystem.
* Developers of **continuous quality tools**, to define the concept of vulnerability based on the modeling of the network of package dependencies.

***
*This notebook is part of a user guide series that cover in detail the operation of the library.
In this case, we show how to find sets of packages to protect, with the goal of minimizing the vulnerabilty of the network.*
***


## C - Immunization
[01 - Immunization Delta](#01---Immunization-Delta)&ensp;|&ensp;[02 - Subcritical networks](#02---Selecting-Immunization-targets---Subcritical-networks)&ensp;|&ensp;[03 - Supercritical networks](#03---Selecting-Immunization-targets---Supercritical-networks)&ensp;|&ensp;[03 - Advanced](#03---Selecting-Immunization-targets---Advanced)

The vulnerability of package repositories can be lowered if we protect (immunize) certain packages against failure and propagation of failure. For example, we could subject a package or set of packages to very strict quality controls, security audits or even contractually shield certain non-functional requirements, such as the type of license. All this comes at a cost and it would be good to know which package sets are most cost-effective for reducing the vulnerability.

### 01 - Immunization Delta
Immunization Delta is the decrease in vulnerability of a network after immunizing a given set of packages. Take for example the following figure, which shows three simple network models and possible immunization sets (black nodes). Under each option the REACH and IMPACT-vulnerability and the corresponding immunization delta are shown.

<br>

![Inmunization](docs/img/inmunizacion.png "Simple inmunization examples")  

<br>

In [49]:
from olivia.model import OliviaNetwork
from olivia.immunization import *

pypi = OliviaNetwork(r'data/pypi-2020-01-12.olv')

In [50]:
failure_vulnerability(pypi)

Computing Reach
     Processing node: 50K      


15.730114643659142

In [5]:
immunization_delta(pypi,{'numpy','pandas','matplotlib'})

Reach retrieved from metrics cache
Building Olivia Model
     Finding strongly connected components (SCCs)...
     Building condensation network...
     Adding structural meta-data...
     Done
Computing Reach
     Processing node: 50K      


0.20080368750738664

By immunizing this set of three well known Python packages, we reduce the REACH-vulnerability by 0.20, i.e we reduce  by 0.20 the expected amount of potentially affected packages by a random failure (from 15.73 to 15.53). It is not a drastic reduction, but it is not bad either, for working only on three packages of about 50000.

The default algorithm of *immunizacion_delta(...)* is called *'network'*. It computes the cost function network-wide, immunizes the target set of packages provided, rebuilds the model, computes the cost function again and returns the diference in vulnerability. In the previous code example, you may notice that the first calculation was retrieved from the model's cache, as we had computed REACH just before by calling *failure_vulnerability(pypi)*.

This does not seem to be a particularly efficient method and indeed it is not. However, it is not clear that a better technique exists for calculating the immunization delta of arbitrary sets and cost functions.


For the particular case of the REACH cost function, OLIVIA provides another algorithm for the immunization delta computation, *'analytic'*. In this case, the entire network is not processed. A mathematical formulation that only considers the packages transitively related to the target set is used to analytically calculate the vulnerability reduction associated with its immunization. This can be way faster for certain type of immunization sets, but in large complex networks it is often not the case, and the *'network'* method is time-bound and better.

The two methods are exact but there may be small differences due to rounding.

In [6]:
target = {'networkx','spacy'}

In [7]:
%time immunization_delta(pypi, target)

Reach retrieved from metrics cache
Building Olivia Model
     Finding strongly connected components (SCCs)...
     Building condensation network...
     Adding structural meta-data...
     Done
Computing Reach
     Processing node: 50K      
Wall time: 9.91 s


0.06565417799314766

In [8]:
%time immunization_delta(pypi, target, algorithm='analytic')

Wall time: 1.53 s


0.06565417799314502

Of course you may be interested on computing immunizacion delta according to another cost metric:

In [9]:
immunization_delta(pypi,{'numpy','pandas','matplotlib'}, cost_metric = Impact)

Computing Impact
     Processing node: 50K      
Building Olivia Model
     Finding strongly connected components (SCCs)...
     Building condensation network...
     Adding structural meta-data...
     Done
Computing Impact
     Processing node: 50K      


0.2860969940511353

Notice how we needed to compute IMPACT twice. However the values for the original network are cached and will be reused in subsequent queries:

In [82]:
immunization_delta(pypi,{'six'}, cost_metric = Impact)

Impact retrieved from metrics cache
Building Olivia Model
     Finding strongly connected components (SCCs)...
     Building condensation network...
     Adding structural meta-data...
     Done
Computing Impact
     Processing node: 50K      


0.9497498325651037

### 02 - Selecting Immunization targets - Subcritical networks
OLIVIA includes tools to help locate good immunization sets in package dependency networks. 

Subcritical networks do not have many dependency cycles, and the largest strongly connected component (SCC) is small, logarithmic in relation to the size of the repository. PyPI is subcritical.

From here on we will deal with REACH-vulnerability unless we note otherwise.

¿How effective is to immunize random packages?

In [95]:
failure_vulnerability(pypi)

Reach retrieved from metrics cache


15.730114643659142

In [99]:
immunization_delta(pypi, iset_random(pypi, 20), algorithm='analytic')

0.006579206555568688

*iset_...* methods from *olivia.immunization* produce sets of packages intended as immunization targets. *iset_random(...)* is really more like a baseline tool, selecting uniformly arbitrary ones. The immunization of 20 random packages does not seem to do much for decreasing the vulnerability (-0.0066 out of 15.73 in this case)

In [105]:
immunization_delta(pypi, iset_random(pypi, 1000), algorithm='analytic')

0.9571563644959225

Although appreciable, the immunization of 1000 random packages does not look spectacular either. *iset_random(...)* also provides an option to make an indirect selection, that is select randomly chosen dependencies of randomly chosen packages. This is a well known immunization tactic from Network Science, often used to exemplify advanced vaccination strategies. In this case it is also much better that pure random selection.

In [112]:
immunization_delta(pypi, iset_random(pypi, 20, indirect=True), algorithm='analytic')

3.6588070756017808

If we look at it, this indirect strategy is in fact a probabilistic approach to selection based on the number of dependents. Here we can just try the exact approach:

In [113]:
target = pypi.get_metric(DependentsCount).top(20)
target

DependentsCount retrieved from metrics cache


[('requests', 10623),
 ('six', 5014),
 ('numpy', 3911),
 ('click', 2519),
 ('setuptools', 2117),
 ('python-dateutil', 1873),
 ('pyyaml', 1836),
 ('PyYAML', 1572),
 ('lxml', 1368),
 ('future', 1218),
 ('pandas', 1202),
 ('urllib3', 1180),
 ('pytz', 1092),
 ('boto3', 1070),
 ('beautifulsoup4', 1057),
 ('pytest', 1045),
 ('Django', 995),
 ('django', 945),
 ('Click', 933),
 ('jinja2', 899)]

In [114]:
immunization_delta(pypi, {l[0] for l in target}, algorithm='analytic')

5.278276799432692

*iset_naive_ranking(...)* is a shortcut to select a set of nodes based simply on a ranking over a metricStats object:

In [10]:
immunization_delta(pypi, 
                   iset_naive_ranking(20, pypi.get_metric(DependentsCount)),
                   algorithm='analytic')

Computing Dependents Count


5.278276799432692

Choosing the 20 packages with higher count of direct dependants is even better that indirect selection. Now we achieve a vulnerability reduction of near a third, acting only on 20 out of 50.000 packages.

If we want to reduce REACH-vulnerability, a simple idea to consider could be to immunize the nodes with the greatest REACH value. It is somewhat surprising that this strategy is worse than the dependants count based one:

In [11]:
immunization_delta(pypi, 
                   iset_naive_ranking(20, pypi.get_metric(Reach)),
                   algorithm='analytic')

Reach retrieved from metrics cache


4.87391167316708

In addition to ranking methods, OLIVIA provides some specific techniques to find good targets. *iset_delta_set_reach(...)* computes a set of nodes that meets certain theoretical constraints on REACH and SURFACE values to ensure that it contains the best single node to immunize from the network. It is possible that the set contains other good targets.

The size of the delta set is given by the algorithm. For PyPI it is really small, of only 18 packages.

In [51]:
delta_set = iset_delta_set_reach(pypi)
print(delta_set)

Reach retrieved from metrics cache
Computing Surface
     Processing node: 0K       
Reach retrieved from metrics cache
Surface retrieved from metrics cache
{'notebook', 'pluggy', 'distribute', 'importlib-metadata', 'packaging', 'cryptography', 'zipp', 'boto3', 's3transfer', 'ipykernel', 'wheel', 'pbr', 'jsonschema', 'keyring', 'pytest', 'setuptools', 'configparser', 'requests'}


In [37]:
immunization_delta(pypi, delta_set, algorithm='analytic')

6.401134617657488

This is our best result so far, a 6.40 out of 15.73 (41%) decrease in vulnerability by immunizing 18 packages.

Of course we can also rank the results within the delta set for finding smaller sets:

In [24]:
smaller = iset_naive_ranking(4, pypi.get_metric(DependentsCount), subset = delta_set)
print(smaller)
immunization_delta(pypi, smaller, algorithm='analytic')

DependentsCount retrieved from metrics cache
{'setuptools', 'boto3', 'pytest', 'requests'}


3.6809872749477996

### 03 - Selecting Immunization targets - Supercritical networks

Supercritical packet dependency networks are those that contain a strongly connected (SCC) component of significant size. Maven is one of them.

In [30]:
maven = OliviaNetwork(r'data/maven-2020-01-12.olv')

Supercritical networks are much more vulnerable, because the big SCC contributes greatly to the propagation of defects.

In [31]:
failure_vulnerability(maven)

Computing Reach
     Processing node: 124K      


1805.5391236430194

Lets see what happens if we immunize the delta set for Maven:

In [32]:
maven_delta_set = iset_delta_set_reach (maven)
immunization_delta(maven, maven_delta_set)

Reach retrieved from metrics cache
Computing Surface
     Processing node: 0K        
Reach retrieved from metrics cache
Surface retrieved from metrics cache
Reach retrieved from metrics cache
Building Olivia Model
     Finding strongly connected components (SCCs)...
     Building condensation network...
     Adding structural meta-data...
     Done
Computing Reach
     Processing node: 119K      


1800.3279159303206

We reduced the vulnerability of Maven by 1800!. This means that before immunization, the failure of a single package could affect an average of 1805 other packages. After immunization, this number is 5 (five packages).

However, there is a catch.

In [33]:
len(maven_delta_set)

6540

In supercrítical networks, the delta set is larger. Here it represent aproximately 5% of the network (was 0.04% in PyPI). In fact, the delta set is not very useful in supercritical networks, as it usually contains the largest SCC, which is the real cause of the high vulnerability.

*iset_sap(...)* computes an immunization set by detecting the strong articulation points (SAP) of the biggest SCC in the network. SAPs are packages whose immunization is likely to contribute to the break-up of the SCC, reducing its ability to spread defects.

In [40]:
sap = iset_sap(maven)

immunization_delta(maven, sap)

Reach retrieved from metrics cache
Building Olivia Model
     Finding strongly connected components (SCCs)...
     Building condensation network...
     Adding structural meta-data...
     Done
Computing Reach
     Processing node: 125K      


1695.3186300807877

In [42]:
len(sap)

351

SAP is much smaller (0.28%), and still achieves a reduction in vulnerability of near 94%

Note that here we have not used the 'analityc' method for the calculation of the immunization delta, since sets are larger and it could be really slow.

Ranking on the SAP set gives us a tool for finding smaller sets:

In [52]:
immunization_delta(maven, iset_naive_ranking(100, maven.get_metric(DependentsCount), subset=sap))

DependentsCount retrieved from metrics cache
Reach retrieved from metrics cache
Building Olivia Model
     Finding strongly connected components (SCCs)...
     Building condensation network...
     Adding structural meta-data...
     Done
Computing Reach
     Processing node: 125K      


1544.4391883362787

Notice that the SAP approach is completely innefective in subcritical networks

In [55]:
immunization_delta(pypi, iset_sap(pypi), algorithm='analytic')

0.013040223771815782

Even the immunization of the largest SCC in full has a negligible effect on the network's vulnerability

In [56]:
immunization_delta(pypi, pypi.sorted_clusters()[0], algorithm='analytic')

0.019717921443485796

### 03 - Selecting Immunization targets - Advanced

Calculating the immunization delta is computationally expensive and the problem of selecting an optimal immunization set of a given size is mathematically intractable for large repositories.

In this guide we have used PyPI and Maven, but there are much larger networks, such as npm with over a million packages.

We will now give some ideas of additional techniques that can be used for the heuristic exploration of the problem.

#### Manipulating immunization sets

Set operations may be interesting to facilitate the experimental search for good immunization sets.

Packages both in PyPI delta set and in top 100 by dependentsCount:

In [59]:
delta_set & iset_naive_ranking(100, pypi.get_metric(DependentsCount))

DependentsCount retrieved from metrics cache


{'boto3',
 'configparser',
 'cryptography',
 'jsonschema',
 'packaging',
 'pbr',
 'pytest',
 'requests',
 'setuptools',
 'wheel'}

In [64]:
immunization_delta(pypi, {'boto3',
                         'configparser',
                         'cryptography',
                         'jsonschema',
                         'packaging',
                         'pbr',
                         'pytest',
                         'requests',
                         'setuptools',
                         'wheel'}, algorithm='analytic')

5.7339360989638735

That approach is much better than simply selecting the top 10 packages by DependentsCount:

In [65]:
immunization_delta(pypi,
                   iset_naive_ranking(10, pypi.get_metric(DependentsCount)), 
                   algorithm='analytic')

DependentsCount retrieved from metrics cache


3.7312768388291375

#### Ranking by compound metrics

Since arithmetic operations can be performed with MetricStats objects, it is easy to employ ranking functions with compound metrics. For example, REACH*SURFACE is a theoretical upper bound of immunization delta.

In [71]:
upper = iset_naive_ranking(50, pypi.get_metric(Reach)*pypi.get_metric(Surface))
immunization_delta(pypi, upper, algorithm='analytic')

Reach retrieved from metrics cache
Surface retrieved from metrics cache


8.024760666587873

In fact, in this case we can get exactly the PyPI delta set, using this technique with the added benefit of being able to regulate the number of elements in the set.

In [72]:
len(delta_set)

18

In [73]:
upper = iset_naive_ranking(18, pypi.get_metric(Reach)*pypi.get_metric(Surface))

Reach retrieved from metrics cache
Surface retrieved from metrics cache


In [75]:
# void set
delta_set-upper

set()

#### Brute force and Greedy selection

Once we get small enough candidate sets, we could try brute-force and greedy approaches for finding smaller sets. 

Considering the PyPI delta set, let's see which package individually contributes most to vulnerability reduction:

In [38]:
[(i, immunization_delta(pypi, {i}, algorithm='analytic')) for i in delta_set]

[('importlib-metadata', 1.3886065476893985),
 ('requests', 1.6734625536776582),
 ('setuptools', 0.9829807351376906),
 ('zipp', 0.3518496631603829),
 ('distribute', 0.2246582358271284),
 ('pytest', 0.7602135287397077),
 ('boto3', 0.3025450104400583),
 ('packaging', 0.6895757002718355),
 ('pbr', 0.34966316038293344),
 ('ipykernel', 0.05127447504235118),
 ('cryptography', 0.4491982823149352),
 ('s3transfer', 0.06293582318874838),
 ('jsonschema', 0.5747350588976874),
 ('pluggy', 0.06888468660126856),
 ('configparser', 0.3583894732695111),
 ('keyring', 0.25920891935547413),
 ('notebook', 0.1722018673915613),
 ('wheel', 0.5350037426624118)]

It should be noted that, as the sets of dependants and transitional dependencies are interrelated, the immunization delta is not additive. For example, if you take out *'importlib-metadata'* from the delta set, immunization delta decreases only by a small amount:

In [52]:
immunization_delta(pypi, delta_set-{'importlib-metadata'}, algorithm='analytic')

6.274553835244061

#### Using other centrality measures

Common measures of centrality in the field of network science, or more custom or specific ones, can be used to select immunization sets. You may need to narrow down the set of candidates to  make the computation feasible.

For example, using betweenness centrality on the SAP set gives the best results we have found for small immunization sets in Maven:

In [79]:
import networkx as nx

# Using the NetworkX implementation of betweenness centrality
# You can access the full NetworkX network underlying the repository using OliviaModel.network.
# Here we restrict the centrality computation to the subgraph induced by the SAP set:
sap_betweenness = nx.betweenness_centrality(maven.network.subgraph(sap))

In [84]:
from olivia.packagemetrics import MetricStats

# Build a MetricStats object to use in iset_naive_ranking
# MetricStats constructor admits a dictionary of values such as those produced by NetworkX
sap_betweenness = MetricStats(sap_betweenness)

In [86]:
immunization_delta(maven, iset_naive_ranking(10, sap_betweenness))

Reach retrieved from metrics cache
Building Olivia Model
     Finding strongly connected components (SCCs)...
     Building condensation network...
     Adding structural meta-data...
     Done
Computing Reach
     Processing node: 125K      


471.4254134057055

Follows a visual representation of this technique. The drawn portion (981 nodes) is the biggest SCC of Maven , representing about 0.8% of the network. 

The highlighted white nodes correspond to the SAP set (351 nodes), whose immunization reduces by 93.9% the REACH-vulnerability of the network as a whole. The blue highlights are the 10 SAP with the highest out-degree centrality (number of dependants) and the red ones the 10 with the highest betweenness. By immunizing these sets, a reduction in network vulnerability of 13% and 26% respectively is achieved.
The representation omits the direction of the arcs and uses the Kamada-kawai algorithm.

<br>

![Maven SCC](docs/img/maven.png "SAP immunization of Maven with betweenness ranking")  

<br>