# Olivia-Finder introduction

## 0 - Previous requirements

**We make sure to have the necessary units installed**

In [None]:
%pip install -r ../requirements.txt

**Add the Library Route to Path**

In [1]:
# Add the path to the olivia_finder package
import sys
sys.path.append('../')

## 1 - DataSource

The class **DataSource** provides an interface for the obtaining of data from the different existing Packs managers.

A datasource can implement classes:

- **Scraper**, to obtain the data directly from the website of the package manager
- **CSVNetwork**, to obtain the data from a CSV file

In this example we will instantize the desired implementation of data source class

Below are some of its most relevant features

### Data obtaining via web scarping

In the first place we import the implementation of the data source we want, for this example we will use the Bioconductor Scraper

In [2]:
from olivia_finder.data_source.scrapers.bioconductor import BiocScraper
bioc_scraper_ds = BiocScraper()

Show relevant information about the data source

In [3]:
print(bioc_scraper_ds.get_info())

{'name': 'Bioconductor', 'description': 'Scraper class implementation for the Bioconductor package network'}


Get a list with the name of the packages obtained from this source

Specifically, the Biic Scraper class gets the list of packages from the url:

-   https://bioconductor.org/packages/release/BiocViews.html#___Software

Each specific implementation of a Scraper must manage this process on its own.

In [3]:
package_list = bioc_scraper_ds.obtain_package_names()
package_list[:10]

['ABSSeq',
 'ABarray',
 'ACE',
 'ACME',
 'ADAM',
 'ADAMgui',
 'ADImpute',
 'ADaCGH2',
 'AGDEX',
 'AIMS']

We can obtain the data from a specific package, for example the **`DeepBlueR`** package

In [4]:
deepbluer = bioc_scraper_ds.obtain_package_data("DeepBlueR")
deepbluer

{'name': 'DeepBlueR',
 'version': '1.24.1',
 'dependencies': [{'name': 'R', 'version': '>= 3.3'},
  {'name': 'XML', 'version': ''},
  {'name': 'RCurl', 'version': ''},
  {'name': 'GenomicRanges', 'version': ''},
  {'name': 'data.table', 'version': ''},
  {'name': 'stringr', 'version': ''},
  {'name': 'diffr', 'version': ''},
  {'name': 'dplyr', 'version': ''},
  {'name': 'methods', 'version': ''},
  {'name': 'rjson', 'version': ''},
  {'name': 'utils', 'version': ''},
  {'name': 'R.utils', 'version': ''},
  {'name': 'foreach', 'version': ''},
  {'name': 'withr', 'version': ''},
  {'name': 'rtracklayer', 'version': ''},
  {'name': 'GenomeInfoDb', 'version': ''},
  {'name': 'settings', 'version': ''},
  {'name': 'filehash', 'version': ''}],
 'url': 'https://www.bioconductor.org/packages/release/bioc/html/DeepBlueR.html'}

Be careful with the sensitivity to **caps**, if the package has not been found, an **ScraperError** exception is raised

In [5]:
try:
    deepbluer2 = bioc_scraper_ds.obtain_package_data("deepbluer")
except Exception as e:
    print(e)

ScraperError: Package deepbluer not found


We can get the data from a list of package names using the function:
-   ```python
    obtain_packages_data(list[str])
    ```

In [6]:
pkgs_data, not_found = bioc_scraper_ds.obtain_packages_data(package_list[:3])
pkgs_data

[{'name': 'ABSSeq',
  'version': '1.52.0',
  'dependencies': [{'name': 'R', 'version': '>= 2.10'},
   {'name': 'methods', 'version': ''},
   {'name': 'locfit', 'version': ''},
   {'name': 'limma', 'version': ''}],
  'url': 'https://www.bioconductor.org/packages/release/bioc/html/ABSSeq.html'},
 {'name': 'ABarray',
  'version': '1.66.0',
  'dependencies': [{'name': 'Biobase', 'version': ''},
   {'name': 'graphics', 'version': ''},
   {'name': 'grDevices', 'version': ''},
   {'name': 'methods', 'version': ''},
   {'name': 'multtest', 'version': ''},
   {'name': 'stats', 'version': ''},
   {'name': 'tcltk', 'version': ''},
   {'name': 'utils', 'version': ''}],
  'url': 'https://www.bioconductor.org/packages/release/bioc/html/ABarray.html'},
 {'name': 'ACE',
  'version': '1.16.0',
  'dependencies': [{'name': 'R', 'version': '>= 3.4'},
   {'name': 'Biobase', 'version': ''},
   {'name': 'QDNAseq', 'version': ''},
   {'name': 'ggplot2', 'version': ''},
   {'name': 'grid', 'version': ''},
   {

In [9]:
for p in pkgs_data:
    print(f'Package: {p["name"]} ({p["version"]})')

    for d in p["dependencies"]:
        print(f'-   Dependency: {d["name"]} {d["version"]}')

Package: ABSSeq (1.52.0)
-   Dependency: R >= 2.10
-   Dependency: methods 
-   Dependency: locfit 
-   Dependency: limma 
Package: ABarray (1.66.0)
-   Dependency: Biobase 
-   Dependency: graphics 
-   Dependency: grDevices 
-   Dependency: methods 
-   Dependency: multtest 
-   Dependency: stats 
-   Dependency: tcltk 
-   Dependency: utils 
Package: ACE (1.16.0)
-   Dependency: R >= 3.4
-   Dependency: Biobase 
-   Dependency: QDNAseq 
-   Dependency: ggplot2 
-   Dependency: grid 
-   Dependency: stats 
-   Dependency: utils 
-   Dependency: methods 
-   Dependency: grDevices 
-   Dependency: GenomicRanges 


Packages not found appear as the second object of the tuple

In [7]:
pkgs_data, not_found = bioc_scraper_ds.obtain_packages_data(["deepbluer", "DeepBlueR"])
not_found

['deepbluer']

### Data obtaining from CSV files

In [2]:
from olivia_finder.data_source.csv_network import CSVNetwork

In [4]:
# Load the network
bioc_csv_ds = CSVNetwork(
    "results/csv_datasets/bioconductor_adjlist_scraping.csv",  # Path to the CSV file
    "Bioconductor",                         # Name of the data source
    "Bioconductor as a CSV file",            # Description of the data source
    dependent_field="name",                 # Name of the field that contains the dependencies
    dependency_field="dependency",          # Name of the field that contains the name of the package
    dependent_version_field="version",      # Name of the field that contains the version of the package
    dependency_version_field="dependency_version",     # Name of the field that contains the version of the dependency
    dependent_url_field="url",              # Name of the field that contains the URL of the package
)

In [5]:
package_list = bioc_csv_ds.obtain_package_names()
package_list[:10]

['ABSSeq',
 'ABarray',
 'ACE',
 'ACME',
 'ADAM',
 'ADAMgui',
 'ADImpute',
 'ADaCGH2',
 'AGDEX',
 'AIMS']

In [6]:
deepbluer = bioc_csv_ds.obtain_package_data("DeepBlueR")
deepbluer

{'name': 'DeepBlueR',
 'version': '1.24.1',
 'url': 'https://www.bioconductor.org/packages/release/bioc/html/DeepBlueR.html',
 'dependencies': [{'name': 'R', 'version': '>= 3.3'},
  {'name': 'XML', 'version': nan},
  {'name': 'RCurl', 'version': nan},
  {'name': 'GenomicRanges', 'version': nan},
  {'name': 'data.table', 'version': nan},
  {'name': 'stringr', 'version': nan},
  {'name': 'diffr', 'version': nan},
  {'name': 'dplyr', 'version': nan},
  {'name': 'methods', 'version': nan},
  {'name': 'rjson', 'version': nan},
  {'name': 'utils', 'version': nan},
  {'name': 'R.utils', 'version': nan},
  {'name': 'foreach', 'version': nan},
  {'name': 'withr', 'version': nan},
  {'name': 'rtracklayer', 'version': nan},
  {'name': 'GenomeInfoDb', 'version': nan},
  {'name': 'settings', 'version': nan},
  {'name': 'filehash', 'version': nan}]}

In [7]:
packages = bioc_csv_ds.obtain_packages_data(package_list[:3])
packages

[{'name': 'ABSSeq',
  'version': '1.52.0',
  'url': 'https://www.bioconductor.org/packages/release/bioc/html/ABSSeq.html',
  'dependencies': [{'name': 'R', 'version': '>= 2.10'},
   {'name': 'methods', 'version': nan},
   {'name': 'locfit', 'version': nan},
   {'name': 'limma', 'version': nan}]},
 {'name': 'ABarray',
  'version': '1.66.0',
  'url': 'https://www.bioconductor.org/packages/release/bioc/html/ABarray.html',
  'dependencies': [{'name': 'Biobase', 'version': nan},
   {'name': 'graphics', 'version': nan},
   {'name': 'grDevices', 'version': nan},
   {'name': 'methods', 'version': nan},
   {'name': 'multtest', 'version': nan},
   {'name': 'stats', 'version': nan},
   {'name': 'tcltk', 'version': nan},
   {'name': 'utils', 'version': nan}]},
 {'name': 'ACE',
  'version': '1.16.0',
  'url': 'https://www.bioconductor.org/packages/release/bioc/html/ACE.html',
  'dependencies': [{'name': 'R', 'version': '>= 3.4'},
   {'name': 'Biobase', 'version': nan},
   {'name': 'QDNAseq', 'versi

---

## 2 - Initialization of a package manager

In [8]:
from olivia_finder.package_manager import PackageManager

**Declare the class**

Initialize the packagemanager class with the implementation of the scraper we want

In [9]:
from olivia_finder.data_source.scrapers.pypi import PypiScraper
pypi_scraper_pm = PackageManager(PypiScraper())

Or init the class from csv file

In [5]:
from olivia_finder.data_source.csv_network import CSVNetwork

In [6]:
cran_scraped_csv_pm = PackageManager(
    CSVNetwork(
        "results/csv_datasets/cran_adjlist_scraping.csv",   # Path to the CSV file
        "CRAN",                                             # Name of the data source
        "CRAN as a CSV file",                               # Description of the data source
        dependent_field="name",                             # Name of the field that contains the dependencies
        dependency_field="dependency",                      # Name of the field that contains the name of the package
        dependent_version_field="version",                  # Name of the field that contains the version of the package
        dependency_version_field="dependency_version",     # Name of the field that contains the version of the dependency
        dependent_url_field="url",                          # Name of the field that contains the URL of the package
    )
)

print(f'Scraped CRAN packages: {len(cran_scraped_csv_pm.data_source.obtain_package_names())}')

Scraped CRAN packages: 18195


In [7]:
cran_librariesio_csv_pm = PackageManager(
    CSVNetwork(
        "results/csv_datasets/cran_librariesio_dependencies.csv",   # Path to the CSV file  
        "CRAN",                                                     # Name of the data source
        "CRAN as a CSV file",                                       # Description of the data source
        dependent_field="Project Name",                             # Name of the field that contains the dependencies
        dependency_field="Dependency Name",                         # Name of the field that contains the name of the package
        dependent_version_field="Version Number",                   # Name of the field that contains the version of the package
        dependency_version_field="Dependency Requirements"          # Name of the field that contains the version of the dependency
    )
)
print(f'Libraries.io CRAN packages: {len(cran_librariesio_csv_pm.data_source.obtain_package_names())}')

Libraries.io CRAN packages: 15522


**Get a package from package manager**

In [10]:
networkx = pypi_scraper_pm.obtain_package("networkx")
networkx.print()

Package:
  name: networkx
  version: 3.0
  url: https://pypi.org/project/networkx/
  dependencies:
    numpy:(>=1.20)
    scipy:(>=1.8)
    matplotlib:(>=3.4)
    pandas:(>=1.3)
    pre-commit:(>=2.20)
    mypy:(>=0.991)
    sphinx:(==5.2.3)
    pydata-sphinx-theme:(>=0.11)
    sphinx-gallery:(>=0.11)
    numpydoc:(>=1.5)
    pillow:(>=9.2)
    nb2plots:(>=0.6)
    texext:(>=0.6.7)
    lxml:(>=4.6)
    pygraphviz:(>=1.10)
    pydot:(>=1.4.2)
    sympy:(>=1.10)
    pytest:(>=7.2)
    pytest-cov:(>=4.0)
    codecov:(>=2.1)


In [8]:
cran_scraped_csv_pm.obtain_package("A3").print()
print("\n------------------\n")
cran_librariesio_csv_pm.obtain_package("A3").print()

Package:
  name: A3
  version: 1.0.0
  url: https://cran.r-project.org/package=A3
  dependencies:
    R:≥ 2.15.0
    xtable:nan
    pbapply:nan

------------------

Package:
  name: A3
  version: 1.0.0
  url: None
  dependencies:
    R:>= 2.15.0
    randomForest:*


**Get packages from a list of package names**

Webscraping-based implementation obtains the data manager website data

In [11]:
packages = pypi_scraper_pm.obtain_packages(["networkx", "numpy", "pandas"])
packages

[<olivia_finder.package.Package at 0x7fed5937bd90>,
 <olivia_finder.package.Package at 0x7fed27ae2250>,
 <olivia_finder.package.Package at 0x7fed27ae26a0>]

In [20]:
packages[2].print()

Package:
  name: pandas
  version: 1.5.3
  url: https://pypi.org/project/pandas/
  dependencies:
    python-dateutil:(>=2.8.1)
    pytz:(>=2020.1)
    numpy:(>=1.23.2)
    hypothesis:(>=5.5.3)
    pytest:(>=6.0)
    pytest-xdist:(>=1.31)


CSV file-based implementation obtains file data from the csv

In [14]:
cran_packages = cran_scraped_csv_pm.obtain_packages(["A3", "pbapply", "xtable"])
for p in cran_packages:
    print("\n------------------\n")
    p.print()


------------------

Package:
  name: A3
  version: 1.0.0
  url: https://cran.r-project.org/package=A3
  dependencies:
    R:≥ 2.15.0
    xtable:nan
    pbapply:nan

------------------

Package:
  name: pbapply
  version: 1.7-0
  url: https://cran.r-project.org/package=pbapply
  dependencies:
    R:≥ 3.2.0
    parallel:nan

------------------

Package:
  name: xtable
  version: 1.8-4
  url: https://cran.r-project.org/package=xtable
  dependencies:
    R:≥ 2.10.0
    stats:nan
    utils:nan


**Get all packages from a package manager**

***Note:***

The functionality of storing packages in the PackageManager object has been implemented

-   Can be activated by flag

    ```python
    extend=True
    ```

The functionality of showing the progress of obtaining packages has been implemented

-   Can be activated by flag

    ```python
    show_progress=True
    ```

Getting all the packages from a package manager can take a while, so it is recommended to save the data to a CSV file for later use.

We can see that the execution time for half a million packages (Pypi) is around 7 hours.
In the case of Bioconductor, to obtain the 2000 packages it contains, the execution time is around 4 minutes.

-   From **Spraper** data source implementation

In [7]:
pypi_packages = pypi_scraper_pm.obtain_packages(extend=True, show_progress=True)

 15%|█▌        | 66423/438514 [1:14:50<6:40:39, 15.48it/s] 

In [None]:
bioc_scraper_pm = PackageManager(BiocScraper())
bioconductor_packages = bioc_scraper_pm.obtain_packages(extend=True, show_progress=True)

100%|██████████| 2183/2183 [03:47<00:00,  9.59it/s]


-   From **CSVNetwork** data source implementation

In [8]:
cran_packages = cran_scraped_csv_pm.obtain_packages(extend=True, show_progress=True)

100%|██████████| 18195/18195 [02:04<00:00, 145.67it/s]


In [15]:
cran_scraped_csv_pm.obtain_package("A3").print()

Package:
  name: A3
  version: 1.0.0
  url: https://cran.r-project.org/package=A3
  dependencies:
    R:≥ 2.15.0
    xtable:nan
    pbapply:nan


In [16]:
cran_librariesio_csv_pm.obtain_package("A3").print()

Package:
  name: A3
  version: 1.0.0
  url: None
  dependencies:
    R:>= 2.15.0
    randomForest:*


As can be seen there is inconsistency among the different data sources, it is recommended to use the most up-to-date source

#### Data persistence

The functionality of saving the PackageManager object in disk and loading of it has also been implemented, in order to maintain persistence and not repeat processes such as WebScraping.

**Save the PackageManager object**

We can save the object through the `save` function

The file extension is irrelevant since it is a binary serialization, but by agreement the extension has been chosen **.olvpm** "to identify the PackageManager files

In [18]:
cran_scraped_csv_pm.save("results/package_managers/cran.olvpm")

**Load the PackageManager object**

We can load the PackageManager object through the static method

```python 
    PackageManager.load(path:str)
```

In [13]:
cran_loaded_csv_pm = PackageManager.load("results/package_managers/cran.olvpm")
cran_loaded_csv_pm.obtain_package("A3").print()

Package:
  name: A3
  version: 1.0.0
  url: https://cran.r-project.org/package=A3
  dependencies:
    R:≥ 2.15.0
    xtable:nan
    pbapply:nan


**Export the CSV format**

We can export the data of the packages to a CSV, with a structure similar to that of the data of Libraries.

We can use the following function to generate a Pandas Dataframe and then write the file as Dis CSV

-   
    ```python
    pandas_df = package_manager.export_full_adjlist()
    ```


In [15]:
# Store the package manager as a adjacency list
cran_df = cran_loaded_csv_pm.get_package_list()
cran_df.to_csv("results/csv_datasets/cran_full_adjlist.csv", index=False)
cran_df.head()

AttributeError: 'list' object has no attribute 'to_csv'