We make sure to have the necessary units installed

In [None]:
%pip install -r requirements.txt

## Initialization of a data source

In this example we will instantize the desired implementation of data source class

Different implementations are available to perform this task, based on obtaining the web pages of the respective package managers and also from CSV text files

### Data obtaining via web scarping

#### How to proceed

In the first place we import the implementation of the data source we want, for this example we will use the Bioconductor Scraper

In [1]:
from olivia_finder.scraping.bioconductor import BiocScraper

In [2]:
data_source = BiocScraper()

Show relevant information about the data source

In [3]:
print(data_source.get_info())

Name: Bioconductor
Description: Scraper class implementation for the Bioconductor package network


Get a list with the name of the packages obtained from this source

In [4]:
package_list = data_source.obtain_package_names()
package_list[:10]

['BiocGenerics',
 'S4Vectors',
 'BiocVersion',
 'GenomeInfoDb',
 'IRanges',
 'Biobase',
 'zlibbioc',
 'XVector',
 'Biostrings',
 'BiocParallel']

We can obtain the data from a specific package, for example the **`DeepBlueR`** package

In [6]:
deepbluer = data_source.obtain_package_data("DeepBlueR")
deepbluer

{'name': 'DeepBlueR',
 'version': '1.24.1',
 'url': 'https://www.bioconductor.org/packages/release/bioc/html/DeepBlueR.html',
 'dependencies': [<olivia_finder.package.Package at 0x7feccb508970>,
  <olivia_finder.package.Package at 0x7feccb508c40>,
  <olivia_finder.package.Package at 0x7feccb5086d0>,
  <olivia_finder.package.Package at 0x7feccb508b20>,
  <olivia_finder.package.Package at 0x7feccb508b80>,
  <olivia_finder.package.Package at 0x7feccb508ac0>,
  <olivia_finder.package.Package at 0x7feccb508d00>,
  <olivia_finder.package.Package at 0x7feccb508820>,
  <olivia_finder.package.Package at 0x7feccb508940>,
  <olivia_finder.package.Package at 0x7feccb508d60>,
  <olivia_finder.package.Package at 0x7feccb508610>,
  <olivia_finder.package.Package at 0x7feccb508ca0>,
  <olivia_finder.package.Package at 0x7feccb508be0>,
  <olivia_finder.package.Package at 0x7feccb508a60>,
  <olivia_finder.package.Package at 0x7feccb5086a0>,
  <olivia_finder.package.Package at 0x7feccb508a90>,
  <olivia_

Be careful with the sensitivity to **caps**, if the package has not been found, an ***NotFoundException*** is returned

In [7]:
from olivia_finder.scraping.scraper import ScraperError

try:
    deepbluer2 = data_source.obtain_package_data("deepbluer")
except ScraperError as e:
    print(e)

Package deepbluer not found


In [9]:
pkgs_data = data_source.obtain_packages_data(package_list[:3])
pkgs_data

[{'name': 'BiocGenerics',
  'version': '0.44.0',
  'dependencies': [<olivia_finder.package.Package at 0x7feccb517a90>,
   <olivia_finder.package.Package at 0x7feccb517760>,
   <olivia_finder.package.Package at 0x7feccb51f1c0>,
   <olivia_finder.package.Package at 0x7feccb517d90>,
   <olivia_finder.package.Package at 0x7feccb517c70>],
  'url': 'https://www.bioconductor.org/packages/release/bioc/html/BiocGenerics.html'},
 {'name': 'S4Vectors',
  'version': '0.36.2',
  'dependencies': [<olivia_finder.package.Package at 0x7feccb1e7100>,
   <olivia_finder.package.Package at 0x7feccb1e7160>,
   <olivia_finder.package.Package at 0x7feccc17df70>,
   <olivia_finder.package.Package at 0x7feccb1e71c0>,
   <olivia_finder.package.Package at 0x7feccb1e7220>,
   <olivia_finder.package.Package at 0x7feccb1e70d0>],
  'url': 'https://www.bioconductor.org/packages/release/bioc/html/S4Vectors.html'},
 {'name': 'BiocVersion',
  'version': '3.16.0',
  'dependencies': [<olivia_finder.package.Package at 0x7fe

In [18]:
for p in pkgs_data:
    print(f'Package: {p["name"]} {p["version"]}')

    for d in p["dependencies"]:
        print(f'-   Dependency: {d.name} {d.version}')


Package: BiocGenerics 0.44.0
-   Dependency: utils 
-   Dependency: stats 
-   Dependency: R >= 4.0.0
-   Dependency: graphics 
-   Dependency: methods 
Package: S4Vectors 0.36.2
-   Dependency: utils 
-   Dependency: stats 
-   Dependency: R >= 4.0.0
-   Dependency: stats4 
-   Dependency: BiocGenerics 
-   Dependency: methods 
Package: BiocVersion 3.16.0
-   Dependency: R >= 4.2.0


#### Scrapers implementation

## Initialization of a package manager

In [19]:
from olivia_finder.package_manager import PackageManager

In [20]:
bioconductor = PackageManager(data_source)

In [21]:
p = bioconductor.obtain_package("DeepBlueR")
p.print()

Package:
  name: DeepBlueR
  version: 1.24.1
  url: https://www.bioconductor.org/packages/release/bioc/html/DeepBlueR.html
  dependencies:
    diffr:
    rtracklayer:
    R:>= 3.3
    R.utils:
    foreach:
    utils:
    settings:
    RCurl:
    stringr:
    filehash:
    data.table:
    GenomeInfoDb:
    withr:
    XML:
    methods:
    rjson:
    GenomicRanges:
    dplyr:


In [9]:
bioconductor_packages = bioconductor.obtain_packages(
    package_list[:2],
    extend_repo=True, 
    show_progress=True
)
bioconductor_packages

100%|██████████| 2/2 [00:05<00:00,  2.58s/it]


[<olivia_finder.package.Package at 0x7f0608303fd0>,
 <olivia_finder.package.Package at 0x7f05fbf18f50>]

In [10]:
for p in bioconductor_packages:
    p.print()

Package:
  name: BiocGenerics
  version: 0.44.0
  url: https://www.bioconductor.org/packages/release/bioc/html/BiocGenerics.html
  dependencies:
    utils:
    R:>= 4.0.0
    stats:
    methods:
    graphics:
Package:
  name: S4Vectors
  version: 0.36.2
  url: https://www.bioconductor.org/packages/release/bioc/html/S4Vectors.html
  dependencies:
    utils:
    R:>= 4.0.0
    stats:
    stats4:
    methods:
    BiocGenerics:


In [11]:
bioconductor_packages = bioconductor.obtain_packages(extend_repo=True, show_progress=True)

100%|██████████| 2183/2183 [04:00<00:00,  9.07it/s]


In [14]:
import pickle
# Save the bioconductor package manager as a pickle file
with open("./results/package_managers/bioconductor_pm_scraping.pkl", "wb") as f:
    pickle.dump(bioconductor_packages, f)

In [16]:
# Store the package manager as a adjacency list
b_df = bioconductor.to_full_adj_list()

# Store the package manager as a adjacency list
b_df.to_csv("./results/csv_datasets/bioconductor_adjlist_scraping.csv", index=False)
