We make sure to have the necessary units installed

In [1]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


## Initialization of a data source

In this example we will instantize the desired implementation of data source class

Different implementations are available to perform this task, based on obtaining the web pages of the respective package managers and also from CSV text files

### Data obtaining via web scarping

#### How to proceed

In the first place we import the implementation of the data source we want, for this example we will use the Bioconductor Scraper

In [2]:
from olivia_finder.scraping.bioconductor import BiocScraper

In [3]:
data_source = BiocScraper(use_logger=True)

2023-03-09 20:01:38 [    INFO] Added SSLProxies to proxy builders (logger.py:84)
2023-03-09 20:01:38 [    INFO] Added FreeProxyList to proxy builders (logger.py:84)
2023-03-09 20:01:38 [    INFO] Added GeonodeProxy to proxy builders (logger.py:84)
2023-03-09 20:01:38 [   DEBUG] Starting new HTTPS connection (1): www.sslproxies.org:443 (connectionpool.py:1003)
2023-03-09 20:01:39 [   DEBUG] https://www.sslproxies.org:443 "GET / HTTP/1.1" 200 None (connectionpool.py:456)
2023-03-09 20:01:39 [    INFO] Found 100 proxies from SSLProxies (logger.py:84)
2023-03-09 20:01:39 [   DEBUG] Starting new HTTPS connection (1): free-proxy-list.net:443 (connectionpool.py:1003)
2023-03-09 20:01:39 [   DEBUG] https://free-proxy-list.net:443 "GET /anonymous-proxy.html HTTP/1.1" 200 None (connectionpool.py:456)
2023-03-09 20:01:39 [    INFO] Found 100 proxies from FreeProxyList (logger.py:84)
2023-03-09 20:01:39 [   DEBUG] Starting new HTTPS connection (1): proxylist.geonode.com:443 (connectionpool.py:1003

Show relevant information about the data source

In [4]:
print(data_source.get_info())

Name: Bioconductor
Description: Scraper class implementation for the Bioconductor package network


Get a list with the name of the packages obtained from this source

In [5]:
data_source.disable_logger()
package_list = data_source.obtain_package_names()
package_list[:10]

['BiocGenerics',
 'S4Vectors',
 'BiocVersion',
 'GenomeInfoDb',
 'IRanges',
 'Biobase',
 'zlibbioc',
 'XVector',
 'Biostrings',
 'BiocParallel']

We can obtain the data from a specific package, for example the **`DeepBlueR`** package

In [6]:
data_source.enable_logger()
deepbluer = data_source.obtain_package_data("DeepBlueR")
deepbluer

2023-03-09 20:01:47 [    INFO] Scraping package DeepBlueR (logger.py:84)
2023-03-09 20:01:47 [    INFO] Getting next proxy (logger.py:84)
2023-03-09 20:01:47 [    INFO] Proxy list rotated, new: 158.140.160.86:10808 (logger.py:84)
2023-03-09 20:01:47 [    INFO] Using proxy: {'http': 'http://104.223.135.178:10000'} (logger.py:84)
2023-03-09 20:01:47 [    INFO] Getting next useragent (logger.py:84)
2023-03-09 20:01:47 [    INFO] Using user agent: Mozilla/5.0 (Windows NT 6.1; rv:36.0) Gecko/20100101 Firefox/36.0 (logger.py:84)
2023-03-09 20:01:47 [   DEBUG] Starting new HTTPS connection (1): www.bioconductor.org:443 (connectionpool.py:1003)
2023-03-09 20:01:47 [   DEBUG] https://www.bioconductor.org:443 "GET /packages/release/bioc/html/DeepBlueR.html HTTP/1.1" 200 5671 (connectionpool.py:456)
2023-03-09 20:01:47 [    INFO] Response status code: 200 (logger.py:84)


{'name': 'DeepBlueR',
 'version': '1.24.1',
 'url': 'https://www.bioconductor.org/packages/release/bioc/html/DeepBlueR.html',
 'dependencies': [{'name': 'R', 'version': '>= 3.3'},
  {'name': 'XML', 'version': ''},
  {'name': 'RCurl', 'version': ''},
  {'name': 'GenomicRanges', 'version': ''},
  {'name': 'data.table', 'version': ''},
  {'name': 'stringr', 'version': ''},
  {'name': 'diffr', 'version': ''},
  {'name': 'dplyr', 'version': ''},
  {'name': 'methods', 'version': ''},
  {'name': 'rjson', 'version': ''},
  {'name': 'utils', 'version': ''},
  {'name': 'R.utils', 'version': ''},
  {'name': 'foreach', 'version': ''},
  {'name': 'withr', 'version': ''},
  {'name': 'rtracklayer', 'version': ''},
  {'name': 'GenomeInfoDb', 'version': ''},
  {'name': 'settings', 'version': ''},
  {'name': 'filehash', 'version': ''}]}

Be careful with the sensitivity to **caps**, if the package has not been found, an ***NotFoundException*** is returned

In [7]:
from olivia_finder.scraping.scraper import ScraperError

try:
    deepbluer2 = data_source.obtain_package_data("deepbluer")
except ScraperError as e:
    print(e)

2023-03-09 20:01:47 [    INFO] Scraping package deepbluer (logger.py:84)
2023-03-09 20:01:47 [    INFO] Getting next proxy (logger.py:84)
2023-03-09 20:01:47 [    INFO] Proxy list rotated, new: 86.110.189.118:42539 (logger.py:84)
2023-03-09 20:01:47 [    INFO] Using proxy: {'http': 'http://158.140.160.86:10808'} (logger.py:84)
2023-03-09 20:01:47 [    INFO] Getting next useragent (logger.py:84)
2023-03-09 20:01:47 [    INFO] Using user agent: Mozilla/5.0 (Windows NT 6.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 AOL/9.7 AOLBuild/4343.4049.US Safari/537.36 (logger.py:84)
2023-03-09 20:01:47 [   DEBUG] Starting new HTTPS connection (1): www.bioconductor.org:443 (connectionpool.py:1003)
2023-03-09 20:01:48 [   DEBUG] https://www.bioconductor.org:443 "GET /packages/release/bioc/html/deepbluer.html HTTP/1.1" 404 6873 (connectionpool.py:456)
2023-03-09 20:01:48 [    INFO] Response status code: 404 (logger.py:84)
2023-03-09 20:01:48 [    INFO] Package deepbluer not fou

Package deepbluer not found


In [8]:
data_source.disable_logger()
pkgs_data = data_source.obtain_packages_data(package_list[:3])
pkgs_data

[{'name': 'BiocGenerics',
  'version': '0.44.0',
  'dependencies': [{'name': 'R', 'version': '>= 4.0.0'},
   {'name': 'methods', 'version': ''},
   {'name': 'utils', 'version': ''},
   {'name': 'graphics', 'version': ''},
   {'name': 'stats', 'version': ''},
   {'name': 'methods', 'version': ''},
   {'name': 'utils', 'version': ''},
   {'name': 'graphics', 'version': ''},
   {'name': 'stats', 'version': ''}],
  'url': 'https://www.bioconductor.org/packages/release/bioc/html/BiocGenerics.html'},
 {'name': 'S4Vectors',
  'version': '0.36.2',
  'dependencies': [{'name': 'R', 'version': '>= 4.0.0'},
   {'name': 'methods', 'version': ''},
   {'name': 'utils', 'version': ''},
   {'name': 'stats', 'version': ''},
   {'name': 'stats4', 'version': ''},
   {'name': 'BiocGenerics', 'version': ''}],
  'url': 'https://www.bioconductor.org/packages/release/bioc/html/S4Vectors.html'},
 {'name': 'BiocVersion',
  'version': '3.16.0',
  'dependencies': [{'name': 'R', 'version': '>= 4.2.0'}],
  'url': 'h

In [9]:
for p in pkgs_data:
    print(f'Package: {p["name"]} {p["version"]}')

    for d in p["dependencies"]:
        print(f'-   Dependency: {d["name"]} {d["version"]}')


Package: BiocGenerics 0.44.0
-   Dependency: R >= 4.0.0
-   Dependency: methods 
-   Dependency: utils 
-   Dependency: graphics 
-   Dependency: stats 
-   Dependency: methods 
-   Dependency: utils 
-   Dependency: graphics 
-   Dependency: stats 
Package: S4Vectors 0.36.2
-   Dependency: R >= 4.0.0
-   Dependency: methods 
-   Dependency: utils 
-   Dependency: stats 
-   Dependency: stats4 
-   Dependency: BiocGenerics 
Package: BiocVersion 3.16.0
-   Dependency: R >= 4.2.0


#### Scrapers implementation

## Initialization of a package manager

In [10]:
from olivia_finder.package_manager import PackageManager

In [11]:
bioconductor = PackageManager(data_source)

In [12]:
p = bioconductor.obtain_package("DeepBlueR")
p.print()

Package:
  name: DeepBlueR
  version: 1.24.1
  url: https://www.bioconductor.org/packages/release/bioc/html/DeepBlueR.html
  dependencies:
    R:>= 3.3
    XML:
    RCurl:
    GenomicRanges:
    data.table:
    stringr:
    diffr:
    dplyr:
    methods:
    rjson:
    utils:
    R.utils:
    foreach:
    withr:
    rtracklayer:
    GenomeInfoDb:
    settings:
    filehash:


In [13]:
bioconductor_packages = bioconductor.obtain_packages(
    package_list[:2],
    extend_repo=True, 
    show_progress=True
)
bioconductor_packages

100%|██████████| 2/2 [00:00<00:00, 10.17it/s]


[<olivia_finder.package.Package at 0x7ff75d4db890>,
 <olivia_finder.package.Package at 0x7ff75d4ee850>]

In [14]:
for p in bioconductor_packages:
    p.print()

Package:
  name: BiocGenerics
  version: 0.44.0
  url: https://www.bioconductor.org/packages/release/bioc/html/BiocGenerics.html
  dependencies:
    R:>= 4.0.0
    methods:
    utils:
    graphics:
    stats:
    methods:
    utils:
    graphics:
    stats:
Package:
  name: S4Vectors
  version: 0.36.2
  url: https://www.bioconductor.org/packages/release/bioc/html/S4Vectors.html
  dependencies:
    R:>= 4.0.0
    methods:
    utils:
    stats:
    stats4:
    BiocGenerics:


In [15]:
bioconductor_packages = bioconductor.obtain_packages(extend_repo=True, show_progress=True)

100%|██████████| 2183/2183 [03:47<00:00,  9.59it/s]


In [16]:
import pickle
# Save the bioconductor package manager as a pickle file
with open("./results/package_managers/bioconductor_pm_scraping.pkl", "wb") as f:
    pickle.dump(bioconductor_packages, f)

In [17]:
# Store the package manager as a adjacency list
b_df = bioconductor.to_full_adj_list()

# Store the package manager as a adjacency list
b_df.to_csv("./results/csv_datasets/bioconductor_adjlist_scraping.csv", index=False)


In [19]:
# Get the package with the most dependencies
max_deps = max(bioconductor_packages, key=lambda p: len(p.dependencies))
max_deps.print()
print(f'Number of dependencies: {len(max_deps.dependencies)}')

Package:
  name: singleCellTK
  version: 2.8.0
  url: https://www.bioconductor.org/packages/release/bioc/html/singleCellTK.html
  dependencies:
    R:>= 4.0
    SummarizedExperiment:
    SingleCellExperiment:
    DelayedArray:
    Biobase:
    ape:
    AnnotationHub:
    batchelor:
    BiocParallel:
    celldex:
    colourpicker:
    colorspace:
    cowplot:
    cluster:
    ComplexHeatmap:
    data.table:
    DelayedMatrixStats:
    DESeq2:
    dplyr:
    DT:
    ExperimentHub:
    ensembldb:
    fields:
    ggplot2:
    ggplotify:
    ggrepel:
    ggtree:
    gridExtra:
    GSVA:
    GSVAdata:
    igraph:
    KernSmooth:
    limma:
    MAST:
    Matrix:
    matrixStats:
    methods:
    msigdbr:
    multtest:
    plotly:
    plyr:
    ROCR:
    Rtsne:
    S4Vectors:
    scater:
    scMerge:
    scran:
    Seurat:>= 3.1.3
    shiny:
    shinyjs:
    SingleR:
    SoupX:
    sva:
    reshape2:
    shinyalert:
    circlize:
    enrichR:
    celda:
    shinycssloaders:
    DropletUtils:
 