Skip to content

alishibli97/Web-Scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scraping Pipeline

  1. Create folder .secrets and setup MongoDB, RabbitMQ, and Chrome credentials:

    cp -r .secrets_example .secrets 
  2. Start all containers:

    docker-compose up -d
    docker-compose ps
  3. Install package:

    pip install --editable '.'
  4. Launch a process to generate predicates

    python -m webly.predicates \
        --output amqp \
        --amqp-url amqp://user@localhost \
        --amqp-pass-file .secrets/rabbitmq_default_pass_file \
        data/vrd/predicates.txt
  5. Launch one or more of the expander processes:

    python -m webly.expander \
        --ngrams 4 5 \
        --ngrams-dir data/ngrams/processed \
        --ngrams-max 2 \
        --languages fr it \
        --input amqp \
        --output amqp \
        --amqp-url amqp://user@localhost \
        --amqp-pass-file .secrets/rabbitmq_default_pass_file
  6. Launch one or more scraper processes:

    python -m webly.scraper \
        --engines google yahoo flickr \
        --chrome-url http://localhost:3000/webdriver \
        --chrome-token-file .secrets/chrome_token \
        --input amqp \
        --output mongo \
        --amqp-url amqp://user@localhost \
        --amqp-pass-file .secrets/rabbitmq_default_pass_file \
        --mongo-url mongodb://user@localhost \
        --mongo-pass-file .secrets/mongo_initdb_root_password
  7. Kill expander and scraper processes, then stop containers:

    docker-compose stop
  8. Download images

    python -m webly.downloader \
        --mongo-url mongodb://user@localhost \
        --mongo-pass-file .secrets/mongo_initdb_root_password \
        --output-dir images 

Monitoring:

Test queries manually (input from stdin):

python -m webly.scraper \
    --engines google yahoo flickr \
    --chrome-url http://localhost:3000/webdriver \
    --chrome-token-file .secrets/chrome_token \
    --input text \
    --output text

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages