Skip to content

browserify-search/scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scripts

These scripts run various processes that make browserify search work. This diagram demonstrates at a high level the data flow of the system.

Data flow chart

Starting with the first column from the left:

  • npm - in general we use the npm registry registry.npmjs.org for fetching metadata about modules
  • npm download counts - an API provided by npm Inc.
  • manual test results - this is data collected by manually testing 399 randomly selected modules on npm. The data file is test-summary.json. Read for more background info.

Second column:

Third column:

  • modules - this is a mongodb collection contain one document per module, storing the data gotten from process-module
  • moduleStats - this is a mongodb collection storing the mean, and variance for the download counts throughout the entire npm registry, calculated via a map reduce command in aggregate_stats.js.

Fourth column:

  • import ES - this is a script that takes modules and moduleStats from the mongodb as input and exports the data into Elastic Search.

Fifth column:

  • the elastic search instance provides full text search capability to the search engine

Sixth column:

  • www - implements the module search engine website

Develop On This

Initial setup if you want to develop on this.

git clone git@github.com:browserify-search/scripts.git
cd scripts
npm install
cp config.sample.json config.json

Processing Modules

  • ./follower.js - this script runs continuously in the background of the web server, tracking changes in the main npm registry using follow-registry. As soon as a module is published, in runs process-module to process it.
  • ./dispatcher.js and worker.js - this pair of scripts enable distributed parallel processing of npm modules. dispatcher.js will be started from the web server (with the mongodb instance), while worker.js will be started on any number of worker machines. The communicate via zero mq and all results will be saved to the mongodb instance.
  • ./process_module.js - process a singe module.
  • ./update_browserifiability.js - recalculate browserifiability scores for every modules in the db.

Elastic Search

Configuration

First things first, you need to install elastic search. Then you need to make sure you have the following settings in elasticsearch.yml (usually in the config directory within the location where Elastic Search is installed):

http.max_content_length: 1000mb
script.disable_dynamic: false

Updating Elastic Search

  • update_mapping - this drops the data collection on elastic search (starts over) and updates the schema. You can tweak the schema prior to running if you want to tweak the search parameters to weight certain fields more than others.

  • bulk_insert_elasticsearch_from_db.js - this script reads from the modules and moduleStats collections in mongodb and streams a stream of line-separated json output suitable for bulk inserting into elastic search. To actually do this, you can use the command

      ./bulk_insert_elasticsearch_from_db.js | curl -s -XPOST localhost:9200/browserify-search/module/_bulk --data-binary @-
    
  • bulk_insert_elasticsearch_from_files.js - instead of reading from mongodb, you can instead read from a mongodb data dump with this script. Get the data dump files, then run

      ./bulk_insert_elasticsearch_from_files.js modules.json moduleStats.json | curl -s -XPOST localhost:9200/browserify-search/module/_bulk --data-binary @-
    

Update Download Counts

  • update_download_counts.js all - updates the download counts for every module in mongodb.

About

Various scripts needed in the operation of the browserify search server.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published