Capture system from non scrapping data sources
Package to download data and process datasets, not related to scrapping, like Microsoft Academic MAG and Scielo. For MAG, this package allows to download the data from azure and in parallel, dumps the data to MongDB collections. This allows too, create indexes for ElasticSearch database to perform search using the title of the document.
https://docs.microsoft.com/en-us/academic-services/graph/get-started-setup-provisioning
- Install MongoDB:
- Debian Based systems:
apt install mongodb
- RedHat Based systems: here
- Debian Based systems:
- Install ElasticSearch: here
- The other dependecies can be installed with pip installing this package.
pip install inti
Very importan for MA, other way it won't work. https://docs.mongodb.com/manual/reference/ulimit/ max opened files have to be updated.
inti_maloader --ma_dir=/storage/colav/mag_sample/ --db=MA --all
inti_maesloader --mag_dir=/home/colav/mag/ --col_name=Papers --field_name=PaperTitle --index_name=mag
There is a special requirement for mongodb server to allow run multithread sessions
To avoid the error
ok" : 0, "errmsg" : "cannot add session into the cache", "code" : 261, "codeName" : "TooManyLogicalSessions",
you need to start the server with the option
mongod --setParameter maxSessions=10000000 --config /etc/mongodb.conf
in the file /etc/elasticsearch/elasticsearch.yml add
thread_pool.get.queue_size: 10000
thread_pool.write.queue_size: 10000
With low disk space, this error can appear
('1 document(s) failed to index.', [{'index': {'_index': 'mag', '_type': '_doc', '_id': '9915517', 'status': 429, 'error': {'type': 'cluster_block_exception', 'reason': 'index [mag] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];'}, 'data': {'PaperTitle': '...'}}}])
solver it with this.
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }'
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'
increase the index creation memory to 6G of RAM to improve the performance(use this with caution)
db.adminCommand({getParameter: 1, maxIndexBuildMemoryUsageMegabytes: 1})
db.adminCommand({setParameter: 1, maxIndexBuildMemoryUsageMegabytes: 6144})
Be aware that running this package, mongodb producess a huge amount of informtation in the logs, please clean the file /var/log/mongodb.log (it could be more that 65G)
This is required to perform massive insertions in parallel!
BSD-3-Clause License