Skip to content
Recommendation model for Golos.io
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
golosio_recommendation_model
.gitattributes
Dockerfile
README.md
architecture.png
install.sh
requirements.txt
setup.py
uninstall.sh

README.md

Golos.io recommendation system

This repo contains files of recommendation system for golos.io

.
+-- install.sh - Bash script to fill crontab tasks for a model rebuilding
+-- uninstall.sh - Bash script to clean crontab and to stop all daemons
+-- setup.py - Package configuration
+-- golosio_recommendation_model
   +-- config.py - Overall model configuration
   +-- daemonize.py - Function for making daemon of a specified function
   +-- server
      +-- server.py - Flask server for recommendation system
      +-- config.py - Server configuration
   +-- sync
      +-- convert_events.py - Convert events in MongoDB for training FFM model
      +-- sync_comments.py - Synchronizing MongoDB with Golos node
      +-- sync_events.py - Synchronizing Golosio MySQL events with MongoDB
      +-- sync_accounts.py - Synchronizing Golosio MySQL accounts with MongoDB
   +-- model
      +-- utils.py - Helpers for preprocessing, processes regulation and etc.
      +-- train
         +-- ann.py - Process of training model to find similar posts
         +-- doc2vec.py - Process of training model to find doc2vec vectors for each post
         +-- ffm.py - Process of training FFM model to arrange recommendations for each user
      +-- predict
         +-- ann.py - Process of finding similar posts for new posts in database
         +-- doc2vec.py - Process of finding doc2vec vectors for each new post in database
         +-- ffm.py - Process of creating recommendations list for each active user
+-- bin - These scripts will appear in /usr/local/bin directory
   +-- doc2vec_train - Daemon that trains doc2vec model
   +-- doc2vec_predict - Daemon that makes doc2vec predictions for all posts in database
   +-- ann_train - Daemon that trains ANN model
   +-- ann_predict - Daemon that makes ANN predictions for all posts in database
   +-- ffm_train - Daemon that trains FFM model
   +-- ffm_predict - Daemon that makes FFM predictions and stores them to a database
   +-- recommendations_server - Daemon for a recommendation model server
   +-- sync_comments - Daemon that loads new comments from a golos node to a database
   +-- sync_events - Daemon that loads events from a specified mysql DB to a database
   +-- sync_accounts - Daemon that loads accounts from a specified csv file to a database

Architecture

Recommendation model architecture: Recommendation model architecture

Installation

Install LibFFM before usage. Instruction can be found here: http://github.com/alexeygrigorev/libffm-python

Prepare mongo database before installation. You can load current mongo dumps here:

$ scp earth@earth.cyber.fund:~/Documents/golosio-recommendation-model/golosio-recommendation-dump-comment.json ./
$ scp earth@earth.cyber.fund:~/Documents/golosio-recommendation-model/golosio-recommendation-dump-event.json ./

Prepare config file before installation. It should looks like this:

# golosio_recommendation_model/config.py
config = {
  'database_url': "localhost:27017", # Your mongo database url
  'database_name': "golos_comments", # Mongo database with dumps content
  'accounts_path': "/home/anatoli/Documents/golosio_recommendation_model/accounts.csv", # Path to csv file with accounts, only for debug
  'node_url': 'ws://localhost:8090', # Golos.io websocket url
  'model_path': "/tmp/", # Path to model files
  'log_path': "/tmp/recommendation_model.log", # Path to model log
  'events_database': { # Credentials for mysql database with events
    'host': 'localhost',
    'database': 'golos',
    'user': 'root',
    'password': 'root'
  }
}

Install a package with:

$ pip3 install .

To add model daemons to a crontab, use:

$ install.sh

This script will add train tasks to a crontab and will start comments synchronization.

It'll take some time to generate a new version of a model. For example, You'll get new model after a full day, if you ran installation script at 22:00. If you want to get first version as quickly as possible, run daemons manually:

$ doc2vec_train start
$ ann_train start
$ ffm_train start

To stop model daemons and to clean crontab, run:

$ uninstall.sh

How to use it

To add new events to a database, run:

$ sync_events start

To update accounts in a database, run:

$ sync_accounts start

To start server, run:

$ recommendations_server start

Base server actions

To get similar posts and distances to each of them for a specified one, run:

$ curl http://localhost:8080/similar?permlink=POST_PERMLINK

For example:

$ curl http://localhost:8080/similar?permlink=@cka3ka/0x-zrx-naverno-zatuzemunit-skoro-50-50

[
  [
    "@cka3ka/0x-zrx-naverno-zatuzemunit-skoro-50-50", 
    0.00017143443983513862
  ], 
  [
    "@cka3ka/golos-tuzemun", 
    0.11337035149335861
  ], 
  [
    "@cka3ka/3oo5vp-sozdatel-ethereum-vitalik-buterin-voshel-v-spisok-50", 
    0.528253972530365
  ], 
  [
    "@cka3ka/sozdatel-ethereum-vitalik-buterin-voshel-v-spisok-50", 
    0.6705115437507629
  ], 
  [
    "@cka3ka/bitcoin-stal-shestym-po-populyarnosti-sredi-mirovykh-valyut", 
    0.7635799050331116
  ], 
  [
    "@abdulazizov/v-zimbabe-bitkoin-likhoradka", 
    0.9337025880813599
  ], 
  [
    "@abdulazizov/bitcoin-fork", 
    0.937701404094696
  ],
  ...
]

To get recommendations for specified user, run:

curl http://localhost:8080/recommendations?user=USER_ID

For example:

$ curl http://localhost:8080/recommendations?user=58158

[
  {
    "post_permlink": "@tarimta/obektivnyi-marafon-etap-3", 
    "prediction": 0.9400154948234558
  }, 
  {
    "post_permlink": "@lumia/estafeta-prodolzhi-pesnyu-zadushevnaya", 
    "prediction": 0.9309653043746948
  }, 
  {
    "post_permlink": "@oksi969/dizain-cheloveka-lyubov-i-napravlenie-g-centr", 
    "prediction": 0.9016984701156616
  }, 
  {
    "post_permlink": "@is-pain/vzveshennye-lyudi-or-minus-16-kilogramm-za-dva-mesyaca", 
    "prediction": 0.8760964870452881
  }, 
  {
    "post_permlink": "@miroslav/golos-photography-awards-edinstvennaya", 
    "prediction": 0.8590876460075378
  },
  ...
]

To get recommendations for specified user and specified post, run:

curl http://localhost:8080/post_recommendations?user=USER_ID&permlink=POST_PERMLINK

For example:

$ curl http://localhost:8080/post_recommendations?user=71116&permlink=@cka3ka/golos-tuzemun

[
  {
    "post_permlink": "@cka3ka/0x-zrx-naverno-zatuzemunit-skoro-50-50", 
    "prediction": 0.34954845905303955
  }, 
  {
    "post_permlink": "@igrinov50-50/skonchalsya-leonid-bronevoi", 
    "prediction": 0.3138478994369507
  }, 
  {
    "post_permlink": "@ljpromo/isportilas-autentichnost", 
    "prediction": 0.16339488327503204
  }, 
  {
    "post_permlink": "@cka3ka/bitcoin-stal-shestym-po-populyarnosti-sredi-mirovykh-valyut", 
    "prediction": 0.07751113921403885
  }
]

Debug server actions

To get supported user ids, run

$ curl http://localhost:8080/users

To find user id by username, run:

$ curl http://localhost:8080/user_id?user_name=USER_NAME

For example:

$ curl http://localhost:8080/user_id?user_name=smartcity-admin

{
  "user_id": 60837
}

To get page views for some user, run:

$ curl http://localhost:8080/history?user=USER_ID

For example:

$ curl http://localhost:8080/history?user=58158

[
  "@vik/test-redaktora-dlya-botov-ot-vik-11-10", 
  "@vox-populi/otchyot-kuratora-30-oktyabrya-5-noyabrya", 
  "@vp-freelance/4kpmi-rezultaty-ezhenedelnogo-konkursa-luchshaya-rabota-po-itogam-nedeli", 
  "@vp-freelance/khudozhestvennyi-perevod", 
  "@vp-freelance/konkursnaya-rabota-16-odnazhdy-na-rabote-ikra-belugi", 
  "@vp-freelance/realnosti-frilansa-mysli", 
  "@vp-freelance/rezultaty-konkursa-odnazhdy-na-rabote-za-oktyabr-2017-goda", 
  "@vp-freelance/treiding-kak-vid-frilansa"
]

Configuration

Overall service configuration situated in config.py file, but most of the configuration hidden in .py files deep inside a package. You can additionally modify lines below to get better results:

You can change service port here:

# server/server.py
port = 8080 # Use desired port

It's highly recommended not to play with these parameters, but you can do it at your own risk in these files:

# sync/convert_comments.py
...
HOURS_LIMIT = 14 * 24 # Time window size (in hours) for events extraction. Bigger values makes recommendations less sensitive to changes in preferences
...
# model/train/doc2vec.py
...
WORD_LENGTH_QUANTILE = 10 # Remove words shorter than 90%
TEXT_LENGTH_QUANTILE = 66 # Remove texts shorter than 66%
HIGH_WORD_FREQUENCY_QUANTILE = 99.5 # Remove words that appears more often than 99.5%
LOW_WORD_FREQUENCY_QUANTILE = 60 # Remove words that appears less often than 30%
# Parameters of doc2vec model. You can read about them in this article: https://radimrehurek.com/gensim/models/doc2vec.html
DOC2VEC_PARAMETERS = {
  'size': 300,
  'window': 20,
  'min_count': 5,
  'workers': 13
}
...
# model/train/ann.py
...
NUMBER_OF_TREES = 1000 # Number of trees in the ANN model. Bigger values ​​means slow prediction and high quality of result
NUMBER_OF_VALUES = 1000 # Number of values ​​for one-hot encoding of categorical features. Bigger values ​​means slow preparation and high quality of result
...
# model/train/ffm.py
...
# FFM parameters. You can read about them in this article: https://github.com/alexeygrigorev/libffm-python
MODEL_PARAMETERS = {
  'eta': 0.1,
  'lam': 0.01,
  'k': 70
}

ITERATIONS = 10 # Iterations of training process
WORKERS = 13 # Number of workers for dataset processing. Should be equal to AVAILABLE_CORES + 1
...
# model/predict/doc2vec.py
...
DOC2VEC_STEPS = 2500 # Number of steps for doc2vec model. Bigger values ​​means slow prediction and fast convergence
DOC2VEC_ALPHA = 0.03 # Learning rate for doc2vec model. Bigger values ​​means fast prediction and slow convergence
...
# model/predict/ann.py
...
NUMBER_OF_RECOMMENDATIONS = 50 # Number of similar posts for a post
...

Tests and logs

To run load tests, download first version of a model and use:

python3 -m unittest tests.load_test_case

It'll show average response time for actions that returns recommendations and similar posts.

To see model logs, run:

tail -f ./model.log

Timing

Processing time:

  • doc2vec - 2.5h
    • train - 1.5h
    • predict - 1h
  • ann - 2h
    • train - 1h
    • predict 1h
  • ffm - 5h
    • train - 4h
    • predict - 1h

Tested on a server with i7-5930K, 128Gb DDR-4, 1 Tb SSD-PCIe.

You can’t perform that action at this time.