# Popularity

This notebook demonstrates using functions to rank news articles. The aim is to enable a use a popularity factor (0-10) to boost truthworthy articles to the top. This is achieved by using `function_score` and `script_score`.

Relevant links:
* https://www.elastic.co/guide/en/elasticsearch/guide/current/script-score.html
* https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html
* https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-scripting-fields.html

TODO:
* Inidividual articles can be boosted by the number of votes. We should probably implement this. It will probably require reindexing. https://www.elastic.co/guide/en/elasticsearch/guide/current/boosting-by-popularity.html
* This might be the way to update the number of votes: https://www.elastic.co/guide/en/elasticsearch/reference/current/_updating_documents.html

In [None]:
import subprocess
import json

In [None]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/rssfeeds/_count" -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res)
res

In [None]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/rssfeeds/_mapping" -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res)
res

In [None]:
term = "spyware"

In [None]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/rssfeeds_test/article/_search" -H 'Content-Type: application/json' -d'
{
    "_source": ["title", "content", "resource_label"],
    "query": {
        "match": {
            "content": " """ + term + """ "
        }
    }
}
' -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res, strict=False)
print(res['hits']['total'])
print()

for hit in res['hits']['hits']:
    print(hit['_score'], hit['_source']['resource_label'], hit['_source']['title'])
    print('-'*80)

In [None]:
#script = """weight = 5
#if (doc['resource_label.keyword'].value == 'trendmicro') { weight = 7 }
#if (doc['resource_label.keyword'].value == 'itsecurityguru') { weight = 4 }
#return weight * _score"""

In [None]:
script = """if (doc['resource_label.keyword'].value == 'trendmicro') { return 7 * _score }
if (doc['resource_label.keyword'].value == 'itsecurityguru') { return 4 * _score }
return 5 * _score"""

In [None]:
popularity = """arstechnica 8
bankinfosecurity 6
bleepingcomputer 5
csoonline 7
darkreading 7
euractiv 7
itsecurityguru 4
malwarebytes 8
nakedsecurity 5
politico 6
reuters 7
securelist 7
securityaffairs 4
securityintelligence 7
securityweek 7
techcrunch 6
thehackernews 5
threatpost 8
trendmicro 7
wired 6"""

# Not in RSS feeds or websites:
# symantec blog 7
# fire eye 7
# talos blog 7
# scmagazine 6
# bbc 6
# independent 6
# forbes 6
# secureworks 6
# tripwire 6

# Not rated by Adrien:
# cert
# cisco
# securityweekly
# welivesecurity

In [None]:
script = ""

for line in popularity.split('\n'):
    values = line.split(' ')
    script += "if (doc['resource_label.keyword'].value == '" + values[0] + "') { return " + values[1] + " * _score }\n"

script += "return 5 * _score"
print(script)

In [None]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/rssfeeds_test/article/_search" -H 'Content-Type: application/json' -d'
{
    "_source": ["title", "content", "resource_label"],
    "query": {
        "function_score": {
            "functions": [
                {
                    "script_score": {
                        "script": " """ + script.replace('\n', '\\n').replace("'", '\\u0027') + """ "
                    }
                }
            ],
            "query": {   
                "match": {
                    "content": " """ + term + """ "
                }
            }
        }
    }
}
' -u guest:teradata
"""
print(query)

In [None]:
res = subprocess.getoutput(query)
#res

In [None]:
res = json.loads(res, strict=False)
print(res['hits']['total'])
print()

for hit in res['hits']['hits']:
    print(hit['_score'], hit['_source']['resource_label'], hit['_source']['title'])
    print('-'*80)

In [None]:
functions = []

for line in popularity.split('\n'):
    values = line.split(' ')
    functions.append("""        {
          "filter": {
            "match": {
              "resource_label": """ + '"' + values[0] + '"' + """   
            }
          },
          "weight": """ + str(float(values[1])/5.) + """
        }""")

functions = ",\n".join(functions)
print(functions)

In [None]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/rssfeeds_test/article/_search" -H 'Content-Type: application/json' -d'
{
  "_source": [
    "title",
    "content",
    "resource_label"
  ],
  "query": {
    "function_score": {
      "boost": "5",
      "functions": [
""" + functions + """
      ],
      "boost_mode": "multiply",

      "query": {   
        "match": {
          "content": " """ + term + """ "
        }
      }
    }
  }
}
' -u guest:teradata
"""
print(query)

In [None]:
res = subprocess.getoutput(query)
#res

In [None]:
res = json.loads(res, strict=False)
print(res['hits']['total'])
print()

for hit in res['hits']['hits']:
    print(hit['_score'], hit['_source']['resource_label'], hit['_source']['title'])
    print('-'*80)