# Multiple analyzers

This notebook demonstrates using multiple analyzers to search in the news articles. The aim is to enable a use of the topics from the vsdx knowledge graph as synonyms. In this way, the hierarchy among the topics will be understood, i.e. by searching for a parent category (e.g. malware), one can find the news articles related to all subcategories (e.g. spyware, ransomware, trojan, ...).

In [1]:
import subprocess
import json

In [2]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/rssfeeds/_count" -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res)
res

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'count': 4808}

In [3]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/rssfeeds/_mapping" -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res)
res

{'rssfeeds': {'mappings': {'article': {'properties': {'categories': {'fields': {'keyword': {'ignore_above': 256,
        'type': 'keyword'}},
      'type': 'text'},
     'content': {'fields': {'processed': {'analyzer': 'custom_text_analyzer',
        'type': 'text'},
       'tagged': {'analyzer': 'custom_text_analyzer',
        'fielddata': True,
        'fielddata_frequency_filter': {'max': 0.1,
         'min': 0.001,
         'min_segment_size': 10},
        'type': 'text'}},
      'type': 'text'},
     'description': {'fields': {'processed': {'analyzer': 'custom_text_analyzer',
        'type': 'text'},
       'tagged': {'analyzer': 'custom_text_analyzer',
        'fielddata': True,
        'fielddata_frequency_filter': {'max': 0.1,
         'min': 0.001,
         'min_segment_size': 10},
        'type': 'text'}},
      'type': 'text'},
     'link': {'ignore_above': 256, 'type': 'keyword'},
     'published': {'type': 'date'},
     'resource_label': {'ignore_above': 256, 'type': 'keyw

In [4]:
query = """
curl -s -X DELETE "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/rssfeeds_test" -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res)
res

{'acknowledged': True}

It is often useful to index the same field in different ways for different purposes. This is the purpose of multi-fields.

https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html

In [5]:
analyzer_knowledge_graph = """
        "analyzer_knowledge_graph": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["standard", "lowercase", "knowledge_graph"]
        }"""

knowledge_graph = """
          "synonyms" : [
"privacy => privacy",
"accountability => accountability",
"authenticity => authenticity",
"integrity => integrity",
"trustworthiness => trustworthiness",
"confidentiality => confidentiality",
"availability => availability",
"auditability => auditability",
"non-repudiation => non-repudiation",
"general terms => general terms, privacy, accountability, authenticity, integrity, trustworthiness, confidentiality, availability, auditability, non-repudiation",

"directives => directives, network and information security directive, nis directive, nis, national law",
"ban => ban",
"regulation => regulation, general data protection regulation, gdpr, electronic identification authentication and trust services regulation, eidas",
"cybersecurity strategy => cybersecurity strategy",
"cybersecurity doctrine => cybersecurity doctrine",
"policy => policy, directives, network and information security directive, nis directive, nis, national law, ban, regulation, general data protection regulation, gdpr, electronic identification authentication and trust services regulation, eidas, cybersecurity strategy, cybersecurity doctrine",

"fake news => fake news",
"elections => elections",
"cyber conflict => cyber conflict",
"disinformation => disinformation",
"destabilization => destabilization",
"information warfare => information warfare",
"psychological warfare => psychological warfare",
"cyber war => cyber war",
"deterrence => deterrence",
"propaganda => propaganda",
"government interference => government interference",
"geopolitics => geopolitics, fake news, elections, cyber conflict, disinformation, destabilization, information warfare, psychological warfare, cyber war, deterrence, propaganda, government interference",

"nation states => nation states",
"cyber-terrorists => cyber-terrorists",
"insiders => insiders",
"cyber-criminals => cyber-criminals",
"hacktivists => hacktivists",
"script-kidies => script-kidies",
"corporations => corporations",
"threat actors => threat actors, nation states, cyber-terrorists, insiders, cyber-criminals, hacktivists, script-kidies, corporations",

"ipo => ipo",
"acquisition => acquisition",
"sme => sme",
"award => award",
"operator of essential service => operator of essential service",
"merger => merger",
"insurance => insurance",
"stock market => stock market",
"partnership => partnership",
"stock options => stock options",
"fine => fine",
"funding => funding",
"technology companies => technology companies, google, amazon, facebook, apple, twitter, linkedin",
"industry => industry",
"valuation => valuation",
"black market => black market",
"startup => startup",
"cybersecurity companies => cybersecurity companies",
"public private partnerships, ppp => public private partnerships, ppp",
"alliance => alliance",
"bankruptcy => bankruptcy",
"critical infrastructure => critical infrastructure",
"business => business, ipo, acquisition, sme, award, operator of essential service, merger, insurance, stock market, partnership, stock options, fine, funding, technology companies, google, amazon, facebook, apple, twitter, linkedin, industry, valuation, black market, startup, cybersecurity companies, public private partnerships, ppp, alliance, bankruptcy, critical infrastructure",

"cyber espionage => cyber espionage, nation state espionage, corporation espionage, financial espionage, targeted attacks, denial and deception",
"information leakage => information leakage, misconfiguration, data leaks, personal data",
"botnets => botnets, ddos, iot botnet",
"denial of service => denial of service, amplification attacks, reflection attacks",
"data breaches => data breaches, personal data, exploitation, hacking, security vulnerabilities, security incident, credential theft, data dump",
"exploit kits => exploit kits, vulnerabilities, zero-day, 0-day",
"phishing => phishing, social engineering, spear-phishing, malware, spam, data stealing",
"web application attacks => web application attacks, cross-site scripting, xss, local file inclusion, lfi, remote file inclusion, rfi, sql injection, cross-site request forgery, csrf",
"spam => spam, social engineering, malware",
"insider threat => insider threat, human error, malicious insider, cyber espionage, un-authorised access, data leak",
"identity theft => identity theft, social engineering, social media abuse, dark web shopping, confidential information, sensitive information, impersonation, credential stealing, personal information, personal data",
"vulnerabilities => vulnerabilities, zero-day, 0-day, exploitation, hardware vulnerabilities, software vulnerabilities",
"physical manipulation damage => physical manipulation damage, outage, failures, malfunction, environmental disaster, natural disaster, physical attack, damage caused by third party",
"malware => malware, advanced persistent threat, apt, virus, worm, ransomware, trojan, cryptominer, rootkit, bootkit, backdoor, spyware, scareware, addware, keylogger",
"web based attacks => web based attacks, drive-by downloads, cryptojacking, man-in-the-browser, waterholing, supply-chain attack",
"threats => threats, cyber espionage, nation state espionage, corporation espionage, financial espionage, targeted attacks, denial and deception, information leakage, misconfiguration, data leaks, personal data, botnets, ddos, iot botnet, denial of service, amplification attacks, reflection attacks, data breaches, personal data, exploitation, hacking, security vulnerabilities, security incident, credential theft, data dump, exploit kits, vulnerabilities, zero-day, 0-day, phishing, social engineering, spear-phishing, malware, spam, data stealing, web application attacks, cross-site scripting, xss, local file inclusion, lfi, remote file inclusion, rfi, sql injection, cross-site request forgery, csrf, spam, social engineering, malware, insider threat, human error, malicious insider, cyber espionage, un-authorised access, data leak, identity theft, social engineering, social media abuse, dark web shopping, confidential information, sensitive information, impersonation, credential stealing, personal information, personal data, vulnerabilities, zero-day, 0-day, exploitation, hardware vulnerabilities, software vulnerabilities, physical manipulation damage, outage, failures, malfunction, environmental disaster, natural disaster, physical attack, damage caused by third party, malware, advanced persistent threat, apt, virus, worm, ransomware, trojan, cryptominer, rootkit, bootkit, backdoor, spyware, scareware, addware, keylogger, web based attacks, drive-by downloads, cryptojacking, man-in-the-browser, waterholing, supply-chain attack",

"log management software => log management software",
"multi-factor authentication => multi-factor authentication",
"siem => siem",
"e-mail screening => e-mail screening",
"ransomware prevention => ransomware prevention",
"anti-subversion software => anti-subversion software",
"security information management => security information management",
"records management => records management",
"intrusion detection system, ids => intrusion detection system, ids",
"cryptographic software => cryptographic software",
"anti-keyloggers => anti-keyloggers",
"access control => access control",
"anti-tamper software => anti-tamper software",
"anti-malware => anti-malware",
"vpn => vpn",
"computer aided dispatch, cad => computer aided dispatch, cad",
"antivirus software => antivirus software",
"secure coding => secure coding",
"anti-spyware => anti-spyware",
"firewall => firewall",
"secure operating systems => secure operating systems",
"hardware security module => hardware security module",
"intrusion prevention system, ips => intrusion prevention system, ips",
"sandbox => sandbox",
"security by design => security by design",
"security software and hardware => security software and hardware, log management software, multi-factor authentication, siem, e-mail screening, ransomware prevention, anti-subversion software, security information management, records management, intrusion detection system, ids, cryptographic software, anti-keyloggers, access control, anti-tamper software, anti-malware, vpn, computer aided dispatch, cad, antivirus software, secure coding, anti-spyware, firewall, secure operating systems, hardware security module, intrusion prevention system, ips, sandbox, security by design",

"software => software",
"data => data",
"physical assets equipment, devices, hardware => physical assets equipment, devices, hardware",
"information => information",
"brand => brand",
"reputation => reputation",
"intellectual property, ip => intellectual property, ip",
"people => people",
"services => services",
"assets => assets, software, data, physical assets equipment, devices, hardware, information, brand, reputation, intellectual property, ip, people, services",

"big data => big data",
"machine learning => machine learning",
"artificial intelligence => artificial intelligence",
"virtual reality, vr => virtual reality, vr",
"deep learning => deep learning",
"autonomous systems => autonomous systems",
"industrial control systems, ics => industrial control systems, ics",
"iot => iot",
"nano technology => nano technology",
"quantum computing => quantum computing",
"biometrics => biometrics",
"cloud technology => cloud technology",
"blockchain => blockchain",
"augmented reality, ar => augmented reality, ar",
"smart infrastructure => smart infrastructure",
"robotics => robotics",
"software-defined networking, sdn => software-defined networking, sdn, 5g",
"emerging technology => emerging technology, big data, machine learning, artificial intelligence, virtual reality, vr, deep learning, autonomous systems, industrial control systems, ics, iot, nano technology, quantum computing, biometrics, cloud technology, blockchain, augmented reality, ar, smart infrastructure, robotics, software-defined networking, sdn, 5g"
]
"""

In [6]:
analyzer_default = """
        "analyzer_default": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["standard", "lowercase", "synonym"]
        }
"""

synonyms = """
          "synonyms" : [
"network and information security directive, nis directive, nis => network and information security directive, nis directive, nis",
"general data protection regulation, gdpr => general data protection regulation, gdpr",
"electronic identification authentication and trust services regulation, eidas => electronic identification authentication and trust services regulation, eidas",
"public private partnerships, ppp => public private partnerships, ppp",
"amplification attacks, reflection attacks => amplification attacks, reflection attacks",
"zero-day, 0-day => zero-day, 0-day",
"cross-site scripting, xss => cross-site scripting, xss",
"local file inclusion, lfi => local file inclusion, lfi",
"remote file inclusion, rfi => remote file inclusion, rfi",
"cross-site request forgery, csrf => cross-site request forgery, csrf",
"confidential information, sensitive information => confidential information, sensitive information",
"personal information, personal data => personal information, personal data",
"zero-day, 0-day => zero-day, 0-day",
"failures, malfunction => failures, malfunction",
"advanced persistent threat, apt => advanced persistent threat, apt",
"intrusion detection system, ids => intrusion detection system, ids",
"computer aided dispatch, cad => computer aided dispatch, cad",
"intrusion prevention system, ips => intrusion prevention system, ips",
"physical assets equipment, devices, hardware => physical assets equipment, devices, hardware",
"intellectual property, ip => intellectual property, ip",
"virtual reality, vr => virtual reality, vr",
"industrial control systems, ics => industrial control systems, ics",
"augmented reality, ar => augmented reality, ar",
"software-defined networking, sdn => software-defined networking, sdn"
]
"""

In [7]:
query = """
curl -s -X PUT "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/rssfeeds_test" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {""" + analyzer_default + """,
      """ + analyzer_knowledge_graph + """
      },
      "filter": {
        "synonym": {
          "type": "synonym",
          """ + synonyms + """
        },
        "knowledge_graph": {
          "type": "synonym",
          """ + knowledge_graph + """
        }
      }
    }
  },
  "mappings": {
    "article": {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "analyzer_default",
          "fields": {
            "knowledge_graph": {
              "type": "text",
              "analyzer": "analyzer_default",
              "search_analyzer": "analyzer_knowledge_graph"
            }
          }
        }
      }
    }
  }
}
' -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res)
res

{'acknowledged': True, 'index': 'rssfeeds_test', 'shards_acknowledged': True}

In [8]:
query = """
curl -s -X POST "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/_reindex" -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "rssfeeds"
  },
  "dest": {
    "index": "rssfeeds_test"
  }
}
' -u guest:teradata
"""

res = subprocess.getoutput(query)
res

'{"took":4947,"timed_out":false,"total":4808,"updated":0,"created":4808,"deleted":0,"batches":5,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1.0,"throttled_until_millis":0,"failures":[]}'

In [9]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/rssfeeds_test/_count" -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res)
res

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'count': 4441}

In [10]:
child = "spyware"
parent = "malware"
grandparent = "threats"

In [11]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/rssfeeds_test/article/_search" -H 'Content-Type: application/json' -d'
{
    "_source": ["title", "content"],
    "query": {
        "match": {
            "content": "{""" + child + """}"
        }
    }
}
' -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res, strict=False)
print(res['hits']['total'])
print()

for hit in res['hits']['hits']:
    print(hit['_score'], hit['_id'], hit['_source']['title'])
    print('- '*40)
    for line in hit['_source']['content'].replace('\n', '').split('.'):
        if child in line.lower():
            print(line.strip())
    print('-'*80)

3

5.419101 7f639c10d25908048588617a1e89d01ee012d5cf Maikspy Spyware Poses as Adult Game, Targets Windows and Android Users
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
By Ecular Xu and Grey GuoWe discovered a malware family called Maikspy — a multi-platform spyware that can steal users’ private data
The spyware targets Windows and Android users, and first posed as an adult game named after a popular U
Maikspy, which is an alias that combines the name of the adult film actress and spyware, has been around since 2016
Our analysis of the latest Maikspy variants revealed that users contracted the spyware from hxxp://miakhalifagame[
However, the spyware just hides itself and runs in the background
If the user has multiple Twitter accounts, the spyware will use the account where the user is logged in
Code snippet of the process where the Unix timestamp, the device’s Bluetooth adapter name, and the name of user’s Twitter account are combined to produce a d

In [12]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/rssfeeds_test/article/_search" -H 'Content-Type: application/json' -d'
{
    "_source": ["title", "content"],
    "query": {
        "match": {
            "content": "{""" + parent + """}"
        }
    }
}
' -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res, strict=False)
print(res['hits']['total'])
print()

for hit in res['hits']['hits']:
    print(hit['_score'], hit['_id'], hit['_source']['title'])
    print('- '*40)
    for line in hit['_source']['content'].replace('\n', '').split('.'):
        if parent in line.lower():
            print(line.strip())
    print('-'*80)

66

2.8633888 8634f7b3b41daf88c39b4064e6404ab485b79376 Fortinet Threat Landscape Report Reveals an Evolution of Malware to Exploit Cryptocurrencies
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Highlights of the report follow:Cybercrime Attack Methods Evolve to Ensure Success at Speed and Scale Data indicates that cybercriminals are getting better and more sophisticated in their use of malware and leveraging newly announced zero-day vulnerabilities to attack at speed and scale
Spike in Cryptojacking: Malware is evolving and becoming more difficult to prevent and detect
The prevalence of cryptomining malware more than doubled from quarter to quarter from 13% to 28%
Cryptomining malware is also showing incredible diversity for such a relatively new threat
Cybercriminals are creating stealthier file-less malware to inject infected code into browsers with less detection
Targeted Attacks for Maximum Impact: The impact of destructive malware remains high, p

In [13]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/rssfeeds_test/article/_search" -H 'Content-Type: application/json' -d'
{
    "_source": ["title", "content"],
    "query": {
        "match": {
            "content.knowledge_graph": "{""" + parent + """}"
        }
    }
}
' -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res, strict=False)
print(res['hits']['total'])
print()

for hit in res['hits']['hits']:
    print(hit['_score'], hit['_id'], hit['_source']['title'])
    print('- '*40)
    for line in hit['_source']['content'].replace('\n', '').split('.'):
        if parent in line.lower():
            print(line.strip())
    print('-'*80)

115

9.341919 3b7350e6fe211e2072731106ab33b45ea83cdb34 North Korea-linked Sun Team APT group targets deflectors with Android Malware
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Researchers at McAfee discovered that the malware was on Google Play as ‘unreleased’ versions and it accounts for only around 100 infections, they also notified it to Google that has already removed the threat from the store
Once installed, the malware starts copying sensitive information from the device, including personal photos, contacts, and SMS messages, and then sends them to the threat actors
While the 음식궁합 and Fast AppLock apps are data stealer malware that receives commands and additional executable (
dex) files from a cloud control server, the  AppLockFree is a reconnaissance malware that prepares the installations to further payloads
The malware spread to friends, asking them to install the malicious apps and offer feedback via a Facebook account with a fake profil