# Searching using synonyms

This notebook demonstrates using the topics from the vsdx knowledge graph for searching in the text from the pdf document summary sections.

* Searching for "spyware" gives 1 document only. No synonyms are used in this case.
* Searching for "malware" gives 78 documents. This is because all subcategories of malware are used in the search, i.e. the following terms are tried as synonyms: malware, advanced persistent threat, apt, virus, worm, ransomware, trojan, cryptominer, rootkit, bootkit, backdoor, spyware, scareware, addware, keylogger
* Searching for "threats" gives 251 documents. This is because all topics that belong to threats are searched for.

In [1]:
import subprocess
import json

In [2]:
query = """
curl -s -X DELETE "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/pdf_documents_test3" -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res)
res

{'acknowledged': True}

In [3]:
query = """
curl -s -X PUT "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/pdf_documents_test3" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "test_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["standard", "lowercase", "synonym"]
        },
        "test_analyzer0": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["standard", "lowercase"]
        }
      },
      "filter": {
        "synonym" : {
          "type" : "synonym",
          "synonyms" : [
"privacy => privacy",
"accountability => accountability",
"authenticity => authenticity",
"integrity => integrity",
"trustworthiness => trustworthiness",
"confidentiality => confidentiality",
"availability => availability",
"auditability => auditability",
"non-repudiation => non-repudiation",
"general terms => general terms, privacy, accountability, authenticity, integrity, trustworthiness, confidentiality, availability, auditability, non-repudiation",

"directives => directives, network and information security directive, nis directive, nis, national law",
"ban => ban",
"regulation => regulation, general data protection regulation, gdpr, electronic identification authentication and trust services regulation, eidas",
"cybersecurity strategy => cybersecurity strategy",
"cybersecurity doctrine => cybersecurity doctrine",
"policy => policy, directives, network and information security directive, nis directive, nis, national law, ban, regulation, general data protection regulation, gdpr, electronic identification authentication and trust services regulation, eidas, cybersecurity strategy, cybersecurity doctrine",

"fake news => fake news",
"elections => elections",
"cyber conflict => cyber conflict",
"disinformation => disinformation",
"destabilization => destabilization",
"information warfare => information warfare",
"psychological warfare => psychological warfare",
"cyber war => cyber war",
"deterrence => deterrence",
"propaganda => propaganda",
"government interference => government interference",
"geopolitics => geopolitics, fake news, elections, cyber conflict, disinformation, destabilization, information warfare, psychological warfare, cyber war, deterrence, propaganda, government interference",

"nation states => nation states",
"cyber-terrorists => cyber-terrorists",
"insiders => insiders",
"cyber-criminals => cyber-criminals",
"hacktivists => hacktivists",
"script-kidies => script-kidies",
"corporations => corporations",
"threat actors => threat actors, nation states, cyber-terrorists, insiders, cyber-criminals, hacktivists, script-kidies, corporations",

"ipo => ipo",
"acquisition => acquisition",
"sme => sme",
"award => award",
"operator of essential service => operator of essential service",
"merger => merger",
"insurance => insurance",
"stock market => stock market",
"partnership => partnership",
"stock options => stock options",
"fine => fine",
"funding => funding",
"technology companies => technology companies, google, amazon, facebook, apple, twitter, linkedin",
"industry => industry",
"valuation => valuation",
"black market => black market",
"startup => startup",
"cybersecurity companies => cybersecurity companies",
"public private partnerships, ppp => public private partnerships, ppp",
"alliance => alliance",
"bankruptcy => bankruptcy",
"critical infrastructure => critical infrastructure",
"business => business, ipo, acquisition, sme, award, operator of essential service, merger, insurance, stock market, partnership, stock options, fine, funding, technology companies, google, amazon, facebook, apple, twitter, linkedin, industry, valuation, black market, startup, cybersecurity companies, public private partnerships, ppp, alliance, bankruptcy, critical infrastructure",

"cyber espionage => cyber espionage, nation state espionage, corporation espionage, financial espionage, targeted attacks, denial and deception",
"information leakage => information leakage, misconfiguration, data leaks, personal data",
"botnets => botnets, ddos, iot botnet",
"denial of service => denial of service, amplification attacks, reflection attacks",
"data breaches => data breaches, personal data, exploitation, hacking, security vulnerabilities, security incident, credential theft, data dump",
"exploit kits => exploit kits, vulnerabilities, zero-day, 0-day",
"phishing => phishing, social engineering, spear-phishing, malware, spam, data stealing",
"web application attacks => web application attacks, cross-site scripting, xss, local file inclusion, lfi, remote file inclusion, rfi, sql injection, cross-site request forgery, csrf",
"spam => spam, social engineering, malware",
"insider threat => insider threat, human error, malicious insider, cyber espionage, un-authorised access, data leak",
"identity theft => identity theft, social engineering, social media abuse, dark web shopping, confidential information, sensitive information, impersonation, credential stealing, personal information, personal data",
"vulnerabilities => vulnerabilities, zero-day, 0-day, exploitation, hardware vulnerabilities, software vulnerabilities",
"physical manipulation damage => physical manipulation damage, outage, failures, malfunction, environmental disaster, natural disaster, physical attack, damage caused by third party",
"malware => malware, advanced persistent threat, apt, virus, worm, ransomware, trojan, cryptominer, rootkit, bootkit, backdoor, spyware, scareware, addware, keylogger",
"web based attacks => web based attacks, drive-by downloads, cryptojacking, man-in-the-browser, waterholing, supply-chain attack",
"threats => threats, cyber espionage, nation state espionage, corporation espionage, financial espionage, targeted attacks, denial and deception, information leakage, misconfiguration, data leaks, personal data, botnets, ddos, iot botnet, denial of service, amplification attacks, reflection attacks, data breaches, personal data, exploitation, hacking, security vulnerabilities, security incident, credential theft, data dump, exploit kits, vulnerabilities, zero-day, 0-day, phishing, social engineering, spear-phishing, malware, spam, data stealing, web application attacks, cross-site scripting, xss, local file inclusion, lfi, remote file inclusion, rfi, sql injection, cross-site request forgery, csrf, spam, social engineering, malware, insider threat, human error, malicious insider, cyber espionage, un-authorised access, data leak, identity theft, social engineering, social media abuse, dark web shopping, confidential information, sensitive information, impersonation, credential stealing, personal information, personal data, vulnerabilities, zero-day, 0-day, exploitation, hardware vulnerabilities, software vulnerabilities, physical manipulation damage, outage, failures, malfunction, environmental disaster, natural disaster, physical attack, damage caused by third party, malware, advanced persistent threat, apt, virus, worm, ransomware, trojan, cryptominer, rootkit, bootkit, backdoor, spyware, scareware, addware, keylogger, web based attacks, drive-by downloads, cryptojacking, man-in-the-browser, waterholing, supply-chain attack",

"log management software => log management software",
"multi-factor authentication => multi-factor authentication",
"siem => siem",
"e-mail screening => e-mail screening",
"ransomware prevention => ransomware prevention",
"anti-subversion software => anti-subversion software",
"security information management => security information management",
"records management => records management",
"intrusion detection system, ids => intrusion detection system, ids",
"cryptographic software => cryptographic software",
"anti-keyloggers => anti-keyloggers",
"access control => access control",
"anti-tamper software => anti-tamper software",
"anti-malware => anti-malware",
"vpn => vpn",
"computer aided dispatch, cad => computer aided dispatch, cad",
"antivirus software => antivirus software",
"secure coding => secure coding",
"anti-spyware => anti-spyware",
"firewall => firewall",
"secure operating systems => secure operating systems",
"hardware security module => hardware security module",
"intrusion prevention system, ips => intrusion prevention system, ips",
"sandbox => sandbox",
"security by design => security by design",
"security software and hardware => security software and hardware, log management software, multi-factor authentication, siem, e-mail screening, ransomware prevention, anti-subversion software, security information management, records management, intrusion detection system, ids, cryptographic software, anti-keyloggers, access control, anti-tamper software, anti-malware, vpn, computer aided dispatch, cad, antivirus software, secure coding, anti-spyware, firewall, secure operating systems, hardware security module, intrusion prevention system, ips, sandbox, security by design",

"software => software",
"data => data",
"physical assets equipment, devices, hardware => physical assets equipment, devices, hardware",
"information => information",
"brand => brand",
"reputation => reputation",
"intellectual property, ip => intellectual property, ip",
"people => people",
"services => services",
"assets => assets, software, data, physical assets equipment, devices, hardware, information, brand, reputation, intellectual property, ip, people, services",

"big data => big data",
"machine learning => machine learning",
"artificial intelligence => artificial intelligence",
"virtual reality, vr => virtual reality, vr",
"deep learning => deep learning",
"autonomous systems => autonomous systems",
"industrial control systems, ics => industrial control systems, ics",
"iot => iot",
"nano technology => nano technology",
"quantum computing => quantum computing",
"biometrics => biometrics",
"cloud technology => cloud technology",
"blockchain => blockchain",
"augmented reality, ar => augmented reality, ar",
"smart infrastructure => smart infrastructure",
"robotics => robotics",
"software-defined networking, sdn => software-defined networking, sdn, 5g",
"emerging technology => emerging technology, big data, machine learning, artificial intelligence, virtual reality, vr, deep learning, autonomous systems, industrial control systems, ics, iot, nano technology, quantum computing, biometrics, cloud technology, blockchain, augmented reality, ar, smart infrastructure, robotics, software-defined networking, sdn, 5g"]
        }
      }
    }
  },
  "mappings": {
    "document": {
      "properties": {
        "summary": {
          "type": "text",
          "search_analyzer": "test_analyzer",
          "analyzer": "test_analyzer0" 
        }
      }
    }
  }
}
' -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res)
res

{'acknowledged': True,
 'index': 'pdf_documents_test3',
 'shards_acknowledged': True}

In [4]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/pdf_documents_test3/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "test_analyzer",
  "text":     "spyware"
}
' -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res)
res

{'tokens': [{'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'spyware',
   'type': '<ALPHANUM>'}]}

In [5]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/pdf_documents_test3/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "test_analyzer",
  "text":     "malware"
}
' -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res)
res

{'tokens': [{'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'malware',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'advanced',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'apt',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'virus',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'worm',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'ransomware',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'trojan',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'cryptominer',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'rootkit',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,


In [6]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/pdf_documents_test3/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "test_analyzer",
  "text":     "threats"
}
' -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res)
res

{'tokens': [{'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'threats',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'cyber',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'nation',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'corporation',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'financial',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'targeted',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'denial',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'information',
   'type': 'SYNONYM'},
  {'end_offset': 7,
   'position': 0,
   'start_offset': 0,
   'token': 'misconfiguration',
   'type': 'SYNONYM'},
  {'end_offset': 7,


In [7]:
query = """
curl -s -X POST "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/_reindex" -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "pdf_documents"
  },
  "dest": {
    "index": "pdf_documents_test3"
  }
}
' -u guest:teradata
"""

res = subprocess.getoutput(query)

In [8]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/pdf_documents_test3/_count" -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res)
res

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'count': 344}

In [9]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/pdf_documents_test3/_count?q=has_summary:true" -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res)
res

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 5, 'total': 5},
 'count': 262}

In [10]:
child = "spyware"
parent = "malware"
grandparent = "threats"

In [11]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/pdf_documents_test3/document/_search" -H 'Content-Type: application/json' -d'
{
    "_source": ["title", "summary"],
    "query": {
        "match": {
            "summary": "{""" + child + """}"
        }
    }
}
' -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res, strict=False)
print(res['hits']['total'])
print()

for hit in res['hits']['hits']:
    print(hit['_score'], hit['_id'], hit['_source']['title'])
    print('- '*40)
    for line in hit['_source']['summary'].split('\n'):
        if child in line.lower():
            print(line)
    print('-'*80)

1

3.6434639 213 Mobile Identity Management
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
 Eavesdropping and spyware: Weaknesses in GSM and 802.11x encryption make 
--------------------------------------------------------------------------------


In [12]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/pdf_documents_test3/document/_search" -H 'Content-Type: application/json' -d'
{
    "_source": ["title", "summary"],
    "query": {
        "match": {
            "summary": "{""" + parent + """}"
        }
    }
}
' -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res, strict=False)
print(res['hits']['total'])
print()

for hit in res['hits']['hits']:
    print(hit['_score'], hit['_id'], hit['_source']['title'])
    print('- '*40)
    for line in hit['_source']['summary'].split('\n'):
        if parent in line.lower():
            print(line)
    print('-'*80)

77

10.55539 141 ENISA Threat Landscape 2015
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
effective complexity, achieved with next generation malware and attack vectors. Our conclusions in 2015 
 Highly efficient development of malware weaponization and automated tools to detect and exploit 
 Campaigning with highly profitable malicious infrastructures and malware to breach data and hold 
--------------------------------------------------------------------------------
8.719921 138 ENISA Threat Landscape Report 2017
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
methods. In particular: malware contains all necessary “intelligence” and functions to autonomously 
advancements in phishing practices. Malware incorporates command and control functions; this reduces 
these changes of malware, attack vectors and malicious infrastructures more thoroughly and develop 
485 https://blog.malwarebytes.com/malwarebytes-news/2017/

In [13]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/pdf_documents_test3/document/_search" -H 'Content-Type: application/json' -d'
{
    "_source": ["title", "summary"],
    "query": {
        "match": {
            "summary": "{""" + grandparent + """}"
        }
    }
}
' -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res, strict=False)
print(res['hits']['total'])
print()

for hit in res['hits']['hits']:
    print(hit['_score'], hit['_id'], hit['_source']['title'])
    print('- '*40)
    for line in hit['_source']['summary'].split('\n'):
        if grandparent in line.lower():
            print(line)
    print('-'*80)

251

7.06011 14 Annual Incident Reports 2011
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
--------------------------------------------------------------------------------
6.940102 266 Auditing Security Measures
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
--------------------------------------------------------------------------------
6.856991 99 Smart Hospitals
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Threats to smart hospitals are, however, not limited to malicious actions in terms of their root cause. Human errors 
and system failures as well as third-party failures also play an important role. The risks that result from these threats 
--------------------------------------------------------------------------------
6.2878833 331 Trusted e-ID Infrastructures and services in EU
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
-----------------