# Knowledge graph

This notebook attempts to extract the knowledge graph.

Automatic extraction from the vsdx file was not successful, it turns out that tools such as omnigraffle only convert to graphical format. I did not find a way to extract the informaton to e.g. XML format such that the hierarchy among the topics would be preserved.

Therefore, I converted the vsdx file to a pdf file and use the tika parser to extract the topics. The topics are then copied and pasted manually according to the hierarchy. This notebook converts the string with the topics and hierarchy to the format expected by the elasticsearch synonyms.

See synonyms.ipynb for a detailed study on how to define synonyms in order to get the expected search performance.

In [1]:
import tika
from tika import parser

In [2]:
path = "/Users/ds186095/Desktop/Cybersecurity_taxonomy adrien_first_final.pdf"

In [3]:
#parsed = parser.from_file(path)

In [4]:
#terms = [x for x in parsed['content'].split('\n') if x]

In [5]:
#for x in terms:
#    print(x)

Manualy extracted knowledge graph:

In [6]:
threats = """
Web based attacks
. Drive-by downloads
. Cryptojacking
. Man-in-the-browser
. Waterholing
. Supply-chain attack

Web application attacks
. Cross-Site Scripting, XSS
. Local File Inclusion, LFI
. Remote File Inclusion, RFI
. SQL injection
. Cross-Site Request Forgery, CSRF

Phishing
. social engineering
. spear-phishing
. malware
. spam
. data stealing

Spam
. social engineering
. malware

Denial of service
. Amplification attacks, Reflection attacks

Botnets
. DDoS
. IoT botnet

Insider threat
. Human error
. Malicious Insider
. Cyber espionage
. Un-authorised access
. Data leak

Physical manipulation damage
. Outage
. Failures, malfunction
. Environmental disaster
. Natural disaster
. Physical attack
. Damage caused by third party

Data breaches
. Personal data
. Exploitation
. Hacking
. Security vulnerabilities
. Security incident
. Credential theft
. Data dump

Identity theft
. Social engineering
. Social media abuse
. Dark web shopping
. Confidential information, sensitive information
. Impersonation
. Credential stealing
. Personal information, personal data

Information leakage
. Misconfiguration
. Data leaks
. Personal data

Exploit kits
. Vulnerabilities
. Zero-day, 0-day

Vulnerabilities
. Zero-day, 0-day
. Exploitation
. Hardware vulnerabilities
. Software vulnerabilities

Cyber espionage
. Nation state espionage
. Corporation espionage
. Financial espionage
. Targeted attacks
. Denial and deception

Malware
. Advanced Persistent Threat, APT
. Virus
. Worm
. Ransomware
. Trojan
. Cryptominer
. Rootkit
. Bootkit
. Backdoor
. Spyware
. Scareware
. Addware
. Keylogger
"""

In [7]:
threat_actors = """
Nation States
Cyber-criminals
Corporations
Hacktivists
Script-kidies
Cyber-terrorists
Insiders
"""

In [8]:
business = """
Public Private Partnerships, PPP
Cybersecurity companies
Black market
Valuation
Stock options
Stock market
Insurance
Fine
Bankruptcy
Alliance
IPO
Partnership
Funding
Acquisition
Merger
Industry
Critical Infrastructure
Operator of essential service
SME
Startup
Award

Technology companies
. Google
. Amazon
. Facebook
. Apple
. Twitter
. LinkedIn
"""

In [9]:
assets = """
Data
Services
Physical assets equipment, devices, hardware
Software
information
People
Reputation
Intellectual property, IP
Brand
"""

In [10]:
geopolitics = """
Destabilization
fake news
Deterrence
elections
disinformation
Cyber war
Cyber conflict
Government interference
Propaganda
Psychological warfare
Information warfare
"""

In [11]:
general_terms = """
Confidentiality
Integrity
Availability
Accountability
Authenticity
Trustworthiness
Auditability
Non-repudiation
Privacy
"""

In [12]:
policy = """
Ban
Cybersecurity strategy

Regulation
. General Data Protection Regulation, GDPR
. Electronic Identification Authentication and Trust Services Regulation, EIDAS

Directives
. Network and Information Security Directive, NIS Directive, NIS
. National law

Cybersecurity doctrine
"""

In [13]:
security_software_hardware = """
Hardware Security Module
Secure coding
Security by design
Access control
Anti-keyloggers
Anti-malware
Anti-spyware
Anti-subversion software
Anti-tamper software
Antivirus software
Cryptographic software
Computer Aided Dispatch, CAD
E-mail Screening
Firewall
Intrusion detection system, IDS
Intrusion prevention system, IPS
Log management software
Ransomware prevention
Records Management
Sandbox
Security information management
SIEM
VPN
Multi-factor authentication
Secure operating systems
"""

In [14]:
emerging_technology = """
Biometrics
Cloud technology
Deep learning
Machine learning
Virtual Reality, VR
Augmented Reality, AR
Nano technology
Robotics

Software-Defined Networking, SDN
. 5G

IoT
Autonomous systems
Artificial Intelligence
smart infrastructure
big data
Quantum computing
Blockchain
Industrial Control Systems, ICS
"""

In [15]:
cyber_security = dict()
cyber_security["threats"] = threats
cyber_security["threat actors"] = threat_actors
cyber_security["business"] = business
cyber_security["assets"] = assets
cyber_security["geopolitics"] = geopolitics
cyber_security["general terms"] = general_terms
cyber_security["policy"] = policy
cyber_security["security software and hardware"] = security_software_hardware
cyber_security["emerging technology"] = emerging_technology

Functions to process these dictionaries into a string of synonyms in the format excepted by Elasticsearch:

In [16]:
def get_dict_per_topic(topic_string):
    """Turn the topic string into a dictionary."""
    
    d = dict()
    for line in topic_string.split('\n'):
        if line.strip() == "":
            continue
        if line[0] == '.':
            d[topic].append(line[2:].lower())
        else:
            topic = line.lower()
            d[topic] = list()
    return d
        
threats_dict = get_dict_per_topic(threats)
threats_dict

{'botnets': ['ddos', 'iot botnet'],
 'cyber espionage': ['nation state espionage',
  'corporation espionage',
  'financial espionage',
  'targeted attacks',
  'denial and deception'],
 'data breaches': ['personal data',
  'exploitation',
  'hacking',
  'security vulnerabilities',
  'security incident',
  'credential theft',
  'data dump'],
 'denial of service': ['amplification attacks, reflection attacks'],
 'exploit kits': ['vulnerabilities', 'zero-day, 0-day'],
 'identity theft': ['social engineering',
  'social media abuse',
  'dark web shopping',
  'confidential information, sensitive information',
  'impersonation',
  'credential stealing',
  'personal information, personal data'],
 'information leakage': ['misconfiguration', 'data leaks', 'personal data'],
 'insider threat': ['human error',
  'malicious insider',
  'cyber espionage',
  'un-authorised access',
  'data leak'],
 'malware': ['advanced persistent threat, apt',
  'virus',
  'worm',
  'ransomware',
  'trojan',
  'crypto

In [17]:
def get_synonyms_per_topic_parent(topic_dict):
    """Turn the dictionary with offsprings into a mapping of parent => parent, children.
    
    In case there are no children, parent => parent will be kept.
    This doesn't define any synonyms but it is important for the get_synonyms_per_topic_parent
    to do the grandparent => grandparent, parents mapping correctly.
    """
    
    synonyms = list()

    for k, v in topic_dict.items():
        s = '"' + k + ' => ' + k
        if len(v):
            s += ', ' + ', '.join(v)
        s += '",'
        synonyms.append(s)
        print(s)
    
    return synonyms

synonyms = get_synonyms_per_topic_parent(threats_dict)

"cyber espionage => cyber espionage, nation state espionage, corporation espionage, financial espionage, targeted attacks, denial and deception",
"information leakage => information leakage, misconfiguration, data leaks, personal data",
"botnets => botnets, ddos, iot botnet",
"denial of service => denial of service, amplification attacks, reflection attacks",
"data breaches => data breaches, personal data, exploitation, hacking, security vulnerabilities, security incident, credential theft, data dump",
"exploit kits => exploit kits, vulnerabilities, zero-day, 0-day",
"phishing => phishing, social engineering, spear-phishing, malware, spam, data stealing",
"web application attacks => web application attacks, cross-site scripting, xss, local file inclusion, lfi, remote file inclusion, rfi, sql injection, cross-site request forgery, csrf",
"spam => spam, social engineering, malware",
"insider threat => insider threat, human error, malicious insider, cyber espionage, un-authorised access, 

In [18]:
def get_synonyms_per_topic_grandparent(grandparent, topic_dict):
    """Turn the dictionary with offsprings into a mapping of grandparent => grandparent, parents and
    parent => parent, children (for each parent).
    """
    
    synonmys = get_synonyms_per_topic_parent(topic_dict)
    
    s = '"{} => {}, '.format(grandparent, grandparent)
    for k, v in topic_dict.items():
        s += k + ', '
        if len(v):
            s += ', '.join(v) + ', '
    s = s[:-2] + '",'

    synonyms.append(s)
    print(s)
    print()
    
    return synonyms

synonyms = get_synonyms_per_topic_grandparent("threats", threats_dict)

"cyber espionage => cyber espionage, nation state espionage, corporation espionage, financial espionage, targeted attacks, denial and deception",
"information leakage => information leakage, misconfiguration, data leaks, personal data",
"botnets => botnets, ddos, iot botnet",
"denial of service => denial of service, amplification attacks, reflection attacks",
"data breaches => data breaches, personal data, exploitation, hacking, security vulnerabilities, security incident, credential theft, data dump",
"exploit kits => exploit kits, vulnerabilities, zero-day, 0-day",
"phishing => phishing, social engineering, spear-phishing, malware, spam, data stealing",
"web application attacks => web application attacks, cross-site scripting, xss, local file inclusion, lfi, remote file inclusion, rfi, sql injection, cross-site request forgery, csrf",
"spam => spam, social engineering, malware",
"insider threat => insider threat, human error, malicious insider, cyber espionage, un-authorised access, 

In [19]:
for line in synonyms:
    print(line)

"cyber espionage => cyber espionage, nation state espionage, corporation espionage, financial espionage, targeted attacks, denial and deception",
"information leakage => information leakage, misconfiguration, data leaks, personal data",
"botnets => botnets, ddos, iot botnet",
"denial of service => denial of service, amplification attacks, reflection attacks",
"data breaches => data breaches, personal data, exploitation, hacking, security vulnerabilities, security incident, credential theft, data dump",
"exploit kits => exploit kits, vulnerabilities, zero-day, 0-day",
"phishing => phishing, social engineering, spear-phishing, malware, spam, data stealing",
"web application attacks => web application attacks, cross-site scripting, xss, local file inclusion, lfi, remote file inclusion, rfi, sql injection, cross-site request forgery, csrf",
"spam => spam, social engineering, malware",
"insider threat => insider threat, human error, malicious insider, cyber espionage, un-authorised access, 

Get the complete list of synonyms:

In [20]:
for k, v in cyber_security.items():
    d = get_dict_per_topic(v)
    synonyms = get_synonyms_per_topic_grandparent(k, d)
    #for line in synonyms:
    #    print(line)

"privacy => privacy",
"accountability => accountability",
"authenticity => authenticity",
"integrity => integrity",
"trustworthiness => trustworthiness",
"confidentiality => confidentiality",
"availability => availability",
"auditability => auditability",
"non-repudiation => non-repudiation",
"general terms => general terms, privacy, accountability, authenticity, integrity, trustworthiness, confidentiality, availability, auditability, non-repudiation",

"directives => directives, network and information security directive, nis directive, nis, national law",
"ban => ban",
"regulation => regulation, general data protection regulation, gdpr, electronic identification authentication and trust services regulation, eidas",
"cybersecurity strategy => cybersecurity strategy",
"cybersecurity doctrine => cybersecurity doctrine",
"policy => policy, directives, network and information security directive, nis directive, nis, national law, ban, regulation, general data protection regulation, gdpr, e

In [21]:
def extract_synonyms(s):
    if ',' in s:
        print('"{} => {}",'.format(s, s))
        
for grandparent, offsprings in cyber_security.items():
    topic_dict = get_dict_per_topic(offsprings)
    for parent, children in topic_dict.items():
        extract_synonyms(parent)
        for child in children:
            extract_synonyms(child)

"network and information security directive, nis directive, nis => network and information security directive, nis directive, nis",
"general data protection regulation, gdpr => general data protection regulation, gdpr",
"electronic identification authentication and trust services regulation, eidas => electronic identification authentication and trust services regulation, eidas",
"public private partnerships, ppp => public private partnerships, ppp",
"amplification attacks, reflection attacks => amplification attacks, reflection attacks",
"zero-day, 0-day => zero-day, 0-day",
"cross-site scripting, xss => cross-site scripting, xss",
"local file inclusion, lfi => local file inclusion, lfi",
"remote file inclusion, rfi => remote file inclusion, rfi",
"cross-site request forgery, csrf => cross-site request forgery, csrf",
"confidential information, sensitive information => confidential information, sensitive information",
"personal information, personal data => personal information, person