## Extending NER with EntityRuler

**This notebook has examples of how to add a rule-based stage to a SpaCy pipeline to handle some cybersecurity-relevant entity types.**

 * Spacy's entityruller lets us find and type entities based on rules
 * The rules can be based on a simple enumeration of possible strings or phrases, or on a regular expression
 * This can be used in addition to patterns learned through training on annotated data
 * One issue of this approach is that for some categories, like MALWARE_NAME, a mention like "bad bunny" could be a malware_name or not (e.g., "Peter was a bad bunny because he kept getting into Mr. McGregor's garden"), but if we are mostly processing text from cybersecurity-related sources, this may not be a big problem. 


In [None]:
import spacy
from spacy import displacy

### Load one of Spacy's language models. This is a medium sized one for English

In [2]:
nlp = spacy.load("en_core_web_md")

### Add an EntityRuler to SpaCy's pipeline
We will need to put it before NER in the pipeline or, if we put it after, allow it to overwrite a label found by the NER step.  I'm not sure which is a better choice.

In [3]:
ruler = nlp.add_pipe("entity_ruler", before="ner")
#ruler = nlp.add_pipe("entity_ruler", config = {'overwrite_ents':True})

### Add rules to recognze an email address as an EMAIL and a url as a URL

In [4]:
patterns = [{"label": "EMAIL", "pattern": [{'LIKE_EMAIL':True}]},
            {"label": "URL", "pattern": [{'LIKE_URL':True}]}]         
ruler.add_patterns(patterns)

### Add rules for known malware names as a MALWARE_NAME
 * We can collect names esily from Wikidata and other sources.  One problem is that there's new malware all the time and new documents and new are unlinkely to mention a malware instance that was from 10 years ago. Maybe we can use these lists both to recognize malware but also to train the NER system to to recognize them and others.

In [5]:
pattern = [{"label": "MALWARE_NAME", "pattern": [{'LOWER':"wannacry"}]}]
ruler.add_patterns(pattern)

### Add rules for known malwaye subclasses/types

In [78]:
pattern = [{"label": "MALWARE_TYPE", "pattern": [{'LOWER':"ransomware"}]}]
ruler.add_patterns(pattern)

### Add a rule to match a valid IP address or a domain name

In [6]:
octet_rx = r'(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'
ip_rx= [ {"TEXT": {"REGEX": r"^{0}(?:\.{0}){{3}}$".format(octet_rx)}}]
ruler.add_patterns([{"label":"IP_ADDRESS", "pattern":ip_rx}])

domain_rx = "^((?!-)[A-Za-z0-9-]{1, 63}(?<!-)\\.)+[A-Za-z]{2, 6}$" 
ruler.add_patterns([{"label":"DOMAIN_NAME", "pattern":domain_rx}])

### Add rule for a MD5 hash value
 * we can extend the pattern to cover SHA-1, SHA2, etc.  They are just longer.

In [7]:
md5_rx = [{"TEXT": {"REGEX": r"^[0-9a-fA-F]{32}$"}}]
ruler.add_patterns([{"label":"HASH", "pattern":md5_rx}])

### Add a rule for a network port
 * we'll require the word port preceeding a number between 0 and 65535
 * we'll approximate the number as requiring a string of 1-5 digits

In [8]:
port_number_rx = {"TEXT": {"REGEX": r"^\d{1,5}$"}}
ruler.add_patterns([{"label":"PORT", "pattern": [{'LOWER':'port'}, port_number_rx ]}])

### Add rule for CVE vulnerability
 * the simple regex r"cve-\d{4}-\d{4,7}" won't work because of spacCy's tokenization

In [23]:
cve_pat = [{"LOWER": {"REGEX": r"cve-\d{4}"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": r"\d{4,7}"}}]
ruler.add_patterns([{"label":"VULNERABILITY", "pattern": cve_pat}])

### Add rule for threat actors with names from Wikidata
  * this entity rule data was generated by 

In [10]:
ruler.add_patterns(
 [
  {'label':'THREAT_ACTOR', 'pattern': [{'TEXT':'Lazarus'},{'TEXT':'Group'}]},
  {'label':'THREAT_ACTOR', 'pattern': [{'TEXT':'Equation'},{'TEXT':'Group'}]},
  {'label':'THREAT_ACTOR', 'pattern': [{'TEXT':'Cozy'},{'TEXT':'Bear'}]}
 ])

### Input some text and run it through the Spacy pipeline

In [86]:
text = "The UMBC website is http://umbc.edu/ and its email address is info@umbc.edu. \
It was taken offline by the WannaCry ransomware which exploited CVE-2017-0144. \
The attack from Cozy Bear came from 71.244.148.58 via port 8080. \
The file hash was 327b6f07435811239bc47e1544353273."

doc = nlp(text)
print("done!")

done!


### Display the text marking its entities and their types.  The default types are the 18 types from [Ontonotes](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf)

In [87]:
colors = {'URL': "#85C1E9", "EMAIL": "red", "MALWARE_NAME":"orange", "vulnerability":"#CAFF70", \
          "IP_ADDRESS":"#EE82EE", 'threat_actor':"#FFB90F", 'port':'#E0FFFF', 'hash':'#FFFF00', \
          'MALWARE_TYPE':'#3CB371'}

In [88]:
displacy.render(doc, style="ent", options={"colors": colors})

### *fin*