Skip to content

anuraagkansara/SecureNLP

Repository files navigation

Secure NLP

  1. SubTask-1 [Subtask1_Bert.ipynb, Subtask1_BiLSTM.ipynb & preProcess.ipynb(preprocessing data)]
  2. SubTask-2 [T2_BERT.ipynb & T2_NER_bilstm_CRf.ipynb] are in "code" folder.

EVALUATION DATA Available in "SemEval_eval_input" folder for all 4 SubTasks

MALWARETEXTDBV2.0 DATASET

This folder contains the datasets that constitute MalwareTextDB-V2.0. The dataset is used in SemEval-2018 Task 8: Semantic Extraction from CybersecUrity REports using Natural Language Processing (SecureNLP)

It contains 5 subfolders:

  • train: contains the training materials released to the participants
  • dev: contains the development data used in the Practice phase
  • test_1: contains the gold data for SubTask 1 and 2
  • test_2: contains the gold data for SubTask 3
  • test_3: contains the gold data for SubTask 4

The following are the explanation of content inside each of the subfolder

plaintext/ (train only)

  • contains 65 plaintext files after the APT PDF reports are processed with PDFMiner

tokenized/

  • Contains the tokenized reports with the annotated labels in IOB format

  • 3 different types of labels are used: Entity, Action, Modifier

  • If a token is not fallen under any label type, it is labeled as O(means the outside of the labels)

  • If a token is the first word of a label, then it is labeled as B-<Label_Type> (means the beginning of a label)

  • If a token is the subsequent word of the text span of a label, then it is labeled as I-<Label_Type> (means the inside of a label)

  • Example, to O direct B-Action site B-Entity visitors I-Entity

  • For more details about the IOB format, please refer http://www.nltk.org/book/ch07.html section 2.6

annotations/

  • contains the plaintext files with XML tags denoting nonsentence sections such as headings and covers

  • contains the annotations files (.ann) for each plaintext file; the positions of the annotations are based on character counts

  • In .ann files, 3 different annotation ID types are used : T(text-bound annotation), R(relation), A(attribute)

  • Different attribute labels are ActionName, Capability, StrategicObjectives, TacticalObjectives

  • Example,

Sentence : The dynamic analysis showed the malware sample contacted the C&C server, but wasn't sending any URL parameters (id1, id2).

Annotations : T34 Action 47 56 contacted T28 Subject 28 46 the malware sample R23 SubjAction Subject:T28 Action:T34 T30 Object 57 71 the C&C server R24 ActionObj Action:T34 Object:T30

additional_plaintext/ (train only)

  • contains additional 84 plaintext files after the APT PDF reports are processed with PDFMiner
  • If the participants need to generate special features such brown clusters, these textfiles can be used
  • Tokenized versions of this set of files are not provided in the tokenized folder

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published