Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please Help: Regarding fine tuning #216

Open
abhishekdhiman25 opened this issue Mar 14, 2024 · 1 comment
Open

Please Help: Regarding fine tuning #216

abhishekdhiman25 opened this issue Mar 14, 2024 · 1 comment

Comments

@abhishekdhiman25
Copy link

Hi Reader,

I wish you are well. I was trying to understand fine-tuning part from "fine_tune_multi_label.ipynb" notebook.
Few Questions:
Q 1. - I want to know what is the order of 50 ATT&CK Labels defined under CLASSES Variable.
Q 2. - Why is it recommend not to change the code of particular cell.
Q 3. - If somebody wants to change the classes to fine tune model on some other ATT&CK labels, what is the correct method to do
so and in what order the labels should be placed.
Q 4. - If somebody wants to increase number of classes what is the correct approach.

Thanks for your support in advance

For Reference CLASSES:
CLASSES = [
'T1003.001', 'T1005', 'T1012', 'T1016', 'T1021.001', 'T1027',
'T1033', 'T1036.005', 'T1041', 'T1047', 'T1053.005', 'T1055',
'T1056.001', 'T1057', 'T1059.003', 'T1068', 'T1070.004',
'T1071.001', 'T1072', 'T1074.001', 'T1078', 'T1082', 'T1083',
'T1090', 'T1095', 'T1105', 'T1106', 'T1110', 'T1112', 'T1113',
'T1140', 'T1190', 'T1204.002', 'T1210', 'T1218.011', 'T1219',
'T1484.001', 'T1518.001', 'T1543.003', 'T1547.001', 'T1548.002',
'T1552.001', 'T1557.001', 'T1562.001', 'T1564.001', 'T1566.001',
'T1569.002', 'T1570', 'T1573.001', 'T1574.002'
]

@abhishekdhiman25 abhishekdhiman25 changed the title Doubt: Regarding order of 50 Classes used for fine tuning Please Help: Regarding fine tuning Mar 15, 2024
@mehaase
Copy link
Contributor

mehaase commented Mar 18, 2024

Hi @abhishekdhiman25,

Q1 - They are in lexical order, but the order is somewhat arbitrary. The order of the classes affects how the labels are vectorized, i.e. turned from strings like "T1003.001" into dense vectors. E.g. the vector [1, 0, 0, 0, 0, ....] means that the associated technique is the first item in CLASSES: T1003.001.
Q2 - The notebook says not to modify that cell because we have already fine-tuned SciBERT using that vectorization scheme. This notebook is intended for continuing to fine tune with additional training data for the same set of labels. If you change the order of the labels, then additional fine tuning will be counter-productive, because the model has to relearn what each position in the label vector represents.
Q3 - If you want to fine tune SciBERT using different labels, you should look at the model-development/train_multi_label.ipynb notebook. That notebook illustrates how to start with an upstream SciBERT checkpoint and fine-tune it on the training data in data/tram2-data/multi_label.json.
Q4 - Same as for Q3. You'll want to set up MITRE Annotation Toolkit for labeling your additional training data. See: https://github.com/center-for-threat-informed-defense/tram/wiki/Data-Annotation

I hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants