The full Technical Report can be found directly in the repo under the name Scrutinizer_TR.pdf.
Data have been removed from the repo due to sensitivity reasons. Please contact us for more information.
Tested with Python 3.5.6 and Python 3.7.1
src/tokenizer/tokenizer_driver.py- tokenizes input data, using
spacy
- tokenizes input data, using
src/featurizer/featurizer_extractor.py- Can either be
tfidforword-embeddings - Each object will contain the features as an instance variable
- Can either be
src/featurizer/sentence_embedding.py- Implements a
scikit-learntransformer, which gives us the word (glove) embeddings, usingspacy. I use average pooling here.
- Implements a
src/classifier/classifier_linear_svm.py- Contains the Linear
SVMmodel along with the sigmoid on top of it, which gives us calibrated probabilities for thetopnpredictions - The instance variable
cv, determines the number of the cross validation folds. Hence, our input dataset needs to have at leastcvnumber of samples per class, otherwise it will not work. - The default
cvis 3, but one could try bigger values (4, 5), and see which gives better results experimentally.
- Contains the Linear
src/parser/dataset_parser.py- Contains logic, which creates the dataset for
templateandrow_indexpredictions. - Look at
src/templates/template_transformer.pyfor how the templates are created from the original Excel formulas. - Also has logic for combining features of sentences and claims
- This class is a little "messy", which could be fixed a little later
- Contains logic, which creates the dataset for
src/templates/template_transformer.py- Logic for creating templates from Excel formulas
- Uses various Regexes (look at
src/regex/regex.py), to filter (to some extent) Excel formulas - Uses
pandasapplyfunction to create a series of transformations for each row of theDataFrame
The experiments supported right now are src/experiments/exp_only_row_idx.py and src/experiments/exp_only_templates.py. You can see the code in both, to better see how I use the above classes for Tokenization, Feturization and Classification. The code is not very clean, and I expect to make it better as I go on.
You need to have:
(1) the data sepcified on top of each experiment python file. I.e the variable DATA_PATH, needs to correspond to a file.
(2) The csv file needs to have the same format as the currect on this repo (data/main_annotated_dataset_12-16-2019.csv).
(3) Need to have all the requirements from requirements.txt
Before you can run do:
- Create a Virtual Environment with python version
3.5.6(although probably anything above that will do) python -m spacy download en_core_web_mdpip install -r requirements.txt
From the root path of the directory run:
python -m src.experiments.exp_only_templates --num_runs 1 --cv 3 --min_samples_per_label 20 --topn 3
Where:
num_runs: The number of times the task is run. The end accuracy numbers are the average of the accuracy of each runcv: Cross validation folds used forClassificationmin_samples_per_label: Min number of samples we keep for each label. Note thatmin_samples_per_label>=cvtopn: Topn predictions to return