Skip to content

Latest commit

 

History

History
69 lines (66 loc) · 4.27 KB

TODO.md

File metadata and controls

69 lines (66 loc) · 4.27 KB

TO-DO List

This document lists all the things I should look more carefully in the future.

  • If an element ID contains _ then the sneaking part in relation_matrix won't work.
  • Insert some sample files in input and output folders.
  • Update the code to support the last version of NLTK ParentedTree implementation.
  • The writers should use an xml library instead of writing strings to a file.
  • Installation script.
  • Implement an error measurement framework (in ManTIME class) to get statistics from the models.
  • Implement a shuffle method and cross-fold validation for the data.
  • Do I really need to load Stanford Core NLP everytime for every document? Once (the problem)[dasmith/stanford-corenlp-python#13] with long texts is solved I should switch to the new stanford-core-nlp.
  • Unit-test the code with a proper testing framework (py.test).
  • Comment the code: better and more verbosely using Google Commenting Style.

Done:

  • Make the code general with respect to different annotation standards for CRF (IO, BIO, WIO, WBIO, WBIOE, BIOE).
  • Can the same two objects be connected by two different types of temporal relations? No.
  • Can an event be anchored to two different MAKEINSTANCE tags? Yes. (not supported yet.)
  • Move the output folder up.
  • Complete the InTextEntity class development.
  • Implement features for the temporal relation extraction task looking at my notes from the literature.
  • Implement the classifier for Temporal Links.
  • Adapt the writers to output temporal links too.
  • Implement the feature extractor for Temporal Links.
  • Probably some variables in Document and Sentence objects can be deleted.
  • What's id_token in Word class?
  • Do we need EventInstance? Yes, we do.
  • In the attribute training phase, the multi-word expressions should be represented as one sample. The features will be merged according to the order of appearance.
  • add useful folder (models, output, buffer) in the Git repo
  • pickle the num2py arrays and remove the dependency
  • Activate the post processing pipeline.
  • Fix the logging messages (info, warning, debug)
  • The method search_subsequence is called many times. A more adequate ADT should be used.
  • Implement a HTML (CSS3) writer (timesheet.js, TimelineJS).
  • Make the features as lighter as possible (in terms of storage space).
  • show the #_files_processed/#files.
  • convert the gazetteers to Unicode.
  • Should the attribute data matrix be made of positive samples only?
  • Correct some morphological gazetteer features according to the English grammar. Are all the things called prepositions actually prepositions? (ask to Marilena Di Bari)
  • Implement the bufferisation at feature level.
  • Fix unicode-related bug at utilities.py:76.
  • Have a look at argparse ... it's not correct right now.
  • Filter out useless features such as female gazetters, male gazetters, US cities. (commented)
  • Look carefully at all the features and possibly cut them. (commented)
  • Instead of the settings.py file, use OS.ENVIRONS variable.
  • Implement the i2b2 reader.
  • Implement the i2b2 writer.
  • Implement a caching system for Stanford Core NLP.
  • Remove the output produced by CRF++ in the training phase.
  • Integrate (Norma)[https://github.com/filannim/timex-normaliser].
  • Introduce model folders instead of files.
  • Fix and connect the post-processing pipeline.
  • Attributes models should include identification feature (heavier but hopefully better).
  • Split identification models (TIMEXes and EVENTs).
  • CRF based attributes extraction.
  • There are some print statement somewhere (WARNING cases). I should use something more appropriate for them (log).
  • Remove the output produced from Stanford Parser in the stdout/stderr (if everything goes ok).
  • Implement AttributeDataMatrix writer.
  • Implement TempEval-3 writer.
  • Implement TempEval-3 reader.
  • Implement the classifier for events and timexes.
  • Implement the universal feature extractor for events and timexes.
  • Find documentation about how to comment the code so that nice Python-doc style web pages can be automatically generated.
  • Love ManTIME and refactor it!