David McClure davidmcclure

Organizations

Scholars' Lab
David McClure
  • David McClure ae2fb2f
    Refactors the html_to_text so that it takes a file path, not a raw HT…
David McClure
  • David McClure 7ef7912
    In MockCorpus#add_file, make segment='000' by default, since it's oft…
David McClure
  • David McClure cb3e0f5
    Refactors the office/pdf converters so that they just take file paths.
David McClure
  • David McClure f9ce3dd
    Tightens up office_to_text().
David McClure
  • David McClure d26a4cb
    For now, get rid of the unused MockFile#write_txt() method.
David McClure
  • David McClure b48fadb
    Gets rid of the explicit add_segment calls in the pdf/office conversi…
David McClure
  • David McClure 253603b
    Style refactoring in MockCorpus#write_txt().
David McClure
  • David McClure 7af8c08
    In MockCorpus#add_file(), ensure the segment exists.
David McClure
  • David McClure ea00697
    Adds a MockCorpus#write_txt() method.
David McClure
  • David McClure 90c3efc
    In the mock corpus class, just use a single add_file method, which de…
David McClure
  • David McClure 610288c
    Converts the Corpus class to Google-style docstrings.
David McClure
  • David McClure 3651619
    Stubbing in a MockFile base class.
David McClure
  • David McClure 42319b4
    When testing PDF text extraction, just use a single page.
David McClure
  • David McClure 9d36931
    Test the office_to_text() helper.
David McClure
  • David McClure f143389
    Handle connection-refused exceptions in tika_is_online(), correctly w…
David McClure
  • David McClure 6c99629
    Adds a tika_is_online() helper.
David McClure
  • David McClure 7174005
    Tightening up formatting in the html_to_text() helper.
David McClure
  • David McClure 0bff63b
    Stubs in a Tika adapter class.
David McClure
  • David McClure 806ad3a
    Stubbing in tests for text extraction with Tika.
David McClure
  • David McClure d7bedb5
    Spacing out the docstrings in the corpus utils.
David McClure
  • David McClure 3363630
    Standardizing spacing in the corpus tests.
David McClure
  • David McClure 102c759
    Tests the pdf_to_text() helper.
David McClure
  • David McClure 7e17071
    In the corpus tests, migrate to Google-style docstrings.
David McClure
  • David McClure 8a79835
    Make the add_pdf method return a handle on the new PDF file.
David McClure
  • David McClure f8ffdc4
    Adds a MockCorpus#add_pdf() method.
David McClure
David McClure
  • David McClure 9417a0a
    Block in the beginnings of a MockCorpus class.