Skip to content

Add datathon derivatives to app (binary info, web pages, web graph #447

@ruebot

Description

@ruebot

Is your feature request related to a problem? Please describe.

There only way to create the derivatives we used for the recent datathon(s) is to do them via spark shell. We should add them to the app.

Describe the solution you'd like

Add the following derivatives to app:

  • Binaries
    • Audio
    • Images
    • PDFs
    • Presentation program files
    • Spreadsheets
    • Text files
    • Word processor files
    • Videos
  • Web pages
    • .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
  • Web graph
    • .webgraph()

Additional context

  • For the binary derivatives, we might want to sort out if we do just the binaries, all the info about the binary, or binaries + binary info?
  • For webpages, should we add a domain column, so it is similar to the "full-text" derivative, or should it completely replace the "full-text" derivative?
  • For webgraph, should this just be the DomainGraphExtractor as "TEXT"?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions