-
Notifications
You must be signed in to change notification settings - Fork 33
Closed
Description
Is your feature request related to a problem? Please describe.
There only way to create the derivatives we used for the recent datathon(s) is to do them via spark shell. We should add them to the app.
Describe the solution you'd like
Add the following derivatives to app:
- Binaries
- Audio
- Images
- PDFs
- Presentation program files
- Spreadsheets
- Text files
- Word processor files
- Videos
- Web pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
- Web graph
.webgraph()
Additional context
- For the binary derivatives, we might want to sort out if we do just the binaries, all the info about the binary, or binaries + binary info?
- For
webpages, should we add a domain column, so it is similar to the "full-text" derivative, or should it completely replace the "full-text" derivative? - For
webgraph, should this just be theDomainGraphExtractoras "TEXT"?
Reactions are currently unavailable