Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document the differences in the data format of the different sources (PageFreezer, Versionista). #46

Closed
titaniumbones opened this issue Jun 21, 2017 · 25 comments

Comments

@titaniumbones
Copy link
Contributor

titaniumbones commented Jun 21, 2017

We are building a flexible framework designed to accommodate a variety of crawled page snapshots. Different services produce different data formats. By documenting them carefully, we set ourselves up for success.

@titaniumbones titaniumbones added this to the Phase I Evaluations milestone Jun 21, 2017
@Mr0grog
Copy link
Member

Mr0grog commented Jun 21, 2017

@titaniumbones are you talking differences in the actual systems themselves or the data we are storing and making public? If the latter, that is documented here (in the source_metadata item): https://github.com/edgi-govdata-archiving/web-monitoring#versions

There’s no pagefreezer info there yet because we do not have a consistent format to document for it yet.

@Mr0grog
Copy link
Member

Mr0grog commented Jun 21, 2017

There’s no pagefreezer info there yet because we do not have a consistent format to document for it yet.

Same for IA, too.

@titaniumbones
Copy link
Contributor Author

@Mr0grog I think it was the data in the actual systems themselves. But... good question! Um. This came out of the GSoC meeting, where we were trying to turn the candidates' proposals into issues which could be grouped into milestones. So I think @janakrajchadha @suchthis @mhucka @danielballan will probably refine the issue together!

@danielballan
Copy link
Contributor

@janakrajchadha Your proposal included "Understand the differences...." Once the differences are clear in your mind, documentation can be a concrete achievement for this task.

@danielballan
Copy link
Contributor

Oops, @Mr0grog's comment and the subsequent ones hadn't loaded for me when I posted the above. Yes, the task here is to both determine and document source_metadata for PF and IA.

@Mr0grog
Copy link
Member

Mr0grog commented Jun 21, 2017

Ah, sorry! Didn’t realize this was coming out of another discussion. 👍

@janakrajchadha
Copy link

@danielballan @titaniumbones Can either one of you redirect me to the place where a similar thing has been done for Versionista (if it exists)?

@Mr0grog
Copy link
Member

Mr0grog commented Jun 22, 2017

@janakrajchadha in terms of the raw data we can get out of Versionista, that’s never been documented:

  • Partially because it’s always changing—we are scraping, so occasionally access to some piece of information disappears or we figure out how to extract some new piece of data
  • Partially because I have just failed to clearly document; I’ve focused on what we store in the DB server

You could check out a recent output file to get a feel, though: https://s3-us-west-2.amazonaws.com/edgi-versionista-archive/versionista1/metadata-2017-06-20T00%3A00Z.json

@janakrajchadha
Copy link

@titaniumbones are you talking differences in the actual systems themselves or the data we are storing and making public? If the latter, that is documented here (in the source_metadata item): https://github.com/edgi-govdata-archiving/web-monitoring#versions

@Mr0grog I may be wrong here, but what we're storing as versions contains the different fields which we want in our DB and the source_metadata is what we get from the source itself. After taking a look at the recent Versionista output file, I would say that the source_metadata for a general case of Versionista output is in fact documented well. How is the data which we are storing and making public (in the source_metadata field) different from the data in the actual system output?
Am I confusing terms here?

@Mr0grog
Copy link
Member

Mr0grog commented Jun 27, 2017

source_metadata is what we get from the source itself

It‘s close, but not exactly the same. source_metadata doesn’t include fields that are already represented in the page and version records and also flattens some fields (e.g. diff.hashdiff_hash). See here for the script that converts raw Versionista scraper output to DB input: https://github.com/edgi-govdata-archiving/web-monitoring-versionista-scraper/blob/master/bin/import-to-db#L55-L77

@janakrajchadha
Copy link

@Mr0grog Oh, that was a little confusing earlier because of the term source_metadata being used for different things. Thanks for the clarification!

I was just adding the information in the documentation for some of the fields in the PageFreezer output and I was wondering if summarizing the internal info is a better way to go. Since we were talking about the data in the actual systems themselves, there are some fields which do not concern us. I just wanted to get a sense of how detailed we want this documentation to be. @danielballan @titaniumbones @suchthis @mhucka

@janakrajchadha
Copy link

@titaniumbones @mhucka @suchthis @danielballan @Mr0grog I've documented the data format of the different sources and I've also added a table for differences between them. A few fields don't have a description as I wasn't sure what they meant.
Also, there's another IPython notebook which can be used to view an example of output for IA and PF. I've added a link to a Versionista output in the document itself.
See edgi-govdata-archiving/web-monitoring-processing@d92164f
Please review

@janakrajchadha
Copy link

@danielballan Should this be closed or should we keep this open as I still have to add a little more information to the document?

@danielballan
Copy link
Contributor

Let's leave it open to track our progress. Would you enumerate the blank entries here? Then we can ask for external help.

@janakrajchadha
Copy link

PageFreezer

  • Data:
  • Depth:
  • TaskId :
  • Url0 :
  • Url1:
  • UrlType:
  • Writeflag:

Versionista

  • diffWithPreviousDate :
  • diffWithFirstDate :

@Mr0grog
Copy link
Member

Mr0grog commented Jul 5, 2017

A few fields don't have a description as I wasn't sure what they meant…

Versionista

  • diffWithPreviousDate
  • diffWithFirstDate

These two fields are kind of weird and are sort of a result of the CSVs that analysts are currently using.

diffWithFirstDate is listed in the CSVs as the “date” of the diff between the current version and the first-ever-captured version. Diffs don’t really have a date, though, so this is actually just the capture date of the first-ever captured version of this page.

diffWithPreviousDate is the date of the current version (not the date of the previous version being diffed with, as you might expect from the name).

@Mr0grog
Copy link
Member

Mr0grog commented Jul 5, 2017

Three other minor notes:

  • hasContent indicates whether Versionista stored any content, not whether the version actually had any. One of the drawbacks of Versionista is that it won’t store content for files of a certain type or files that are too large (I think the threshold is probably somewhere around 1 or 2 MB, but Versionista has no docs on the actual number).

  • filePath should be where it is stored in our public archive, not on Versionista. e.g. if you have:

    "filePath": "versionista1/72879-6127248/version-11822980.html"

    Then you should be able to retrieve content from:

    http://edgi-versionista-archive.s3.amazonaws.com/versionista1/72879-6127248/version-11822980.html

    (That said, filePath is incorrect in some older metadata files, where it is the actual path on disk where we temporarily downloaded the version content before uploading to S3.)

  • hash is missing (I think it got accidentally combined with filePath above)

@janakrajchadha
Copy link

janakrajchadha commented Jul 6, 2017

Thanks a lot @Mr0grog! The date fields are ambiguous.

hash is missing (I think it got accidentally combined with filePath above)

Yeah, I probably mixed this up as the hash and filePath are kept as a single object in the output.

@Mr0grog
Copy link
Member

Mr0grog commented Jul 6, 2017

Hmmm, hash and filePath should not be a single object. Are there metadata files where they are? If so, we should correct those.

@janakrajchadha
Copy link

The hash and path of the diff are in a single object. The version hash and filePath aren't. I think I confused those two. Apologies.

@janakrajchadha
Copy link

@Mr0grog I think Internet Archive and Versionista have been well documented here. There are a few fields missing in the PageFreezer part and I was hoping that we could bring up the topic of them providing us better API documentation in the coming discussions with them.
cc: @ambergman

@stale
Copy link

stale bot commented Jan 9, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

@stale stale bot added the stale label Jan 9, 2019
@Mr0grog
Copy link
Member

Mr0grog commented Jan 10, 2019

We have source_metadata_versionista documented in the DB docs; we should probably do the same for source_metadata_web_monitoring. Or we should document that info somewhere else. In any case, this still seems relevant.

@stale
Copy link

stale bot commented Jul 9, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

@stale stale bot added the stale label Jul 9, 2019
@stale stale bot removed stale labels Jul 15, 2019
@Mr0grog Mr0grog self-assigned this Jul 15, 2019
@stale stale bot added the stale label Jan 11, 2020
@stale stale bot removed the stale label Jan 16, 2020
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Jan 16, 2020
@Mr0grog Mr0grog moved this from Ready to Icebox in Web Monitoring Oct 21, 2020
@Mr0grog
Copy link
Member

Mr0grog commented Jul 17, 2023

At this point, I’m just going to close this. The project is shutting down.

@Mr0grog Mr0grog closed this as completed Jul 17, 2023
Web Monitoring automation moved this from Icebox to Done! Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Web Monitoring
  
Done!
Development

No branches or pull requests

5 participants