-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document the differences in the data format of the different sources (PageFreezer, Versionista). #46
Comments
@titaniumbones are you talking differences in the actual systems themselves or the data we are storing and making public? If the latter, that is documented here (in the There’s no pagefreezer info there yet because we do not have a consistent format to document for it yet. |
Same for IA, too. |
@Mr0grog I think it was the data in the actual systems themselves. But... good question! Um. This came out of the GSoC meeting, where we were trying to turn the candidates' proposals into issues which could be grouped into milestones. So I think @janakrajchadha @suchthis @mhucka @danielballan will probably refine the issue together! |
@janakrajchadha Your proposal included "Understand the differences...." Once the differences are clear in your mind, documentation can be a concrete achievement for this task. |
Oops, @Mr0grog's comment and the subsequent ones hadn't loaded for me when I posted the above. Yes, the task here is to both determine and document source_metadata for PF and IA. |
Ah, sorry! Didn’t realize this was coming out of another discussion. 👍 |
@danielballan @titaniumbones Can either one of you redirect me to the place where a similar thing has been done for Versionista (if it exists)? |
@janakrajchadha in terms of the raw data we can get out of Versionista, that’s never been documented:
You could check out a recent output file to get a feel, though: https://s3-us-west-2.amazonaws.com/edgi-versionista-archive/versionista1/metadata-2017-06-20T00%3A00Z.json |
@Mr0grog I may be wrong here, but what we're storing as versions contains the different fields which we want in our DB and the source_metadata is what we get from the source itself. After taking a look at the recent Versionista output file, I would say that the source_metadata for a general case of Versionista output is in fact documented well. How is the data which we are storing and making public (in the source_metadata field) different from the data in the actual system output? |
It‘s close, but not exactly the same. |
@Mr0grog Oh, that was a little confusing earlier because of the term I was just adding the information in the documentation for some of the fields in the PageFreezer output and I was wondering if summarizing the internal info is a better way to go. Since we were talking about the data in the actual systems themselves, there are some fields which do not concern us. I just wanted to get a sense of how detailed we want this documentation to be. @danielballan @titaniumbones @suchthis @mhucka |
@titaniumbones @mhucka @suchthis @danielballan @Mr0grog I've documented the data format of the different sources and I've also added a table for differences between them. A few fields don't have a description as I wasn't sure what they meant. |
@danielballan Should this be closed or should we keep this open as I still have to add a little more information to the document? |
Let's leave it open to track our progress. Would you enumerate the blank entries here? Then we can ask for external help. |
PageFreezer
Versionista
|
These two fields are kind of weird and are sort of a result of the CSVs that analysts are currently using.
|
Three other minor notes:
|
Thanks a lot @Mr0grog! The date fields are ambiguous.
Yeah, I probably mixed this up as the hash and filePath are kept as a single object in the output. |
Hmmm, |
The |
@Mr0grog I think Internet Archive and Versionista have been well documented here. There are a few fields missing in the PageFreezer part and I was hoping that we could bring up the topic of them providing us better API documentation in the coming discussions with them. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions. |
We have |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions. |
At this point, I’m just going to close this. The project is shutting down. |
We are building a flexible framework designed to accommodate a variety of crawled page snapshots. Different services produce different data formats. By documenting them carefully, we set ourselves up for success.
The text was updated successfully, but these errors were encountered: