Using Micro-collections in Social Media to Generate Seeds for Web Archive Collections

Brief description of Code

MicroCols.py: Generates collections of URIs from the different social media sites.
SegmentCols.py: Labels collections of URIs with post classes.
PrecEval.py: Performs precision evaluation with reference gold standard.
cdSegmentCols.py: Performs carbondating (creation date estimation) of links.
genericCommon.py: consists of utility functions used by all previously described scripts.

In the dataset, the post classes were SS, MS, SM, and MM. In the paper, we used P_1A_1, P_nA_1, P_1A_n, and P_nA_n:

The dataset topics:

Collection of URIs were labeled with one of the four post post class labels.

name: social media name - index number (string)
extraction-timestamp: datetime entire collection was created (string)
segmented-cols: (array[objects])
- (object)
  - ss or sm or ms or mm or mc: post classes (array[objects])
    - (object)
      - timestamp: datetime when URIs were extracted (string)
      - sim-coeff: if predicted-precision ≥ sim-coeff, uris are labeled relevant (float)
      - predicted-precision: precision score of uris (float)
      - uris: collection of uris extracted from social media sources (array[object])
        
        (object)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Code		Code
Data		Data
.gitignore		.gitignore
README.md		README.md