Skip to content

Code and data associated with: ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2019) full paper

Notifications You must be signed in to change notification settings

anwala/MicroCollectionsJCDL2019

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Using Micro-collections in Social Media to Generate Seeds for Web Archive Collections

Code and data associated with: ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2019) full paper and tech report.

Brief description of Code

  • MicroCols.py: Generates collections of URIs from the different social media sites.
  • SegmentCols.py: Labels collections of URIs with post classes.
  • PrecEval.py: Performs precision evaluation with reference gold standard.
  • cdSegmentCols.py: Performs carbondating (creation date estimation) of links.
  • genericCommon.py: consists of utility functions used by all previously described scripts.

Brief description of Data

In the dataset, the post classes were SS, MS, SM, and MM. In the paper, we used P_1A_1, P_nA_1, P_1A_n, and P_nA_n:

  • Single Post, Single Author = SS = P_1A_1
  • Single Post, Single Author = MS = P_nA_1
  • Single Post, Single Author = SM = P_1A_n
  • Single Post, Single Author = MM = P_nA_n
  • MC (Micro-Collections) = P_nA_1 ∪ P_1A_n ∪ P_nA_n

The dataset topics:

  1. Ebola Virus Outbreak
  2. Flint Water Crisis
  3. MSD Shooting
  4. 2018 World Cup
  5. 2018 Midterm Elections

Collection of URIs were labeled with one of the four post post class labels.

Description of a sample of fields of single JSON file (e.g., reddit ebola virus outbreak)

  • name: social media name - index number (string)
  • extraction-timestamp: datetime entire collection was created (string)
  • segmented-cols: (array[objects])
    • (object)
      • ss or sm or ms or mm or mc: post classes (array[objects])
        • (object)
          • timestamp: datetime when URIs were extracted (string)
          • sim-coeff: if predicted-precision ≥ sim-coeff, uris are labeled relevant (float)
          • predicted-precision: precision score of uris (float)
          • uris: collection of uris extracted from social media sources (array[object])
            • (object)

About

Code and data associated with: ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2019) full paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages