# easy2edm

This notebook creates proceedings for the [International Conference on Educational Data Mining](https://educationaldatamining.org/conferences/) using reviewing data from [EasyChair](https://easychair.org).

## Steps

- See the [README](README.md) for how to download and structure the EasyChair data
- Change the ISBN and decision types below to suit
- Make the frontmatter in `./book-proceedings/book-proceedings.tex` so you know what page publications start on
- Run the cells below **using that start page as a parameter** to generate the bibtex in `./proceedings/cdrom/bib` and to organize the accepted pdfs in `./proceedings/cdrom/pdf`
- Elsewhere, generate the DOIs, add them to the bibtex above, and place into `./proceedings/cdrom/bib-with-doi`
- Run the cell below to stamp the pdfs with the doi-enhanced bibtex, putting the stamped pdfs into `./proceedings/cdrom/stamped`
- Build `./book-proceedings/book-proceedings.tex`; this will bring in the stamped pdfs and compile the proceedings as a pdf book
    - Check that the stamp on several papers matches their actual pagination in the book
- Use pdftk or similar to slice out the frontmatter, then use the bibtex cell below to create bibtex for it (if desired)

## Utility functions

In [1]:
# easy2acl.py - Convert data from EasyChair for use with ACLPUB
#
# Original Author: Nils Blomqvist
# Forked/modified by: Asad Sayeed
# Further modifications and docs (for 2019 Anthology): Matt Post
# Index for LaTeX book proceedings: Mehdi Ghanimifard and Simon Dobnik
# Modified for EDM by Andrew Olney 
# Please see the documentation in the README file at http://github.com/acl-org/easy2acl.

import os
import re
import sys

from csv import DictReader
from glob import glob
from shutil import copy, rmtree
from unicode_tex import unicode_to_tex
from pybtex.database import BibliographyData, Entry
from PyPDF2 import PdfFileReader
from functools import cmp_to_key
from pylatex.utils import escape_latex

isbn = "978-1-7336736-3-1"

# Specify conference tracks  here IN PROCEEDINGS ORDER
tracks =  ['long-papers','short-papers','posters','doctoral-consortium','industry-track','workshop-tutorials']
pretty_tracks =  ['Long Papers','Short Papers','Posters','Doctoral Consortium','Industry Track','Workshops \\& Tutorials']

# specify decision types here
decisions = {
    "Accept in current track" : None,
    "Accept+move to short" :"short-papers",
    "Accept+move to posters":"posters",
    "ACCEPT":None
}

#---------------------------------------------------------------------------------------------------------

# for later sorting papers by track
# (submission_id, title, authors, track)
track_order = ['front'] + tracks 
def paper_cmp(a, b):
    if track_order.index(a[3]) > track_order.index(b[3]):
        return 1
    elif a[3] == b[3]:
        if int(a[0]) > int(b[0]):
            return 1
        else:
            return -1
    else:
        return -1
paper_cmp_key = cmp_to_key(paper_cmp)

def texify(string):
    """Return a modified version of the argument string where non-ASCII symbols have
    been converted into LaTeX escape codes.

    """
#     unicode_to_tex does not handle german sharp s well, giving \ss instead of {\ss}
#     return ' '.join(map(unicode_to_tex, string.split())).replace(r'\textquotesingle', "'")
    return ' '.join(map(escape_latex, string.split())).replace(r'\textquotesingle', "'")

def get_track_metadata(directory):
    #,----
    #| Metadata
    #`----
    metadata = { 'chairs': [] }
    with open(os.path.join(directory, 'meta')) as metadata_file:
        for line in metadata_file:
            key, value = line.rstrip().split(maxsplit=1)
            if key == 'chairs':
                metadata[key].append(value)
            else:
                metadata[key] = value

    for key in 'abbrev volume title shortbooktitle booktitle month year location publisher chairs'.split():
        if key not in metadata:
            print('Fatal: missing key "{}" from "meta" file'.format(key))
            print("Please see the documentation at https://acl-org.github.io/ACLPUB/anthology.html.")
            sys.exit(1)

    for key in "bib_url volume_name short_booktitle type".split():
        if key in metadata:
            print('Fatal: bad key "{}" in the "meta" file'.format(key))
            print("Please see the documentation at https://acl-org.github.io/ACLPUB/anthology.html.")
            sys.exit(1)

    venue = metadata["abbrev"]
    volume_name = metadata["volume"]
    year = metadata["year"]
    return metadata

def collect_track_metadata():
    metadata ={}
    for d in tracks :
        metadata[d] = get_track_metadata(d)
    return metadata

# Across all tracks, build a dictionary of submissions (which has author 
# information). We do this across tracks because some submissions have 
# decisions that move them to other tracks 

def collect_submissions_and_acceptances( decision_map, metadata ):
    # it seems we only need id and authors from submissions, not title; keeping title in for debugging purposes
    submissions = {}
    for d in tracks :
        with open(os.path.join(d, 'submissions')) as submissions_file:
            for line in submissions_file:
                entry = line.rstrip().split("\t")
                submission_id = entry[0]
                authors = entry[1].replace(' and', ',').split(', ')
                title = entry[2]

                submissions[submission_id] = (title, authors)
            print("Found ", len(submissions), " submitted files in ", d)

    #
    # Append each accepted submission, as a tuple, to the 'accepted' list.
    # Order in this file is used to determine program order.
    #
    accepted = []
    for d in tracks :
        with open(os.path.join(d, 'accepted')) as accepted_file:
            for line in accepted_file:
                entry = line.rstrip().split("\t")
                # modified here to filter out the rejected files rather than doing
                # that by hand
                #if entry[-1] == 'ACCEPT':
                if entry[-1] in decision_map:
                    #print(d)
                    submission_id = entry[0]
                    title = entry[1]
                    authors = submissions[submission_id][1]
                    # if we defined an explicit mapping, use it
                    if decision_map[ entry[-1] ]:
                        track = decision_map[ entry[-1] ]
                    # otherwise we should place in current track
                    else:
                        track = d

                    accepted.append((submission_id, title, authors, track))
            print("Found ", len(accepted), " accepted files in ", d)

    # Read abstracts
    abstracts = {}
    for d in tracks :
        if os.path.exists(os.path.join(d, 'submission.csv')):
            with open(os.path.join(d, 'submission.csv')) as csv_file:
                d = DictReader(csv_file)
                for row in d:
                    abstracts[row['#']] = row['abstract']
            print('Found ', len(abstracts), 'abstracts in ',d)
        else:
            print('No abstracts available.')

    #
    # Find all relevant PDFs
    #
    venue = metadata['long-papers']["abbrev"]
    year = metadata['long-papers']["year"]
    booktitle = metadata['long-papers']['booktitle']
    chairs = metadata['long-papers']['chairs']
    
    # The PDF of the full proceedings
    full_pdf_file = 'pdf/{}_{}.pdf'.format(venue, year)
    if not os.path.exists(full_pdf_file):
        print("Fatal: could not find full volume PDF '{}'".format(full_pdf_file))
        sys.exit(1)

    # The PDF of the frontmatter
    frontmatter_pdf_file = 'pdf/{}_{}_frontmatter.pdf'.format(venue, year)
    if not os.path.exists(frontmatter_pdf_file):
        print("Fatal: could not find frontmatter PDF file '{}'".format(frontmatter_pdf_file))
        sys.exit(1)

    # File locations of all PDFs (seeded with PDF for frontmatter)
    pdfs = { '0': frontmatter_pdf_file }
    for d in tracks :
        for pdf_file in glob(os.path.join(d,'pdf/{}_{}_paper_*.pdf'.format(venue, year))):
            submission_id = pdf_file.split('_')[-1].replace('.pdf', '')
            pdfs[submission_id] = pdf_file

    # List of accepted papers (seeded with frontmatter)
    accepted.insert(0, ('0', booktitle, chairs, 'front'))
#     return (submissions, accepted, abstracts, pdfs)
    return (accepted, abstracts, pdfs)

#
# Create Anthology tarball
#

# def render_bibtex_and_track_assigned_pdf(metadata, submissions, accepted, abstracts, pdfs, start_page = 1):
def render_bibtex_and_track_assigned_pdf(metadata, accepted, abstracts, pdfs, start_page = 1):
    
    # All this information is shared across tracks, so we can use long-papers
    venue = metadata['long-papers']["abbrev"]
    year = metadata['long-papers']["year"]
    booktitle = metadata['long-papers']['booktitle']
    chairs = metadata['long-papers']['chairs']
    # volume name is track name
    #volume_name = metadata['long-papers']["volume"]
    location = metadata['long-papers']["location"]
    publisher = metadata['long-papers']["publisher"]
    month= metadata['long-papers']["month"]
    editors =  metadata['long-papers']["chairs"]
    
    # Create destination directories
    for dir in ['bib', 'pdf']:
        dest_dir = os.path.join('proceedings/cdrom', dir)
        if not os.path.exists(dest_dir):
            os.makedirs(dest_dir)

    # Copy over "meta" file
    print('COPYING long papers meta -> proceedings/meta', file=sys.stderr)
    copy('long-papers/meta', 'proceedings/meta')

    final_bibs = []
#     start_page = 1
    acepted = accepted.sort(key=paper_cmp_key)
    for paper_id, entry in enumerate(accepted):
        #print( entry)
        submission_id, paper_title, authors, track = entry
        authors = ' and '.join(authors)
        if not submission_id in pdfs:
            print('Fatal: no PDF found for paper', paper_id, file=sys.stderr)
            sys.exit(1)

        pdf_path = pdfs[submission_id]
        dest_path = 'proceedings/cdrom/pdf/{}.{}-{}.{}.pdf'.format(year, venue,track, paper_id)

        copy(pdf_path, dest_path)
        print('COPYING', pdf_path, '->', dest_path, file=sys.stderr)

        bib_path = dest_path.replace('pdf', 'bib')
        if not os.path.exists(os.path.dirname(bib_path)):
            os.makedirs(os.path.dirname(bib_path))

        anthology_id = os.path.basename(dest_path).replace('.pdf', '')

        bib_type = 'inproceedings' if submission_id != '0' else 'proceedings'
        bib_entry = Entry(bib_type, [
            ('author', authors),
            ('title', paper_title),
            ('year', year ),
            ('month',month),
            ('address', location),
            ('publisher', publisher),
            ('editor', ' and '.join(editors)),
            ('isbn', isbn)
        ])

        # Add page range if not frontmatter
        if paper_id > 0:
            with open(pdf_path, 'rb') as in_:
                file = PdfFileReader(in_)
                last_page = start_page + file.getNumPages() - 1
                bib_entry.fields['pages'] = '{}--{}'.format(start_page, last_page)
                start_page = last_page + 1

        # Add the abstract if present
        if submission_id in abstracts:
            bib_entry.fields['abstract'] = abstracts.get(submission_id)

        # Add booktitle for non-proceedings entries
        if bib_type == 'inproceedings':
            bib_entry.fields['booktitle'] = booktitle

        try:
            bib_string = BibliographyData({ anthology_id: bib_entry }).to_string('bibtex')
        except TypeError as e:
            print('Fatal: Error in BibTeX-encoding paper', submission_id, file=sys.stderr)
            sys.exit(1)
        final_bibs.append(bib_string)
        with open(bib_path, 'w') as out_bib:
            print(bib_string, file=out_bib)
            print('CREATED', bib_path)
    return final_bibs
            
def make_book(metadata, accepted, pdfs, final_bibs):
    # All this information is shared across tracks, so we can use long-papers
    venue = metadata['long-papers']["abbrev"]
    year = metadata['long-papers']["year"]
    booktitle = metadata['long-papers']['booktitle']
    chairs = metadata['long-papers']['chairs']
    # volume name is track name
    #volume_name = metadata['long-papers']["volume"]
    location = metadata['long-papers']["location"]
    publisher = metadata['long-papers']["publisher"]
    month= metadata['long-papers']["month"]
    
    # Create an index for LaTeX book proceedings
    if not os.path.exists('book-proceedings'):
        os.makedirs('book-proceedings')

    current_track = ""
    with open('book-proceedings/all_papers.tex', 'w') as book_file:
        for paper_id, entry in enumerate(accepted):
            submission_id, paper_title, authors, track = entry
            if submission_id == '0':
                continue
            if len(authors) > 1:
                authors = ', '.join(authors[:-1]) + ' and ' + authors[-1]
            else:
                authors = authors[0]
            
            # insert toc heading for track
#             ['long-papers','short-papers','posters','doctoral-consortium','industry-track','workshop-tutorials']
            if current_track != track:
                pretty_track = pretty_tracks[ tracks.index(track) ]
                print("""\pdfbookmark{{{pretty_track}}}{{{track}}}\n\\addtocontents{{toc}}{{\\vspace{{10pt}}\\textbf{{{pretty_track}}}\\vspace{{5pt}}}}\n""".format( pretty_track=pretty_track, track=track), file=book_file)
                current_track = track
            
            # consider the proceedings path as the version of record
            pdf_path = '../proceedings/cdrom/stamped/{}.{}-{}.{}.pdf'.format(year, venue,track, paper_id)
            print("""\goodpaper{{{pdf_file}}}{{{title}}}%
    {{{authors}}}\n""".format(authors=texify(authors), pdf_file=pdf_path, title=texify(paper_title)), file=book_file)

#             print("""\goodpaper{{../{pdf_file}}}{{{title}}}%
#     {{{authors}}}\n""".format(authors=texify(authors), pdf_file=pdfs[submission_id], title=texify(paper_title)), file=book_file)


    # Write the volume-level bib with all the entries
    dest_bib = 'proceedings/cdrom/{}-{}.bib'.format(venue, year)
    with open(dest_bib, 'w') as whole_bib:
        print('\n'.join(final_bibs), file=whole_bib)
        print('CREATED', dest_bib)

    # Copy over the volume-level PDF
    full_pdf_file = 'pdf/{}_{}.pdf'.format(venue, year)
    dest_pdf = dest_bib.replace('bib', 'pdf')
    print('COPYING', full_pdf_file, '->', dest_pdf, file=sys.stderr)
    copy(full_pdf_file, dest_pdf)

## Make everything

- You must make the frontmatter first to find out what page number the first paper starts on
- This does not make front matter and combined proceedings
- Run again after making them in latex to generate their bibtex

In [2]:
metadata =collect_track_metadata()

# submissions, accepted, abstracts, pdfs  = collect_submissions_and_acceptances( decisions, metadata )

# final_bibs = render_bibtex_and_track_assigned_pdf(metadata, submissions, accepted, abstracts, pdfs,5)

accepted, abstracts, pdfs  = collect_submissions_and_acceptances( decisions, metadata )

final_bibs = render_bibtex_and_track_assigned_pdf(metadata, accepted, abstracts, pdfs,5)

make_book(metadata, accepted, pdfs, final_bibs)

Found  90  submitted files in  long-papers
Found  146  submitted files in  short-papers
Found  166  submitted files in  posters
Found  179  submitted files in  doctoral-consortium
Found  185  submitted files in  industry-track
Found  192  submitted files in  workshop-tutorials
Found  52  accepted files in  long-papers
Found  89  accepted files in  short-papers
Found  99  accepted files in  posters
Found  108  accepted files in  doctoral-consortium
Found  112  accepted files in  industry-track
Found  118  accepted files in  workshop-tutorials
Found  90 abstracts in  <csv.DictReader object at 0x7f22bc124290>
Found  146 abstracts in  <csv.DictReader object at 0x7f22bc124250>
Found  166 abstracts in  <csv.DictReader object at 0x7f22ad47dc10>
Found  179 abstracts in  <csv.DictReader object at 0x7f22bc124250>
Found  185 abstracts in  <csv.DictReader object at 0x7f22ad47cf10>
Found  192 abstracts in  <csv.DictReader object at 0x7f22bc124250>
CREATED proceedings/cdrom/bib/2022.EDM-front.0.bib


COPYING long papers meta -> proceedings/meta
COPYING pdf/EDM_2022_frontmatter.pdf -> proceedings/cdrom/pdf/2022.EDM-front.0.pdf
COPYING long-papers/pdf/EDM_2022_paper_7.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.1.pdf
COPYING long-papers/pdf/EDM_2022_paper_10.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.2.pdf
COPYING long-papers/pdf/EDM_2022_paper_12.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.3.pdf
COPYING long-papers/pdf/EDM_2022_paper_18.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.4.pdf
COPYING long-papers/pdf/EDM_2022_paper_21.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.5.pdf
COPYING long-papers/pdf/EDM_2022_paper_22.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.6.pdf
COPYING long-papers/pdf/EDM_2022_paper_35.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.7.pdf
COPYING long-papers/pdf/EDM_2022_paper_37.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.8.pdf
COPYING long-papers/pdf/EDM_2022_paper_51.pdf -> proceedings/cdrom/pdf/2022.EDM-long-pape

CREATED proceedings/cdrom/bib/2022.EDM-long-papers.13.bib
CREATED proceedings/cdrom/bib/2022.EDM-long-papers.14.bib
CREATED proceedings/cdrom/bib/2022.EDM-long-papers.15.bib
CREATED proceedings/cdrom/bib/2022.EDM-long-papers.16.bib
CREATED proceedings/cdrom/bib/2022.EDM-long-papers.17.bib
CREATED proceedings/cdrom/bib/2022.EDM-long-papers.18.bib
CREATED proceedings/cdrom/bib/2022.EDM-long-papers.19.bib
CREATED proceedings/cdrom/bib/2022.EDM-long-papers.20.bib
CREATED proceedings/cdrom/bib/2022.EDM-long-papers.21.bib
CREATED proceedings/cdrom/bib/2022.EDM-long-papers.22.bib
CREATED proceedings/cdrom/bib/2022.EDM-long-papers.23.bib
CREATED proceedings/cdrom/bib/2022.EDM-long-papers.24.bib
CREATED proceedings/cdrom/bib/2022.EDM-long-papers.25.bib
CREATED proceedings/cdrom/bib/2022.EDM-long-papers.26.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.27.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.28.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.29.bib
CREATED pro

COPYING long-papers/pdf/EDM_2022_paper_78.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.16.pdf
COPYING long-papers/pdf/EDM_2022_paper_80.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.17.pdf
COPYING long-papers/pdf/EDM_2022_paper_84.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.18.pdf
COPYING long-papers/pdf/EDM_2022_paper_94.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.19.pdf
COPYING long-papers/pdf/EDM_2022_paper_108.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.20.pdf
COPYING long-papers/pdf/EDM_2022_paper_109.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.21.pdf
COPYING long-papers/pdf/EDM_2022_paper_125.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.22.pdf
COPYING long-papers/pdf/EDM_2022_paper_133.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.23.pdf
COPYING long-papers/pdf/EDM_2022_paper_141.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.24.pdf
COPYING long-papers/pdf/EDM_2022_paper_148.pdf -> proceedings/cdrom/pdf/2022.EDM-long-papers.25.pdf
COPY

CREATED proceedings/cdrom/bib/2022.EDM-short-papers.34.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.35.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.36.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.37.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.38.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.39.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.40.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.41.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.42.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.43.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.44.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.45.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.46.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.47.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.48.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.49.bib
CREATED proceedings/cdrom/bib/2022.EDM-short-papers.50.b

COPYING short-papers/pdf/EDM_2022_paper_52.pdf -> proceedings/cdrom/pdf/2022.EDM-short-papers.37.pdf
COPYING long-papers/pdf/EDM_2022_paper_59.pdf -> proceedings/cdrom/pdf/2022.EDM-short-papers.38.pdf
COPYING long-papers/pdf/EDM_2022_paper_69.pdf -> proceedings/cdrom/pdf/2022.EDM-short-papers.39.pdf
COPYING long-papers/pdf/EDM_2022_paper_72.pdf -> proceedings/cdrom/pdf/2022.EDM-short-papers.40.pdf
COPYING long-papers/pdf/EDM_2022_paper_74.pdf -> proceedings/cdrom/pdf/2022.EDM-short-papers.41.pdf
COPYING short-papers/pdf/EDM_2022_paper_82.pdf -> proceedings/cdrom/pdf/2022.EDM-short-papers.42.pdf
COPYING long-papers/pdf/EDM_2022_paper_93.pdf -> proceedings/cdrom/pdf/2022.EDM-short-papers.43.pdf
COPYING long-papers/pdf/EDM_2022_paper_97.pdf -> proceedings/cdrom/pdf/2022.EDM-short-papers.44.pdf
COPYING long-papers/pdf/EDM_2022_paper_105.pdf -> proceedings/cdrom/pdf/2022.EDM-short-papers.45.pdf
COPYING short-papers/pdf/EDM_2022_paper_107.pdf -> proceedings/cdrom/pdf/2022.EDM-short-papers.46

CREATED proceedings/cdrom/bib/2022.EDM-posters.60.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.61.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.62.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.63.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.64.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.65.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.66.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.67.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.68.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.69.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.70.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.71.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.72.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.73.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.74.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.75.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.76.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.77.bib
CREATED proceedings/cdrom/bi

proceedings/cdrom/pdf/2022.EDM-posters.63.pdf
COPYING short-papers/pdf/EDM_2022_paper_57.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.64.pdf
COPYING long-papers/pdf/EDM_2022_paper_61.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.65.pdf
COPYING short-papers/pdf/EDM_2022_paper_67.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.66.pdf
COPYING long-papers/pdf/EDM_2022_paper_71.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.67.pdf
COPYING short-papers/pdf/EDM_2022_paper_86.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.68.pdf
COPYING short-papers/pdf/EDM_2022_paper_87.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.69.pdf
COPYING long-papers/pdf/EDM_2022_paper_88.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.70.pdf
COPYING short-papers/pdf/EDM_2022_paper_90.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.71.pdf
COPYING long-papers/pdf/EDM_2022_paper_91.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.72.pdf
COPYING short-papers/pdf/EDM_2022_paper_101.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.73.p

CREATED proceedings/cdrom/bib/2022.EDM-posters.88.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.89.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.90.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.91.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.92.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.93.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.94.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.95.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.96.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.97.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.98.bib
CREATED proceedings/cdrom/bib/2022.EDM-posters.99.bib
CREATED proceedings/cdrom/bib/2022.EDM-doctoral-consortium.100.bib
CREATED proceedings/cdrom/bib/2022.EDM-doctoral-consortium.101.bib
CREATED proceedings/cdrom/bib/2022.EDM-doctoral-consortium.102.bib
CREATED proceedings/cdrom/bib/2022.EDM-doctoral-consortium.103.bib
CREATED proceedings/cdrom/bib/2022.EDM-doctoral-consortium.104.bib
CREATED proceedin

COPYING posters/pdf/EDM_2022_paper_200.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.90.pdf
COPYING posters/pdf/EDM_2022_paper_202.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.91.pdf
COPYING posters/pdf/EDM_2022_paper_205.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.92.pdf
COPYING posters/pdf/EDM_2022_paper_211.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.93.pdf
COPYING posters/pdf/EDM_2022_paper_214.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.94.pdf
COPYING posters/pdf/EDM_2022_paper_215.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.95.pdf
COPYING posters/pdf/EDM_2022_paper_220.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.96.pdf
COPYING posters/pdf/EDM_2022_paper_221.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.97.pdf
COPYING posters/pdf/EDM_2022_paper_222.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.98.pdf
COPYING posters/pdf/EDM_2022_paper_223.pdf -> proceedings/cdrom/pdf/2022.EDM-posters.99.pdf
COPYING doctoral-consortium/pdf/EDM_2022_paper_34.pdf -> proceedings/cdrom/pdf/2

CREATED proceedings/cdrom/bib/2022.EDM-workshop-tutorials.113.bib
CREATED proceedings/cdrom/bib/2022.EDM-workshop-tutorials.114.bib
CREATED proceedings/cdrom/bib/2022.EDM-workshop-tutorials.115.bib
CREATED proceedings/cdrom/bib/2022.EDM-workshop-tutorials.116.bib
CREATED proceedings/cdrom/bib/2022.EDM-workshop-tutorials.117.bib
CREATED proceedings/cdrom/bib/2022.EDM-workshop-tutorials.118.bib
CREATED proceedings/cdrom/EDM-2022.bib


## Add citation copyright block

- Doi have been reserved elsewhere and written to bibtex
- We create citation block for each paper and add it to left corner of first page

#### Notes

Attempts to remove the citation block warning from Word authored PDFs was not successful using pdfbox or pypdf2. 
Only Inkscape was successful but could not be automated robustly.
Inkscape also made the first page svg so no text could be extracted.
As a result we ruled out Inkscape and instead opted to cover the warning with a white rectangle, then put the new citation block over that.
`pdf_annotate` was explored for this and functioned, however one cannot copy/paste text from the resulting block, hyperlink from the block, or use any font but Helvetica.
So I switched to `PyPDF2` for most of the needed functionality using watermarking, but still needed to generate the PDF used in the watermark, which is done with `pdflatex`. 

In [4]:
#from PyPDF2 import PdfWriter, PdfReader
import PyPDF2
import os, sys, re, pybtex
from pybtex.database import parse_file,BibliographyData, Entry, Person
import subprocess
from glob import glob

#change pdflatex location to match your path
BIN_PDFLATEX = '/usr/local/texlive/2022/bin/x86_64-linux/pdflatex'

#this template is used to generate a latex file that pdflatex compiles to a pdf citation/copyright block for merging with the paper
template_block = r'''\documentclass{edm_article}
\usepackage{tikz} %can be removed if tikz is removed
\usepackage[pdflang={en-US},pdftex,hidelinks]{hyperref}
\usetikzlibrary{fit} %can be removed if tikz is removed
\toappear{\scriptsize #CITATION \\
\\[#OFFSETmm] %a full space is great but some citation blocks are too big and we need to reclaim the space
\copyright~#YEAR Copyright is held by the author(s). This work is distributed under the Creative Commons Attribution NonCommercial NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license. \\
\url{https://doi.org/#DOI}}

\begin{document}
\title{}
\maketitle
\begin{abstract}
\end{abstract}
%BEGIN TIKZ: This is only to cover warnings and can be removed if warnings are removed
    \begin{tikzpicture}[remember picture,overlay,
mynode/.style 2 args = {
    fill=white, 
    inner sep=0pt, outer sep=0pt,
    fit=(#1) (#2)}
                        ]
\coordinate (bottom left) at (0,-17.5);
\coordinate (top right) at (8.5,-15); %(8.5,-14.5)
    \node[mynode={bottom left}{top right}] {};
    \end{tikzpicture}
%END TIKZ
\end{document}'''

def make_name(person):
    name = ''
    if person.first_names:
        name += ' '.join(person.first_names) 
    if person.middle_names:
        name += ' ' + ' '.join(person.middle_names)
    if person.prelast_names:
        name += ' ' + ' '.join(person.prelast_names)
    if person.last_names:
        name += ' ' + ' '.join(person.last_names)
    if person.lineage_names:
        name += ' ' + ' '.join(person.lineage_names)
    return name

def stamp_citation(pdf_file, bib_file, output_dir ):
    #output file has same name as input, just different location
    output_file = os.path.join(output_dir, os.path.basename(pdf_file))
    #extract bibtex elements for metadata, etc
    bib_data = parse_file(bib_file)
    key,entry = list(bib_data.entries.items())[0]
    #generate latex from template
    dirty_citation = pybtex.bibtex.format_from_file(bib_file,style='abbrv')
    key_brace = key + '}'
    key_end = dirty_citation.find(key_brace) + len(key_brace)
    clean_citation=dirty_citation[key_end:].replace('\n',' ').replace('\\newblock', ' ').replace('\\end{thebibliography}','').replace('  ',' ').replace('  ',' ').strip()
    year = entry.fields['year']
    if "doi" in entry.fields:
        doi = entry.fields['doi']
    else:
        print("WARNING: DOI MISSING")
        doi = "WARNING: DOI MISSING"
    #Customize here for bizarre things authors might do
    if entry.fields['title'].startswith("{Process-BERT"):
        offset = "-3"
        print( "*** Offsetting " + offset + " for (" + str(len(clean_citation)) + ") " + entry.fields['title'] )
    #This is the default length adjustment for long citations (long title, authors, etc)
    elif len(clean_citation) > 425: #500 is a reasonable value here, but is seems that some people reduce the space of the copyright block
        offset = "-2"
        print( "*** Offsetting " + offset + " for (" + str(len(clean_citation)) + ") " + entry.fields['title'] )
    elif len(clean_citation) > 400:
        offset = "-1"
        print( "*** Offsetting " + offset + " for (" + str(len(clean_citation)) + ") " + entry.fields['title'] )
    else:
        offset = "0"
    citation_block = template_block.replace('#YEAR',year).replace('#CITATION',clean_citation).replace('#DOI',doi).replace('#OFFSET',offset)
    with open('watermark.tex', 'w') as f:
        f.write(citation_block)
    subprocess.run([BIN_PDFLATEX, 'watermark.tex'], stdout=subprocess.DEVNULL, check=True)
    
    with open(pdf_file, "rb") as filehandle_input:
        pdf = PyPDF2.PdfReader(filehandle_input)
        with open('watermark.pdf', "rb") as filehandle_watermark:
            watermark = PyPDF2.PdfReader(filehandle_watermark)

            first_page = pdf.pages[0]
            first_page_watermark = watermark.pages[0]
            first_page.mergePage(first_page_watermark)

            pdf_writer = PyPDF2.PdfWriter()
            pdf_writer.addPage(first_page)
            for i in range(1,len(pdf.pages)):
                pdf_writer.addPage(pdf.getPage(i))

            pdf_writer.add_metadata(
                {
                    "/Author": ', '.join( make_name(i) for i in  entry.persons['author']),
                    "/Title": entry.fields["title"],
                    "/Subject": entry.fields["abstract"],
                }
            )

            with open(output_file, "wb") as filehandle_output:
                # write the watermarked file to the new file
                pdf_writer.write(filehandle_output)
                
# stamp_citation( "2022.EDM-long-papers.7.pdf", "2022.EDM-long-papers.7.bib", "proceedings/cdrom/stamped" )
for pdf_path in glob(os.path.join('proceedings/cdrom/pdf/*.pdf')):
    base_name = os.path.split(pdf_path)[1]
    # TODO: change this back to bib-with-doi
    # FOR TEST ONLY bib_path = os.path.join('proceedings/cdrom/bib', os.path.splitext(base_name)[0] + ".bib")
    bib_path = os.path.join('proceedings/cdrom/bib-with-doi', os.path.splitext(base_name)[0] + ".bib")
#     print(pdf_path)
    print(base_name)
#     print(bib_path)
    try:
        stamp_citation( pdf_path, bib_path, "proceedings/cdrom/stamped")
    except:
        print(sys.exc_info()[0])

2022.EDM-posters.72.pdf
2022.EDM-posters.88.pdf
*** Offsetting -3 for (341) {Process-BERT}: A Framework for Representation Learning on Educational Process Data
2022.EDM-short-papers.33.pdf
2022.EDM-long-papers.7.pdf
*** Offsetting -1 for (418) Detecting {SMART} Model Cognitive Operations in Mathematical Problem-Solving Process
2022.EDM-long-papers.26.pdf
*** Offsetting -2 for (444) Challenges and Feasibility of Automatic Speech Recognition for Modeling Student Collaborative Discourse in Classrooms
2022.EDM-posters.75.pdf
2022.EDM-posters.65.pdf
2022.EDM-posters.74.pdf
2022.EDM-long-papers.17.pdf
2022.EDM-long-papers.16.pdf
2022.EDM-posters.82.pdf
2022.EDM-doctoral-consortium.102.pdf
2022.EDM-posters.84.pdf
2022.EDM-short-papers.49.pdf
*** Offsetting -2 for (429) Using {Markov} Models and Random Walks to Examine Strategy Use of More or Less Successful Comprehenders
2022.EDM-long-papers.14.pdf
2022.EDM-short-papers.28.pdf
2022.EDM-long-papers.23.pdf
2022.EDM-short-papers.45.pdf
2022.EDM-

## Discarded approaches

In [170]:
from pybtex.database import parse_file,BibliographyData, Entry, Person

entry.persons['editor'] = [Person('Mitrovic, Tanja'), Person('Bosch, Nigel')]
entry.persons['editor']

[Person('Mitrovic, Tanja'), Person('Bosch, Nigel')]

In [171]:
print(bib_data)

BibliographyData(
  entries=OrderedCaseInsensitiveDict([
    ('2022.EDM-doctoral-consortium.102', Entry('inproceedings',
      fields=[
        ('abstract', 'How to measure the semantic similarity of natural language is a fundamental issue in many tasks, such as paraphrase identification (PI) and plagiarism detection (PD) which are intended to solve major issues in education. There are many approaches that have been suggested, such as machine learning (ML) and deep learning (DL) methods. Unlike in prior research, where detecting paraphrases in short and sentence-level texts has been done, we focus on the not yet explored area of paraphrase detection in paragraphs. We consider that the meaning of a piece of text can be broken into more than one sentence, this is over and above the sentences as extracted from two benchmark datasets (Webis-CPC-11 and MSRP). TF-IDF, Bleu metric, N-gram overlap, and Word2vec are used as features, then SVM is invoked as a classifier. The contribution of this

In [13]:
from pdf_annotate import PdfAnnotator, Appearance, Location
import os, pybtex
from pylatexenc.latex2text import LatexNodes2Text 
from unidecode import unidecode

copyright = "Copyright 2022 by the authors. This publication is distributed under the terms and conditions of the Creative Commons Attribution NonCommercial NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license (https://creativecommons.org/licenses/by-nc-nd/4.0/)"
def stamp_citation(pdf_file, bib_file, output_dir ):
    #works but style is slightly off 
    citation = pybtex.format_from_file(bib_file,style='plain',output_backend="plaintext")
    text_block =  copyright + "\n\n" +  unidecode(citation[4:].rstrip())
    #requires post processing and loses doi
    #citation = pybtex.bibtex.format_from_file(bib_file,style='abbrv',output_backend="html")
    annotator = PdfAnnotator(pdf_file)
    #x_start, y_start, x_end, y_end = 53,240,290,10
    x_start, y_start, x_end, y_end = 53,170,290,70
    annotator.add_annotation(
        'square',
        Location(x1=x_start, y1=y_start, x2=x_end, y2=y_end, page=0),
        Appearance(stroke_color=(1,1,1), stroke_width=3, fill=(1,1,1))
        #Appearance(content= text_block, font_size=7,  fill=(0,0,0)), 
    )
    annotator.add_annotation(
        'text',
        Location(x1=x_start, y1=y_start, x2=x_end, y2=y_end, page=0),
        Appearance(content= text_block, font_size=7,  fill=(0,0,0)), 
    )

    annotator.write(os.path.join(output_dir, os.path.basename(pdf_file)))
#2022.EDM-doctoral-consortium.102
# stamp_citation('2022.EDM-posters.96.pdf',"2022.EDM-posters.96.bib","proceedings/cdrom/stamped")
stamp_citation('2022.EDM-doctoral-consortium.102.pdf',"2022.EDM-doctoral-consortium.102.bib","proceedings/cdrom/stamped")

In [15]:
bib_file = "/z/aolney/reviews/conferences/edm/2022/publications-chair/easy2edm/proceedings/cdrom/bib-with-doi/2022.EDM-doctoral-consortium.100.bib"
pybtex.format_from_file(bib_file,style='plain',output_backend="plaintext")

'[1] Ethan Prihar, Alexander Moore, and Neil Heffernan. Identifying explanations within student-tutor chat logs. In Proceedings of the 15th International Conference on Educational Data Mining, 768–772. Durham, United Kingdom, July 2022. International Educational Data Mining Society. doi:10.5072/zenodo.1073398.\n'

In [58]:
bib_file = "2022.EDM-doctoral-consortium.102.bib"
latex_text = pybtex.bibtex.format_from_file(bib_file,style='abbrv')
# from pylatexenc.latex2text import LatexNodes2Text 

# dirty_citation = LatexNodes2Text().latex_to_text(latex_text)
latex_text
key = '2022.EDM-doctoral-consortium.102}'
key_end = latex_text.find(key) + len(key)
latex_text[key_end:].replace('\n',' ').replace('\\newblock', ' ').replace('\\end{thebibliography}','').replace('  ',' ').replace('  ',' ').strip()

'A.~A. Saqaabi, C.~Stewart, E.~Akrida, and A.~Cristea. A paraphrase identification approach in paragraph length texts. In T.~Mitrovic and N.~Bosch, editors, {\\em Proceedings of the 15th International Conference on Educational Data Mining}, pages 777--783, Durham, United Kingdom, July 2022. International Educational Data Mining Society.'

In [16]:
' '. join(dirty_citation.split('\n')[3:]).rstrip().replace('  ',' ').replace('  ',' ').replace('\xa0', ' ')

'A. A. Saqaabi, C. Stewart, E. Akrida, and A. Cristea. A paraphrase identification approach in paragraph length texts. In T. Mitrovic and N. Bosch, editors, Proceedings of the 15th International Conference on Educational Data Mining, pages 777–783, Durham, United Kingdom, July 2022. International Educational Data Mining Society.'