# Author detection

After the date detection from the first notebook in this series, we try to detect the span of the authors of the publication and use the same cell setup as before: three stages for the train, validation and test sets.

We use a simple paradigm for the detection of `Author`: boundary matching. Here, we try to detect not the content of the author span, but only its boundaries, which are then simply connected. The authors are normally listed at the beginning of a reference. For the end (`AuthorStop`), we combine some indicators like `Date` annotation, `PERIODs` and `COLONs`. Additional wordlists help to disambiguate spans that actually refer to something else, like editors.

The Ruta script for detecting dates need to be executed before running this script.

#### Let's first inspect a single training example with gold standard information

In [None]:
%displayMode RUTA_COLORING
%loadCas data/train/A00-1042.txt.xmi
%typeSystemDir typesystems/
TYPESYSTEM ReferencesTypeSystem;
COLOR(Author,"lightgreen");

#### Let's now write some rules to detect authors

In [None]:
%inputDir data/test
%outputDir temp/out_author_train

// Evaluation mode for Author annotations
%displayMode EVALUATION
%evalTypes Author

// Correct paths for resources. The script for detecting Dates has been written to temp/ when executing exercise 1.
%scriptDir temp/
%typeSystemDir typesystems/

// Writing this script and the TypeSystem
%writescript temp/Author.ruta
%saveTypeSystem typesystems/AuthorTypeSystem.xml

TYPESYSTEM ReferencesTypeSystem;
SCRIPT Date;

// Try to find dates if there aren't any yet
Document{-CONTAINS(Date)-> CALL(Date)};

DECLARE FirstInRef, AuthorStopInd, AuthorStop;
DECLARE Initial, EditorInd, NoAuthorInd;

WORDLIST EditorList = "resources/editor_ind.txt";
MARKFAST(EditorInd, EditorList, true);
WORDLIST NoAuthorList = "resources/no_author.txt";
MARKFAST(NoAuthorInd, NoAuthorList, true);

BLOCK(utils) Document{}{
    Reference{-> MARKFIRST(FirstInRef)};
    
    // Detect author initials, e.g. "P."
    CW{REGEXP(".")-> Initial};
    (CW{REGEXP("..")} PERIOD){-> Initial};
    CAP{REGEXP(".{2,3}")-> Initial};
    i:Initial{->i.end=p.end} p:PERIOD; 
    
    // Find the boundary of the Author annotation
    ANY{-> AuthorStopInd} @Date;
    PERIOD{-PARTOF(Initial)-> AuthorStopInd};
    COLON{-> AuthorStopInd};
    as:AuthorStopInd{-> UNMARK(as)} Initial{ENDSWITH(PERIOD)};
}

BLOCK(Author) Reference{}{
   
    # AuthorStopInd{-> AuthorStop};
    
    // Create actual Author annotations
    (FirstInRef # AuthorStop){-> Author};
    
    // Disambiguation of Author annotations based on wordlists
    a:Author{CONTAINS(EditorInd)-> UNMARK(a)};
    a:Author{CONTAINS(NoAuthorInd)-> UNMARK(a)};
}

#### Error analysis on train data set

In [None]:
%inputDir temp/out_author_train
%outputDir temp/unused
%displayMode CSV
%csvConfig BadReference

DECLARE BadReference;
Reference{OR(CONTAINS(FalsePositive),CONTAINS(FalseNegative)),-PARTOF(BadReference)-> BadReference};

COLOR(AuthorStop, "red");
COLOR(TruePositive, "lightgreen");
COLOR(FalsePositive, "lightblue");
COLOR(FalseNegative, "pink");

#### Error analysis on validation data set

In [None]:
%inputDir data/validation
%outputDir temp/out_author_validation
%displayMode EVALUATION
%evalTypes Author

SCRIPT Author;
CALL(Author);

In [None]:
%inputDir temp/out_author_validation
%outputDir temp/unused
%displayMode CSV
%csvConfig BadReference

DECLARE BadReference;
Reference{OR(CONTAINS(FalsePositive),CONTAINS(FalseNegative)),-PARTOF(BadReference)-> BadReference};

COLOR(AuthorStop, "red");
COLOR(TruePositive, "lightgreen");
COLOR(FalsePositive, "lightblue");
COLOR(FalseNegative, "pink");

#### Results on test data set

In [None]:
%inputDir data/test
%displayMode EVALUATION
%evalTypes Author

SCRIPT Author;
CALL(Author);