This is a command-line tool. It reads ASCII text from the specified files (or its standard input), and will print on the standard output lines that (probably) match a Greek surname. Various command-line options can direct the matching to be performed on specified fields or the longest part of a field.
Run make install
The classifier requires two files containing n-grams derived from
large collections of Greek and international surnames.
Therefore, run it from the directory containing the source code
(as perl greek-classifier.pl
), or install it in order to run it
from any directory (as greek-classifier
).
perl greek-classifier.pl highly-cited-cs-all.txt
ALAMOUTI
ALEXOPOULOS
CURTIS
DENNIS
KOMLOS
PAPADIMITRIOU
POLYDOROS
TRIVEDI
VALIANT
VARANASI
VARDI
VAZIRANI
VOLAKIS
YANNAKAKIS
greek-classifier [-d distance] [-k field] [-l] [-t separator] [file ...]
greek-classifier -g [file ...]
-d distance Specify the distance that generates a match (default 9)
Higher values increase precision (fewer wrong entries)
Lower values increase recall (fewer missed entries)
-D Print the calculated distances
-g Generate an n-gram table
-k field Specify field to match; first is 1 (default whole line)
-l Match only line's / field's longest word
-t separator Specify field separator RE (space characters by default)
-u Normalize matched part to uppercase
-w Print matching word, rather than matching line
These are the classifier's performance metrics,
as reported by the script evaluate.sh
- Precision: .94
- Recall: .86
- Specificity: .97
- Accuracy: .94
- MCC: .86
The Python 3 script greek-scientists.py
is a small CGI web application
that queries the DBLP bibliographic database for Greek scientists who have
published in a given venue over a specific period.