Classify surnames as Greek
Emacs Lisp Python Perl Other
Permalink
Failed to load latest commit information.
.gitignore
LICENSE Clarify licensing Apr 11, 2013
Makefile
README.md Add greek-scientists CGI application Dec 12, 2013
evaluate.sh Correct fgrep run and classifier name Apr 11, 2013
greek-classifier.pl
greek-in-all.txt
greek-scientists.py Locate Python 3 through env Dec 14, 2014
highly-cited-bio-all.txt Add evaluation data Apr 11, 2013
highly-cited-bio-el.txt Add evaluation data Apr 11, 2013
highly-cited-cs-all.txt Add evaluation data Apr 11, 2013
highly-cited-el.txt Add evaluation data Apr 11, 2013
ngram.all Remove values with probability < 1e-4 Apr 11, 2013
ngram.el Remove values with probability < 1e-4 Apr 11, 2013

README.md

This is a command-line tool. It reads ASCII text from the specified files (or its standard input), and will print on the standard output lines that (probably) match a Greek surname. Various command-line options can direct the matching to be performed on specified fields or the longest part of a field.

Installation

Run make install

Execution

The classifier requires two files containing n-grams derived from large collections of Greek and international surnames. Therefore, run it from the directory containing the source code (as perl greek-classifier.pl), or install it in order to run it from any directory (as greek-classifier).

Example

perl greek-classifier.pl highly-cited-cs-all.txt
ALAMOUTI
ALEXOPOULOS
CURTIS
DENNIS
KOMLOS
PAPADIMITRIOU
POLYDOROS
TRIVEDI
VALIANT
VARANASI
VARDI
VAZIRANI
VOLAKIS
YANNAKAKIS

Command-line options

greek-classifier [-d distance] [-k field] [-l] [-t separator] [file ...]
greek-classifier -g [file ...]
-d distance     Specify the distance that generates a match (default 9)
                Higher values increase precision (fewer wrong entries)
                Lower values increase recall (fewer missed entries)
-D              Print the calculated distances
-g              Generate an n-gram table
-k field        Specify field to match; first is 1 (default whole line)
-l              Match only line's / field's longest word
-t separator    Specify field separator RE (space characters by default)
-u              Normalize matched part to uppercase
-w              Print matching word, rather than matching line

Performance

These are the classifier's performance metrics, as reported by the script evaluate.sh

  • Precision: .94
  • Recall: .86
  • Specificity: .97
  • Accuracy: .94
  • MCC: .86

Use case

The Python 3 script greek-scientists.py is a small CGI web application that queries the DBLP bibliographic database for Greek scientists who have published in a given venue over a specific period.